
EFFICIENT SPEECH TRANSLATION WITH DYNAMIC LATENT PERCEIVERS
Ioannis Tsiamas, Gerard I. G´
allego, Jos´
e A. R. Fonollosa
Universitat Polit`
ecnica de Catalunya, Barcelona
{ioannis.tsiamas,gerard.ion.gallego,jose.fonollosa}@upc.edu
Marta R. Costa-juss`
a
Meta AI, Paris
costajussa@meta.com
ABSTRACT
Transformers have been the dominant architecture for Speech
Translation in recent years, achieving significant improve-
ments in translation quality. Since speech signals are longer
than their textual counterparts, and due to the quadratic com-
plexity of the Transformer, a down-sampling step is essen-
tial for its adoption in Speech Translation. Instead, in this
research, we propose to ease the complexity by using a Per-
ceiver encoder to map the speech inputs to a fixed-length
latent representation. Furthermore, we introduce a novel way
of training Perceivers, with Dynamic Latent Access (DLA),
unlocking larger latent spaces without any additional compu-
tational overhead. Speech-to-Text Perceivers with DLA can
match the performance of Transformer baselines across three
language pairs in MuST-C. Finally, a DLA-trained model
is easily adaptable to DLA at inference, and can be flexi-
bly deployed with various computational budgets, without
significant drops in translation quality.
Index Terms—Speech Translation, Efficiency, Perceiver
1. INTRODUCTION
Speech Translation (ST) has traditionally relied on a cascade
approach, using two separate systems, an Automatic Speech
Recognition (ASR) for transcription and a Machine Transla-
tion (MT) for text translation. Recently, the end-to-end ap-
proach, with a single model, has attracted more interest, hav-
ing several advantages such as faster inference and no error
propagation [1,2]. The Transformer [3] has been crucial for
this change, becoming the standard model in end-to-end ST.
One of the Transformer’s key features is the ability to
model token-to-token interactions with attention matrices,
which imposes a quadratic complexity with respect to the
sequence length. Since speech sequences are much longer
than text sequences, directly processing speech with a Trans-
former becomes problematic. Thus, a modification is usually
necessary, with down-sampling the speech signal at the input
of the encoder [4] or at the input of the attention modules
[5]. In this research, we take an alternative approach and
propose to map the input speech to a fixed-length latent rep-
resentation using a Perceiver encoder [6]. This mapping
Work at UPC was supported by the Spanish State Research Agency
(AEI) project PID2019-107579RB-I00 / AEI / 10.13039/501100011033.
Fig. 1. Speech-to-Text Perceiver
swaps the quadratic complexity from the sequence length
to the number of latents and makes the model only linearly
dependent on the sequence length. We demonstrate that a
Perceiver encoder coupled with a Transformer decoder can
obtain competitive results across three language pairs in end-
to-end ST. To further ease the computational burden of the
proposed model, we introduce a novel way of training and
doing inference with Perceivers, called Dynamic Latent Ac-
cess (DLA). By enabling Perceivers to have access to a large
latent space but only use a small part of it at each training
step, we can increase the model’s expressive power without
incurring additional computational costs. We also show that
a diversity-based DLA can be utilized during inference to
achieve significant improvements in efficiency with minimal
reduction in translation quality. Finally, we investigate the
complementary nature of DLA at training and inference and
show that combining the two can create a single and flexible
model that can be used in various scenarios with varying
computational budgets. Our code is publicly available.1
1https://github.com/mt-upc/s2t-perceiver
arXiv:2210.16264v2 [cs.CL] 14 Mar 2023