EFFICIENT SPEECH TRANSLATION WITH DYNAMIC LATENT PERCEIVERS Ioannis Tsiamas Gerard I. G allego Jos e A. R. Fonollosa

2025-08-25 0 0 567.49KB 5 页 10玖币
侵权投诉
EFFICIENT SPEECH TRANSLATION WITH DYNAMIC LATENT PERCEIVERS
Ioannis Tsiamas, Gerard I. G´
allego, Jos´
e A. R. Fonollosa
Universitat Polit`
ecnica de Catalunya, Barcelona
{ioannis.tsiamas,gerard.ion.gallego,jose.fonollosa}@upc.edu
Marta R. Costa-juss`
a
Meta AI, Paris
costajussa@meta.com
ABSTRACT
Transformers have been the dominant architecture for Speech
Translation in recent years, achieving significant improve-
ments in translation quality. Since speech signals are longer
than their textual counterparts, and due to the quadratic com-
plexity of the Transformer, a down-sampling step is essen-
tial for its adoption in Speech Translation. Instead, in this
research, we propose to ease the complexity by using a Per-
ceiver encoder to map the speech inputs to a fixed-length
latent representation. Furthermore, we introduce a novel way
of training Perceivers, with Dynamic Latent Access (DLA),
unlocking larger latent spaces without any additional compu-
tational overhead. Speech-to-Text Perceivers with DLA can
match the performance of Transformer baselines across three
language pairs in MuST-C. Finally, a DLA-trained model
is easily adaptable to DLA at inference, and can be flexi-
bly deployed with various computational budgets, without
significant drops in translation quality.
Index TermsSpeech Translation, Efficiency, Perceiver
1. INTRODUCTION
Speech Translation (ST) has traditionally relied on a cascade
approach, using two separate systems, an Automatic Speech
Recognition (ASR) for transcription and a Machine Transla-
tion (MT) for text translation. Recently, the end-to-end ap-
proach, with a single model, has attracted more interest, hav-
ing several advantages such as faster inference and no error
propagation [1,2]. The Transformer [3] has been crucial for
this change, becoming the standard model in end-to-end ST.
One of the Transformer’s key features is the ability to
model token-to-token interactions with attention matrices,
which imposes a quadratic complexity with respect to the
sequence length. Since speech sequences are much longer
than text sequences, directly processing speech with a Trans-
former becomes problematic. Thus, a modification is usually
necessary, with down-sampling the speech signal at the input
of the encoder [4] or at the input of the attention modules
[5]. In this research, we take an alternative approach and
propose to map the input speech to a fixed-length latent rep-
resentation using a Perceiver encoder [6]. This mapping
Work at UPC was supported by the Spanish State Research Agency
(AEI) project PID2019-107579RB-I00 / AEI / 10.13039/501100011033.
Fig. 1. Speech-to-Text Perceiver
swaps the quadratic complexity from the sequence length
to the number of latents and makes the model only linearly
dependent on the sequence length. We demonstrate that a
Perceiver encoder coupled with a Transformer decoder can
obtain competitive results across three language pairs in end-
to-end ST. To further ease the computational burden of the
proposed model, we introduce a novel way of training and
doing inference with Perceivers, called Dynamic Latent Ac-
cess (DLA). By enabling Perceivers to have access to a large
latent space but only use a small part of it at each training
step, we can increase the model’s expressive power without
incurring additional computational costs. We also show that
a diversity-based DLA can be utilized during inference to
achieve significant improvements in efficiency with minimal
reduction in translation quality. Finally, we investigate the
complementary nature of DLA at training and inference and
show that combining the two can create a single and flexible
model that can be used in various scenarios with varying
computational budgets. Our code is publicly available.1
1https://github.com/mt-upc/s2t-perceiver
arXiv:2210.16264v2 [cs.CL] 14 Mar 2023
摘要:

EFFICIENTSPEECHTRANSLATIONWITHDYNAMICLATENTPERCEIVERSIoannisTsiamas,GerardI.G´allego,Jos´eA.R.FonollosaUniversitatPolitecnicadeCatalunya,Barcelonafioannis.tsiamas,gerard.ion.gallego,jose.fonollosag@upc.eduMartaR.Costa-jussaMetaAI,Pariscostajussa@meta.comABSTRACTTransformershavebeenthedominantarchi...

展开>> 收起<<
EFFICIENT SPEECH TRANSLATION WITH DYNAMIC LATENT PERCEIVERS Ioannis Tsiamas Gerard I. G allego Jos e A. R. Fonollosa.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:567.49KB 格式:PDF 时间:2025-08-25

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注