EFFICIENT SPEECH TRANSLATION WITH DYNAMIC LATENT PERCEIVERS Ioannis Tsiamas Gerard I. G allego Jos e A. R. Fonollosa

2025-08-25 2 0 567.49KB 5 页 10玖币

侵权投诉

EFFICIENT SPEECH TRANSLATION WITH DYNAMIC LATENT PERCEIVERS

Ioannis Tsiamas, Gerard I. G´

allego, Jos´

e A. R. Fonollosa

Universitat Polit`

ecnica de Catalunya, Barcelona

{ioannis.tsiamas,gerard.ion.gallego,jose.fonollosa}@upc.edu

Marta R. Costa-juss`

Meta AI, Paris

costajussa@meta.com

ABSTRACT

Transformers have been the dominant architecture for Speech

Translation in recent years, achieving signiﬁcant improve-

ments in translation quality. Since speech signals are longer

than their textual counterparts, and due to the quadratic com-

plexity of the Transformer, a down-sampling step is essen-

tial for its adoption in Speech Translation. Instead, in this

research, we propose to ease the complexity by using a Per-

ceiver encoder to map the speech inputs to a ﬁxed-length

latent representation. Furthermore, we introduce a novel way

of training Perceivers, with Dynamic Latent Access (DLA),

unlocking larger latent spaces without any additional compu-

tational overhead. Speech-to-Text Perceivers with DLA can

match the performance of Transformer baselines across three

language pairs in MuST-C. Finally, a DLA-trained model

is easily adaptable to DLA at inference, and can be ﬂexi-

bly deployed with various computational budgets, without

signiﬁcant drops in translation quality.

Index Terms—Speech Translation, Efﬁciency, Perceiver

1. INTRODUCTION

Speech Translation (ST) has traditionally relied on a cascade

approach, using two separate systems, an Automatic Speech

Recognition (ASR) for transcription and a Machine Transla-

tion (MT) for text translation. Recently, the end-to-end ap-

proach, with a single model, has attracted more interest, hav-

ing several advantages such as faster inference and no error

propagation [1,2]. The Transformer [3] has been crucial for

this change, becoming the standard model in end-to-end ST.

One of the Transformer’s key features is the ability to

model token-to-token interactions with attention matrices,

which imposes a quadratic complexity with respect to the

sequence length. Since speech sequences are much longer

than text sequences, directly processing speech with a Trans-

former becomes problematic. Thus, a modiﬁcation is usually

necessary, with down-sampling the speech signal at the input

of the encoder [4] or at the input of the attention modules

[5]. In this research, we take an alternative approach and

propose to map the input speech to a ﬁxed-length latent rep-

resentation using a Perceiver encoder [6]. This mapping

Work at UPC was supported by the Spanish State Research Agency

(AEI) project PID2019-107579RB-I00 / AEI / 10.13039/501100011033.

Fig. 1. Speech-to-Text Perceiver

swaps the quadratic complexity from the sequence length

to the number of latents and makes the model only linearly

dependent on the sequence length. We demonstrate that a

Perceiver encoder coupled with a Transformer decoder can

obtain competitive results across three language pairs in end-

to-end ST. To further ease the computational burden of the

proposed model, we introduce a novel way of training and

doing inference with Perceivers, called Dynamic Latent Ac-

cess (DLA). By enabling Perceivers to have access to a large

latent space but only use a small part of it at each training

step, we can increase the model’s expressive power without

incurring additional computational costs. We also show that

a diversity-based DLA can be utilized during inference to

achieve signiﬁcant improvements in efﬁciency with minimal

reduction in translation quality. Finally, we investigate the

complementary nature of DLA at training and inference and

show that combining the two can create a single and ﬂexible

model that can be used in various scenarios with varying

computational budgets. Our code is publicly available.1

1https://github.com/mt-upc/s2t-perceiver

arXiv:2210.16264v2 [cs.CL] 14 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EFFICIENTSPEECHTRANSLATIONWITHDYNAMICLATENTPERCEIVERSIoannisTsiamas,GerardI.G´allego,Jos´eA.R.FonollosaUniversitatPolitecnicadeCatalunya,Barcelonafioannis.tsiamas,gerard.ion.gallego,jose.fonollosag@upc.eduMartaR.Costa-jussaMetaAI,Pariscostajussa@meta.comABSTRACTTransformershavebeenthedominantarchi...

展开>> 收起<<

EFFICIENT SPEECH TRANSLATION WITH DYNAMIC LATENT PERCEIVERS Ioannis Tsiamas Gerard I. G allego Jos e A. R. Fonollosa.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EFFICIENT SPEECH TRANSLATION WITH DYNAMIC LATENT PERCEIVERS Ioannis Tsiamas Gerard I. G allego Jos e A. R. Fonollosa

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: