NAMED ENTITY DETECTION AND INJECTION FOR DIRECT SPEECH TRANSLATION Marco Gaidoyz Yun Tang Ilia Kulikov Rongqing Huang Hongyu Gong Hirofumi Inaguma Meta AI USAyFondazione Bruno Kessler ItalyzUniversity of Trento Italy

2025-05-02 0 0 654.18KB 5 页 10玖币
侵权投诉
NAMED ENTITY DETECTION AND INJECTION FOR DIRECT SPEECH TRANSLATION
Marco Gaido∗†‡ , Yun Tang?, Ilia Kulikov?, Rongqing Huang?, Hongyu Gong?, Hirofumi Inaguma?
?Meta AI, USA, Fondazione Bruno Kessler, Italy, University of Trento, Italy
mgaido@fbk.eu,{yuntang,kulikov,rhuangq,hygong,hirofumii}@meta.com
ABSTRACT
In a sentence, certain words are critical for its seman-
tic. Among them, named entities (NEs) are notoriously
challenging for neural models. Despite their importance,
their accurate handling has been neglected in speech-to-text
(S2T) translation research, and recent work has shown that
S2T models perform poorly for locations and notably person
names, whose spelling is challenging unless known in ad-
vance. In this work, we explore how to leverage dictionaries
of NEs known to likely appear in a given context to improve
S2T model outputs. Our experiments show that we can reli-
ably detect NEs likely present in an utterance starting from
S2T encoder outputs. Indeed, we demonstrate that the current
detection quality is sufficient to improve NE accuracy in the
translation with a 31% reduction in person name errors.
Index Termsspeech translation, named entities
1. INTRODUCTION
Translation is the process to convey the same semantic mean-
ing of a source sentence into a target language. In this process,
named entities (NEs) – which identify real-world people, lo-
cations, organizations, etc. – play a paramount role and their
correct translation is crucial to express the accurate meaning
[1]. On the other end, current neural translation systems are
known to struggle in presence of rare words [2], as NEs often
are. These motivations drove researchers to study dedicated
solutions that exploit additional information available at infer-
ence time, such as bilingual dictionaries [3, 4, 5, 6]. All these
works, however, are targeted for text-to-text (T2T) translation
and assume that the dictionary entities present in the source
sentence can be easily retrieved with pattern matching. This
assumption does not hold for the speech-to-text (S2T) trans-
lation task, where the source modality is audio.
The S2T task was initially accomplished by a cascade of
automatic speech recognition (ASR) and T2T translation sys-
tems. However, end-to-end (or direct) S2T solutions have
recently progressed up to achieve similar translation quality
[7], with the benefits of a simpler architecture and lower la-
tency. Cascade and direct models have been shown to equally
struggle with NEs [8], even more than T2T ones, especially
Work done during an internship at Meta AI.
regarding person names [9] that are particularly hard to rec-
ognize from speech. Despite this and the importance of NEs,
to the best of our knowledge, no work has so far explored
how to exploit contextual dictionaries of NEs available at in-
ference time in S2T. In addition, existing methods designed
for T2T are not applicable due to the different input modality.
Motivated by the practical relevance of the problem and
the lack of existing solutions, we present the first approach to
exploit contextual information – in the form of a bilingual dic-
tionary of NEs – in direct S2T. Specifically, our main focus is
the detection of the NEs present in an utterance, among those
in a given contextual dictionary. Performing this task allows
us to rely on the existing solutions to inject the correct trans-
lations for the NEs. To showcase that the quality of our NE
detector is sufficient to be useful, we adopt a decoder archi-
tecture similar to Contextual Listen Attend and Spell (CLAS)
[10] and provide it with the list of translated NEs considered
present by our detector module. Experimental results on 3
language pairs (enes,fr,it) demonstrate that we can improve
NE accuracy by up to 7.1% over a base S2T model, and re-
duce the errors on person names by up to 31.3% over a strong
baseline exploiting the same inference-time contextual data.
2. ENTITY DETECTION FOR S2T TRANSLATION
Two operations are necessary to exploit a dictionary of NEs
likely to appear in an utterance: i) detect the relevant NEs
among those in the dictionary, ii) look at the corresponding
translations to accurately generate them. Accordingly, we add
two modules to the S2T model: i) a detector identifying the
NEs present in the utterance, and ii) a module informing the
decoder about the forms of the NEs in the target language.
2.1. Entity Detection
A recent research direction in S2T consists in training models
that jointly perform S2T and T2T to improve the quality of
direct S2T [11, 12]. These speech/text-to-text (ST2T) models
include auxiliary tasks to force the encoder outputs of differ-
ent modalities to be close when the text/audio content is the
same. Fig. 1 confirms that encoder outputs for text (the text
is actually converted into phonemes before being fed to the
arXiv:2210.11981v2 [cs.CL] 11 Mar 2023
摘要:

NAMEDENTITYDETECTIONANDINJECTIONFORDIRECTSPEECHTRANSLATIONMarcoGaidoyz,YunTang?,IliaKulikov?,RongqingHuang?,HongyuGong?,HirofumiInaguma??MetaAI,USA,yFondazioneBrunoKessler,Italy,zUniversityofTrento,Italymgaido@fbk.eu,fyuntang,kulikov,rhuangq,hygong,hirofumiig@meta.comABSTRACTInasentence,certainword...

展开>> 收起<<
NAMED ENTITY DETECTION AND INJECTION FOR DIRECT SPEECH TRANSLATION Marco Gaidoyz Yun Tang Ilia Kulikov Rongqing Huang Hongyu Gong Hirofumi Inaguma Meta AI USAyFondazione Bruno Kessler ItalyzUniversity of Trento Italy.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:654.18KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注