
NAMED ENTITY DETECTION AND INJECTION FOR DIRECT SPEECH TRANSLATION
Marco Gaido∗†‡ , Yun Tang?, Ilia Kulikov?, Rongqing Huang?, Hongyu Gong?, Hirofumi Inaguma?
?Meta AI, USA, †Fondazione Bruno Kessler, Italy, ‡University of Trento, Italy
mgaido@fbk.eu,{yuntang,kulikov,rhuangq,hygong,hirofumii}@meta.com
ABSTRACT
In a sentence, certain words are critical for its seman-
tic. Among them, named entities (NEs) are notoriously
challenging for neural models. Despite their importance,
their accurate handling has been neglected in speech-to-text
(S2T) translation research, and recent work has shown that
S2T models perform poorly for locations and notably person
names, whose spelling is challenging unless known in ad-
vance. In this work, we explore how to leverage dictionaries
of NEs known to likely appear in a given context to improve
S2T model outputs. Our experiments show that we can reli-
ably detect NEs likely present in an utterance starting from
S2T encoder outputs. Indeed, we demonstrate that the current
detection quality is sufficient to improve NE accuracy in the
translation with a 31% reduction in person name errors.
Index Terms—speech translation, named entities
1. INTRODUCTION
Translation is the process to convey the same semantic mean-
ing of a source sentence into a target language. In this process,
named entities (NEs) – which identify real-world people, lo-
cations, organizations, etc. – play a paramount role and their
correct translation is crucial to express the accurate meaning
[1]. On the other end, current neural translation systems are
known to struggle in presence of rare words [2], as NEs often
are. These motivations drove researchers to study dedicated
solutions that exploit additional information available at infer-
ence time, such as bilingual dictionaries [3, 4, 5, 6]. All these
works, however, are targeted for text-to-text (T2T) translation
and assume that the dictionary entities present in the source
sentence can be easily retrieved with pattern matching. This
assumption does not hold for the speech-to-text (S2T) trans-
lation task, where the source modality is audio.
The S2T task was initially accomplished by a cascade of
automatic speech recognition (ASR) and T2T translation sys-
tems. However, end-to-end (or direct) S2T solutions have
recently progressed up to achieve similar translation quality
[7], with the benefits of a simpler architecture and lower la-
tency. Cascade and direct models have been shown to equally
struggle with NEs [8], even more than T2T ones, especially
∗Work done during an internship at Meta AI.
regarding person names [9] that are particularly hard to rec-
ognize from speech. Despite this and the importance of NEs,
to the best of our knowledge, no work has so far explored
how to exploit contextual dictionaries of NEs available at in-
ference time in S2T. In addition, existing methods designed
for T2T are not applicable due to the different input modality.
Motivated by the practical relevance of the problem and
the lack of existing solutions, we present the first approach to
exploit contextual information – in the form of a bilingual dic-
tionary of NEs – in direct S2T. Specifically, our main focus is
the detection of the NEs present in an utterance, among those
in a given contextual dictionary. Performing this task allows
us to rely on the existing solutions to inject the correct trans-
lations for the NEs. To showcase that the quality of our NE
detector is sufficient to be useful, we adopt a decoder archi-
tecture similar to Contextual Listen Attend and Spell (CLAS)
[10] and provide it with the list of translated NEs considered
present by our detector module. Experimental results on 3
language pairs (en→es,fr,it) demonstrate that we can improve
NE accuracy by up to 7.1% over a base S2T model, and re-
duce the errors on person names by up to 31.3% over a strong
baseline exploiting the same inference-time contextual data.
2. ENTITY DETECTION FOR S2T TRANSLATION
Two operations are necessary to exploit a dictionary of NEs
likely to appear in an utterance: i) detect the relevant NEs
among those in the dictionary, ii) look at the corresponding
translations to accurately generate them. Accordingly, we add
two modules to the S2T model: i) a detector identifying the
NEs present in the utterance, and ii) a module informing the
decoder about the forms of the NEs in the target language.
2.1. Entity Detection
A recent research direction in S2T consists in training models
that jointly perform S2T and T2T to improve the quality of
direct S2T [11, 12]. These speech/text-to-text (ST2T) models
include auxiliary tasks to force the encoder outputs of differ-
ent modalities to be close when the text/audio content is the
same. Fig. 1 confirms that encoder outputs for text (the text
is actually converted into phonemes before being fed to the
arXiv:2210.11981v2 [cs.CL] 11 Mar 2023