Does Joint Training Really Help Cascaded Speech Translation?
Viet Anh Khoa Tran David Thulke Yingbo Gao Christian Herold Hermann Ney
Human Language Technology and Pattern Recognition Group
Computer Science Department
RWTH Aachen University
D-52056 Aachen, Germany
{vtran|thulke|gao|herold|ney}@i6.informatik.rwth-aachen.de
Abstract
Currently, in speech translation, the straight-
forward approach - cascading a recognition
system with a translation system - delivers
state-of-the-art results. However, fundamen-
tal challenges such as error propagation from
the automatic speech recognition system still
remain. To mitigate these problems, recently,
people turn their attention to direct data and
propose various joint training methods. In
this work, we seek to answer the question of
whether joint training really helps cascaded
speech translation. We review recent papers
on the topic and also investigate a joint train-
ing criterion by marginalizing the transcrip-
tion posterior probabilities. Our findings show
that a strong cascaded baseline can diminish
any improvements obtained using joint train-
ing, and we suggest alternatives to joint train-
ing. We hope this work can serve as a re-
fresher of the current speech translation land-
scape, and motivate research in finding more
efficient and creative ways to utilize the direct
data for speech translation.
1 Introduction
Speech translation (ST) is the task of automatic
translation of speech in some source language into
some other target language (Stentiford and Steer,
1988;Waibel et al.,1991). Traditionally, a cas-
caded approach is used, where an automatic speech
recognition (ASR) system is used to transcribe the
speech, followed by a machine translation (MT)
system, to translate the transcripts (Sperber and
Paulik,2020). The problem of error propagation
has been the center of discussion in ST literature
(Ney,1999;Casacuberta et al.,2004;Matusov
et al.,2005;Peitz et al.,2012;Sperber et al.,2017),
and instead of using the discrete symbols in the
source languages, ideas like using n-best lists, lat-
tices, and neural network hidden representations
are investigated (Saleem et al.,2004;Kano et al.,
2017;Anastasopoulos and Chiang,2018;Zhang
et al.,2019;Sperber et al.,2019). For a more sys-
tematic review of the ST development, we refer the
readers to Sperber and Paulik (2020).
With recent efforts in the expansion of the ST
data collection (Di Gangi et al.,2019;Beilharz
et al.,2020), more and more direct ST data is avail-
able. Such direct data comes as pairs of source
speech and target translation, and often as triplets
further including source transcriptions.
Various joint training methods are proposed to
use such data to improve cascaded systems, with
the hope that uncertainties during transcription can
be passed on to translation to be resolved. Here,
what we call “joint training” is often referred to as
“end-to-end training” in the literature, where the
direct ST data is utilized in the joint optimization of
the ASR and MT models (Kano et al.,2017;Berard
et al.,2018;Anastasopoulos and Chiang,2018;
Inaguma et al.,2019;Sperber et al.,2019;Bahar
et al.,2019;Wang et al.,2020;Bahar et al.,2021).
In this work, we revisit the principal question of
whether or not joint training really helps cascaded
speech translation.
2 Cascaded Approach
In traditional cascaded systems, an ASR model
p(fJ
1|xT
1)
and an MT model
p(eI
1|fJ
1)
are trained
separately, where we denote speech features as
xT
1
,
transcriptions as
fJ
1
, and translations as
eI
1
. The
decoding is done in two steps:
ˆ
fJ
1= argmax
[fJ
1]
p(fJ
1|xT
1)
ˆeI
1= argmax
[eI
1]
p(eI
1|ˆ
fJ
1)
The
argmax
is approximated using beam search for
computational reasons, and we will assume a fixed
beam size
N
for the decoding of both transcriptions
and translations.
arXiv:2210.13700v2 [eess.AS] 24 Nov 2022