Does Joint Training Really Help Cascaded Speech Translation Viet Anh Khoa Tran David Thulke Yingbo Gao Christian Herold Hermann Ney Human Language Technology and Pattern Recognition Group

2025-05-03 0 0 267.63KB 8 页 10玖币
侵权投诉
Does Joint Training Really Help Cascaded Speech Translation?
Viet Anh Khoa Tran David Thulke Yingbo Gao Christian Herold Hermann Ney
Human Language Technology and Pattern Recognition Group
Computer Science Department
RWTH Aachen University
D-52056 Aachen, Germany
{vtran|thulke|gao|herold|ney}@i6.informatik.rwth-aachen.de
Abstract
Currently, in speech translation, the straight-
forward approach - cascading a recognition
system with a translation system - delivers
state-of-the-art results. However, fundamen-
tal challenges such as error propagation from
the automatic speech recognition system still
remain. To mitigate these problems, recently,
people turn their attention to direct data and
propose various joint training methods. In
this work, we seek to answer the question of
whether joint training really helps cascaded
speech translation. We review recent papers
on the topic and also investigate a joint train-
ing criterion by marginalizing the transcrip-
tion posterior probabilities. Our findings show
that a strong cascaded baseline can diminish
any improvements obtained using joint train-
ing, and we suggest alternatives to joint train-
ing. We hope this work can serve as a re-
fresher of the current speech translation land-
scape, and motivate research in finding more
efficient and creative ways to utilize the direct
data for speech translation.
1 Introduction
Speech translation (ST) is the task of automatic
translation of speech in some source language into
some other target language (Stentiford and Steer,
1988;Waibel et al.,1991). Traditionally, a cas-
caded approach is used, where an automatic speech
recognition (ASR) system is used to transcribe the
speech, followed by a machine translation (MT)
system, to translate the transcripts (Sperber and
Paulik,2020). The problem of error propagation
has been the center of discussion in ST literature
(Ney,1999;Casacuberta et al.,2004;Matusov
et al.,2005;Peitz et al.,2012;Sperber et al.,2017),
and instead of using the discrete symbols in the
source languages, ideas like using n-best lists, lat-
tices, and neural network hidden representations
are investigated (Saleem et al.,2004;Kano et al.,
2017;Anastasopoulos and Chiang,2018;Zhang
et al.,2019;Sperber et al.,2019). For a more sys-
tematic review of the ST development, we refer the
readers to Sperber and Paulik (2020).
With recent efforts in the expansion of the ST
data collection (Di Gangi et al.,2019;Beilharz
et al.,2020), more and more direct ST data is avail-
able. Such direct data comes as pairs of source
speech and target translation, and often as triplets
further including source transcriptions.
Various joint training methods are proposed to
use such data to improve cascaded systems, with
the hope that uncertainties during transcription can
be passed on to translation to be resolved. Here,
what we call “joint training” is often referred to as
“end-to-end training” in the literature, where the
direct ST data is utilized in the joint optimization of
the ASR and MT models (Kano et al.,2017;Berard
et al.,2018;Anastasopoulos and Chiang,2018;
Inaguma et al.,2019;Sperber et al.,2019;Bahar
et al.,2019;Wang et al.,2020;Bahar et al.,2021).
In this work, we revisit the principal question of
whether or not joint training really helps cascaded
speech translation.
2 Cascaded Approach
In traditional cascaded systems, an ASR model
p(fJ
1|xT
1)
and an MT model
p(eI
1|fJ
1)
are trained
separately, where we denote speech features as
xT
1
,
transcriptions as
fJ
1
, and translations as
eI
1
. The
decoding is done in two steps:
ˆ
fJ
1= argmax
[fJ
1]
p(fJ
1|xT
1)
ˆeI
1= argmax
[eI
1]
p(eI
1|ˆ
fJ
1)
The
argmax
is approximated using beam search for
computational reasons, and we will assume a fixed
beam size
N
for the decoding of both transcriptions
and translations.
arXiv:2210.13700v2 [eess.AS] 24 Nov 2022
3 Joint Training Approaches
3.1 Top-KCascaded Translation
Assume we have pre-trained an ASR and an MT
model, and some direct ST training data is available.
The pre-trained ASR model is used to produce a
K
-best list of ASR hypotheses
F1, F2, . . . , FK
us-
ing beam search with beam size
NK
. While
there is no unique method to make use of the top-
K
transcript, we describe
Top-K-Train
, a straightfor-
ward algorithm similar to re-ranking. We obtain
the score
˜p
for each ASR hypothesis with length
normalization and normalize them locally within
the top-Khypotheses.
p(Fk|xT
1) = ˜p(Fk|xT
1)
PK
k0=1 ˜p(Fk0|xT
1)(1)
During training,
p(ei|FK;ei1
0)
is the MT model
output. Given the ASR hypotheses
F1, . . . , FK
,
the following training objective is maximized.
log K
X
k=1
p(Fk|xT
1)
I
Y
i=1
p(ei|Fk;ei1
0)!
We hypothesize that this objective (a) exposes dif-
ferent transcriptions and potential ASR errors to
the MT model and (b) encourages the ASR model
to produce hypotheses closer to the expectations of
the MT model, thus reducing model discrepancy.
Since discrete ASR hypotheses are passed to the
MT model from a previous beam search, the er-
ror signal to ASR is passed via the renormalized
transcript scores during backpropagation.
Similarly, we introduce
Top-K-Search
. We ob-
tain an MT hypothesis
Ek
for each
Fk
using beam
search. The final hypothesis is obtained as below.
ˆeˆ
I
1= argmax
Ek
{p(Fk|xT
1)·p(Ek|Fk)}
Here,
p(Fk|xT
1)
is obtained as in Equation 1and
p(Ek|Fk)
is the length normalized translation score
from the MT model. Observe that this search is
applicable to any cascade architecture and is thus
independent of the training criterion. In our exper-
iments, we always Top-
K
-Search when decoding
models trained with Top-
K
-Train. The idea of gen-
erating the top-
K
ASR hypotheses during search
has also been explored in the literature (e.g. Sec-
tion 3.3).
3.2 Tight-Integration
Another way to train the cascade architecture us-
ing direct ST data is the
tight integrated cascade
approach (Bahar et al.,2021). We introduce an
exponent
γ
that controls the sharpness of the dis-
tribution of the conditional probabilities. Thus,
instead of passing the 1-best hypothesis of the ASR
system as a sequence of 1-hot vectors, we pass the
renormalized probabilities to the MT model.
p(fj|fj1
1;xT
1) = ˜pγ(fj|fj1
1xT
1)
Pf0∈|VF|˜pγ(f0
j|fj1
1xT
1)
Here, VFis the vocabulary of the ASR system.
3.3 Searchable Hidden Intermediates
Dalmia et al. (2021) propose passing the final de-
coder representations of the N-best ASR hypothe-
ses (i.e. the
searchable hidden intermediates
) di-
rectly to the MT system, bypassing the MT input
embedding.
Additionally, they extend the multi-task learning
approach by allowing the MT decoder to attend
to the ASR encoder states, which in turn are opti-
mized using beam search in training. They show
that during decoding, a higher ASR beam size in-
deed leads to a better ST performance.
4 Experimental Results and Analysis
We focus on the MuST-C English-German speech
translation task (Di Gangi et al.,2019) in the do-
main of TED talks and evaluate on test-HE and
test-COMMON. We use an in-house filtered sub-
set of the IWSLT 2021 English-German dataset
as in Bahar et al. (2021), which contains 1.9M
segments (2300 hours) of ASR data and 24M par-
allel sentences of MT data. The in-domain ASR
data comprises MUST-C, TED-LIUM, and IWSLT
TED, while the out-of-domain ASR data consists of
EuroParl, How2, LibriSpeech, and Mozilla Com-
mon Voice. For translation, the dataset contains
24M parallel sentences of in-domain translation
data (MuST-C, TED-LIUM, and IWSLT TED), as
well as out-of-domain translation data (NewsCom-
mentary, EuroParl, WikiTitles, ParaCrawl, Com-
monCrawl, Rapid, OpenSubtitles2018). For ST
data, we only use MuST-C. We provide further de-
tails in Appendix A. Depending on whether or not
fine-tuned on in-domain ASR and MT data, we
split our experiments into two sets: A1-A5 and
B1-B5.
摘要:

DoesJointTrainingReallyHelpCascadedSpeechTranslation?VietAnhKhoaTranDavidThulkeYingboGaoChristianHeroldHermannNeyHumanLanguageTechnologyandPatternRecognitionGroupComputerScienceDepartmentRWTHAachenUniversityD-52056Aachen,Germany{vtran|thulke|gao|herold|ney}@i6.informatik.rwth-aachen.deAbstractCurren...

展开>> 收起<<
Does Joint Training Really Help Cascaded Speech Translation Viet Anh Khoa Tran David Thulke Yingbo Gao Christian Herold Hermann Ney Human Language Technology and Pattern Recognition Group.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:267.63KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注