Does Joint Training Really Help Cascaded Speech Translation Viet Anh Khoa Tran David Thulke Yingbo Gao Christian Herold Hermann Ney Human Language Technology and Pattern Recognition Group

2025-05-03 0 0 267.63KB 8 页 10玖币

侵权投诉

Does Joint Training Really Help Cascaded Speech Translation?

Viet Anh Khoa Tran David Thulke Yingbo Gao Christian Herold Hermann Ney

Human Language Technology and Pattern Recognition Group

Computer Science Department

RWTH Aachen University

D-52056 Aachen, Germany

{vtran|thulke|gao|herold|ney}@i6.informatik.rwth-aachen.de

Abstract

Currently, in speech translation, the straight-

forward approach - cascading a recognition

system with a translation system - delivers

state-of-the-art results. However, fundamen-

tal challenges such as error propagation from

the automatic speech recognition system still

remain. To mitigate these problems, recently,

people turn their attention to direct data and

propose various joint training methods. In

this work, we seek to answer the question of

whether joint training really helps cascaded

speech translation. We review recent papers

on the topic and also investigate a joint train-

ing criterion by marginalizing the transcrip-

tion posterior probabilities. Our ﬁndings show

that a strong cascaded baseline can diminish

any improvements obtained using joint train-

ing, and we suggest alternatives to joint train-

ing. We hope this work can serve as a re-

fresher of the current speech translation land-

scape, and motivate research in ﬁnding more

efﬁcient and creative ways to utilize the direct

data for speech translation.

1 Introduction

Speech translation (ST) is the task of automatic

translation of speech in some source language into

some other target language (Stentiford and Steer,

1988;Waibel et al.,1991). Traditionally, a cas-

caded approach is used, where an automatic speech

recognition (ASR) system is used to transcribe the

speech, followed by a machine translation (MT)

system, to translate the transcripts (Sperber and

Paulik,2020). The problem of error propagation

has been the center of discussion in ST literature

(Ney,1999;Casacuberta et al.,2004;Matusov

et al.,2005;Peitz et al.,2012;Sperber et al.,2017),

and instead of using the discrete symbols in the

source languages, ideas like using n-best lists, lat-

tices, and neural network hidden representations

are investigated (Saleem et al.,2004;Kano et al.,

2017;Anastasopoulos and Chiang,2018;Zhang

et al.,2019;Sperber et al.,2019). For a more sys-

tematic review of the ST development, we refer the

readers to Sperber and Paulik (2020).

With recent efforts in the expansion of the ST

data collection (Di Gangi et al.,2019;Beilharz

et al.,2020), more and more direct ST data is avail-

able. Such direct data comes as pairs of source

speech and target translation, and often as triplets

further including source transcriptions.

Various joint training methods are proposed to

use such data to improve cascaded systems, with

the hope that uncertainties during transcription can

be passed on to translation to be resolved. Here,

what we call “joint training” is often referred to as

“end-to-end training” in the literature, where the

direct ST data is utilized in the joint optimization of

the ASR and MT models (Kano et al.,2017;Berard

et al.,2018;Anastasopoulos and Chiang,2018;

Inaguma et al.,2019;Sperber et al.,2019;Bahar

et al.,2019;Wang et al.,2020;Bahar et al.,2021).

In this work, we revisit the principal question of

whether or not joint training really helps cascaded

speech translation.

2 Cascaded Approach

In traditional cascaded systems, an ASR model

p(fJ

1|xT

and an MT model

p(eI

1|fJ

are trained

separately, where we denote speech features as

transcriptions as

, and translations as

. The

decoding is done in two steps:

1= argmax

[fJ

p(fJ

1|xT

ˆeI

1= argmax

[eI

p(eI

1|ˆ

The

argmax

is approximated using beam search for

computational reasons, and we will assume a ﬁxed

beam size

for the decoding of both transcriptions

and translations.

arXiv:2210.13700v2 [eess.AS] 24 Nov 2022

3 Joint Training Approaches

3.1 Top-KCascaded Translation

Assume we have pre-trained an ASR and an MT

model, and some direct ST training data is available.

The pre-trained ASR model is used to produce a

-best list of ASR hypotheses

F1, F2, . . . , FK

us-

ing beam search with beam size

N≥K

. While

there is no unique method to make use of the top-

transcript, we describe

Top-K-Train

, a straightfor-

ward algorithm similar to re-ranking. We obtain

the score

˜p

for each ASR hypothesis with length

normalization and normalize them locally within

the top-Khypotheses.

p(Fk|xT

1) = ˜p(Fk|xT

k0=1 ˜p(Fk0|xT

1)(1)

During training,

p(ei|FK;ei−1

is the MT model

output. Given the ASR hypotheses

F1, . . . , FK

the following training objective is maximized.

log K

k=1

p(Fk|xT

i=1

p(ei|Fk;ei−1

0)!

We hypothesize that this objective (a) exposes dif-

ferent transcriptions and potential ASR errors to

the MT model and (b) encourages the ASR model

to produce hypotheses closer to the expectations of

the MT model, thus reducing model discrepancy.

Since discrete ASR hypotheses are passed to the

MT model from a previous beam search, the er-

ror signal to ASR is passed via the renormalized

transcript scores during backpropagation.

Similarly, we introduce

Top-K-Search

. We ob-

tain an MT hypothesis

for each

using beam

search. The ﬁnal hypothesis is obtained as below.

ˆeˆ

1= argmax

{p(Fk|xT

1)·p(Ek|Fk)}

Here,

p(Fk|xT

is obtained as in Equation 1and

p(Ek|Fk)

is the length normalized translation score

from the MT model. Observe that this search is

applicable to any cascade architecture and is thus

independent of the training criterion. In our exper-

iments, we always Top-

-Search when decoding

models trained with Top-

-Train. The idea of gen-

erating the top-

ASR hypotheses during search

has also been explored in the literature (e.g. Sec-

tion 3.3).

3.2 Tight-Integration

Another way to train the cascade architecture us-

ing direct ST data is the

tight integrated cascade

approach (Bahar et al.,2021). We introduce an

exponent

that controls the sharpness of the dis-

tribution of the conditional probabilities. Thus,

instead of passing the 1-best hypothesis of the ASR

system as a sequence of 1-hot vectors, we pass the

renormalized probabilities to the MT model.

p(fj|fj−1

1;xT

1) = ˜pγ(fj|fj−1

1xT

Pf0∈|VF|˜pγ(f0

j|fj−1

1xT

Here, VFis the vocabulary of the ASR system.

3.3 Searchable Hidden Intermediates

Dalmia et al. (2021) propose passing the ﬁnal de-

coder representations of the N-best ASR hypothe-

ses (i.e. the

searchable hidden intermediates

) di-

rectly to the MT system, bypassing the MT input

embedding.

Additionally, they extend the multi-task learning

approach by allowing the MT decoder to attend

to the ASR encoder states, which in turn are opti-

mized using beam search in training. They show

that during decoding, a higher ASR beam size in-

deed leads to a better ST performance.

4 Experimental Results and Analysis

We focus on the MuST-C English-German speech

translation task (Di Gangi et al.,2019) in the do-

main of TED talks and evaluate on test-HE and

test-COMMON. We use an in-house ﬁltered sub-

set of the IWSLT 2021 English-German dataset

as in Bahar et al. (2021), which contains 1.9M

segments (2300 hours) of ASR data and 24M par-

allel sentences of MT data. The in-domain ASR

data comprises MUST-C, TED-LIUM, and IWSLT

TED, while the out-of-domain ASR data consists of

EuroParl, How2, LibriSpeech, and Mozilla Com-

mon Voice. For translation, the dataset contains

24M parallel sentences of in-domain translation

data (MuST-C, TED-LIUM, and IWSLT TED), as

well as out-of-domain translation data (NewsCom-

mentary, EuroParl, WikiTitles, ParaCrawl, Com-

monCrawl, Rapid, OpenSubtitles2018). For ST

data, we only use MuST-C. We provide further de-

tails in Appendix A. Depending on whether or not

ﬁne-tuned on in-domain ASR and MT data, we

split our experiments into two sets: A1-A5 and

B1-B5.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DoesJointTrainingReallyHelpCascadedSpeechTranslation?VietAnhKhoaTranDavidThulkeYingboGaoChristianHeroldHermannNeyHumanLanguageTechnologyandPatternRecognitionGroupComputerScienceDepartmentRWTHAachenUniversityD-52056Aachen,Germany{vtran|thulke|gao|herold|ney}@i6.informatik.rwth-aachen.deAbstractCurren...

展开>> 收起<<

Does Joint Training Really Help Cascaded Speech Translation Viet Anh Khoa Tran David Thulke Yingbo Gao Christian Herold Hermann Ney Human Language Technology and Pattern Recognition Group.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Does Joint Training Really Help Cascaded Speech Translation Viet Anh Khoa Tran David Thulke Yingbo Gao Christian Herold Hermann Ney Human Language Technology and Pattern Recognition Group

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: