CODE-SWITCHING WITHOUT SWITCHING LANGUAGE AGNOSTIC END-TO-END SPEECH TRANSLATION Christian Huber1 Enes Yavuz Ugan1 and Alexander Waibel12

2025-04-27 0 0 280.92KB 6 页 10玖币

侵权投诉

CODE-SWITCHING WITHOUT SWITCHING:

LANGUAGE AGNOSTIC END-TO-END SPEECH TRANSLATION

Christian Huber1, Enes Yavuz Ugan1, and Alexander Waibel1,2

1Interactive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany

2Carnegie Mellon University, Pittsburgh PA, USA

ﬁrstname.lastname@kit.edu, alexander.waibel@cmu.edu

ABSTRACT

We propose a) a Language Agnostic end-to-end Speech

Translation model (LAST), and b) a data augmentation strat-

egy to increase code-switching (CS) performance.

With increasing globalization, multiple languages are in-

creasingly used interchangeably during ﬂuent speech. Such

CS complicates traditional speech recognition and translation,

as we must recognize which language was spoken ﬁrst and

then apply a language-dependent recognizer and subsequent

translation component to generate the desired target language

output. Such a pipeline introduces latency and errors. In

this paper, we eliminate the need for that, by treating speech

recognition and translation as one uniﬁed end-to-end speech

translation problem. By training LAST with both input lan-

guages, we decode speech into one target language, regardless

of the input language. LAST delivers comparable recognition

and speech translation accuracy in monolingual usage, while

reducing latency and error rate considerably when CS is ob-

served.

Index Terms—speech translation, language agnostic in-

put.

1. INTRODUCTION

Due to increasing globalization, multiple languages are in-

creasingly used interchangeably during ﬂuent speech. This is

referred to as code-switching (CS).

From a linguistic perspective, CS can be divided into mul-

tiple categories [1]:

• Inter-sentential CS: The switch between languages hap-

pens at sentence boundaries. Usually, the speaker is

aware of the language shift.

• Intra-sentential CS: Here the second language is in-

cluded in the middle of the sentence. This switch

mainly occurs unaware of the speaker. Additionally,

the word borrowed from the second language can hap-

pen to be adapted to the grammar of the ﬁrst language

as well.

I'm Jessi und das hier ist mein Koffer. I'm Jessi und das hier ist mein Koffer.

LID

I'm Jessi und das hier ist mein Koffer.

ASR

Ich bin Jessi und das hier ist mein Koffer.

LAST

Ich bin Jessi und das hier ist mein Koffer.

Fig. 1. Illustration of the information ﬂow. Left: Base-

line: Language identiﬁcation (LID) followed by either the

speech translation (ST) or automatic speech recognition

(ASR) model. Right: Our LAST approach. For the green

boxes, we use a transformer based encoder-decoder model.

The transcripts in the middle and at the bottom are only shown

for illustration purposes (the models don’t have access to it).

• Extra-sentential CS: In this case, a tag element from a

second language is included, for example at the end of

a sentence. This word is more excluded from the main

language.

As of today, there are only a few CS datasets. Some exam-

ple corpora available are [2] for CS between French and Alge-

rian speech, SEAME from [3] containing utterances switch-

ing between Mandarin and English and [4] gathered data with

CS between English and Cantonese. The Fisher CS dataset

[5] and the Bangor Miami CS dataset [6] contain CS auto-

matic speech recognition (ASR) transcripts in English and

Spanish and their translations.

Since these datasets are limited in size and available lan-

guages, we instead train our model with data not containing

arXiv:2210.01512v2 [cs.CL] 9 Nov 2022

CS and focus mostly on inter-sentential CS.

Our contributions are the following:

To deal with inter-sentential CS, instead of recognizing

which language was spoken ﬁrst, and then apply a language-

dependent recognizer and subsequent translation component

(or apply a speech recognition or speech translation com-

ponent as in ﬁgure 1, left), we propose to use a Language

Agnostic end-to-end Speech Translation model (LAST)

model which treats speech recognition and speech transla-

tion as one uniﬁed end-to-end speech translation problem

(see ﬁgure 1, right). By training with both input languages,

we decode speech into one output target language, regard-

less of whether it represents input speech from the same or a

different language. The uniﬁed system delivers comparable

recognition and speech translation accuracy in monolingual

usage, while reducing latency and error rate considerably

when CS is observed. Furthermore, the pipeline is simpliﬁed

considerably.

This is shown by evaluating on a testset denoted tst-inter.

We created this testset, in which the audio contains lan-

guage switches, for language agnostic speech translation from

tst-COMMON. While performing comparable on ASR and

speech translation (ST) testsets, LAST increases performance

by 7.3 BLEU on tst-inter, compared to a human-annotated

LID followed by an ASR or ST model. Furthermore, we

use a data augmentation strategy to increase performance for

utterances which have multiple input languages. With the

data augmentation strategy of concatenating audio and corre-

sponding labels of multiple utterances with different source

languages into one new utterance, performance of LAST

increases by 3.3 BLEU on tst-inter.

The paper is structured as follows: In the following chap-

ter we look at related work, in chapter 3 we report the used

data, model, results and limitations, and in chapter 4 we con-

clude our results.

2. RELATED WORK

Since there are only a few CS datasets available, there has not

been too much research for language pairs without such data.

[7] propose a model which has the union of graphemes of all

languages plus language-speciﬁc tags as the target label. In

order to gain performance in the task of CS, they suggest ar-

tiﬁcially generating training data that contains CS utterances.

In order to achieve this, they combine full-length utterances

of different languages. When concatenating the correspond-

ing targets the language-speciﬁc token is also added before

the target sequence of the respective utterance. For our LAST

approach, this is not necessary, since we have only one lan-

guage in the label. The authors of [8] used ASR models with

a separate TDNN-LSTM [9] as an acoustic model, as well as

a separate language model. Thus they are able to utilize CS

speech-only data for enhancing the acoustic model and used

CS text-only data they artiﬁcially created, using different ap-

proaches, for enhancing their language model.

Most of the work on CS, however, focuses on language

pairs where some transcribed CS data is available. In [10] the

authors aim at improving CS performance using a multi-task

learning (MTL) approach. The authors investigate training a

model predicting a sequence of labels as well as predicting

a language identiﬁer at different levels. [11] propose to use

the Learning without Forgetting [12] framework to adapt a

on monolingual data trained model to CS. In [13] the authors

propose to train a CTC model [14] for speech recognition and

to linearly adjust the posteriors using a frame-level language

identiﬁcation model. The authors of [15] modify the self-

attention of the decoder to reduce multilingual context confu-

sion and to improve the performance of the CS ASR model.

Most similar to our work is the model E2E BIDIRECT

SHARED of [16, ﬁgure 3G]. However, the difference to our

work is, that [16] uses CS data, where they need transcrip-

tions and translations, as well as annotations which words are

from which language, and they focus on intra-sentential CS.

Furthermore, they ﬁrst generate a transcription and therefore

have to explicitly detect which language is spoken in each part

of the audio. Errors in the transcription step can lead to worse

translation performance.

Corpus Utterances Speech data [h]

A: Training Data: ASR 949k 1825

Europarl 64k 148

Librivox 225k 512

Common Voice 511k 685

LT 149k 480

B: Training Data: ST 1196 1995

MuST-C v1 230k 400

MuST-C v2 251k 450

Europarl-ST 33k 77

ST TED 142k 210

CoVoST v2 272k 404

TED LIUM 268k 454

C: Test Data

tst-COMMON (EN to DE) 2580 4.2

tst2013 (DE to DE) 1369 1.9

tst2014 (DE to DE) 1414 2.5

tst2015 (DE to DE) 4486 3.0

tst-inter (EN and DE to DE) 284 (746) 0.9

Table 1. Summary of the datasets used for training. The tst-

inter dataset we created contains 746 segments when splitting

by the languages.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CODE-SWITCHINGWITHOUTSWITCHING:LANGUAGEAGNOSTICEND-TO-ENDSPEECHTRANSLATIONChristianHuber1,EnesYavuzUgan1,andAlexanderWaibel1;21InteractiveSystemsLab,KarlsruheInstituteofTechnology,Karlsruhe,Germany2CarnegieMellonUniversity,PittsburghPA,USArstname.lastname@kit.edu,alexander.waibel@cmu.eduABSTRACTWep...

展开>> 收起<<

CODE-SWITCHING WITHOUT SWITCHING LANGUAGE AGNOSTIC END-TO-END SPEECH TRANSLATION Christian Huber1 Enes Yavuz Ugan1 and Alexander Waibel12.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CODE-SWITCHING WITHOUT SWITCHING LANGUAGE AGNOSTIC END-TO-END SPEECH TRANSLATION Christian Huber1 Enes Yavuz Ugan1 and Alexander Waibel12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: