CODE-SWITCHING WITHOUT SWITCHING LANGUAGE AGNOSTIC END-TO-END SPEECH TRANSLATION Christian Huber1 Enes Yavuz Ugan1 and Alexander Waibel12

2025-04-27 0 0 280.92KB 6 页 10玖币
侵权投诉
CODE-SWITCHING WITHOUT SWITCHING:
LANGUAGE AGNOSTIC END-TO-END SPEECH TRANSLATION
Christian Huber1, Enes Yavuz Ugan1, and Alexander Waibel1,2
1Interactive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany
2Carnegie Mellon University, Pittsburgh PA, USA
firstname.lastname@kit.edu, alexander.waibel@cmu.edu
ABSTRACT
We propose a) a Language Agnostic end-to-end Speech
Translation model (LAST), and b) a data augmentation strat-
egy to increase code-switching (CS) performance.
With increasing globalization, multiple languages are in-
creasingly used interchangeably during fluent speech. Such
CS complicates traditional speech recognition and translation,
as we must recognize which language was spoken first and
then apply a language-dependent recognizer and subsequent
translation component to generate the desired target language
output. Such a pipeline introduces latency and errors. In
this paper, we eliminate the need for that, by treating speech
recognition and translation as one unified end-to-end speech
translation problem. By training LAST with both input lan-
guages, we decode speech into one target language, regardless
of the input language. LAST delivers comparable recognition
and speech translation accuracy in monolingual usage, while
reducing latency and error rate considerably when CS is ob-
served.
Index Termsspeech translation, language agnostic in-
put.
1. INTRODUCTION
Due to increasing globalization, multiple languages are in-
creasingly used interchangeably during fluent speech. This is
referred to as code-switching (CS).
From a linguistic perspective, CS can be divided into mul-
tiple categories [1]:
Inter-sentential CS: The switch between languages hap-
pens at sentence boundaries. Usually, the speaker is
aware of the language shift.
Intra-sentential CS: Here the second language is in-
cluded in the middle of the sentence. This switch
mainly occurs unaware of the speaker. Additionally,
the word borrowed from the second language can hap-
pen to be adapted to the grammar of the first language
as well.
I'm Jessi und das hier ist mein Koffer. I'm Jessi und das hier ist mein Koffer.
LID
I'm Jessi und das hier ist mein Koffer.
ASR
ST
Ich bin Jessi und das hier ist mein Koffer.
LAST
Ich bin Jessi und das hier ist mein Koffer.
Fig. 1. Illustration of the information flow. Left: Base-
line: Language identification (LID) followed by either the
speech translation (ST) or automatic speech recognition
(ASR) model. Right: Our LAST approach. For the green
boxes, we use a transformer based encoder-decoder model.
The transcripts in the middle and at the bottom are only shown
for illustration purposes (the models don’t have access to it).
Extra-sentential CS: In this case, a tag element from a
second language is included, for example at the end of
a sentence. This word is more excluded from the main
language.
As of today, there are only a few CS datasets. Some exam-
ple corpora available are [2] for CS between French and Alge-
rian speech, SEAME from [3] containing utterances switch-
ing between Mandarin and English and [4] gathered data with
CS between English and Cantonese. The Fisher CS dataset
[5] and the Bangor Miami CS dataset [6] contain CS auto-
matic speech recognition (ASR) transcripts in English and
Spanish and their translations.
Since these datasets are limited in size and available lan-
guages, we instead train our model with data not containing
arXiv:2210.01512v2 [cs.CL] 9 Nov 2022
CS and focus mostly on inter-sentential CS.
Our contributions are the following:
To deal with inter-sentential CS, instead of recognizing
which language was spoken first, and then apply a language-
dependent recognizer and subsequent translation component
(or apply a speech recognition or speech translation com-
ponent as in figure 1, left), we propose to use a Language
Agnostic end-to-end Speech Translation model (LAST)
model which treats speech recognition and speech transla-
tion as one unified end-to-end speech translation problem
(see figure 1, right). By training with both input languages,
we decode speech into one output target language, regard-
less of whether it represents input speech from the same or a
different language. The unified system delivers comparable
recognition and speech translation accuracy in monolingual
usage, while reducing latency and error rate considerably
when CS is observed. Furthermore, the pipeline is simplified
considerably.
This is shown by evaluating on a testset denoted tst-inter.
We created this testset, in which the audio contains lan-
guage switches, for language agnostic speech translation from
tst-COMMON. While performing comparable on ASR and
speech translation (ST) testsets, LAST increases performance
by 7.3 BLEU on tst-inter, compared to a human-annotated
LID followed by an ASR or ST model. Furthermore, we
use a data augmentation strategy to increase performance for
utterances which have multiple input languages. With the
data augmentation strategy of concatenating audio and corre-
sponding labels of multiple utterances with different source
languages into one new utterance, performance of LAST
increases by 3.3 BLEU on tst-inter.
The paper is structured as follows: In the following chap-
ter we look at related work, in chapter 3 we report the used
data, model, results and limitations, and in chapter 4 we con-
clude our results.
2. RELATED WORK
Since there are only a few CS datasets available, there has not
been too much research for language pairs without such data.
[7] propose a model which has the union of graphemes of all
languages plus language-specific tags as the target label. In
order to gain performance in the task of CS, they suggest ar-
tificially generating training data that contains CS utterances.
In order to achieve this, they combine full-length utterances
of different languages. When concatenating the correspond-
ing targets the language-specific token is also added before
the target sequence of the respective utterance. For our LAST
approach, this is not necessary, since we have only one lan-
guage in the label. The authors of [8] used ASR models with
a separate TDNN-LSTM [9] as an acoustic model, as well as
a separate language model. Thus they are able to utilize CS
speech-only data for enhancing the acoustic model and used
CS text-only data they artificially created, using different ap-
proaches, for enhancing their language model.
Most of the work on CS, however, focuses on language
pairs where some transcribed CS data is available. In [10] the
authors aim at improving CS performance using a multi-task
learning (MTL) approach. The authors investigate training a
model predicting a sequence of labels as well as predicting
a language identifier at different levels. [11] propose to use
the Learning without Forgetting [12] framework to adapt a
on monolingual data trained model to CS. In [13] the authors
propose to train a CTC model [14] for speech recognition and
to linearly adjust the posteriors using a frame-level language
identification model. The authors of [15] modify the self-
attention of the decoder to reduce multilingual context confu-
sion and to improve the performance of the CS ASR model.
Most similar to our work is the model E2E BIDIRECT
SHARED of [16, figure 3G]. However, the difference to our
work is, that [16] uses CS data, where they need transcrip-
tions and translations, as well as annotations which words are
from which language, and they focus on intra-sentential CS.
Furthermore, they first generate a transcription and therefore
have to explicitly detect which language is spoken in each part
of the audio. Errors in the transcription step can lead to worse
translation performance.
Corpus Utterances Speech data [h]
A: Training Data: ASR 949k 1825
Europarl 64k 148
Librivox 225k 512
Common Voice 511k 685
LT 149k 480
B: Training Data: ST 1196 1995
MuST-C v1 230k 400
MuST-C v2 251k 450
Europarl-ST 33k 77
ST TED 142k 210
CoVoST v2 272k 404
TED LIUM 268k 454
C: Test Data
tst-COMMON (EN to DE) 2580 4.2
tst2013 (DE to DE) 1369 1.9
tst2014 (DE to DE) 1414 2.5
tst2015 (DE to DE) 4486 3.0
tst-inter (EN and DE to DE) 284 (746) 0.9
Table 1. Summary of the datasets used for training. The tst-
inter dataset we created contains 746 segments when splitting
by the languages.
摘要:

CODE-SWITCHINGWITHOUTSWITCHING:LANGUAGEAGNOSTICEND-TO-ENDSPEECHTRANSLATIONChristianHuber1,EnesYavuzUgan1,andAlexanderWaibel1;21InteractiveSystemsLab,KarlsruheInstituteofTechnology,Karlsruhe,Germany2CarnegieMellonUniversity,PittsburghPA,USArstname.lastname@kit.edu,alexander.waibel@cmu.eduABSTRACTWep...

展开>> 收起<<
CODE-SWITCHING WITHOUT SWITCHING LANGUAGE AGNOSTIC END-TO-END SPEECH TRANSLATION Christian Huber1 Enes Yavuz Ugan1 and Alexander Waibel12.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:280.92KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注