
CS and focus mostly on inter-sentential CS.
Our contributions are the following:
To deal with inter-sentential CS, instead of recognizing
which language was spoken first, and then apply a language-
dependent recognizer and subsequent translation component
(or apply a speech recognition or speech translation com-
ponent as in figure 1, left), we propose to use a Language
Agnostic end-to-end Speech Translation model (LAST)
model which treats speech recognition and speech transla-
tion as one unified end-to-end speech translation problem
(see figure 1, right). By training with both input languages,
we decode speech into one output target language, regard-
less of whether it represents input speech from the same or a
different language. The unified system delivers comparable
recognition and speech translation accuracy in monolingual
usage, while reducing latency and error rate considerably
when CS is observed. Furthermore, the pipeline is simplified
considerably.
This is shown by evaluating on a testset denoted tst-inter.
We created this testset, in which the audio contains lan-
guage switches, for language agnostic speech translation from
tst-COMMON. While performing comparable on ASR and
speech translation (ST) testsets, LAST increases performance
by 7.3 BLEU on tst-inter, compared to a human-annotated
LID followed by an ASR or ST model. Furthermore, we
use a data augmentation strategy to increase performance for
utterances which have multiple input languages. With the
data augmentation strategy of concatenating audio and corre-
sponding labels of multiple utterances with different source
languages into one new utterance, performance of LAST
increases by 3.3 BLEU on tst-inter.
The paper is structured as follows: In the following chap-
ter we look at related work, in chapter 3 we report the used
data, model, results and limitations, and in chapter 4 we con-
clude our results.
2. RELATED WORK
Since there are only a few CS datasets available, there has not
been too much research for language pairs without such data.
[7] propose a model which has the union of graphemes of all
languages plus language-specific tags as the target label. In
order to gain performance in the task of CS, they suggest ar-
tificially generating training data that contains CS utterances.
In order to achieve this, they combine full-length utterances
of different languages. When concatenating the correspond-
ing targets the language-specific token is also added before
the target sequence of the respective utterance. For our LAST
approach, this is not necessary, since we have only one lan-
guage in the label. The authors of [8] used ASR models with
a separate TDNN-LSTM [9] as an acoustic model, as well as
a separate language model. Thus they are able to utilize CS
speech-only data for enhancing the acoustic model and used
CS text-only data they artificially created, using different ap-
proaches, for enhancing their language model.
Most of the work on CS, however, focuses on language
pairs where some transcribed CS data is available. In [10] the
authors aim at improving CS performance using a multi-task
learning (MTL) approach. The authors investigate training a
model predicting a sequence of labels as well as predicting
a language identifier at different levels. [11] propose to use
the Learning without Forgetting [12] framework to adapt a
on monolingual data trained model to CS. In [13] the authors
propose to train a CTC model [14] for speech recognition and
to linearly adjust the posteriors using a frame-level language
identification model. The authors of [15] modify the self-
attention of the decoder to reduce multilingual context confu-
sion and to improve the performance of the CS ASR model.
Most similar to our work is the model E2E BIDIRECT
SHARED of [16, figure 3G]. However, the difference to our
work is, that [16] uses CS data, where they need transcrip-
tions and translations, as well as annotations which words are
from which language, and they focus on intra-sentential CS.
Furthermore, they first generate a transcription and therefore
have to explicitly detect which language is spoken in each part
of the audio. Errors in the transcription step can lead to worse
translation performance.
Corpus Utterances Speech data [h]
A: Training Data: ASR 949k 1825
Europarl 64k 148
Librivox 225k 512
Common Voice 511k 685
LT 149k 480
B: Training Data: ST 1196 1995
MuST-C v1 230k 400
MuST-C v2 251k 450
Europarl-ST 33k 77
ST TED 142k 210
CoVoST v2 272k 404
TED LIUM 268k 454
C: Test Data
tst-COMMON (EN to DE) 2580 4.2
tst2013 (DE to DE) 1369 1.9
tst2014 (DE to DE) 1414 2.5
tst2015 (DE to DE) 4486 3.0
tst-inter (EN and DE to DE) 284 (746) 0.9
Table 1. Summary of the datasets used for training. The tst-
inter dataset we created contains 746 segments when splitting
by the languages.