
Simultaneous Translation for Unsegmented Input:
A Sliding Window Approach
Sukanta Sen∗
University of Edinburgh
sukantasen10@gmail.com
Ondˇ
rej Bojar
Charles University
MFF ÚFAL
bojar@ufal.mff.cuni.cz
Barry Haddow
University of Edinburgh
bhaddow@ed.ac.uk
Abstract
In the cascaded approach to spoken language
translation (SLT), the ASR output is typically
punctuated and segmented into sentences be-
fore being passed to MT, since the latter is typ-
ically trained on written text. However, erro-
neous segmentation, due to poor sentence-final
punctuation by the ASR system, leads to degra-
dation in translation quality, especially in the
simultaneous (online) setting where the input
is continuously updated. To reduce the influ-
ence of automatic segmentation, we present a
sliding window approach to translate raw ASR
outputs (online or offline) without needing to
rely on an automatic segmenter. We train trans-
lation models using parallel windows (instead
of parallel sentences) extracted from the orig-
inal training data. At test time, we translate
at the window level and join the translated
windows using a simple approach to generate
the final translation. Experiments on English-
to-German and English-to-Czech show that
our approach improves 1.3–2.0 BLEU points
over the usual ASR-segmenter pipeline, and
the fixed-length window considerably reduces
flicker compared to a baseline retranslation-
based online SLT system.
1 Introduction
For machine translation (MT) with textual input,
it is usual to segment the text into sentences be-
fore translation, with the boundaries of sentences
in most text types indicated by punctuation. For
spoken language translation (SLT), in contrast, the
input is audio so there is no punctuation provided
to assist segmentation. Segmentation thus has to be
guessed by the ASR system or a separate compo-
nent. Perhaps more importantly, for many speech
genres the input cannot easily be segmented into
well-formed sentences as found in MT training
data, giving a mismatch between training and test.
∗Work performed while at the University of Edinburgh
In order to address the segmentation problem in
SLT, systems often include a segmentation com-
ponent in their pipeline, e.g. Cho et al. (2017).
In other words, a typical cascaded SLT system
consists of automatic speech recognition (ASR –
which outputs lowercased, unpunctuated text) a
punctuator/segmenter (which adds punctuation and
so defines segments) and an MT system. The seg-
menter can be a sequence-sequence model, and
training data is easily synthesised from punctuated
text. However adding segmentation as an extra
step has the disadvantage of introducing an extra
component to be managed and deployed. Further-
more, errors in segmentation have been shown to
contribute significantly to overall errors in SLT (Li
et al.,2021), since neural MT is known to be sus-
ceptible to degradation from noisy input (Khayral-
lah and Koehn,2018).
These issues with segmentation can be exacer-
bated in the online or simultaneous setting. This
is an important use case for SLT where we want
to produce the translations from live speech, as the
speaker is talking. To minimise the latency of the
translation, we would like to start translating before
speaker has finished their sentence. Some online
low-latency ASR approaches will also revise their
output after it has been produced, creating addi-
tional difficulties for the downstream components.
In this scenario, the segmentation into sentences
will be more uncertain and we are faced with the
choice of waiting for the input to stabilise (so in-
creasing latency) or translating early (potentially
introducing more errors, or having to correct the
output when the ASR is extended and updated).
To address the segmentation issue in SLT, Li
et al. (2021) has proposed to a data augmentation
technique which simulates the bad segmentation in
the training data. They concatenate two adjacent
source sentences (and also the corresponding tar-
gets) and then start and end of the concatenated
sentences are truncated proportionally.
arXiv:2210.09754v1 [cs.CL] 18 Oct 2022