robust models and better segmentation strategies
and that an isolated comparison may not be
informative enough. Consequently, we explore
different combinations of these two approaches
for two different SLT models and present results
in five language pairs. Figure 1shows the four
segmentation methods we study in this work (see
also Section 3.3). Our experiments follow the
popular retranslation approach (Niehues et al.,
2016,2018;Arivazhagan et al.,2020a,b) where
a partial segment is retranslated every time new
audio becomes available. Retranslation has the
advantage of being a simple approach to online
SLT, which can use a standard MT inference
engine. As a side-effect, the previous translation
can change in later retranslations and the resulting
“flicker” (i.e. sudden translation changes in the
output of previous time steps) is also considered in
our evaluation of different strategies.
Our main contributions are:
•
We explore various combinations of segmen-
tation strategies and robustness-finetuning ap-
proaches for translating unsegmented audio in
an online SLT setup.
•
We find that the advantage of dedicated au-
dio segmentation models over a fixed-window
approach becomes much smaller if the trans-
lation model is context-aware, and merging
translations of overlapping windows can per-
form comparatively to the gold segmentation.
•
We discuss issues with the evaluation of delay
in an existing evaluation toolkit for retrans-
lation when different segmentations are used
and show how these can be mitigated.
2 Related Work
In recent years, the IWSLT shared task organisers
have stopped providing gold segmented test sets
for the offline speech translation task which has
lead to increased research focus on audio segmen-
tation (Ansari et al.,2020;Anastasopoulos et al.,
2021,2022). One obvious strategy to segment au-
dio is to create fixed windows of the same duration,
but previous research has mostly relied on more
elaborate methods. Typically, methods with voice
activity detection (VAD) (Sohn et al.,1999) were
employed to identify natural breaks in the speech
signal. However, VAD models do not guarantee
breaks that align with complete utterances and can
produce segments that are too long or too short
which is why hybrid approaches that also consider
the length of the predicted utterance can be helpful
(Potapczyk and Przybysz,2020;Gaido et al.,2021;
Shanbhogue et al.,2022). Most recently, Tsia-
mas et al. (2022b) finetune a wav2vec 2.0 model
(Baevski et al.,2020) to predict gold segmentation-
like utterance boundaries, an approach which out-
performs several alternative segmentation methods
and was widely adopted in the 2022 IWSLT offline
SLT shared task (Tsiamas et al.,2022a;Pham et al.,
2022;Gaido et al.,2022).
Apart from improving automatic audio segmen-
tation methods, previous research has also focused
on making SLT models more robust toward seg-
mentation errors. Gaido et al. (2020) and Zhang
et al. (2021) both explore context-aware end-to-
end SLT models and show that context can help to
better translate VAD-segmented utterances. Sim-
ilarly, training on artificially truncated data can
be beneficial to segmentation robustness in cas-
caded setups (Li et al.,2021) but also in end-to-end
models (Gaido et al.,2020). While this approach
can introduce misalignments between source audio
and target text, such misalignments in the training
data are not necessarily harmful to SLT models as
Ouyang et al. (2022) recently showed in an evalua-
tion of the MuST-C dataset (Di Gangi et al.,2019).
Both of these approaches – improving auto-
matic segmentation and making models more ro-
bust toward segmentation errors – can be combined.
For example, Papi et al. (2021) show that contin-
ued finetuning on artificial segmentation can help
narrow the gap between hybrid segmentation ap-
proaches and manual segmentation. However, a
combination of both methods is not always equally
beneficial. Gaido et al. (2022) repeat Papi et al.
(2021)’s analysis with the segmentation model pro-
posed by Tsiamas et al. (2022b) and show that for
this segmentation strategy, continued finetuning on
resegmented data does not lead to an improvement
in translation quality.
In our work, we aim to extend these efforts
and test various combinations of segmentation and
model finetuning strategies. We are especially inter-
ested in fixed-window segmentations which have
largely been ignored in SLT research but are attrac-
tive from a practical point of view because they
do not require an additional model to perform seg-
mentation. To the best of our knowledge, we are