Dont Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation Chantal Amrhein1and Barry Haddow2

2025-05-04 0 0 487.86KB 17 页 10玖币
侵权投诉
Don’t Discard Fixed-Window Audio Segmentation in Speech-to-Text
Translation
Chantal Amrhein1and Barry Haddow2
1Department of Computational Linguistics, University of Zurich
2School of Informatics, University of Edinburgh
amrhein@cl.uzh.ch, bhaddow@ed.ac.uk
Abstract
For real-life applications, it is crucial that end-
to-end spoken language translation models per-
form well on continuous audio, without rely-
ing on human-supplied segmentation. For on-
line spoken language translation, where mod-
els need to start translating before the full ut-
terance is spoken, most previous work has ig-
nored the segmentation problem. In this pa-
per, we compare various methods for improv-
ing models’ robustness towards segmentation
errors and different segmentation strategies in
both offline and online settings and report re-
sults on translation quality, flicker and delay.
Our findings on five different language pairs
show that a simple fixed-window audio seg-
mentation can perform surprisingly well given
the right conditions.1
1 Introduction
End-to-end spoken language translation (SLT) has
seen considerable advances in recent years. To ap-
ply these findings to real online and offline SLT
settings, we need to be able to process continu-
ous audio input. However, most previous work
on end-to-end SLT makes use of human-annotated,
sentence-like gold segments both at training and
test time which are not available in real-life set-
tings. Unfortunately, SLT models that were trained
on such gold segments often suffer a noticeable
quality loss when applied to artificially split au-
dio segments (Zhang et al.,2021;Tsiamas et al.,
2022b). This also highlights that a good segmenta-
tion is more important for SLT than for automatic
speech recognition (ASR) because we need to split
the audio into “translatable units”. For a cascade
system, a segmenter/punctuator can be inserted
between the ASR and machine translation (MT)
model (Cho et al.,2017) in order to create suitable
1
We publicly release our code and model outputs
here:
https://github.com/ZurichNLP/window_
audio_segmentation
gold
shas
fixed
merged
Figure 1: Visualisation of the different audio segmenta-
tion methods studied in this paper.
segments for the MT model. However for end-to-
end SLT systems, it is still not clear how to best
translate continuous input.
Solving this problem is very much an active re-
search field that has mainly been tackled from two
sides: (1) improving SLT models to be more robust
towards segmentation errors (Gaido et al.,2020;Li
et al.,2021;Zhang et al.,2021) and (2) developing
strategies to split streaming audio into segments
that resemble the training data more closely (Gaido
et al.,2021;Tsiamas et al.,2022b). Both types of
approaches were successfully used in recent years
for the IWSLT offline SLT shared task (Ansari et al.,
2020;Anastasopoulos et al.,2021,2022) to trans-
late audio without gold segmentations. However,
they have not yet been tested systematically in the
online SLT setup where translation starts before
the full utterance is spoken. Recent editions of
the IWSLT simultaneous speech translation shared
task focused more on evaluation using the gold seg-
mentation rather than unsegmented audio (Anasta-
sopoulos et al.,2021,2022). Segmenting stream-
ing audio is especially interesting in online SLT
because aside from effects on translation quality,
different segmentations can also influence the delay
(or latency) of the generated translation.
In this paper, we aim to fill this gap and
focus on the end-to-end online SLT setup. We
suspect that there is an interplay between more
arXiv:2210.13363v1 [cs.CL] 24 Oct 2022
robust models and better segmentation strategies
and that an isolated comparison may not be
informative enough. Consequently, we explore
different combinations of these two approaches
for two different SLT models and present results
in five language pairs. Figure 1shows the four
segmentation methods we study in this work (see
also Section 3.3). Our experiments follow the
popular retranslation approach (Niehues et al.,
2016,2018;Arivazhagan et al.,2020a,b) where
a partial segment is retranslated every time new
audio becomes available. Retranslation has the
advantage of being a simple approach to online
SLT, which can use a standard MT inference
engine. As a side-effect, the previous translation
can change in later retranslations and the resulting
“flicker” (i.e. sudden translation changes in the
output of previous time steps) is also considered in
our evaluation of different strategies.
Our main contributions are:
We explore various combinations of segmen-
tation strategies and robustness-finetuning ap-
proaches for translating unsegmented audio in
an online SLT setup.
We find that the advantage of dedicated au-
dio segmentation models over a fixed-window
approach becomes much smaller if the trans-
lation model is context-aware, and merging
translations of overlapping windows can per-
form comparatively to the gold segmentation.
We discuss issues with the evaluation of delay
in an existing evaluation toolkit for retrans-
lation when different segmentations are used
and show how these can be mitigated.
2 Related Work
In recent years, the IWSLT shared task organisers
have stopped providing gold segmented test sets
for the offline speech translation task which has
lead to increased research focus on audio segmen-
tation (Ansari et al.,2020;Anastasopoulos et al.,
2021,2022). One obvious strategy to segment au-
dio is to create fixed windows of the same duration,
but previous research has mostly relied on more
elaborate methods. Typically, methods with voice
activity detection (VAD) (Sohn et al.,1999) were
employed to identify natural breaks in the speech
signal. However, VAD models do not guarantee
breaks that align with complete utterances and can
produce segments that are too long or too short
which is why hybrid approaches that also consider
the length of the predicted utterance can be helpful
(Potapczyk and Przybysz,2020;Gaido et al.,2021;
Shanbhogue et al.,2022). Most recently, Tsia-
mas et al. (2022b) finetune a wav2vec 2.0 model
(Baevski et al.,2020) to predict gold segmentation-
like utterance boundaries, an approach which out-
performs several alternative segmentation methods
and was widely adopted in the 2022 IWSLT offline
SLT shared task (Tsiamas et al.,2022a;Pham et al.,
2022;Gaido et al.,2022).
Apart from improving automatic audio segmen-
tation methods, previous research has also focused
on making SLT models more robust toward seg-
mentation errors. Gaido et al. (2020) and Zhang
et al. (2021) both explore context-aware end-to-
end SLT models and show that context can help to
better translate VAD-segmented utterances. Sim-
ilarly, training on artificially truncated data can
be beneficial to segmentation robustness in cas-
caded setups (Li et al.,2021) but also in end-to-end
models (Gaido et al.,2020). While this approach
can introduce misalignments between source audio
and target text, such misalignments in the training
data are not necessarily harmful to SLT models as
Ouyang et al. (2022) recently showed in an evalua-
tion of the MuST-C dataset (Di Gangi et al.,2019).
Both of these approaches – improving auto-
matic segmentation and making models more ro-
bust toward segmentation errors – can be combined.
For example, Papi et al. (2021) show that contin-
ued finetuning on artificial segmentation can help
narrow the gap between hybrid segmentation ap-
proaches and manual segmentation. However, a
combination of both methods is not always equally
beneficial. Gaido et al. (2022) repeat Papi et al.
(2021)’s analysis with the segmentation model pro-
posed by Tsiamas et al. (2022b) and show that for
this segmentation strategy, continued finetuning on
resegmented data does not lead to an improvement
in translation quality.
In our work, we aim to extend these efforts
and test various combinations of segmentation and
model finetuning strategies. We are especially inter-
ested in fixed-window segmentations which have
largely been ignored in SLT research but are attrac-
tive from a practical point of view because they
do not require an additional model to perform seg-
mentation. To the best of our knowledge, we are
train test
# talks # segments # talks # segments
en-de 2,043 229,703 27 2,641
es-en 378 36,263 15 996
fr-en 250 30,171 11 1,041
it-en 221 24,576 11 979
pt-en 279 30,855 11 1,022
multi 1,128 121,865 48 4038
Table 1: Overview of dataset statistics. The last row
shows the total numbers for the multilingual model on
es-en, fr-en, it-en and pt-en combined.
the first to perform such an extensive segmentation-
focused analysis for online SLT, considering delay,
flicker and translation quality for the evaluation.
3 Experiment Setup
3.1 Data
We run experiments with TED talk data in five dif-
ferent language pairs where the task is to translate
a TED talk as an incoming stream without having
any gold sentence segmentation.
For English-to-German, we use the data from
the MuST-C corpus (Di Gangi et al.,2019) version
1.0
2
. This dataset is built from TED talk audio with
human-annotated transcriptions and translations.
For testing, we use the “tst-COMMON” test set.
For Spanish-, French-, Italian- and Portuguese-to-
English, we use the data from the mTEDx corpus
(Salesky et al.,2021)
3
. This dataset is also based on
TED talks and provides human annotated transcrip-
tions and translations of the audio files. For testing,
we use the “iwslt2021” test set from the IWSLT
2021 multilingual speech translation shared task
(Anastasopoulos et al.,2021). The dataset statistics
can be seen in Table 1.
3.2 Spoken Language Translation Models
We base all our experiments on the joint speech-
and text-to-text model (Tang et al.,2021a,b,c) re-
leased by Meta AI. For the English-German ex-
periments, we use the model provided by Tang
et al. (2021b)
4
and for the other language pairs, we
use the multilingual model provided by Tang et al.
(2021a)
5
. We refer to these models as the
original
2https://ict.fbk.eu/must-c/
3http://www.openslr.org/100
4https://github.com/facebookresearch/
fairseq/blob/main/examples/speech_text_
joint_to_text/docs/ende-mustc.md
5https://github.com/facebookresearch/
fairseq/blob/main/examples/speech_text_
models. These models are trained on full segments
that mostly comprise one sentence:
And like with all powerful technology, this brings
huge benefits, but also some risks.
To investigate the effects of different segmenta-
tion strategies combined with segmentation-robust
models, we finetune three different variants based
on each model. In each case, the finetuning data is
augmented with artificially segmented data, but no
segments cross the boundaries between the individ-
ual TED talks.
prefix:
This model is finetuned on a 50-50
mix of original segments and synthetically
created prefixes (i.e. sentences where the end
is arbitrarily chopped off). Finetuning on pre-
fixes should help for translating artificially
segmented audio where the segment stops in
the middle of an utterance. We create pre-
fixes of the original segments by randomly
sampling a new duration for an audio segment
and using the length ratio to extract the cor-
responding target text. An example for a pre-
fixed version of the original segment can be
seen here:
And like with all
context: This model is finetuned on a mix of
original segments and synthetically created
longer segments. Context was already shown
to help with segmentation errors by Zhang
et al. (2021). This model should be able to
translate segments that consist of multiple ut-
terances. For each segment in the original
training set, we randomly either use the origi-
nal segment (50% of the time) or an extended
segment created by prepending the previous
segment (25% of the time) or the 2 previous
segments (also 25% of the time). We then add
context-prefixed segments for each of these
(possibly-extended) segments, by truncating
the last concatenated segment. An example
for a context-prefixed version of the original
segment can be seen here:
We work every day to generate those kinds of
technologies, safe and useful.
And like with all
powerful technology, this brings huge benefits,
windows:
This model is finetuned on a 50-
50 mix of original segments and windows of
joint_to_text/docs/iwslt2021.md
random duration. We split the audio into win-
dows by starting at the beginning of the audio
and then sampling the duration of the first win-
dow. The end of this window then becomes
the start of the next window and we repeat this
process until we reach the end of a TED talk.
For every such window, we extract the cor-
responding target text from the time-aligned
gold segment(s) via length ratios. This mir-
rors the conditions at inference time with a
fixed-window segmentation where a segment
can start and end anywhere in an utterance and
can also comprise multiple utterances. The
segment durations are sampled uniformly be-
tween 10 and 30 seconds. Note that this model
will see the qualitatively poorest data out of all
finetuned models because both the end of the
segment and the beginning depend on length
ratios which can introduce alignment errors.
An example for a window version of the origi-
nal segment can be seen here:
or death diagnosis without the help of artificial
intelligence. We work every day to generate those
kinds of technologies, safe and useful.
And like
with all powerful technology, this brings huge
benefits, but also some risks.
I don’t know how
this debate ends, but what I’m sure of, is that the
game
All models are trained from the original check-
point for an additional 20k steps and the last two
checkpoints are averaged if more than one is saved.
We do this finetuning by continuing training with
the config file of the original model. For the
English
German MuST-C model, we train on the
audio as well as the corresponding phoneme se-
quences based on the transcript, however, we do
not use additional parallel text data during finetun-
ing. For the multilingual mTEDx model, we only
train on data for the selected language pairs and
only on audio (no phoneme sequences) because
this model was already finetuned on the spoken
language translation task. The validation sets only
contain gold segments and all models stop train-
ing due to the step limit before early stopping is
triggered.
3.3 Segmentation Strategies
We consider four different inference-time segmen-
tation strategies in our experiments, visualised in
Figure 1:
gold:
These are human annotated segmenta-
tion boundaries that are released as part of the
MuST-C and mTEDx data. This segmenta-
tion can be viewed as an oracle segmentation
even though it may not necessarily be the best
segmentation for all models. Using the gold
segmentation in practice is unrealistic, espe-
cially in the online setting where there would
be no time for a human to segment the audio
before translation.
SHAS:
This segmentation method was re-
cently proposed by Tsiamas et al. (2022b).
The authors finetune a pretrained wav2vec 2.0
model (Baevski et al.,2020) on the gold seg-
mentations and train it to predict probabilities
for segmentation boundaries. SHAS can be
used both in offline and online setups using
different algorithms to determine the segmen-
tation boundaries based on the model’s proba-
bilities. Since we perform our experiments
in an online setup, we use the pSTREAM
algorithm to identify segments with SHAS.
We set the maximum segment length to 18
seconds which the authors reported as best-
performing.
fixed:
This is a simple approach that splits the
audio stream into independent fixed windows
of a given duration. In our experiments, we
use durations of 26 seconds, which performed
best in experiments by Tsiamas et al. (2022b).
merged:
Similarly to above, we consider
fixed-size windows for this segmentation strat-
egy but here we construct overlapping win-
dows. We use a duration of 15 seconds
6
and
shift the window with a stride of 2 seconds at
a time. The translations of these overlapping
windows are merged before the next window
is translated (see Section 3.5).
3.4 Retranslation
We employ a retranslation strategy (Niehues et al.,
2016,2018;Arivazhagan et al.,2020a,b) for our
end-to-end SLT experiments. This means that we
retranslate the incoming audio at fixed time inter-
vals. In our experiments, we retranslate every 2
seconds to be consistent with the 2-second stride
from the merging windows approach. Because of
such retranslations of the full audio segment —
from the start of the segment up to the current time
6
We found empirically that this works better than a dura-
tion of 26 seconds as for fixed-windows, with both increased
translation quality and reduced flicker (see Appendix B).
摘要:

Don'tDiscardFixed-WindowAudioSegmentationinSpeech-to-TextTranslationChantalAmrhein1andBarryHaddow21DepartmentofComputationalLinguistics,UniversityofZurich2SchoolofInformatics,UniversityofEdinburghamrhein@cl.uzh.ch,bhaddow@ed.ac.ukAbstractForreal-lifeapplications,itiscrucialthatend-to-endspokenlangua...

展开>> 收起<<
Dont Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation Chantal Amrhein1and Barry Haddow2.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:487.86KB 格式:PDF 时间:2025-05-04

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注