Dont Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation Chantal Amrhein1and Barry Haddow2

2025-05-04 0 0 487.86KB 17 页 10玖币

侵权投诉

Don’t Discard Fixed-Window Audio Segmentation in Speech-to-Text

Translation

Chantal Amrhein1and Barry Haddow2

1Department of Computational Linguistics, University of Zurich

2School of Informatics, University of Edinburgh

amrhein@cl.uzh.ch, bhaddow@ed.ac.uk

Abstract

For real-life applications, it is crucial that end-

to-end spoken language translation models per-

form well on continuous audio, without rely-

ing on human-supplied segmentation. For on-

line spoken language translation, where mod-

els need to start translating before the full ut-

terance is spoken, most previous work has ig-

nored the segmentation problem. In this pa-

per, we compare various methods for improv-

ing models’ robustness towards segmentation

errors and different segmentation strategies in

both ofﬂine and online settings and report re-

sults on translation quality, ﬂicker and delay.

Our ﬁndings on ﬁve different language pairs

show that a simple ﬁxed-window audio seg-

mentation can perform surprisingly well given

the right conditions.1

1 Introduction

End-to-end spoken language translation (SLT) has

seen considerable advances in recent years. To ap-

ply these ﬁndings to real online and ofﬂine SLT

settings, we need to be able to process continu-

ous audio input. However, most previous work

on end-to-end SLT makes use of human-annotated,

sentence-like gold segments both at training and

test time which are not available in real-life set-

tings. Unfortunately, SLT models that were trained

on such gold segments often suffer a noticeable

quality loss when applied to artiﬁcially split au-

dio segments (Zhang et al.,2021;Tsiamas et al.,

2022b). This also highlights that a good segmenta-

tion is more important for SLT than for automatic

speech recognition (ASR) because we need to split

the audio into “translatable units”. For a cascade

system, a segmenter/punctuator can be inserted

between the ASR and machine translation (MT)

model (Cho et al.,2017) in order to create suitable

We publicly release our code and model outputs

here:

https://github.com/ZurichNLP/window_

audio_segmentation

gold

shas

ﬁxed

merged

Figure 1: Visualisation of the different audio segmenta-

tion methods studied in this paper.

segments for the MT model. However for end-to-

end SLT systems, it is still not clear how to best

translate continuous input.

Solving this problem is very much an active re-

search ﬁeld that has mainly been tackled from two

sides: (1) improving SLT models to be more robust

towards segmentation errors (Gaido et al.,2020;Li

et al.,2021;Zhang et al.,2021) and (2) developing

strategies to split streaming audio into segments

that resemble the training data more closely (Gaido

et al.,2021;Tsiamas et al.,2022b). Both types of

approaches were successfully used in recent years

for the IWSLT ofﬂine SLT shared task (Ansari et al.,

2020;Anastasopoulos et al.,2021,2022) to trans-

late audio without gold segmentations. However,

they have not yet been tested systematically in the

online SLT setup where translation starts before

the full utterance is spoken. Recent editions of

the IWSLT simultaneous speech translation shared

task focused more on evaluation using the gold seg-

mentation rather than unsegmented audio (Anasta-

sopoulos et al.,2021,2022). Segmenting stream-

ing audio is especially interesting in online SLT

because aside from effects on translation quality,

different segmentations can also inﬂuence the delay

(or latency) of the generated translation.

In this paper, we aim to ﬁll this gap and

focus on the end-to-end online SLT setup. We

suspect that there is an interplay between more

arXiv:2210.13363v1 [cs.CL] 24 Oct 2022

robust models and better segmentation strategies

and that an isolated comparison may not be

informative enough. Consequently, we explore

different combinations of these two approaches

for two different SLT models and present results

in ﬁve language pairs. Figure 1shows the four

segmentation methods we study in this work (see

also Section 3.3). Our experiments follow the

popular retranslation approach (Niehues et al.,

2016,2018;Arivazhagan et al.,2020a,b) where

a partial segment is retranslated every time new

audio becomes available. Retranslation has the

advantage of being a simple approach to online

SLT, which can use a standard MT inference

engine. As a side-effect, the previous translation

can change in later retranslations and the resulting

“ﬂicker” (i.e. sudden translation changes in the

output of previous time steps) is also considered in

our evaluation of different strategies.

Our main contributions are:

•

We explore various combinations of segmen-

tation strategies and robustness-ﬁnetuning ap-

proaches for translating unsegmented audio in

an online SLT setup.

•

We ﬁnd that the advantage of dedicated au-

dio segmentation models over a ﬁxed-window

approach becomes much smaller if the trans-

lation model is context-aware, and merging

translations of overlapping windows can per-

form comparatively to the gold segmentation.

•

We discuss issues with the evaluation of delay

in an existing evaluation toolkit for retrans-

lation when different segmentations are used

and show how these can be mitigated.

2 Related Work

In recent years, the IWSLT shared task organisers

have stopped providing gold segmented test sets

for the ofﬂine speech translation task which has

lead to increased research focus on audio segmen-

tation (Ansari et al.,2020;Anastasopoulos et al.,

2021,2022). One obvious strategy to segment au-

dio is to create ﬁxed windows of the same duration,

but previous research has mostly relied on more

elaborate methods. Typically, methods with voice

activity detection (VAD) (Sohn et al.,1999) were

employed to identify natural breaks in the speech

signal. However, VAD models do not guarantee

breaks that align with complete utterances and can

produce segments that are too long or too short

which is why hybrid approaches that also consider

the length of the predicted utterance can be helpful

(Potapczyk and Przybysz,2020;Gaido et al.,2021;

Shanbhogue et al.,2022). Most recently, Tsia-

mas et al. (2022b) ﬁnetune a wav2vec 2.0 model

(Baevski et al.,2020) to predict gold segmentation-

like utterance boundaries, an approach which out-

performs several alternative segmentation methods

and was widely adopted in the 2022 IWSLT ofﬂine

SLT shared task (Tsiamas et al.,2022a;Pham et al.,

2022;Gaido et al.,2022).

Apart from improving automatic audio segmen-

tation methods, previous research has also focused

on making SLT models more robust toward seg-

mentation errors. Gaido et al. (2020) and Zhang

et al. (2021) both explore context-aware end-to-

end SLT models and show that context can help to

better translate VAD-segmented utterances. Sim-

ilarly, training on artiﬁcially truncated data can

be beneﬁcial to segmentation robustness in cas-

caded setups (Li et al.,2021) but also in end-to-end

models (Gaido et al.,2020). While this approach

can introduce misalignments between source audio

and target text, such misalignments in the training

data are not necessarily harmful to SLT models as

Ouyang et al. (2022) recently showed in an evalua-

tion of the MuST-C dataset (Di Gangi et al.,2019).

Both of these approaches – improving auto-

matic segmentation and making models more ro-

bust toward segmentation errors – can be combined.

For example, Papi et al. (2021) show that contin-

ued ﬁnetuning on artiﬁcial segmentation can help

narrow the gap between hybrid segmentation ap-

proaches and manual segmentation. However, a

combination of both methods is not always equally

beneﬁcial. Gaido et al. (2022) repeat Papi et al.

(2021)’s analysis with the segmentation model pro-

posed by Tsiamas et al. (2022b) and show that for

this segmentation strategy, continued ﬁnetuning on

resegmented data does not lead to an improvement

in translation quality.

In our work, we aim to extend these efforts

and test various combinations of segmentation and

model ﬁnetuning strategies. We are especially inter-

ested in ﬁxed-window segmentations which have

largely been ignored in SLT research but are attrac-

tive from a practical point of view because they

do not require an additional model to perform seg-

mentation. To the best of our knowledge, we are

train test

# talks # segments # talks # segments

en-de 2,043 229,703 27 2,641

es-en 378 36,263 15 996

fr-en 250 30,171 11 1,041

it-en 221 24,576 11 979

pt-en 279 30,855 11 1,022

multi 1,128 121,865 48 4038

Table 1: Overview of dataset statistics. The last row

shows the total numbers for the multilingual model on

es-en, fr-en, it-en and pt-en combined.

the ﬁrst to perform such an extensive segmentation-

focused analysis for online SLT, considering delay,

ﬂicker and translation quality for the evaluation.

3 Experiment Setup

3.1 Data

We run experiments with TED talk data in ﬁve dif-

ferent language pairs where the task is to translate

a TED talk as an incoming stream without having

any gold sentence segmentation.

For English-to-German, we use the data from

the MuST-C corpus (Di Gangi et al.,2019) version

1.0

. This dataset is built from TED talk audio with

human-annotated transcriptions and translations.

For testing, we use the “tst-COMMON” test set.

For Spanish-, French-, Italian- and Portuguese-to-

English, we use the data from the mTEDx corpus

(Salesky et al.,2021)

. This dataset is also based on

TED talks and provides human annotated transcrip-

tions and translations of the audio ﬁles. For testing,

we use the “iwslt2021” test set from the IWSLT

2021 multilingual speech translation shared task

(Anastasopoulos et al.,2021). The dataset statistics

can be seen in Table 1.

3.2 Spoken Language Translation Models

We base all our experiments on the joint speech-

and text-to-text model (Tang et al.,2021a,b,c) re-

leased by Meta AI. For the English-German ex-

periments, we use the model provided by Tang

et al. (2021b)

and for the other language pairs, we

use the multilingual model provided by Tang et al.

(2021a)

. We refer to these models as the

original

2https://ict.fbk.eu/must-c/

3http://www.openslr.org/100

4https://github.com/facebookresearch/

fairseq/blob/main/examples/speech_text_

joint_to_text/docs/ende-mustc.md

5https://github.com/facebookresearch/

fairseq/blob/main/examples/speech_text_

models. These models are trained on full segments

that mostly comprise one sentence:

And like with all powerful technology, this brings

huge beneﬁts, but also some risks.

To investigate the effects of different segmenta-

tion strategies combined with segmentation-robust

models, we ﬁnetune three different variants based

on each model. In each case, the ﬁnetuning data is

augmented with artiﬁcially segmented data, but no

segments cross the boundaries between the individ-

ual TED talks.

•preﬁx:

This model is ﬁnetuned on a 50-50

mix of original segments and synthetically

created preﬁxes (i.e. sentences where the end

is arbitrarily chopped off). Finetuning on pre-

ﬁxes should help for translating artiﬁcially

segmented audio where the segment stops in

the middle of an utterance. We create pre-

ﬁxes of the original segments by randomly

sampling a new duration for an audio segment

and using the length ratio to extract the cor-

responding target text. An example for a pre-

ﬁxed version of the original segment can be

seen here:

And like with all

•context: This model is ﬁnetuned on a mix of

original segments and synthetically created

longer segments. Context was already shown

to help with segmentation errors by Zhang

et al. (2021). This model should be able to

translate segments that consist of multiple ut-

terances. For each segment in the original

training set, we randomly either use the origi-

nal segment (50% of the time) or an extended

segment created by prepending the previous

segment (25% of the time) or the 2 previous

segments (also 25% of the time). We then add

context-preﬁxed segments for each of these

(possibly-extended) segments, by truncating

the last concatenated segment. An example

for a context-preﬁxed version of the original

segment can be seen here:

We work every day to generate those kinds of

technologies, safe and useful.

And like with all

powerful technology, this brings huge beneﬁts,

•windows:

This model is ﬁnetuned on a 50-

50 mix of original segments and windows of

joint_to_text/docs/iwslt2021.md

random duration. We split the audio into win-

dows by starting at the beginning of the audio

and then sampling the duration of the ﬁrst win-

dow. The end of this window then becomes

the start of the next window and we repeat this

process until we reach the end of a TED talk.

For every such window, we extract the cor-

responding target text from the time-aligned

gold segment(s) via length ratios. This mir-

rors the conditions at inference time with a

ﬁxed-window segmentation where a segment

can start and end anywhere in an utterance and

can also comprise multiple utterances. The

segment durations are sampled uniformly be-

tween 10 and 30 seconds. Note that this model

will see the qualitatively poorest data out of all

ﬁnetuned models because both the end of the

segment and the beginning depend on length

ratios which can introduce alignment errors.

An example for a window version of the origi-

nal segment can be seen here:

or death diagnosis without the help of artiﬁcial

intelligence. We work every day to generate those

kinds of technologies, safe and useful.

And like

with all powerful technology, this brings huge

beneﬁts, but also some risks.

I don’t know how

this debate ends, but what I’m sure of, is that the

game

All models are trained from the original check-

point for an additional 20k steps and the last two

checkpoints are averaged if more than one is saved.

We do this ﬁnetuning by continuing training with

the conﬁg ﬁle of the original model. For the

English

→

German MuST-C model, we train on the

audio as well as the corresponding phoneme se-

quences based on the transcript, however, we do

not use additional parallel text data during ﬁnetun-

ing. For the multilingual mTEDx model, we only

train on data for the selected language pairs and

only on audio (no phoneme sequences) because

this model was already ﬁnetuned on the spoken

language translation task. The validation sets only

contain gold segments and all models stop train-

ing due to the step limit before early stopping is

triggered.

3.3 Segmentation Strategies

We consider four different inference-time segmen-

tation strategies in our experiments, visualised in

Figure 1:

•gold:

These are human annotated segmenta-

tion boundaries that are released as part of the

MuST-C and mTEDx data. This segmenta-

tion can be viewed as an oracle segmentation

even though it may not necessarily be the best

segmentation for all models. Using the gold

segmentation in practice is unrealistic, espe-

cially in the online setting where there would

be no time for a human to segment the audio

before translation.

•SHAS:

This segmentation method was re-

cently proposed by Tsiamas et al. (2022b).

The authors ﬁnetune a pretrained wav2vec 2.0

model (Baevski et al.,2020) on the gold seg-

mentations and train it to predict probabilities

for segmentation boundaries. SHAS can be

used both in ofﬂine and online setups using

different algorithms to determine the segmen-

tation boundaries based on the model’s proba-

bilities. Since we perform our experiments

in an online setup, we use the pSTREAM

algorithm to identify segments with SHAS.

We set the maximum segment length to 18

seconds which the authors reported as best-

performing.

•ﬁxed:

This is a simple approach that splits the

audio stream into independent ﬁxed windows

of a given duration. In our experiments, we

use durations of 26 seconds, which performed

best in experiments by Tsiamas et al. (2022b).

•merged:

Similarly to above, we consider

ﬁxed-size windows for this segmentation strat-

egy but here we construct overlapping win-

dows. We use a duration of 15 seconds

and

shift the window with a stride of 2 seconds at

a time. The translations of these overlapping

windows are merged before the next window

is translated (see Section 3.5).

3.4 Retranslation

We employ a retranslation strategy (Niehues et al.,

2016,2018;Arivazhagan et al.,2020a,b) for our

end-to-end SLT experiments. This means that we

retranslate the incoming audio at ﬁxed time inter-

vals. In our experiments, we retranslate every 2

seconds to be consistent with the 2-second stride

from the merging windows approach. Because of

such retranslations of the full audio segment —

from the start of the segment up to the current time

We found empirically that this works better than a dura-

tion of 26 seconds as for ﬁxed-windows, with both increased

translation quality and reduced ﬂicker (see Appendix B).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Don'tDiscardFixed-WindowAudioSegmentationinSpeech-to-TextTranslationChantalAmrhein1andBarryHaddow21DepartmentofComputationalLinguistics,UniversityofZurich2SchoolofInformatics,UniversityofEdinburghamrhein@cl.uzh.ch,bhaddow@ed.ac.ukAbstractForreal-lifeapplications,itiscrucialthatend-to-endspokenlangua...

展开>> 收起<<

Dont Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation Chantal Amrhein1and Barry Haddow2.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dont Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation Chantal Amrhein1and Barry Haddow2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: