Simultaneous Translation for Unsegmented Input A Sliding Window Approach Sukanta Sen

2025-05-03 1 0 398.48KB 8 页 10玖币

侵权投诉

Simultaneous Translation for Unsegmented Input:

A Sliding Window Approach

Sukanta Sen∗

University of Edinburgh

sukantasen10@gmail.com

Ondˇ

rej Bojar

Charles University

MFF ÚFAL

bojar@ufal.mff.cuni.cz

Barry Haddow

University of Edinburgh

bhaddow@ed.ac.uk

Abstract

In the cascaded approach to spoken language

translation (SLT), the ASR output is typically

punctuated and segmented into sentences be-

fore being passed to MT, since the latter is typ-

ically trained on written text. However, erro-

neous segmentation, due to poor sentence-ﬁnal

punctuation by the ASR system, leads to degra-

dation in translation quality, especially in the

simultaneous (online) setting where the input

is continuously updated. To reduce the inﬂu-

ence of automatic segmentation, we present a

sliding window approach to translate raw ASR

outputs (online or ofﬂine) without needing to

rely on an automatic segmenter. We train trans-

lation models using parallel windows (instead

of parallel sentences) extracted from the orig-

inal training data. At test time, we translate

at the window level and join the translated

windows using a simple approach to generate

the ﬁnal translation. Experiments on English-

to-German and English-to-Czech show that

our approach improves 1.3–2.0 BLEU points

over the usual ASR-segmenter pipeline, and

the ﬁxed-length window considerably reduces

ﬂicker compared to a baseline retranslation-

based online SLT system.

1 Introduction

For machine translation (MT) with textual input,

it is usual to segment the text into sentences be-

fore translation, with the boundaries of sentences

in most text types indicated by punctuation. For

spoken language translation (SLT), in contrast, the

input is audio so there is no punctuation provided

to assist segmentation. Segmentation thus has to be

guessed by the ASR system or a separate compo-

nent. Perhaps more importantly, for many speech

genres the input cannot easily be segmented into

well-formed sentences as found in MT training

data, giving a mismatch between training and test.

∗Work performed while at the University of Edinburgh

In order to address the segmentation problem in

SLT, systems often include a segmentation com-

ponent in their pipeline, e.g. Cho et al. (2017).

In other words, a typical cascaded SLT system

consists of automatic speech recognition (ASR –

which outputs lowercased, unpunctuated text) a

punctuator/segmenter (which adds punctuation and

so deﬁnes segments) and an MT system. The seg-

menter can be a sequence-sequence model, and

training data is easily synthesised from punctuated

text. However adding segmentation as an extra

step has the disadvantage of introducing an extra

component to be managed and deployed. Further-

more, errors in segmentation have been shown to

contribute signiﬁcantly to overall errors in SLT (Li

et al.,2021), since neural MT is known to be sus-

ceptible to degradation from noisy input (Khayral-

lah and Koehn,2018).

These issues with segmentation can be exacer-

bated in the online or simultaneous setting. This

is an important use case for SLT where we want

to produce the translations from live speech, as the

speaker is talking. To minimise the latency of the

translation, we would like to start translating before

speaker has ﬁnished their sentence. Some online

low-latency ASR approaches will also revise their

output after it has been produced, creating addi-

tional difﬁculties for the downstream components.

In this scenario, the segmentation into sentences

will be more uncertain and we are faced with the

choice of waiting for the input to stabilise (so in-

creasing latency) or translating early (potentially

introducing more errors, or having to correct the

output when the ASR is extended and updated).

To address the segmentation issue in SLT, Li

et al. (2021) has proposed to a data augmentation

technique which simulates the bad segmentation in

the training data. They concatenate two adjacent

source sentences (and also the corresponding tar-

gets) and then start and end of the concatenated

sentences are truncated proportionally.

arXiv:2210.09754v1 [cs.CL] 18 Oct 2022

We use a sliding window approach to translate

unsegmented input. In this approach, we translate

the ASR output as a series of overlapping windows,

using a merging algorithm to turn the translated

windows into a single continuous (but still some-

times updated) stream. The process is illustrated

in Figure 1. To generate the training data, we con-

vert the sentence-aligned training data into window-

window pairs, and remove punctuation and casing

from the source. We explain our algorithms in

detail in Section 2.

For online SLT, we use a retranslation approach

(Niehues et al.,2016;Arivazhagan et al.,2020a),

where the MT system retranslates a recent portion

of the input each time there is an update from ASR.

This approach has the advantage that it can use

standard MT inference, including beam search, and

does not require a modiﬁed inference engine as in

streaming approaches (e.g. Ma et al. (2019)). Re-

translation may introduce ﬂicker, i.e. potentially

disruptive changes of displayed text, when outputs

are updated. Flicker can be traded off with la-

tency by masking the last

words of the output

(Arivazhagan et al.,2020a).

Our sliding window

approach is easily combined with retranslation to

create an online SLT system which can operate on

unsegmented ASR. Each time there is an update

from ASR, we retranslate the last

tokens and

merge the latest translation into the output stream.

Using the ﬁxed size window has the advantage of

reducing ﬂicker, since we control how much of the

output stream can change on each retranslation.

Experiments on English

→

Czech and

English

→

German show that our sliding window

approach improves BLEU scores for both online

and ofﬂine SLT. For the online case, our approach

improves the tradeoff between latency and ﬂicker.

2 Window-Based Translation

2.1 Preprocessing

To make the parallel corpus resemble ASR output,

we remove all punctuation (and other special char-

acters) from the source sentences and replace it

with spaces. We then remove repeated spaces, and

lowercase the source.

This paper also introduced the idea of biased beam search,

where the translation of an extended preﬁx is soft-constrained

to stay close to the translation of the preﬁx. Biased beam

search signiﬁcantly reduces ﬂicker, but it requires that ASR

output has a ﬁxed segmentation, and uses a modiﬁed MT

inference engine.

l m n a b c p r q

ASR

Ot l m n a b c r q

a d b c p r q

A B C D E F

Ot+1

Tt+1

Ot+2

ASR A B C D E F G

b d c p r t q s

l m n a b c p r q

l m n a b c p r t q s

ASR

x q z s w v

A B C D E F G

Ot+1

Tt+1

Ot+2

ASR A B C D E F G

b d c p r t q s

l m n a b c p r q

l m n a b c p r t q s

No match

match

Case: Match

Case: No match

extend history

by 1 token

l m n a b c r q

Figure 1: Example of how our proposed window-based

translation works at test time in case of a match and no-

match of translations of two subsequent windows. The

text inside the rectangular box is the source window at

time t, which is translated into output window (Tt) by

the MT system. The text in blue (dark) shade shows the

common segment between the output window (Tt) and

the output stream (Ot) at time t. The text in red shade

shows the segment newly added from the output win-

dow Ttinto the output stream Ot+1. With no common

segment between Ttand Ot(“No match”), we extend

the input window into the history and translate again.

••• indicates there are more tokens. Note that we used

characters here (instead of tokens) just for explanation.

2.2 Generating the Window Pairs for

Training

To convert the parallel corpus into a set of parallel

windows, we use a word-alignment based approach.

We ﬁrst word-align the pre-processed parallel cor-

pus using

fast_align

(Dyer et al.,2013), then

we concatenate each side of the corpus to give two

long lines. Note that the word alignments will

however never cross sentence boundaries. We ran-

domly select windows of length 15–25 from the

target side, and use the word alignment to get the

corresponding source window. The algorithms are

described in Appendix B.

A subtle detail is whether the original corpus

was or was not shufﬂed at the level of sentences.

An original, non-shufﬂed corpus provides the MT

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SimultaneousTranslationforUnsegmentedInput:ASlidingWindowApproachSukantaSenUniversityofEdinburghsukantasen10@gmail.comOndrejBojarCharlesUniversityMFFÚFALbojar@ufal.mff.cuni.czBarryHaddowUniversityofEdinburghbhaddow@ed.ac.ukAbstractInthecascadedapproachtospokenlanguagetranslation(SLT),theASRoutputi...

展开>> 收起<<

Simultaneous Translation for Unsegmented Input A Sliding Window Approach Sukanta Sen.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Simultaneous Translation for Unsegmented Input A Sliding Window Approach Sukanta Sen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: