Simultaneous Translation for Unsegmented Input A Sliding Window Approach Sukanta Sen

2025-05-03 0 0 398.48KB 8 页 10玖币
侵权投诉
Simultaneous Translation for Unsegmented Input:
A Sliding Window Approach
Sukanta Sen
University of Edinburgh
sukantasen10@gmail.com
Ondˇ
rej Bojar
Charles University
MFF ÚFAL
bojar@ufal.mff.cuni.cz
Barry Haddow
University of Edinburgh
bhaddow@ed.ac.uk
Abstract
In the cascaded approach to spoken language
translation (SLT), the ASR output is typically
punctuated and segmented into sentences be-
fore being passed to MT, since the latter is typ-
ically trained on written text. However, erro-
neous segmentation, due to poor sentence-final
punctuation by the ASR system, leads to degra-
dation in translation quality, especially in the
simultaneous (online) setting where the input
is continuously updated. To reduce the influ-
ence of automatic segmentation, we present a
sliding window approach to translate raw ASR
outputs (online or offline) without needing to
rely on an automatic segmenter. We train trans-
lation models using parallel windows (instead
of parallel sentences) extracted from the orig-
inal training data. At test time, we translate
at the window level and join the translated
windows using a simple approach to generate
the final translation. Experiments on English-
to-German and English-to-Czech show that
our approach improves 1.3–2.0 BLEU points
over the usual ASR-segmenter pipeline, and
the fixed-length window considerably reduces
flicker compared to a baseline retranslation-
based online SLT system.
1 Introduction
For machine translation (MT) with textual input,
it is usual to segment the text into sentences be-
fore translation, with the boundaries of sentences
in most text types indicated by punctuation. For
spoken language translation (SLT), in contrast, the
input is audio so there is no punctuation provided
to assist segmentation. Segmentation thus has to be
guessed by the ASR system or a separate compo-
nent. Perhaps more importantly, for many speech
genres the input cannot easily be segmented into
well-formed sentences as found in MT training
data, giving a mismatch between training and test.
Work performed while at the University of Edinburgh
In order to address the segmentation problem in
SLT, systems often include a segmentation com-
ponent in their pipeline, e.g. Cho et al. (2017).
In other words, a typical cascaded SLT system
consists of automatic speech recognition (ASR –
which outputs lowercased, unpunctuated text) a
punctuator/segmenter (which adds punctuation and
so defines segments) and an MT system. The seg-
menter can be a sequence-sequence model, and
training data is easily synthesised from punctuated
text. However adding segmentation as an extra
step has the disadvantage of introducing an extra
component to be managed and deployed. Further-
more, errors in segmentation have been shown to
contribute significantly to overall errors in SLT (Li
et al.,2021), since neural MT is known to be sus-
ceptible to degradation from noisy input (Khayral-
lah and Koehn,2018).
These issues with segmentation can be exacer-
bated in the online or simultaneous setting. This
is an important use case for SLT where we want
to produce the translations from live speech, as the
speaker is talking. To minimise the latency of the
translation, we would like to start translating before
speaker has finished their sentence. Some online
low-latency ASR approaches will also revise their
output after it has been produced, creating addi-
tional difficulties for the downstream components.
In this scenario, the segmentation into sentences
will be more uncertain and we are faced with the
choice of waiting for the input to stabilise (so in-
creasing latency) or translating early (potentially
introducing more errors, or having to correct the
output when the ASR is extended and updated).
To address the segmentation issue in SLT, Li
et al. (2021) has proposed to a data augmentation
technique which simulates the bad segmentation in
the training data. They concatenate two adjacent
source sentences (and also the corresponding tar-
gets) and then start and end of the concatenated
sentences are truncated proportionally.
arXiv:2210.09754v1 [cs.CL] 18 Oct 2022
We use a sliding window approach to translate
unsegmented input. In this approach, we translate
the ASR output as a series of overlapping windows,
using a merging algorithm to turn the translated
windows into a single continuous (but still some-
times updated) stream. The process is illustrated
in Figure 1. To generate the training data, we con-
vert the sentence-aligned training data into window-
window pairs, and remove punctuation and casing
from the source. We explain our algorithms in
detail in Section 2.
For online SLT, we use a retranslation approach
(Niehues et al.,2016;Arivazhagan et al.,2020a),
where the MT system retranslates a recent portion
of the input each time there is an update from ASR.
This approach has the advantage that it can use
standard MT inference, including beam search, and
does not require a modified inference engine as in
streaming approaches (e.g. Ma et al. (2019)). Re-
translation may introduce flicker, i.e. potentially
disruptive changes of displayed text, when outputs
are updated. Flicker can be traded off with la-
tency by masking the last
k
words of the output
(Arivazhagan et al.,2020a).
1
Our sliding window
approach is easily combined with retranslation to
create an online SLT system which can operate on
unsegmented ASR. Each time there is an update
from ASR, we retranslate the last
n
tokens and
merge the latest translation into the output stream.
Using the fixed size window has the advantage of
reducing flicker, since we control how much of the
output stream can change on each retranslation.
Experiments on English
Czech and
English
German show that our sliding window
approach improves BLEU scores for both online
and offline SLT. For the online case, our approach
improves the tradeoff between latency and flicker.
2 Window-Based Translation
2.1 Preprocessing
To make the parallel corpus resemble ASR output,
we remove all punctuation (and other special char-
acters) from the source sentences and replace it
with spaces. We then remove repeated spaces, and
lowercase the source.
1
This paper also introduced the idea of biased beam search,
where the translation of an extended prefix is soft-constrained
to stay close to the translation of the prefix. Biased beam
search significantly reduces flicker, but it requires that ASR
output has a fixed segmentation, and uses a modified MT
inference engine.
l m n a b c p r q
ASR
Ot l m n a b c r q
a d b c p r q
Tt
A B C D E F
Ot+1
Ot+1
Tt+1
Ot+2
ASR A B C D E F G
b d c p r t q s
l m n a b c p r q
l m n a b c p r t q s
ASR
Ot
x q z s w v
Tt
A B C D E F G
Ot+1
Tt+1
Ot+2
ASR A B C D E F G
b d c p r t q s
l m n a b c p r q
l m n a b c p r t q s
No match
match
Case: Match
Case: No match
extend history
by 1 token
l m n a b c r q
Figure 1: Example of how our proposed window-based
translation works at test time in case of a match and no-
match of translations of two subsequent windows. The
text inside the rectangular box is the source window at
time t, which is translated into output window (Tt) by
the MT system. The text in blue (dark) shade shows the
common segment between the output window (Tt) and
the output stream (Ot) at time t. The text in red shade
shows the segment newly added from the output win-
dow Ttinto the output stream Ot+1. With no common
segment between Ttand Ot(“No match”), we extend
the input window into the history and translate again.
indicates there are more tokens. Note that we used
characters here (instead of tokens) just for explanation.
2.2 Generating the Window Pairs for
Training
To convert the parallel corpus into a set of parallel
windows, we use a word-alignment based approach.
We first word-align the pre-processed parallel cor-
pus using
fast_align
(Dyer et al.,2013), then
we concatenate each side of the corpus to give two
long lines. Note that the word alignments will
however never cross sentence boundaries. We ran-
domly select windows of length 15–25 from the
target side, and use the word alignment to get the
corresponding source window. The algorithms are
described in Appendix B.
A subtle detail is whether the original corpus
was or was not shuffled at the level of sentences.
An original, non-shuffled corpus provides the MT
摘要:

SimultaneousTranslationforUnsegmentedInput:ASlidingWindowApproachSukantaSenUniversityofEdinburghsukantasen10@gmail.comOndrejBojarCharlesUniversityMFFÚFALbojar@ufal.mff.cuni.czBarryHaddowUniversityofEdinburghbhaddow@ed.ac.ukAbstractInthecascadedapproachtospokenlanguagetranslation(SLT),theASRoutputi...

收起<<
Simultaneous Translation for Unsegmented Input A Sliding Window Approach Sukanta Sen.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:398.48KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注