Information-Transport-based Policy for Simultaneous Translation Shaolei Zhang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing

2025-05-05 0 0 5.15MB 22 页 10玖币
侵权投诉
Information-Transport-based Policy for Simultaneous Translation
Shaolei Zhang 1,2, Yang Feng 1,2
1Key Laboratory of Intelligent Information Processing
Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
2University of Chinese Academy of Sciences, Beijing, China
{zhangshaolei20z,fengyang}@ict.ac.cn
Abstract
Simultaneous translation (ST) outputs trans-
lation while receiving the source inputs, and
hence requires a policy to determine whether
to translate a target token or wait for the next
source token. The major challenge of ST is
that each target token can only be translated
based on the current received source tokens,
where the received source information will di-
rectly affect the translation quality. So nat-
urally, how much source information is re-
ceived for the translation of the current tar-
get token is supposed to be the pivotal ev-
idence for the ST policy to decide between
translating and waiting. In this paper, we
treat the translation as information transport
from source to target and accordingly propose
an Information-Transport-based Simultaneous
Translation (ITST). ITST quantifies the trans-
ported information weight from each source
token to the current target token, and then de-
cides whether to translate the target token ac-
cording to its accumulated received informa-
tion. Experiments on both text-to-text ST and
speech-to-text ST (a.k.a., streaming speech
translation) tasks show that ITST outperforms
strong baselines and achieves state-of-the-art
performance1.
1 Introduction
Simultaneous translation (ST) (Cho and Esipova,
2016;Gu et al.,2017;Ma et al.,2019;Arivazhagan
et al.,2019), which outputs translation while receiv-
ing the streaming inputs, is essential for many real-
time scenarios, such as simultaneous interpretation,
online subtitles and live broadcasting. Compared
with the conventional full-sentence machine trans-
lation (MT) (Vaswani et al.,2017), ST additionally
requires a read/write policy to decide whether to
wait for the next source input (a.k.a., READ) or
generate a target token (a.k.a., WRITE).
Corresponding author: Yang Feng.
1
Code is available at
https://github.com/
ictnlp/ITST
transported
information
weight
0.33
0.02
0.28
0.15
Streaming
Inputs:
Simultaneous
Outputs:
if accumulated received
information
then: WRITE
else: READ
+ + +
Figure 1: Schematic diagram of ITST (e.g., δ= 0.7).
The goal of ST is to achieve high-quality trans-
lation under low latency, however, the major chal-
lenge is that the low-latency requirement restricts
the ST model to translating each target token only
based on current received source tokens (Ma et al.,
2019). To mitigate the impact of this restric-
tion on translation quality, ST needs a reasonable
read/write policy to ensure that before translating,
the received source information is sufficient to gen-
erate the current target token (Arivazhagan et al.,
2019). To achieve this, read/write policy should
measure the amount of received source informa-
tion, if the received source information is sufficient
for translation, the model translates a target token,
otherwise the model waits for the next input.
However, previous read/write policies, involving
fixed and adaptive, often lack an explicit measure
of how much source information is received for the
translation. Fixed policy decides READ/WRITE
according to predefined rules (Ma et al.,2019;
Zhang and Feng,2021c) and sometimes forces the
model to start translating even though the received
source information is insufficient, thereby affecting
the translation quality. Adaptive policy can dynam-
ically adjust READ/WRITE (Arivazhagan et al.,
2019;Ma et al.,2020c) to achieve better perfor-
mance. However, previous adaptive policies often
directly predict a variable based on the inputs to in-
dicate READ/WRITE decision (Arivazhagan et al.,
2019;Ma et al.,2020c;Miao et al.,2021), without
explicitly modeling the amount of information that
arXiv:2210.12357v2 [cs.CL] 1 Nov 2022
the received source tokens provide to the currently
generated target token.
Under these grounds, we aim to develop a rea-
sonable read/write policy that takes the received
source information as evidence for READ/WRITE.
For the ST process, source tokens provide informa-
tion, while target tokens receive information and
then perform translating, thereby the translation
process can be treated as information transport
from source to target. Along this line, if we are
well aware of how much information is transported
from each source token to the target token, then it
is natural to grasp the total information provided
by the received source tokens for the current target
token, thereby ensuring that the source information
is sufficient for translation.
To this end, we propose Information-Transport-
based Simultaneous Translation (ITST). Borrow-
ing the idea from the optimal transport problem
(Villani,2008), ITST explicitly quantifies the trans-
ported information weight from each source to-
ken to the current target token during translation.
Then, ITST starts translating after judging that the
amount of information provided by received source
tokens for the current target token has reached a
sufficient proportion. As shown in the schematic
diagram in Figure 1, assuming that 70% source
information is sufficient for translation, ITST first
quantifies the transport information weight from
each source token to the current target token (e.g.,
0.15,0.28,···
). With the first three source tokens,
the accumulated received information is 45%, less
than 70%, then ITST selects READ. After receiv-
ing the fourth source token, the accumulated in-
formation received by the current target token be-
comes 78%, thus ITST selects WRITE to translate
the current target token. Experiments on both text-
to-text and speech-to-text simultaneous translation
tasks show that ITST outperforms strong baselines
and achieves state-of-the-art performance.
2 Background
Simultaneous Translation
For the ST task, we
denote the source sequence as
x= (x1,··· , xJ)
and the corresponding source hidden states as
z=
(z1,··· , zJ)
with source length
J
. The model gen-
erates a target sequence
y= (y1,··· , yI)
and the
corresponding target hidden states
s=(s1,··· , sI)
with target length
I
. Since ST model outputs trans-
lation while receiving the source inputs, we denote
the number of received source tokens when trans-
lating
yi
as
gi
. Then, the probability of generating
yi
is
p(yi|xgi,y<i;θ)
, where
θ
is model param-
eters,
xgi
is the first
gi
source tokens and
y<i
is
the previous target tokens. Accordingly, ST model
is trained by minimizing the cross-entropy loss:
Lce =
I
X
i=1
log p(y?
i|xgi,y<i;θ),(1)
where y?
iis the ground-truth target token.
Cross-attention
Translation models often use
cross-attention to measure the similarity of the tar-
get token and the source token (Vaswani et al.,
2017), thereby weighting the source information
(Wiegreffe and Pinter,2019). Given the target hid-
den states
s
and source hidden states
z
, the attention
weight αij between yiand xjis calculated as:
αij = softmax siWQzjWK>
dk!,(2)
where
WQ
and
WK
are projection parameters, and
dk
is the dimension of inputs. Then the context
vector
oi
is calculated as
oi=PJ
j=1 αij zjWV
,
where WVare projection parameters.
3 The Proposed Method
We propose information-transport-based simulta-
neous translation (ITST) to explicitly measure the
source information projected to the current gener-
ated target token. During the ST process, ITST
models the information transport to grasp how
much information is transported from each source
token to the current target token (Sec.3.1). Then,
ITST starts translating a target token after its accu-
mulated received information is sufficient (Sec.3.2).
Details of ITST are as follows.
3.1 Information Transport
Definition of Information Transport
Borrow-
ing the idea of optimal transport problem (OT)
(Dantzig,1949), which aims to look for a trans-
port matrix transforming a probability distribution
into another while minimizing the cost of transport,
we treat the translation process in ST as an informa-
tion transport from source to target. We denote the
information transport as the matrix
T= (Tij )I×J
,
where
Tij (0,1)
is the transported information
weight from
xj
to
yi
. Then, we assume that the
total information received by each target token
2
for
2
Since the participation degree of each source token in
translation is often different, we relax the constraints on total
information provided by source token (Kusner et al.,2015).
(b) Information transport matrix
+++++ =1
+++++ =1
+++++ =1
+++++ =1
+++++ =1
Translation
Constrains
Latency
Constrains
(c) Latency cost matrix(a) Cross-attention matrix
Figure 2: Schematic diagram of learning information transport matrix Tfrom both translation and latency.
translation is 1, i.e., PJ
j=1 Tij = 1.
Under this definition, ITST quantifies the trans-
ported information weight
Tij
base on the current
target hidden state siand source hidden state zj:
Tij =sigmoid siVQzjVK>
dk!(3)
where VQand VKare learnable parameters.
Constraints on Information Transport
Sim-
ilar to the OT problem, modeling information trans-
port in translation also requires the transport costs
to constrain the transported weights. Especially
for ST, we should constrain information trans-
port
T
from the aspects of translation and latency,
where the translation constraints ensure that infor-
mation transport can correctly reflect the transla-
tion process from source to target and the latency
constraints regularize the information transport to
avoid anomalous translation latency.
For translation constraints, the information
transport
T
should learn which source token con-
tributes more to the translation of the current target
token, i.e., reflecting the translation process. For-
tunately, the cross-attention
αij
in the translation
model is used to control the weight that source to-
ken
xj
provides to the target token
yi
(Abnar and
Zuidema,2020;Chen et al.,2020;Zhang and Feng,
2021b), so we integrate information transport into
the cross-attention. As shown in Figure 2(a), we
multiply Tij with cross-attention αij and then nor-
malize to get final attention βij :
ˆ
βij =αij ×Tij , βij =ˆ
βij /
J
X
j=1
ˆ
βij .(4)
Then the context vector is calculated as
oi=
PJ
j=1 βij zjWV
. In this way, the information
transport
T
can be jointly learned with the cross-
attention in the translation process through the orig-
inal cross-entropy loss Lce.
For latency constraints, the information trans-
port
T
will affect the translation latency, since
the model should start translating after receiving
a certain amount of information. Specifically, for
the current target token, if too much information
is provided by the source tokens lagging behind,
waiting for those source tokens will cause high la-
tency. While too much information provided by
the front source tokens will make the model pre-
maturely start translating, resulting in extremely
low latency and poor translation quality (Zhang
and Feng,2022c). Therefore, we aim to avoid too
much information weight being transported from
source tokens that are located too early or too late
compared to the position of the current target token,
thereby getting a suitable latency.
To this end, we introduce a latency cost matrix
C= (Cij )I×J
in the diagonal form to softly regu-
larize the information transport, where
Cij
is the
latency cost of transporting information from
xj
to
yi, related to their relative offset:
Cij =1
I×Jmax
ji×J
Iξ, 0.(5)
ji×J
I
is the relative offset between
xj
and
yi
.
ξ
is a hyperparameter to control the acceptable
offset (i.e. inside transports cost 0), and we set
ξ=1
in our experiments. As an example of the latency
cost matrix shown in Figure 2(c), the transported
weights cost 0 when the relative offset less than 1,
and the cost of other transports is positively related
to the offset. We will compare different settings of
the latency cost in Sec.5.1 and Appendix A.1.
Given the latency cost matrix
C
, the latency loss
Llatency of information transport Tis:
Llatency =
I
X
i=1
J
X
j=1
Tij ×Cij .(6)
Learning Objective
Accordingly, the learn-
ing of ST model
θ
with the proposed information
Algorithm 1: Read/Write Policy of ITST
Input :Streaming inputs x, Threshold δ,
i= 1,j= 1,y0=hBOSi
Output :Target outputs y
1while yi16=hEOSido
2calculate information transport
T= (Ti1,··· , Tij )as Eq.(3);
3if Pj
l=1Til δthen //.WRITE
4translate yiwith (x1,··· , xj);
5ii+ 1;
6else //.READ
7wait for next source input xj+1;
8jj+ 1;
9end
10 end
transport Tcan be formalized as:
min
θ,TLce +Llatency (7)
s.t.
J
X
j=1
Tij = 1,1iI(8)
Tij 0,1iI, 1jJ(9)
Eq.(8) constrains the total information transported
to each target token to be 1 (refer to the definition),
and Eq.(9) constrains the transported weights to be
positive, realized by
sigmoid(·)
in Eq.(3). Then,
we convert the normalization constraints of
Tij
in
Eq.(8) into the following regular term:
Lnorm =
I
X
i=1
J
X
j=1
Tij 1
2
.(10)
Therefore, the total loss LIT ST is calculated as:
LIT ST =Lce +Llatency +Lnorm.(11)
3.2 Information Transport based Policy
Read/Write Policy
After grasping the informa-
tion transported from each source token to the
current target token, we propose an information-
transport-based policy accordingly. With stream-
ing inputs, ITST receives source tokens one by one
and transports their information to the current tar-
get token, and then ITST starts translating when
the accumulated received information is sufficient.
To obtain a controllable latency (Ma et al.,2019)
during testing, a threshold
δ
is introduced to indi-
cate how much proportion of source information
is sufficient for translation. Therefore, as shown in
Algorithm 1, ITST selects WRITE after the accu-
mulated received source information of the current
target token
Pj
l=1Til
is greater than the threshold
δ, otherwise ITST selects READ.
ITST can perform translating under different la-
tency by adjusting the threshold
δ
. With larger
δ
,
ITST tends to wait for more transported informa-
tion, so the latency becomes higher; otherwise, the
latency becomes lower with smaller δ.
Curriculum-based Training
Besides a rea-
sonable read/write policy, ST model also requires
the capability of translating based on incomplete
source information. Therefore, we apply the thresh-
old in training as well, denoted as
δtrain
, and ac-
cordingly mask out the rest of source tokens when
accumulated information of each target token ex-
ceeds
δtrain
. Formally, given
δtrain
,
yi
is translated
based on the first gisource tokens, where giis:
gi=argmin
j
j
X
l=1
Til δtrain.(12)
Then, we mask out the source token
xj
that
j > gi
during training to simulate the streaming inputs.
Regarding how to set
δtrain
during training, un-
like previous methods that train multiple separate
ST models for different thresholds (Ma et al.,2019,
2020c) or randomly sample different thresholds
(Elbayad et al.,2020;Zhang and Feng,2021c),
we propose curriculum-based training for ITST
to train one universal model that can perform ST
under arbitrary latency (various δduring testing).
The proposed curriculum-based training follows
an easy-to-hard schedule. At the beginning of train-
ing, we let the model preferentially focus on the
learning of translation and information transport
under the richer source information, Then, we grad-
ually reduce the source information as the training
progresses to let the ST model learn to translate
with incomplete source inputs. Therefore,
δtrain
is
dynamically adjusted according to an exponential-
decaying schedule during training:
δtrain =δmin +(1δmin)×expNupdate
d,
(13)
where
Nupdate
is update steps, and
d
is a hyperpa-
rameter to control the decaying degree.
δmin
is the
minimum amount of information required, and we
set
δmin = 0.5
in the experiments. Thus, during
training, the information received by each target
token gradually decays from 100% to 50%.
4 Experiments
4.1 Datasets
We conduct experiments on both text-to-text ST
and speech-to-text ST tasks.
Text-to-text ST (T2T-ST)
IWSLT153English Vietnamese (EnVi)
(133K pairs) (Cettolo et al.,2015) We use TED
tst2012 as the validation set (1553 pairs) and TED
tst2013 as the test set (1268 pairs). Following
the previous setting (Raffel et al.,2017;Ma et al.,
2020c), we replace tokens that the frequency less
than 5 by
hunki
, and the vocabulary sizes are 17K
and 7.7K for English and Vietnamese respectively.
WMT154GermanEnglish (DeEn)
(4.5M
pairs) We use newstest2013 as the validation set
(3000 pairs) and newstest2015 as the test set (2169
pairs). 32K BPE (Sennrich et al.,2016) is applied
and the vocabulary is shared across languages.
Speech-to-text ST (S2T-ST)
MuST-C5English German (EnDe)
(234K pairs) and
English Spanish (EnEs)
(270K pairs) (Di Gangi et al.,2019). We use
dev
as validation set (1423 pairs for En
De, 1316
pairs for En
Es) and use
tst-COMMON
as test
set (2641 pairs for En
De, 2502 pairs for En
Es),
respectively. Following Ma et al. (2020b), we use
Kaldi (Povey et al.,2011) to extract 80-dimensional
log-mel filter bank features for speech, computed
with a 25
ms
window size and a 10
ms
window shift,
and we use SentencePiece (Kudo and Richardson,
2018) to generate a unigram vocabulary of size
10000 respectively for source and target text.
4.2 Experimental Settings
We conduct experiments on the following systems.
All implementations are based on Transformer
(Vaswani et al.,2017) and adapted from Fairseq
Library (Ott et al.,2019).
Offline
Full-sentence MT (Vaswani et al.,
2017), which waits for the complete source inputs
and then starts translating.
Wait-k
Wait-k policy (Ma et al.,2019), the
most widely used fixed policy, which first READ
k
source tokens, and then alternately READ one
token and WRITE one token.
Multipath Wait-k
An efficient training for
wait-k (Elbayad et al.,2020), which randomly sam-
ples different kbetween batches during training.
3nlp.stanford.edu/projects/nmt/
4www.statmt.org/wmt15/
5https://ict.fbk.eu/must-c
Adaptive Wait-k
A heuristic composition of
multiple wait-k models (
k=1 ···13
) (Zheng et al.,
2020), which decides whether to translate accord-
ing to the generating probabilities of wait-k models.
MoE Wait-k6
Mixture-of-experts wait-k pol-
icy (Zhang and Feng,2021c), the SOTA fixed pol-
icy, which applies multiple experts to learn multiple
wait-k policies during training.
MMA7
Monotonic multi-head attention (Ma
et al.,2020c), which predicts a Bernoulli variable to
decide READ/WRITE, and the Bernoulli variable
is jointly learning with multi-head attention.
GSiMT
Generative ST (Miao et al.,2021),
which also predicts a Bernoulli variable to decide
READ/WRITE and the variable is trained with a
generative framework via dynamic programming.
RealTranS
End-to-end simultaneous speech
translation with Wait-K-Stride-N strategy (Zeng
et al.,2021), which waits for
N
frame at each step.
MoSST
Monotonic-segmented streaming
speech translation (Dong et al.,2022), which uses
integrate-and-firing method to segment the speech.
ITST The proposed method in Sec.3.
T2T-ST Settings
We apply Transformer-
Small (4 heads) for En
Vi and Transformer-
Base/Big (8/16 heads) for De
En. Note that we
apply the unidirectional encoder for Transformer
to enable simultaneous decoding. Since GSiMT
involves dynamic programming which makes its
training expensive, we report GSiMT on WMT15
De
En (Base) (Miao et al.,2021). For T2T-ST
evaluation, we report BLEU (Papineni et al.,2002)
for translation quality and Average Lagging (AL,
token) (Ma et al.,2019) for latency. We also give
the results with SacreBLEU in Appendix B.
S2T-ST Settings
The proposed ITST can per-
form end-to-end speech-to-text ST in two manners:
fixed pre-decision and flexible pre-decision (Ma
et al.,2020b). For fixed pre-decision, following Ma
et al. (2020b), we apply ConvTransformer-Espnet
(4 heads) (Inaguma et al.,2020) for both En
De
and En
Es, which adds a 3-layer convolutional
network before the encoder to capture the speech
features. Note that the encoder is also unidirec-
tional for simultaneous decoding. The convolu-
tional layers and encoder are initialized from the
pre-trained ASR task. All systems make a fixed
pre-decision of READ/WRITE every 7 source to-
kens (i.e., every 280
ms
). For flexible pre-decision,
6github.com/ictnlp/MoE-Waitk
7github.com/pytorch/fairseq/tree/
master/examples/simultaneous_translation
摘要:

Information-Transport-basedPolicyforSimultaneousTranslationShaoleiZhang1,2,YangFeng1,21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences(ICT/CAS)2UniversityofChineseAcademyofSciences,Beijing,China{zhangshaolei20z,fengyang}@ict.ac.cnAbstractSimult...

展开>> 收起<<
Information-Transport-based Policy for Simultaneous Translation Shaolei Zhang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:5.15MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注