Wait-info Policy Balancing Source and Target at Information Level for Simultaneous Machine Translation Shaolei Zhang12 Shoutao Guo12 Yang Feng12

2025-05-06 0 0 1.04MB 16 页 10玖币
侵权投诉
Wait-info Policy: Balancing Source and Target at Information Level
for Simultaneous Machine Translation
Shaolei Zhang 1,2, Shoutao Guo 1,2, Yang Feng 1,2
1Key Laboratory of Intelligent Information Processing
Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
2University of Chinese Academy of Sciences, Beijing, China
{zhangshaolei20z,guoshoutao22z,fengyang}@ict.ac.cn
Abstract
Simultaneous machine translation (SiMT) out-
puts the translation while receiving the source
inputs, and hence needs to balance the re-
ceived source information and translated target
information to make a reasonable decision be-
tween waiting for inputs or outputting transla-
tion. Previous methods always balance source
and target information at the token level, either
directly waiting for a fixed number of tokens or
adjusting the waiting based on the current to-
ken. In this paper, we propose a Wait-info Pol-
icy to balance source and target at the informa-
tion level. We first quantify the amount of in-
formation contained in each token, named info.
Then during simultaneous translation, the deci-
sion of waiting or outputting is made based on
the comparison results between the total info
of previous target outputs and received source
inputs. Experiments show that our method out-
performs strong baselines under and achieves
better balance via the proposed info1.
1 Introduction
Simultaneous machine translation (SiMT) (Cho
and Esipova,2016;Gu et al.,2017;Ma et al.,2019)
outputs the translation while receiving the source
sentence, aiming at the trade-off between trans-
lation quality and latency. Therefore, a policy is
required for SiMT to decide between waiting for
the source inputs (i.e., READ) or outputting transla-
tions (i.e., WRITE), the core of which is to wisely
balance the received source information and the
translated target information. When the source in-
formation is less, the model should wait for more
inputs for a high-quality translation; conversely,
when the translated target information is less, the
model should output translations for a low latency.
Existing SiMT policies, involving fixed and
adaptive, always balance source and target at the
Corresponding author: Yang Feng.
1
Code is available at
https://github.com/
ictnlp/Wait-info
1111
1 1
Source:
Target: wait tokens
(a) Wait-k policy: treats each token equally, and lags ktokens.
0.5 1.7 1.2 1.0
0.2 1.5
Source:
Target: wait info
(b) Wait-info policy: quantifies the information in each token,
named info (e.g.,
0.5,1.7,· · ·
), and keeps the target informa-
tion always less than the received source information Kinfo.
Figure 1: Schematic diagram of Wait-info v.s. Wait-k.
token level, i.e., treating each source and target
token equally when determining READ/WRITE.
Fixed policies decide READ/WRITE based on the
number of received source tokens (Ma et al.,2019;
Zhang and Feng,2021c), such as wait-k policy (Ma
et al.,2019) simply considers each source token
to be equivalent and lets the target outputs always
lag the source inputs by
k
tokens, as shown in
Figure 1(a). Fixed policies are always limited by
the fact that the policy cannot be adjusted accord-
ing to complex inputs, making them difficult to
get the best trade-off. Adaptive policies predict
READ/WRITE according to the current source and
target tokens (Arivazhagan et al.,2019;Ma et al.,
2020) and thereby get a better trade-off, but they of-
ten ignore and under-utilize the difference between
tokens when deciding READ/WRITE. Besides, ex-
isting adaptive policies always rely on complicated
training (Ma et al.,2020;Miao et al.,2021) or addi-
tional labeled data (Zheng et al.,2019;Zhang et al.,
2020;Alinejad et al.,2021), making them more
computationally expensive than fixed policies.
Treating each token equally when balancing
source and target is not the optimal choice for SiMT
arXiv:2210.11220v1 [cs.CL] 20 Oct 2022
policy. Many studies have shown that different
words have significantly different functions in trans-
lation (Lin et al.,2018;Moradi et al.,2019;Chen
et al.,2020), often divided into content words (i.e.,
noun, verb,
···
) and function words (i.e., conjunc-
tion, preposition,
···
), where the former express
more important meaning and the latter is less infor-
mative. Accordingly, tokens with different amounts
of information should also play different roles in
the SiMT policy, where more informative tokens
should play a more dominant role because they
bring more information to SiMT model (Zhang and
Feng,2022a,b). Therefore, explicitly differentiat-
ing various tokens rather than treating them equally
when determining READ/WRITE will be benefi-
cial to developing a more precise SiMT policy.
In this paper, we differentiate various source and
target tokens based on the amount of information
they contain, aiming to balance received source
information and translated target information at
the information level. To this end, we propose
wait-info policy, a simple yet effective policy for
SiMT. As shown in Figure 1(b), we first quantify
the amount of information contained in each to-
ken through a scalar, named info, which is jointly
learned with the attention mechanism in an unsuper-
vised manner. During the simultaneous translation,
READ/WRITE decisions are made by balancing
the total info of translated target information and
received source information. If the received source
information is more than translated target informa-
tion by
K
info or more, the model outputs transla-
tion, otherwise the model waits for the next input.
Experiments and analyses show that our method
outperforms strong baselines and effectively quan-
tifies the information contained in each token.
2 Related Work
SiMT Policy
Recent policies fall into fixed and
adaptive. For fixed policy, Ma et al. (2019) pro-
posed wait-k policy, which first READ
k
source to-
kens and then READ/WRITE one token alternately.
Elbayad et al. (2020) proposed an efficient multi-
path training for wait-k policy to randomly sample
k
during training. Zhang et al. (2021) proposed
future-guide training for wait-k policy, which intro-
duces a full-sentence MT to guide training. Zhang
and Feng (2021a) proposed a char-level wait-k pol-
icy. Zhang and Feng (2021c) proposed a mixture-
of-experts wait-k policy to develop a universal
SiMT model. For adaptive policy, Gu et al. (2017)
trained an agent to decide READ/WRITE via rein-
forcement learning. Arivazhagan et al. (2019) pro-
posed MILk, which predicts a Bernoulli variable
to determine READ/WRITE. Ma et al. (2020) pro-
posed MMA to implement MILk on Transformer.
(Zhang and Feng,2022c) proposed dual-path SiMT
to enhance MMA with dual learning. Zheng et al.
(2020) developed adaptive wait-k through heuristic
ensemble of multiple wait-k models. Miao et al.
(2021) proposed a generative framework to gen-
erate READ/WRITE decisions. Zhang and Feng
(2022a) proposed Gaussian multi-head attention to
decide READ/WRITE based on alignments.
Previous policies always treat each token equally
when determining READ/WRITE, ignoring the
fact that tokens with different amounts of infor-
mation often play different roles in SiMT policy.
Our method aims to develop a more precise SiMT
policy by differentiating the importance of various
tokens when determining READ/WRITE.
Information Modeling in NMT
Linguistics
divides words into content words and function
words according to their information and functions
in the sentence. Therefore, modeling the informa-
tion contained in each word is often used to im-
prove the NMT performance. Moradi et al. (2019)
and Chen et al. (2020) used the word frequency to
indicate how much information each word contains,
and the words with lower frequencies contain more
information. Liu et al. (2020) and Kobayashi et al.
(2020) found that the norm of word embedding is
related to the token information in NMT. Lin et al.
(2018) and Zhang and Feng (2021b) argued that
the attention mechanism for different types of word
should be different, where the attention distribution
of content word tends to be more concentrated.
Our method explores the usefulness of modeling
information for SiMT policy, and proposes an un-
supervised method to quantify the information of
tokens through the attention mechanism, achieving
good explainability.
3 Background
Full-sentence MT
For a translation task, we
denote the source sentence as
x= (x1,··· , xn)
with source length
n
and the target sentence as
y= (y1,··· , ym)
with target length
m
. Trans-
former (Vaswani et al.,2017) is the most widely
used architecture for full-sentence MT, consisting
of an encoder and a decoder. Encoder maps
x
to
source hidden states
z= (z1,··· , zn)
. Decoder
maps
y
to target hidden states
s= (s1,··· , sm)
,
and then performs translating. Specifically, each en-
coder layer contains two sub-layers: self-attention
and feed-forward network (FFN), while each de-
coder layer contains three sub-layers: self-attention,
cross-attention and FFN. Both self-attention and
cross-attention are implemented through the dot-
product attention between query
Q
and key
K
,
calculated as:
eij =QiWQKjWK>
dk
,(1)
αij = softmax (eij ).(2)
where
eij
is the similarity score between
Qi
and
Kj
, and
αij
is the normalized attention weight.
dk
is the input dimension,
WQ
and
WK
are projection
parameters. More specifically, self-attention ex-
tracts the monolingual representation of source or
target tokens, so the query and key both come from
the source hidden states
z
or target hidden states
s
. While cross-attention extracts the cross-lingual
representation through measuring the correlation
between target and source token, so query comes
from the target hidden states
s
, and key comes from
the source hidden states z.
Wait-k Policy
Simultaneous machine transla-
tion (SiMT) determines when to start translating
each target token through a policy. Wait-k policy
(Ma et al.,2019) is the most widely used policy
for SiMT, which refers to first waiting for
k
source
tokens and then translating and waiting for one
token alternately, i.e., the target outputs always lag-
ging
k
tokens behind the source inputs. Formally,
when translating
yi
, wait-k policy forces the SiMT
model to wait for
gk(i)
source tokens, where
gk(i)
is calculated as:
gk(i) = min {k+i1, n}.(3)
4 Method
To differentiate various tokens when determining
READ/WRITE, we quantify the amount of infor-
mation contained in each source and target token,
named info. As shown in Figure 2, we propose info-
aware Transformer to jointly learn the quantified
info with the attention mechanism in an unsuper-
vised manner. Then based on the quantified info,
we propose wait-info policy to balance the received
source information and translated target informa-
tion. The details are as follows.
Info%
Quan)zer
Inputs
Emb.%&%%Pos. Emb.%&%%Pos.
Outputs
(Shifted right)
Linear
So1max%
Output Probablities
Source
info
Target
info
Info%
Quan)zer
Info-aware
Self-a6en7on
Feed%Forward
Info-aware
Self-a6en7on
Info-consistent
Cross-a6en7on
Feed%Forward
×N
N×
Figure 2: Architecture of the proposed info-aware
Transformer, where we omit residual connection and
layer normalization in the figure for clarity.
4.1 Info Quantification
To quantify the amount of information in each to-
ken, we use a scalar to represent how much in-
formation each token contains, named info. We
denote the info of the source tokens and the target
tokens as
Isrc Rn×1
and
Itgt Rm×1
, respec-
tively, where
Isrc
j
and
Itgt
i
represent the info of
xj
and
yi
, and the higher info means that the token
has more information.
To predict
Isrc
and
Itgt
, we introduce two Info
Quantizers before the encoder and decoder to re-
spectively quantify the information of each source
and target token, as shown in Figure 2. Specifically,
the info quantizer is implemented by a 3-layer feed-
forward network (FFN):
Isrc = 2 ×sigmoid (FFN (x)) ,(4)
Itgt = 2 ×sigmoid (FFN (y)) .(5)
For the formulation of the following wait-info pol-
icy,
2×sigmoid(·)
is used to restrict the quantified
info Isrc
j, Itgt
i(0,2).
Further, in a translation task, source sentence and
target sentence should be semantically equivalent
(Finch et al.,2005;Guo et al.,2022), so the total
information of source tokens should be equal to
that of target tokens. To this end, we introduce an
info-sum loss
Lsum
to constrain the total info of
the source tokens and target tokens, calculated as:
Lsum =
n
X
j=1
Isrc
jζ
2
+
m
X
i=1
Itgt
iζ
2
,
(6)
where
ζ
is a hyperparameter to represent the total
info, and we set
ζ=m+n
2
(i.e., average length of
source and target) to control the average info to be
around 1. Therefore, the final loss Lis:
L=Lce +λLsum,(7)
where
Lce
is the original cross-entropy loss for the
translation (Vaswani et al.,2017).
λ
is a hyperpa-
rameter and we set λ= 0.3in our experiments.
4.2 Learning of Quantified Info
The form of quantified info
Isrc
and
Itgt
has been
constrained through Eq.(4-7), and then the key
challenge is how to encourage the quantified info
to accurately reflect the amount of information
each token contains. Since the tokens with dif-
ferent amounts of information often show differ-
ent preferences in the attention distribution (Lin
et al.,2018), we propose an unsupervised method
to learn the quantified info through the attention
mechanism. As shown in Figure 2, we introduce an
info-aware Transformer, consisting of info-aware
self-attention and info-consistent cross-attention.
Info-aware Self-attention
Self-attentions in
both encoder and decoder are used to extract mono-
lingual representations of tokens, where tokens
with different amounts of information tend to ex-
hibit different attention distributions (Lin et al.,
2018;Zhang and Feng,2021b). Specifically, to-
kens with much information, such as content words,
tend to pay more attention to themselves. For the
tokens with less information, since they have less
meaning in themselves, they need more context
information and thereby pay less attention to them-
selves. Therefore, we use the quantified info to
bias the tokens’ attention to themselves, thereby en-
couraging those tokens that tend to focus more on
themselves to get higher info. Specifically, based
on the original self-attention in Eq.(1,2), we add
the quantified info
Iτ
i, τ {src, tgt}
(respectively
used for encoder and decoder self-attention) on the
token’s similarity to itself
eii
(Lin et al.,2018), and
then normalize them with
softmax (·)
to get the
info-aware self-attention βij , calculated as:
eeij =(eij + (Iτ
i1) ,if i=j
eij ,otherwise ,(8)
βij = softmax (eeij ).(9)
If
Iτ
i>1
(i.e., containing more information), the
token will pay more attention to itself, otherwise
the token will focus more on other tokens to extract
context information. Therefore, the info can be
learned from the attention distribution.
Info-consistent Cross-attention
In addition
to modeling the token info in a monolingual con-
text, the consistency of the token info between tar-
get and source is also crucial for the SiMT policy,
which ensures that the received source information
and the target information can be accurately bal-
anced under the same criterion. For consistency,
the target and source tokens with high similarity
(i.e., those with high cross-attention scores) should
have similar info. Therefore, we scale the cross-
attention with the info consistency between target
and source, where the info consistency is measured
by
L1
distance between target and source info. Info-
consistent cross-attention γij is calculated as:
eγij =αij ×2Itgt
iIsrc
j,(10)
γij =eγij /X
jeγij ,(11)
where
2Itgt
iIsrc
j(0,2]
measures the
info consistent between yiand xj.
Overall, we apply the proposed info-aware self-
attention
βij
and info-consistent cross-attention
γij
to replace the original attention for the learning of
the quantified info.
4.3 Wait-info Policy
Owing to the quantification and learning of info,
we get
Isrc
and
Itgt
to reflect how much informa-
tion that source and target tokens contain. Then,
we develop wait-info policy for SiMT to balance
source and target at the information level.
Borrowing the idea from the wait-k policy that
requires the target outputs to lag behind the source
inputs by
k
tokens (Ma et al.,2019), wait-info pol-
icy keeps that the target information is always less
than the received source information
K
info, where
K
is the lagging info, a hyperparameter to control
the latency. Formally, we denote the number of
摘要:

Wait-infoPolicy:BalancingSourceandTargetatInformationLevelforSimultaneousMachineTranslationShaoleiZhang1,2,ShoutaoGuo1,2,YangFeng1,21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences(ICT/CAS)2UniversityofChineseAcademyofSciences,Beijing,China{zha...

展开>> 收起<<
Wait-info Policy Balancing Source and Target at Information Level for Simultaneous Machine Translation Shaolei Zhang12 Shoutao Guo12 Yang Feng12.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:1.04MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注