policy. Many studies have shown that different
words have significantly different functions in trans-
lation (Lin et al.,2018;Moradi et al.,2019;Chen
et al.,2020), often divided into content words (i.e.,
noun, verb,
···
) and function words (i.e., conjunc-
tion, preposition,
···
), where the former express
more important meaning and the latter is less infor-
mative. Accordingly, tokens with different amounts
of information should also play different roles in
the SiMT policy, where more informative tokens
should play a more dominant role because they
bring more information to SiMT model (Zhang and
Feng,2022a,b). Therefore, explicitly differentiat-
ing various tokens rather than treating them equally
when determining READ/WRITE will be benefi-
cial to developing a more precise SiMT policy.
In this paper, we differentiate various source and
target tokens based on the amount of information
they contain, aiming to balance received source
information and translated target information at
the information level. To this end, we propose
wait-info policy, a simple yet effective policy for
SiMT. As shown in Figure 1(b), we first quantify
the amount of information contained in each to-
ken through a scalar, named info, which is jointly
learned with the attention mechanism in an unsuper-
vised manner. During the simultaneous translation,
READ/WRITE decisions are made by balancing
the total info of translated target information and
received source information. If the received source
information is more than translated target informa-
tion by
K
info or more, the model outputs transla-
tion, otherwise the model waits for the next input.
Experiments and analyses show that our method
outperforms strong baselines and effectively quan-
tifies the information contained in each token.
2 Related Work
SiMT Policy
Recent policies fall into fixed and
adaptive. For fixed policy, Ma et al. (2019) pro-
posed wait-k policy, which first READ
k
source to-
kens and then READ/WRITE one token alternately.
Elbayad et al. (2020) proposed an efficient multi-
path training for wait-k policy to randomly sample
k
during training. Zhang et al. (2021) proposed
future-guide training for wait-k policy, which intro-
duces a full-sentence MT to guide training. Zhang
and Feng (2021a) proposed a char-level wait-k pol-
icy. Zhang and Feng (2021c) proposed a mixture-
of-experts wait-k policy to develop a universal
SiMT model. For adaptive policy, Gu et al. (2017)
trained an agent to decide READ/WRITE via rein-
forcement learning. Arivazhagan et al. (2019) pro-
posed MILk, which predicts a Bernoulli variable
to determine READ/WRITE. Ma et al. (2020) pro-
posed MMA to implement MILk on Transformer.
(Zhang and Feng,2022c) proposed dual-path SiMT
to enhance MMA with dual learning. Zheng et al.
(2020) developed adaptive wait-k through heuristic
ensemble of multiple wait-k models. Miao et al.
(2021) proposed a generative framework to gen-
erate READ/WRITE decisions. Zhang and Feng
(2022a) proposed Gaussian multi-head attention to
decide READ/WRITE based on alignments.
Previous policies always treat each token equally
when determining READ/WRITE, ignoring the
fact that tokens with different amounts of infor-
mation often play different roles in SiMT policy.
Our method aims to develop a more precise SiMT
policy by differentiating the importance of various
tokens when determining READ/WRITE.
Information Modeling in NMT
Linguistics
divides words into content words and function
words according to their information and functions
in the sentence. Therefore, modeling the informa-
tion contained in each word is often used to im-
prove the NMT performance. Moradi et al. (2019)
and Chen et al. (2020) used the word frequency to
indicate how much information each word contains,
and the words with lower frequencies contain more
information. Liu et al. (2020) and Kobayashi et al.
(2020) found that the norm of word embedding is
related to the token information in NMT. Lin et al.
(2018) and Zhang and Feng (2021b) argued that
the attention mechanism for different types of word
should be different, where the attention distribution
of content word tends to be more concentrated.
Our method explores the usefulness of modeling
information for SiMT policy, and proposes an un-
supervised method to quantify the information of
tokens through the attention mechanism, achieving
good explainability.
3 Background
Full-sentence MT
For a translation task, we
denote the source sentence as
x= (x1,··· , xn)
with source length
n
and the target sentence as
y= (y1,··· , ym)
with target length
m
. Trans-
former (Vaswani et al.,2017) is the most widely
used architecture for full-sentence MT, consisting
of an encoder and a decoder. Encoder maps
x
to
source hidden states
z= (z1,··· , zn)
. Decoder