Wait-info Policy Balancing Source and Target at Information Level for Simultaneous Machine Translation Shaolei Zhang12 Shoutao Guo12 Yang Feng12

2025-05-06 0 0 1.04MB 16 页 10玖币

侵权投诉

Wait-info Policy: Balancing Source and Target at Information Level

for Simultaneous Machine Translation

Shaolei Zhang 1,2, Shoutao Guo 1,2, Yang Feng 1,2∗

1Key Laboratory of Intelligent Information Processing

Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)

2University of Chinese Academy of Sciences, Beijing, China

{zhangshaolei20z,guoshoutao22z,fengyang}@ict.ac.cn

Abstract

Simultaneous machine translation (SiMT) out-

puts the translation while receiving the source

inputs, and hence needs to balance the re-

ceived source information and translated target

information to make a reasonable decision be-

tween waiting for inputs or outputting transla-

tion. Previous methods always balance source

and target information at the token level, either

directly waiting for a ﬁxed number of tokens or

adjusting the waiting based on the current to-

ken. In this paper, we propose a Wait-info Pol-

icy to balance source and target at the informa-

tion level. We ﬁrst quantify the amount of in-

formation contained in each token, named info.

Then during simultaneous translation, the deci-

sion of waiting or outputting is made based on

the comparison results between the total info

of previous target outputs and received source

inputs. Experiments show that our method out-

performs strong baselines under and achieves

better balance via the proposed info1.

1 Introduction

Simultaneous machine translation (SiMT) (Cho

and Esipova,2016;Gu et al.,2017;Ma et al.,2019)

outputs the translation while receiving the source

sentence, aiming at the trade-off between trans-

lation quality and latency. Therefore, a policy is

required for SiMT to decide between waiting for

the source inputs (i.e., READ) or outputting transla-

tions (i.e., WRITE), the core of which is to wisely

balance the received source information and the

translated target information. When the source in-

formation is less, the model should wait for more

inputs for a high-quality translation; conversely,

when the translated target information is less, the

model should output translations for a low latency.

Existing SiMT policies, involving ﬁxed and

adaptive, always balance source and target at the

∗Corresponding author: Yang Feng.

Code is available at

https://github.com/

ictnlp/Wait-info

1111

1 1

Source:

Target: wait tokens

…

(a) Wait-k policy: treats each token equally, and lags ktokens.

0.5 1.7 1.2 1.0

0.2 1.5

Source:

Target: wait info

…

(b) Wait-info policy: quantiﬁes the information in each token,

named info (e.g.,

0.5,1.7,· · ·

), and keeps the target informa-

tion always less than the received source information Kinfo.

Figure 1: Schematic diagram of Wait-info v.s. Wait-k.

token level, i.e., treating each source and target

token equally when determining READ/WRITE.

Fixed policies decide READ/WRITE based on the

number of received source tokens (Ma et al.,2019;

Zhang and Feng,2021c), such as wait-k policy (Ma

et al.,2019) simply considers each source token

to be equivalent and lets the target outputs always

lag the source inputs by

tokens, as shown in

Figure 1(a). Fixed policies are always limited by

the fact that the policy cannot be adjusted accord-

ing to complex inputs, making them difﬁcult to

get the best trade-off. Adaptive policies predict

READ/WRITE according to the current source and

target tokens (Arivazhagan et al.,2019;Ma et al.,

2020) and thereby get a better trade-off, but they of-

ten ignore and under-utilize the difference between

tokens when deciding READ/WRITE. Besides, ex-

isting adaptive policies always rely on complicated

training (Ma et al.,2020;Miao et al.,2021) or addi-

tional labeled data (Zheng et al.,2019;Zhang et al.,

2020;Alinejad et al.,2021), making them more

computationally expensive than ﬁxed policies.

Treating each token equally when balancing

source and target is not the optimal choice for SiMT

arXiv:2210.11220v1 [cs.CL] 20 Oct 2022

policy. Many studies have shown that different

words have signiﬁcantly different functions in trans-

lation (Lin et al.,2018;Moradi et al.,2019;Chen

et al.,2020), often divided into content words (i.e.,

noun, verb,

···

) and function words (i.e., conjunc-

tion, preposition,

···

), where the former express

more important meaning and the latter is less infor-

mative. Accordingly, tokens with different amounts

of information should also play different roles in

the SiMT policy, where more informative tokens

should play a more dominant role because they

bring more information to SiMT model (Zhang and

Feng,2022a,b). Therefore, explicitly differentiat-

ing various tokens rather than treating them equally

when determining READ/WRITE will be beneﬁ-

cial to developing a more precise SiMT policy.

In this paper, we differentiate various source and

target tokens based on the amount of information

they contain, aiming to balance received source

information and translated target information at

the information level. To this end, we propose

wait-info policy, a simple yet effective policy for

SiMT. As shown in Figure 1(b), we ﬁrst quantify

the amount of information contained in each to-

ken through a scalar, named info, which is jointly

learned with the attention mechanism in an unsuper-

vised manner. During the simultaneous translation,

READ/WRITE decisions are made by balancing

the total info of translated target information and

received source information. If the received source

information is more than translated target informa-

tion by

info or more, the model outputs transla-

tion, otherwise the model waits for the next input.

Experiments and analyses show that our method

outperforms strong baselines and effectively quan-

tiﬁes the information contained in each token.

2 Related Work

SiMT Policy

Recent policies fall into ﬁxed and

adaptive. For ﬁxed policy, Ma et al. (2019) pro-

posed wait-k policy, which ﬁrst READ

source to-

kens and then READ/WRITE one token alternately.

Elbayad et al. (2020) proposed an efﬁcient multi-

path training for wait-k policy to randomly sample

during training. Zhang et al. (2021) proposed

future-guide training for wait-k policy, which intro-

duces a full-sentence MT to guide training. Zhang

and Feng (2021a) proposed a char-level wait-k pol-

icy. Zhang and Feng (2021c) proposed a mixture-

of-experts wait-k policy to develop a universal

SiMT model. For adaptive policy, Gu et al. (2017)

trained an agent to decide READ/WRITE via rein-

forcement learning. Arivazhagan et al. (2019) pro-

posed MILk, which predicts a Bernoulli variable

to determine READ/WRITE. Ma et al. (2020) pro-

posed MMA to implement MILk on Transformer.

(Zhang and Feng,2022c) proposed dual-path SiMT

to enhance MMA with dual learning. Zheng et al.

(2020) developed adaptive wait-k through heuristic

ensemble of multiple wait-k models. Miao et al.

(2021) proposed a generative framework to gen-

erate READ/WRITE decisions. Zhang and Feng

(2022a) proposed Gaussian multi-head attention to

decide READ/WRITE based on alignments.

Previous policies always treat each token equally

when determining READ/WRITE, ignoring the

fact that tokens with different amounts of infor-

mation often play different roles in SiMT policy.

Our method aims to develop a more precise SiMT

policy by differentiating the importance of various

tokens when determining READ/WRITE.

Information Modeling in NMT

Linguistics

divides words into content words and function

words according to their information and functions

in the sentence. Therefore, modeling the informa-

tion contained in each word is often used to im-

prove the NMT performance. Moradi et al. (2019)

and Chen et al. (2020) used the word frequency to

indicate how much information each word contains,

and the words with lower frequencies contain more

information. Liu et al. (2020) and Kobayashi et al.

(2020) found that the norm of word embedding is

related to the token information in NMT. Lin et al.

(2018) and Zhang and Feng (2021b) argued that

the attention mechanism for different types of word

should be different, where the attention distribution

of content word tends to be more concentrated.

Our method explores the usefulness of modeling

information for SiMT policy, and proposes an un-

supervised method to quantify the information of

tokens through the attention mechanism, achieving

good explainability.

3 Background

Full-sentence MT

For a translation task, we

denote the source sentence as

x= (x1,··· , xn)

with source length

and the target sentence as

y= (y1,··· , ym)

with target length

. Trans-

former (Vaswani et al.,2017) is the most widely

used architecture for full-sentence MT, consisting

of an encoder and a decoder. Encoder maps

source hidden states

z= (z1,··· , zn)

. Decoder

maps

to target hidden states

s= (s1,··· , sm)

and then performs translating. Speciﬁcally, each en-

coder layer contains two sub-layers: self-attention

and feed-forward network (FFN), while each de-

coder layer contains three sub-layers: self-attention,

cross-attention and FFN. Both self-attention and

cross-attention are implemented through the dot-

product attention between query

and key

calculated as:

eij =QiWQKjWK>

√dk

,(1)

αij = softmax (eij ).(2)

where

eij

is the similarity score between

and

, and

αij

is the normalized attention weight.

is the input dimension,

and

are projection

parameters. More speciﬁcally, self-attention ex-

tracts the monolingual representation of source or

target tokens, so the query and key both come from

the source hidden states

or target hidden states

. While cross-attention extracts the cross-lingual

representation through measuring the correlation

between target and source token, so query comes

from the target hidden states

, and key comes from

the source hidden states z.

Wait-k Policy

Simultaneous machine transla-

tion (SiMT) determines when to start translating

each target token through a policy. Wait-k policy

(Ma et al.,2019) is the most widely used policy

for SiMT, which refers to ﬁrst waiting for

source

tokens and then translating and waiting for one

token alternately, i.e., the target outputs always lag-

ging

tokens behind the source inputs. Formally,

when translating

, wait-k policy forces the SiMT

model to wait for

gk(i)

source tokens, where

gk(i)

is calculated as:

gk(i) = min {k+i−1, n}.(3)

4 Method

To differentiate various tokens when determining

READ/WRITE, we quantify the amount of infor-

mation contained in each source and target token,

named info. As shown in Figure 2, we propose info-

aware Transformer to jointly learn the quantiﬁed

info with the attention mechanism in an unsuper-

vised manner. Then based on the quantiﬁed info,

we propose wait-info policy to balance the received

source information and translated target informa-

tion. The details are as follows.

Info%

Quan)zer

Inputs

Emb.%&%%Pos. Emb.%&%%Pos.

Outputs

(Shifted right)

Linear

So1max%

Output Probablities

Source

info

Target

info

Info%

Quan)zer

Info-aware

Self-a6en7on

Feed%Forward

Info-aware

Self-a6en7on

Info-consistent

Cross-a6en7on

Feed%Forward

×N

N×

Figure 2: Architecture of the proposed info-aware

Transformer, where we omit residual connection and

layer normalization in the ﬁgure for clarity.

4.1 Info Quantiﬁcation

To quantify the amount of information in each to-

ken, we use a scalar to represent how much in-

formation each token contains, named info. We

denote the info of the source tokens and the target

tokens as

Isrc ∈Rn×1

and

Itgt ∈Rm×1

, respec-

tively, where

Isrc

and

Itgt

represent the info of

and

, and the higher info means that the token

has more information.

To predict

Isrc

and

Itgt

, we introduce two Info

Quantizers before the encoder and decoder to re-

spectively quantify the information of each source

and target token, as shown in Figure 2. Speciﬁcally,

the info quantizer is implemented by a 3-layer feed-

forward network (FFN):

Isrc = 2 ×sigmoid (FFN (x)) ,(4)

Itgt = 2 ×sigmoid (FFN (y)) .(5)

For the formulation of the following wait-info pol-

icy,

2×sigmoid(·)

is used to restrict the quantiﬁed

info Isrc

j, Itgt

i∈(0,2).

Further, in a translation task, source sentence and

target sentence should be semantically equivalent

(Finch et al.,2005;Guo et al.,2022), so the total

information of source tokens should be equal to

that of target tokens. To this end, we introduce an

info-sum loss

Lsum

to constrain the total info of

the source tokens and target tokens, calculated as:

Lsum =



j=1

Isrc

j−ζ



2

+



i=1

Itgt

i−ζ



2

(6)

where

is a hyperparameter to represent the total

info, and we set

ζ=m+n

(i.e., average length of

source and target) to control the average info to be

around 1. Therefore, the ﬁnal loss Lis:

L=Lce +λLsum,(7)

where

Lce

is the original cross-entropy loss for the

translation (Vaswani et al.,2017).

is a hyperpa-

rameter and we set λ= 0.3in our experiments.

4.2 Learning of Quantiﬁed Info

The form of quantiﬁed info

Isrc

and

Itgt

has been

constrained through Eq.(4-7), and then the key

challenge is how to encourage the quantiﬁed info

to accurately reﬂect the amount of information

each token contains. Since the tokens with dif-

ferent amounts of information often show differ-

ent preferences in the attention distribution (Lin

et al.,2018), we propose an unsupervised method

to learn the quantiﬁed info through the attention

mechanism. As shown in Figure 2, we introduce an

info-aware Transformer, consisting of info-aware

self-attention and info-consistent cross-attention.

Info-aware Self-attention

Self-attentions in

both encoder and decoder are used to extract mono-

lingual representations of tokens, where tokens

with different amounts of information tend to ex-

hibit different attention distributions (Lin et al.,

2018;Zhang and Feng,2021b). Speciﬁcally, to-

kens with much information, such as content words,

tend to pay more attention to themselves. For the

tokens with less information, since they have less

meaning in themselves, they need more context

information and thereby pay less attention to them-

selves. Therefore, we use the quantiﬁed info to

bias the tokens’ attention to themselves, thereby en-

couraging those tokens that tend to focus more on

themselves to get higher info. Speciﬁcally, based

on the original self-attention in Eq.(1,2), we add

the quantiﬁed info

Iτ

i, τ ∈{src, tgt}

(respectively

used for encoder and decoder self-attention) on the

token’s similarity to itself

eii

(Lin et al.,2018), and

then normalize them with

softmax (·)

to get the

info-aware self-attention βij , calculated as:

eeij =(eij + (Iτ

i−1) ,if i=j

eij ,otherwise ,(8)

βij = softmax (eeij ).(9)

Iτ

i>1

(i.e., containing more information), the

token will pay more attention to itself, otherwise

the token will focus more on other tokens to extract

context information. Therefore, the info can be

learned from the attention distribution.

Info-consistent Cross-attention

In addition

to modeling the token info in a monolingual con-

text, the consistency of the token info between tar-

get and source is also crucial for the SiMT policy,

which ensures that the received source information

and the target information can be accurately bal-

anced under the same criterion. For consistency,

the target and source tokens with high similarity

(i.e., those with high cross-attention scores) should

have similar info. Therefore, we scale the cross-

attention with the info consistency between target

and source, where the info consistency is measured

distance between target and source info. Info-

consistent cross-attention γij is calculated as:

eγij =αij ×2−Itgt

i−Isrc

j,(10)

γij =eγij /X

jeγij ,(11)

where

2−Itgt

i−Isrc

j∈(0,2]

measures the

info consistent between yiand xj.

Overall, we apply the proposed info-aware self-

attention

βij

and info-consistent cross-attention

γij

to replace the original attention for the learning of

the quantiﬁed info.

4.3 Wait-info Policy

Owing to the quantiﬁcation and learning of info,

we get

Isrc

and

Itgt

to reﬂect how much informa-

tion that source and target tokens contain. Then,

we develop wait-info policy for SiMT to balance

source and target at the information level.

Borrowing the idea from the wait-k policy that

requires the target outputs to lag behind the source

inputs by

tokens (Ma et al.,2019), wait-info pol-

icy keeps that the target information is always less

than the received source information

info, where

is the lagging info, a hyperparameter to control

the latency. Formally, we denote the number of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Wait-infoPolicy:BalancingSourceandTargetatInformationLevelforSimultaneousMachineTranslationShaoleiZhang1,2,ShoutaoGuo1,2,YangFeng1,21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences(ICT/CAS)2UniversityofChineseAcademyofSciences,Beijing,China{zha...

展开>> 收起<<

Wait-info Policy Balancing Source and Target at Information Level for Simultaneous Machine Translation Shaolei Zhang12 Shoutao Guo12 Yang Feng12.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Wait-info Policy Balancing Source and Target at Information Level for Simultaneous Machine Translation Shaolei Zhang12 Shoutao Guo12 Yang Feng12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: