Information-Transport-based Policy for Simultaneous Translation Shaolei Zhang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing

2025-05-05 0 0 5.15MB 22 页 10玖币

侵权投诉

Information-Transport-based Policy for Simultaneous Translation

Shaolei Zhang 1,2, Yang Feng 1,2∗

1Key Laboratory of Intelligent Information Processing

Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)

2University of Chinese Academy of Sciences, Beijing, China

{zhangshaolei20z,fengyang}@ict.ac.cn

Abstract

Simultaneous translation (ST) outputs trans-

lation while receiving the source inputs, and

hence requires a policy to determine whether

to translate a target token or wait for the next

source token. The major challenge of ST is

that each target token can only be translated

based on the current received source tokens,

where the received source information will di-

rectly affect the translation quality. So nat-

urally, how much source information is re-

ceived for the translation of the current tar-

get token is supposed to be the pivotal ev-

idence for the ST policy to decide between

translating and waiting. In this paper, we

treat the translation as information transport

from source to target and accordingly propose

an Information-Transport-based Simultaneous

Translation (ITST). ITST quantiﬁes the trans-

ported information weight from each source

token to the current target token, and then de-

cides whether to translate the target token ac-

cording to its accumulated received informa-

tion. Experiments on both text-to-text ST and

speech-to-text ST (a.k.a., streaming speech

translation) tasks show that ITST outperforms

strong baselines and achieves state-of-the-art

performance1.

1 Introduction

Simultaneous translation (ST) (Cho and Esipova,

2016;Gu et al.,2017;Ma et al.,2019;Arivazhagan

et al.,2019), which outputs translation while receiv-

ing the streaming inputs, is essential for many real-

time scenarios, such as simultaneous interpretation,

online subtitles and live broadcasting. Compared

with the conventional full-sentence machine trans-

lation (MT) (Vaswani et al.,2017), ST additionally

requires a read/write policy to decide whether to

wait for the next source input (a.k.a., READ) or

generate a target token (a.k.a., WRITE).

∗Corresponding author: Yang Feng.

Code is available at

https://github.com/

ictnlp/ITST

transported

information

weight

0.33

0.02

0.28

0.15

Streaming

Inputs:

Simultaneous

Outputs:

…

if accumulated received

information

then: WRITE

else: READ

+ + +

Figure 1: Schematic diagram of ITST (e.g., δ= 0.7).

The goal of ST is to achieve high-quality trans-

lation under low latency, however, the major chal-

lenge is that the low-latency requirement restricts

the ST model to translating each target token only

based on current received source tokens (Ma et al.,

2019). To mitigate the impact of this restric-

tion on translation quality, ST needs a reasonable

read/write policy to ensure that before translating,

the received source information is sufﬁcient to gen-

erate the current target token (Arivazhagan et al.,

2019). To achieve this, read/write policy should

measure the amount of received source informa-

tion, if the received source information is sufﬁcient

for translation, the model translates a target token,

otherwise the model waits for the next input.

However, previous read/write policies, involving

ﬁxed and adaptive, often lack an explicit measure

of how much source information is received for the

translation. Fixed policy decides READ/WRITE

according to predeﬁned rules (Ma et al.,2019;

Zhang and Feng,2021c) and sometimes forces the

model to start translating even though the received

source information is insufﬁcient, thereby affecting

the translation quality. Adaptive policy can dynam-

ically adjust READ/WRITE (Arivazhagan et al.,

2019;Ma et al.,2020c) to achieve better perfor-

mance. However, previous adaptive policies often

directly predict a variable based on the inputs to in-

dicate READ/WRITE decision (Arivazhagan et al.,

2019;Ma et al.,2020c;Miao et al.,2021), without

explicitly modeling the amount of information that

arXiv:2210.12357v2 [cs.CL] 1 Nov 2022

the received source tokens provide to the currently

generated target token.

Under these grounds, we aim to develop a rea-

sonable read/write policy that takes the received

source information as evidence for READ/WRITE.

For the ST process, source tokens provide informa-

tion, while target tokens receive information and

then perform translating, thereby the translation

process can be treated as information transport

from source to target. Along this line, if we are

well aware of how much information is transported

from each source token to the target token, then it

is natural to grasp the total information provided

by the received source tokens for the current target

token, thereby ensuring that the source information

is sufﬁcient for translation.

To this end, we propose Information-Transport-

based Simultaneous Translation (ITST). Borrow-

ing the idea from the optimal transport problem

(Villani,2008), ITST explicitly quantiﬁes the trans-

ported information weight from each source to-

ken to the current target token during translation.

Then, ITST starts translating after judging that the

amount of information provided by received source

tokens for the current target token has reached a

sufﬁcient proportion. As shown in the schematic

diagram in Figure 1, assuming that 70% source

information is sufﬁcient for translation, ITST ﬁrst

quantiﬁes the transport information weight from

each source token to the current target token (e.g.,

0.15,0.28,···

). With the ﬁrst three source tokens,

the accumulated received information is 45%, less

than 70%, then ITST selects READ. After receiv-

ing the fourth source token, the accumulated in-

formation received by the current target token be-

comes 78%, thus ITST selects WRITE to translate

the current target token. Experiments on both text-

to-text and speech-to-text simultaneous translation

tasks show that ITST outperforms strong baselines

and achieves state-of-the-art performance.

2 Background

Simultaneous Translation

For the ST task, we

denote the source sequence as

x= (x1,··· , xJ)

and the corresponding source hidden states as

(z1,··· , zJ)

with source length

. The model gen-

erates a target sequence

y= (y1,··· , yI)

and the

corresponding target hidden states

s=(s1,··· , sI)

with target length

. Since ST model outputs trans-

lation while receiving the source inputs, we denote

the number of received source tokens when trans-

lating

. Then, the probability of generating

p(yi|x≤gi,y<i;θ)

, where

is model param-

eters,

x≤gi

is the ﬁrst

source tokens and

y<i

the previous target tokens. Accordingly, ST model

is trained by minimizing the cross-entropy loss:

Lce =−

i=1

log p(y?

i|x≤gi,y<i;θ),(1)

where y?

iis the ground-truth target token.

Cross-attention

Translation models often use

cross-attention to measure the similarity of the tar-

get token and the source token (Vaswani et al.,

2017), thereby weighting the source information

(Wiegreffe and Pinter,2019). Given the target hid-

den states

and source hidden states

, the attention

weight αij between yiand xjis calculated as:

αij = softmax siWQzjWK>

√dk!,(2)

where

and

are projection parameters, and

is the dimension of inputs. Then the context

vector

is calculated as

oi=PJ

j=1 αij zjWV

where WVare projection parameters.

3 The Proposed Method

We propose information-transport-based simulta-

neous translation (ITST) to explicitly measure the

source information projected to the current gener-

ated target token. During the ST process, ITST

models the information transport to grasp how

much information is transported from each source

token to the current target token (Sec.3.1). Then,

ITST starts translating a target token after its accu-

mulated received information is sufﬁcient (Sec.3.2).

Details of ITST are as follows.

3.1 Information Transport

Deﬁnition of Information Transport

Borrow-

ing the idea of optimal transport problem (OT)

(Dantzig,1949), which aims to look for a trans-

port matrix transforming a probability distribution

into another while minimizing the cost of transport,

we treat the translation process in ST as an informa-

tion transport from source to target. We denote the

information transport as the matrix

T= (Tij )I×J

where

Tij ∈(0,1)

is the transported information

weight from

. Then, we assume that the

total information received by each target token

for

Since the participation degree of each source token in

translation is often different, we relax the constraints on total

information provided by source token (Kusner et al.,2015).

(b) Information transport matrix

+++++ =1

Translation

Constrains

Latency

Constrains

Figure 2: Schematic diagram of learning information transport matrix Tfrom both translation and latency.

translation is 1, i.e., PJ

j=1 Tij = 1.

Under this deﬁnition, ITST quantiﬁes the trans-

ported information weight

Tij

base on the current

target hidden state siand source hidden state zj:

Tij =sigmoid siVQzjVK>

√dk!(3)

where VQand VKare learnable parameters.

Constraints on Information Transport

Sim-

ilar to the OT problem, modeling information trans-

port in translation also requires the transport costs

to constrain the transported weights. Especially

for ST, we should constrain information trans-

port

from the aspects of translation and latency,

where the translation constraints ensure that infor-

mation transport can correctly reﬂect the transla-

tion process from source to target and the latency

constraints regularize the information transport to

avoid anomalous translation latency.

For translation constraints, the information

transport

should learn which source token con-

tributes more to the translation of the current target

token, i.e., reﬂecting the translation process. For-

tunately, the cross-attention

αij

in the translation

model is used to control the weight that source to-

ken

provides to the target token

(Abnar and

Zuidema,2020;Chen et al.,2020;Zhang and Feng,

2021b), so we integrate information transport into

the cross-attention. As shown in Figure 2(a), we

multiply Tij with cross-attention αij and then nor-

malize to get ﬁnal attention βij :

βij =αij ×Tij , βij =ˆ

βij /

j=1

βij .(4)

Then the context vector is calculated as

oi=

j=1 βij zjWV

. In this way, the information

transport

can be jointly learned with the cross-

attention in the translation process through the orig-

inal cross-entropy loss Lce.

For latency constraints, the information trans-

port

will affect the translation latency, since

the model should start translating after receiving

a certain amount of information. Speciﬁcally, for

the current target token, if too much information

is provided by the source tokens lagging behind,

waiting for those source tokens will cause high la-

tency. While too much information provided by

the front source tokens will make the model pre-

maturely start translating, resulting in extremely

low latency and poor translation quality (Zhang

and Feng,2022c). Therefore, we aim to avoid too

much information weight being transported from

source tokens that are located too early or too late

compared to the position of the current target token,

thereby getting a suitable latency.

To this end, we introduce a latency cost matrix

C= (Cij )I×J

in the diagonal form to softly regu-

larize the information transport, where

Cij

is the

latency cost of transporting information from

yi, related to their relative offset:

Cij =1

I×Jmax 

j−i×J

I−ξ, 0.(5)

j−i×J

I

is the relative offset between

and

is a hyperparameter to control the acceptable

offset (i.e. inside transports cost 0), and we set

ξ=1

in our experiments. As an example of the latency

cost matrix shown in Figure 2(c), the transported

weights cost 0 when the relative offset less than 1,

and the cost of other transports is positively related

to the offset. We will compare different settings of

the latency cost in Sec.5.1 and Appendix A.1.

Given the latency cost matrix

, the latency loss

Llatency of information transport Tis:

Llatency =

i=1

j=1

Tij ×Cij .(6)

Learning Objective

Accordingly, the learn-

ing of ST model

with the proposed information

Algorithm 1: Read/Write Policy of ITST

Input :Streaming inputs x, Threshold δ,

i= 1,j= 1,y0=hBOSi

Output :Target outputs y

1while yi−16=hEOSido

2calculate information transport

T= (Ti1,··· , Tij )as Eq.(3);

3if Pj

l=1Til ≥δthen //.WRITE

4translate yiwith (x1,··· , xj);

5i←i+ 1;

6else //.READ

7wait for next source input xj+1;

8j←j+ 1;

9end

10 end

transport Tcan be formalized as:

min

θ,TLce +Llatency (7)

s.t.

j=1

Tij = 1,∀1≤i≤I(8)

Tij ≥0,∀1≤i≤I, 1≤j≤J(9)

Eq.(8) constrains the total information transported

to each target token to be 1 (refer to the deﬁnition),

and Eq.(9) constrains the transported weights to be

positive, realized by

sigmoid(·)

in Eq.(3). Then,

we convert the normalization constraints of

Tij

Eq.(8) into the following regular term:

Lnorm =

i=1 



j=1

Tij −1



2

.(10)

Therefore, the total loss LIT ST is calculated as:

LIT ST =Lce +Llatency +Lnorm.(11)

3.2 Information Transport based Policy

Read/Write Policy

After grasping the informa-

tion transported from each source token to the

current target token, we propose an information-

transport-based policy accordingly. With stream-

ing inputs, ITST receives source tokens one by one

and transports their information to the current tar-

get token, and then ITST starts translating when

the accumulated received information is sufﬁcient.

To obtain a controllable latency (Ma et al.,2019)

during testing, a threshold

is introduced to indi-

cate how much proportion of source information

is sufﬁcient for translation. Therefore, as shown in

Algorithm 1, ITST selects WRITE after the accu-

mulated received source information of the current

target token

l=1Til

is greater than the threshold

δ, otherwise ITST selects READ.

ITST can perform translating under different la-

tency by adjusting the threshold

. With larger

ITST tends to wait for more transported informa-

tion, so the latency becomes higher; otherwise, the

latency becomes lower with smaller δ.

Curriculum-based Training

Besides a rea-

sonable read/write policy, ST model also requires

the capability of translating based on incomplete

source information. Therefore, we apply the thresh-

old in training as well, denoted as

δtrain

, and ac-

cordingly mask out the rest of source tokens when

accumulated information of each target token ex-

ceeds

δtrain

. Formally, given

δtrain

is translated

based on the ﬁrst gisource tokens, where giis:

gi=argmin

l=1

Til ≥δtrain.(12)

Then, we mask out the source token

that

j > gi

during training to simulate the streaming inputs.

Regarding how to set

δtrain

during training, un-

like previous methods that train multiple separate

ST models for different thresholds (Ma et al.,2019,

2020c) or randomly sample different thresholds

(Elbayad et al.,2020;Zhang and Feng,2021c),

we propose curriculum-based training for ITST

to train one universal model that can perform ST

under arbitrary latency (various δduring testing).

The proposed curriculum-based training follows

an easy-to-hard schedule. At the beginning of train-

ing, we let the model preferentially focus on the

learning of translation and information transport

under the richer source information, Then, we grad-

ually reduce the source information as the training

progresses to let the ST model learn to translate

with incomplete source inputs. Therefore,

δtrain

dynamically adjusted according to an exponential-

decaying schedule during training:

δtrain =δmin +(1−δmin)×exp−Nupdate

d,

(13)

where

Nupdate

is update steps, and

is a hyperpa-

rameter to control the decaying degree.

δmin

is the

minimum amount of information required, and we

set

δmin = 0.5

in the experiments. Thus, during

training, the information received by each target

token gradually decays from 100% to 50%.

4 Experiments

4.1 Datasets

We conduct experiments on both text-to-text ST

and speech-to-text ST tasks.

•Text-to-text ST (T2T-ST)

IWSLT153English →Vietnamese (En→Vi)

(133K pairs) (Cettolo et al.,2015) We use TED

tst2012 as the validation set (1553 pairs) and TED

tst2013 as the test set (1268 pairs). Following

the previous setting (Raffel et al.,2017;Ma et al.,

2020c), we replace tokens that the frequency less

than 5 by

hunki

, and the vocabulary sizes are 17K

and 7.7K for English and Vietnamese respectively.

WMT154German→English (De→En)

(4.5M

pairs) We use newstest2013 as the validation set

(3000 pairs) and newstest2015 as the test set (2169

pairs). 32K BPE (Sennrich et al.,2016) is applied

and the vocabulary is shared across languages.

•Speech-to-text ST (S2T-ST)

MuST-C5English →German (En→De)

(234K pairs) and

English →Spanish (En→Es)

(270K pairs) (Di Gangi et al.,2019). We use

dev

as validation set (1423 pairs for En

→

De, 1316

pairs for En

→

Es) and use

tst-COMMON

as test

set (2641 pairs for En

→

De, 2502 pairs for En

→

Es),

respectively. Following Ma et al. (2020b), we use

Kaldi (Povey et al.,2011) to extract 80-dimensional

log-mel ﬁlter bank features for speech, computed

with a 25

window size and a 10

window shift,

and we use SentencePiece (Kudo and Richardson,

2018) to generate a unigram vocabulary of size

10000 respectively for source and target text.

4.2 Experimental Settings

We conduct experiments on the following systems.

All implementations are based on Transformer

(Vaswani et al.,2017) and adapted from Fairseq

Library (Ott et al.,2019).

Ofﬂine

Full-sentence MT (Vaswani et al.,

2017), which waits for the complete source inputs

and then starts translating.

Wait-k

Wait-k policy (Ma et al.,2019), the

most widely used ﬁxed policy, which ﬁrst READ

source tokens, and then alternately READ one

token and WRITE one token.

Multipath Wait-k

An efﬁcient training for

wait-k (Elbayad et al.,2020), which randomly sam-

ples different kbetween batches during training.

3nlp.stanford.edu/projects/nmt/

4www.statmt.org/wmt15/

5https://ict.fbk.eu/must-c

Adaptive Wait-k

A heuristic composition of

multiple wait-k models (

k=1 ···13

) (Zheng et al.,

2020), which decides whether to translate accord-

ing to the generating probabilities of wait-k models.

MoE Wait-k6

Mixture-of-experts wait-k pol-

icy (Zhang and Feng,2021c), the SOTA ﬁxed pol-

icy, which applies multiple experts to learn multiple

wait-k policies during training.

MMA7

Monotonic multi-head attention (Ma

et al.,2020c), which predicts a Bernoulli variable to

decide READ/WRITE, and the Bernoulli variable

is jointly learning with multi-head attention.

GSiMT

Generative ST (Miao et al.,2021),

which also predicts a Bernoulli variable to decide

READ/WRITE and the variable is trained with a

generative framework via dynamic programming.

RealTranS

End-to-end simultaneous speech

translation with Wait-K-Stride-N strategy (Zeng

et al.,2021), which waits for

frame at each step.

MoSST

Monotonic-segmented streaming

speech translation (Dong et al.,2022), which uses

integrate-and-ﬁring method to segment the speech.

ITST The proposed method in Sec.3.

T2T-ST Settings

We apply Transformer-

Small (4 heads) for En

→

Vi and Transformer-

Base/Big (8/16 heads) for De

→

En. Note that we

apply the unidirectional encoder for Transformer

to enable simultaneous decoding. Since GSiMT

involves dynamic programming which makes its

training expensive, we report GSiMT on WMT15

→

En (Base) (Miao et al.,2021). For T2T-ST

evaluation, we report BLEU (Papineni et al.,2002)

for translation quality and Average Lagging (AL,

token) (Ma et al.,2019) for latency. We also give

the results with SacreBLEU in Appendix B.

S2T-ST Settings

The proposed ITST can per-

form end-to-end speech-to-text ST in two manners:

ﬁxed pre-decision and ﬂexible pre-decision (Ma

et al.,2020b). For ﬁxed pre-decision, following Ma

et al. (2020b), we apply ConvTransformer-Espnet

(4 heads) (Inaguma et al.,2020) for both En

→

and En

→

Es, which adds a 3-layer convolutional

network before the encoder to capture the speech

features. Note that the encoder is also unidirec-

tional for simultaneous decoding. The convolu-

tional layers and encoder are initialized from the

pre-trained ASR task. All systems make a ﬁxed

pre-decision of READ/WRITE every 7 source to-

kens (i.e., every 280

). For ﬂexible pre-decision,

6github.com/ictnlp/MoE-Waitk

7github.com/pytorch/fairseq/tree/

master/examples/simultaneous_translation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Information-Transport-basedPolicyforSimultaneousTranslationShaoleiZhang1,2,YangFeng1,21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences(ICT/CAS)2UniversityofChineseAcademyofSciences,Beijing,China{zhangshaolei20z,fengyang}@ict.ac.cnAbstractSimult...

展开>> 收起<<

Information-Transport-based Policy for Simultaneous Translation Shaolei Zhang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Information-Transport-based Policy for Simultaneous Translation Shaolei Zhang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: