Multi-Granularity Optimization for Non-Autoregressive Translation Yafu Li Leyang Cuiy Yongjng Yin Yue Zhangy Zhejiang University

2025-05-02 0 0 738.68KB 12 页 10玖币

侵权投诉

Multi-Granularity Optimization for Non-Autoregressive Translation

Yafu Li♠♥ , Leyang Cui♣†

, Yongjng Yin♠♥ , Yue Zhang♥♦†

♠Zhejiang University

♥School of Engineering, Westlake University

♣Tencent AI lab

♦Institute of Advanced Technology, Westlake Institute for Advanced Study

yafuly@gmail.com leyangcui@tencent.com

yinyongjing@westlake.edu.cn yue.zhang@wias.org.cn

Abstract

Despite low latency, non-autoregressive ma-

chine translation (NAT) suffers severe perfor-

mance deterioration due to the naive indepen-

dence assumption. This assumption is further

strengthened by cross-entropy loss, which en-

courages a strict match between the hypothesis

and the reference token by token. To alleviate

this issue, we propose multi-granularity opti-

mization for NAT, which collects model behav-

iors on translation segments of various granu-

larities and integrates feedback for backprop-

agation. Experiments on four WMT bench-

marks show that the proposed method signiﬁ-

cantly outperforms the baseline models trained

with cross-entropy loss, and achieves the best

performance on WMT’16 En⇔Ro and highly

competitive results on WMT’14 En⇔De for

fully non-autoregressive translation.

1 Introduction

Neural machine translation (NMT) systems have

shown superior performance on various bench-

mark datasets (Vaswani et al.,2017;Edunov et al.,

2018a). In the training stage, NMT systems min-

imize the token-level cross-entropy loss between

the reference sequence and the model hypothesis.

During inference, NMT models adopt autoregres-

sive decoding, where the decoder generates the

target sentence token by token (

O(N)

). To reduce

the latency of NMT systems, Gu et al. (2018) pro-

pose non-autoregressive neural machine translation

(NAT), which improves the decoding speed by gen-

erating the entire target sequence in parallel (

O(1)

Despite low latency, without modeling the tar-

get sequence history, NAT models tend to generate

translations of low quality (Gu et al.,2018;Sun

et al.,2019;Ghazvininejad et al.,2019). NAT ig-

nores inter-token dependency and naively factor-

izes the sequence-level probability as a product of

independent token probability. However, vanilla

†Corresponding authors.

Each studentsof hasthe a book

Granularity1

Multi-granularity Optimization

… …

Cross-Entropy Training

GranularitykGranularityn

… …

Figure 1: An illustration of modeling the multi-

granularity token-dependency beyond cross-entropy.

NAT adopts the same training optimization method

as autoregressive (AT) models, i.e., cross-entropy

loss (XE loss), which forces the model to learn

a strict position-to-position mapping, heavily pe-

nalizing hypotheses that suffer position shifts but

share large similarity with the references. Given a

reference “she left her keys yesterday .”, an inappro-

priate hypothesis “she left her her her .” can yield

a lower cross-entropy than one reasonable hypoth-

esis “yesterday she left her keys .”. Autoregressive

models suffer less from the issue by considering

previous generated tokens during inference, which

is however infeasible for parallel decoding under

the independence assumption. As a result, NAT

models trained using cross-entropy loss are weak

at handling multi-modality issues and prone to to-

ken repetition mistakes (Sun et al.,2019;Qian et al.,

2021;Ghazvininejad et al.,2020).

Intuitively, generating adequate and ﬂuent trans-

lations involves resolving dependencies of various

ranges (Yule,2006). For example, to generate a

translation “Each of the students has a book”, the

model needs to consider the local n-gram pattern

“a - book”, the subject-verb agreement across the

non-continuous span “each - has”, and the global

context. To capture the token dependency without

arXiv:2210.11017v1 [cs.CL] 20 Oct 2022

the language model, feedback on model’s behav-

ior on text spans of multiple granularities can be

incorporated. To this end, we propose a multi-

granularity optimization method to provide NAT

models with rich feedback on various text spans

involving multi-level dependencies. As shown in

Figure 1, instead of exerting a strict token-level

supervision, we evaluate model behavior on vari-

ous granularities before integrating scores of each

granularity to optimize the model. In this way,

for each sample we highlight different parts of the

translation, e.g., “a book” or “each of the students

has”.

During training, instead of searching for a single

output for each source sequence, we explore the

search space by sampling a set of hypotheses. For

each hypothesis, we jointly mask part of the tokens

and those of the gold reference at the same posi-

tions. To directly evaluate each partially masked

hypothesis, we adopt metric-based optimization

(Ranzato et al.,2016;Shen et al.,2016) which re-

wards the model with a metric function measuring

hypothesis-reference text similarity. Since both

the hypothesis and the reference share the same

masked positions, the metric score of each sam-

ple is mainly determined by those exposed seg-

ments. Finally, we weigh each sample score by the

model conﬁdence to integrate the metric feedback

on segments of various granularities. An illustra-

tive representation is shown in Figure 2, where a

set of masked hypothesis-reference pairs are sam-

pled and scored respectively before being merged

by segment probabilities. In this way, the model

is optimized based on its behavior on text spans

of multiple granularities for each training instance

within a single forward-backward pass.

We evaluate the proposed method across four

machine translation benchmarks: WMT14 En

⇔

and WMT16 En

⇔

Ro. Results show that the pro-

posed method outperforms baseline NAT models

trained with XE loss by a large margin, while

maintaining the same inference latency. The pro-

posed method achieves two best performances for

fully non-autoregressive models among four bench-

marks, and obtains highly competitive results com-

pared with the AT model. To the best of our knowl-

edge, we are the ﬁrst to leverage multi-granularity

metric feedback for training NAT models. Our code

is released at https://github.com/yafuly/MGMO-

NAT.

2 Method

We ﬁrst brieﬂy introduce some preliminaries in-

cluding non-autoregressive machine translation

(Section 2.1) and cross-entropy (Section 2.2), and

then we elaborate our proposed method where the

model learns segments of different granularities for

each instance (Section 2.3).

2.1 Non-autoregressive Machine Translation

(NAT)

The machine translation task can be formally de-

ﬁned as a sequence-to-sequence generation prob-

lem, where the model generates the target language

sequence

y={y1, y2, ..., yT}

given the source

language sequence

x={x1, x2, ..., xS}

based on

the conditional probability

pθ(y|x)

(

denotes the

model parameters). Autoregressive neural machine

translation factorizes the conditional probability

to:

t=1 p(yt|y1, ..., yt−1,x)

. In contrast, non-

autoregressive machine translation (Gu et al.,2018)

ignores the dependency between target tokens and

factorizes the probability as

t=1 p(yt|x)

, where

tokens at each time step are predicted indepen-

dently.

2.2 Cross Entropy (XE)

Similar to AT models, vanilla NAT models are typ-

ically trained using the cross entropy loss:

LXE =−logp(y|x) = −

t=1

logpθ(yt|x)(1)

In addition, a loss for length prediction during

inference is introduced:

Llength =−logpθ(T|x)(2)

Ghazvininejad et al. (2019) adopt masking

scheme in masked language models and train the

NAT models as a conditional masked language

model (CMLM):

LCM LM =−X

yt∈Y(y)

logpθ(yt|Ω(y,Y(y)),x)

(3)

where

Y(y)

is a randomly selected subset of tar-

get tokens and

Ω

denotes a function that masks a

selected set of tokens in

Y(y)

. During decoding,

the CMLM models can generate target language

sequences via iteratively reﬁning translations from

previous iterations.

𝒙: Die Art und Weise,

wie wir diese Steuern

zahlen, ist unterschie

-dlich .

The way we pay these taxes will change

The way we pay these taxes will vary

<m> <m>

- 0.36 0.45 0.27 0.56 - - 0.15𝑝(ℎ!

"|𝒙)

̂𝑝!= 𝑝"𝑇!𝒙 &

#$%&!'∁

𝑝"ℎ#

!𝒙

𝒚"

𝒉"

𝑅"=𝑀𝑒𝑡𝑟𝑖𝑐(𝒉", 𝒚")

𝑅#=𝑀𝑒𝑡𝑟𝑖𝑐(𝒉#, 𝒚#)

̂𝑝(= 𝑝"𝑇(𝒙 &

#$%&#'∁

𝑝"ℎ#

(𝒙

The way we pay taxes taxes will change

The way we pay these taxes will vary

<m>

- - - 0.27 0.21 0.34 - -

<m>

<m> <m>

<m>

0.85

𝑝(ℎ!

#|𝒙)

𝒉#

𝒚#

Sampling 𝐿!"!# 𝒙,𝒚

= −'

$%&

𝑅$̂𝑝$

∑$!̂𝑝$!

𝑘= 1

𝑘= K

𝑘 = {2,3,4,…,𝐾 − 1}

<m>

Figure 2: Method illustration of multi-granularity optimization for NAT. During training, our method (MgMO)

samples Khypotheses for each source sequence, and focuses on different parts of each one by applying the random

masking strategy. For example, MgMO collects model’s performance on the partially exposed segments “way we

pay these” and “change” for the ﬁrst hypothesis (h1), while pays more attention to the phrase “pay these taxes” for

the K-th one (hK).

2.3 Multi-granularity Optimization for NAT

We propose multi-granularity optimization which

integrates feedback on various types of granulari-

ties for each training instance. The overall method

illustration is presented in Figure 2.

Sequence Decomposition

In order to obtain out-

put spans of multiple granularities, we sample

output sequences from the model following a two-

step sampling process. In particular, we ﬁrst sam-

ple a hypothesis length and then sample the output

token at each time step independently given the

sequence length. The probability of the

-th hy-

pothesis hkis calculated as:

pθ(hk|x) = pθ(Tk|x)

t=1

pθ(hk

t|x)(4)

To highlight different segments of multiple gran-

ularities for each sample, we apply a masking strat-

egy that randomly masks a subset of the tokens

for both the hypothesis and the reference at the

same position. We denote the masked hypoth-

esis and reference as

hk={hk

1, . . . , hk

Tk}

and

yk={yk

1, . . . , yk

, respectively, and denote the

set of masked positions as

. Note that the refer-

ence length

may be different from the hypothesis

length Tk.

For the ﬁrst hypothesis output (

k= 1

) in Figure

2, given the randomly generated masked position

set

M1={1,6,7}

, the masked hypothesis and ref-

erence are

h1={hmi, h1

2, h1

3, ..., hmi,hmi, h1

and

y1={hmi, y1

2, y1

3, ..., hmi,hmi, y1

8, y1

, re-

spectively, where

hmi

represents the masked to-

ken. To determine the number of masked tokens

|Mk|

for each training instance, we ﬁrst sample

a threshold

from a uniform distribution

U(0, γ)

and computes |Mk|as follows:

|Mk|=max(bTk−τ∗Tkc,0) (5)

where

is a scaling ratio that controls the likeli-

hood of being masked for each token. Note that

the value of

|Mk|

lies within the range

[0, T k−1]

meaning that at least one token is kept.

In this way, we decompose each training instance

into

pairs of masked hypotheses and references

with different granularities exposed. For example,

in the last sample (

k=K

) in Figure 2, only the

verb phrase “pay these taxes” and the period (“.”)

are learned by the model, whereas the sample (

1) reveals more informative segments.

Metric-based Optimization (MO)

To avoid the

strict mapping of XE loss (Ghazvininejad et al.,

2020;Du et al.,2021), we incorporate metric-based

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-GranularityOptimizationforNon-AutoregressiveTranslationYafuLi~,LeyangCui|y,YongjngYin~,YueZhang~}yZhejiangUniversity~SchoolofEngineering,WestlakeUniversity|TencentAIlab}InstituteofAdvancedTechnology,WestlakeInstituteforAdvancedStudyyafuly@gmail.comleyangcui@tencent.comyinyongjing@westlake.e...

展开>> 收起<<

Multi-Granularity Optimization for Non-Autoregressive Translation Yafu Li Leyang Cuiy Yongjng Yin Yue Zhangy Zhejiang University.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-Granularity Optimization for Non-Autoregressive Translation Yafu Li Leyang Cuiy Yongjng Yin Yue Zhangy Zhejiang University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: