ngram -OAXE Phrase-Based Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation Cunxiao Du

2025-05-02 0 0 1.3MB 11 页 10玖币

侵权投诉

ngram-OAXE : Phrase-Based Order-Agnostic Cross Entropy for

Non-Autoregressive Machine Translation

Cunxiao Du

Singapore Management University

cnsdunm@gmail.com

Zhaopeng Tu∗

Tencent AI Lab

zptu@tencent.com

Longyue Wang

Tencent AI Lab

vinnylywang@tencent.com

Jing Jiang

Singapore Management University

jingjiang@smu.edu.sg

Abstract

Recently, a new training OAXE loss (Du et al.,

2021) has proven effective to ameliorate the

effect of multimodality for non-autoregressive

translation (NAT), which removes the penalty

of word order errors in the standard cross-

entropy loss. Starting from the intuition that

reordering generally occurs between phrases,

we extend OAXE by only allowing reordering

between ngram phrases and still requiring a

strict match of word order within the phrases.

Extensive experiments on NAT benchmarks

across language pairs and data scales demon-

strate the effectiveness and universality of our

approach. Further analyses show that ngram-

OAXE indeed improves the translation of

ngram phrases, and produces more ﬂuent trans-

lation with a better modeling of sentence struc-

ture.1

1 Introduction

Fully non-autoregressive translation (NAT) has re-

ceived increasing attention for its efﬁcient decod-

ing by predicting every target token in parallel (Gu

et al.,2018;Ghazvininejad et al.,2019). However,

such advantage comes at the cost of sacriﬁcing

translation quality due to the multimodality prob-

lem: there exist many possible translations of the

same sentence, while vanilla NAT models may con-

sider them at the same time due to the independent

predictions, which leads to multi-modal outputs in

the form of token repetitions (Gu et al.,2018).

Recent works have incorporated approaches to

improving the standard cross-entropy (XE) loss to

ameliorate the effect of multimodality. The motiva-

tion for these works is that modeling word order is

difﬁcult for NAT, since the model cannot condition

on its previous predictions like its autoregressive

counterpart. Starting from this intuition, a thread of

∗Zhaopeng Tu is the corresponding author.

The codes and models are in

https://github.

com/tencent-ailab/machine-translation/

COLING22_ngram-OAXE/.

research relaxes the word order restriction based on

the

monotonic alignment

assumption (Libovický

and Helcl,2018;Ghazvininejad et al.,2020;Sa-

haria et al.,2020). Du et al. (2021) take a further

step by removing the penalty of word order errors

with a novel order-agnostic cross entropy (OAXE)

loss, which enables NAT models to handle

word

reordering

– a common source of multimodality

problem. Accordingly, OAXE achieves the best

performance among these model variants.

However, OAXE allows reordering between ev-

ery two words, which is not always valid in prac-

tice. For example, the reordering of the two words

“this afternoon” is not correct in grammar. The re-

ordering generally occurs between ngram phrases,

such as “I ate pizza” and “this afternoon”. Starting

from this intuition, we extend OAXE by constrain-

ing the reordering between ngrams and requiring

a strict match of word order within each ngram

(i.e., ngram-OAXE). To this end, we ﬁrst build

the probability distributions of ngrams in the target

sentence using the word probabilities produced by

NAT models. Then we ﬁnd the best ordering of

target ngrams to minimize the cross entropy loss.

We implement the ngram-OAXE loss in an efﬁ-

cient way, which only adds one more line of code

on top of the source code of OAXE. Accordingly,

ngram-OAXE only marginally increases training

time (e.g., 3% more time) over OAXE.

Experimental results on widely-used NAT bench-

marks show that ngram-OAXE improves transla-

tion performance over OAXE in all cases. Encour-

agingly, ngram-OAXE outperforms OAXE by up

to +3.8 BLEU points on raw data (without knowl-

edge distillation) for WMT14 En-De translation

(Table 1), and narrows the performance gap be-

tween training on raw data and on distilled data.

Further analyses show that ngram-OAXE improves

over OAXE on the generation accuracy of ngram

phrases and modeling reordering between ngram

phrases, which makes ngram-OAXE handle long

arXiv:2210.03999v1 [cs.CL] 8 Oct 2022

ate

pizza

this

afternoon

0.2

0.1

0.5

0.1

0.6

Vocabulary

0.4

0.1

0.3

0.1

0.4

0.1

0.3

0.1

0.5

0.1

0.3

Output Probability Distribution

Pos:1 Pos:2 Pos:3 Pos:4 Pos:5

I ate

ate pizza

pizza this

this afternoon

0.02

0.01

0.30

0.01

0.03

0.01

Bigram List

0.16

0.01

0.09

0.01

0.20

0.01

0.09

Output Probability Distribution

Pos:1,2 Pos:2,3 Pos:3,4 Pos:4,5

Word Distribution

Bigram Distribution

Figure 1: Illustration of the proposed ngram-OAXE loss with N= 2 (i.e., bigram-OAXE). We only show the

probabilities of the target words and bigrams for better illustration. Firstly, ngram-OAXE transforms the word

probability distributions to the bigram distributions by multiplying the word probabilities at the corresponding

positions. For example, P(“I ate" | Pos:1,2) = P(“I” | Pos:1) * P(“ate” | Pos:2) = 0.2*0.1=0.02. Then, we select the

ngrams (highlighted in bold) for each neighbouring positions using the efﬁcient Hungarian algorithm.

sentences better, especially on raw data. The

strength of ngram-OAXE on directly learning from

the complex raw data indicates the potential to train

NAT models without knowledge distillation.

2 Methodology

2.1 Preliminaries: NAT

Cross Entropy (XE)

Standard NAT models (Gu

et al.,2018) are trained with the cross entropy loss:

LXE =−log P(Y|X) = −X

log P(yi|X),(1)

where

(X, Y )

with

Y={y1, . . . , yI}

is a bilin-

gual training example, and

P(yi|X)

is calculated

independently by the NAT model. XE requires a

strict match of word order between target tokens

and model predictions, thus will heavily penalize

hypotheses that are semantically equivalent to the

target but different in word order.

Order-Agnostic Cross Entropy (OAXE)

et al. (2021) remove the word order restriction of

XE, and assign loss based on the best alignment

between target tokens and model predictions. They

deﬁne the ordering space

O={O1, . . . , OJ}

for

, where

is an ordering of the set of target to-

kens

(y1, . . . , yI)

. The OAXE objective is deﬁned

as ﬁnding the best ordering

to minimize the

cross entropy loss:

LOAXE = min

Oj∈O−log P(Oj|X),(2)

where

−log P(Oi|X)

is the cross entropy loss for

ordering Oi, which is calculated by Equation 1.

2.2 ngram-OAXE Loss

Figure 1illustrates the two-phase calculation of

ngram-OAXE : 1) constructing the probability dis-

tributions of the ngrams in the target sentence;

2) searching the best ordering of the considered

ngrams to minimize the cross entropy loss.

Formulation

Given the target

{y1, . . . , yI}

, we deﬁne the target ngrams

of size N as all the

continuous tokens in

{y1:N,· · · , yI−N+1:I}

. The output ngram

distributions PGis deﬁned as:

PG(yi:i+N−1|X) =

i+N−1

t=i

P(yt|X),(3)

where

P(yt|X)

is the prediction probability of

NAT models for the token

in position

of the

target sentence, and Nis the size of ngrams.

The ngram-OAXE objective is deﬁned as ﬁnding

the best ordering

to minimize the cross entropy

loss of the considered ngrams in target sentence

Lngram-OAXE = min

Oj∈O−log PG(Oj|X).(4)

Ideally, the best ordering

should meet the fol-

lowing conditions:

The ngrams in

should not be overlapped

(e.g., “I ate" and “ate pizza" should not occur

simultaneously in one O).

2. Oj

is a mixture of ngrams with different sizes

(e.g., “I ate pizza" and “this afternoon").

However, it is computationally infeasible to search

the best ngram segmentation of the target sentence

with highest probabilities. Given a target sentence

with length I, there are

ngram segmentation (i.e,

each token can be labeled as the end of a ngram or

not). For each ngram segmentation with expected

length I/2, the time complexity is

O((I/2)3)

using

the efﬁcient Hungarian algorithm. In this way, the

total computational complexity of the original two

conditions is O(2II3).

For computational tractability, we loosen the

conditions by:

We consider all ngrams in the target sentence

to avoid searching the ngram segmentation. In

other words, each word is allowed to occur in

multiple ngrams in one ordering O.

We only consider ngrams with a ﬁxed size

(e.g., only bigrams), which enables us to cast

this problem as Maximum Bipartite Match-

ing and leverage the efﬁcient Hungarian algo-

rithm, as done in (Du et al.,2021).

By loosening the conditions, there are (I-N+1)

ngrams of size

in the sentence, and the com-

putational complexity is

O(I3)

. Accordingly, the

loss of the ordering Ojis computed as:

PG(Oj|X) = Q

yi:i+N−1∈Oj

PG(yi:i+N−1|X).(5)

Figure 1shows the calculation of bigram-

OAXE loss for the target sentence “I ate pizza this

afternoon”. We consider all bigrams in the sentence

(see “Bigram List”), and obtain the probability dis-

tribution of the considered bigrams. We construct

the bipartite graph

G= (U, V, E)

where the ﬁrst

part of vertices

is the set of N-1 neighbouring

positions (e.g., the ﬁrst two positions“Pos:1,2”),

and the second part of vertices

is the list of N-1

target bigrams. Each edge in E is the prediction log

probability for the bigram in the corresponding po-

sition. We can follow Du et al. (2021) to leverage

the efﬁcient Hungarian algorithm (Kuhn,1955) for

fast calculation of ngram-OAXE (see the assigned

probabilities for the consider bigrams).

Implementation

Algorithm 1shows the pseudo-

code of ngram-OAXE with

N= 2

. The implemen-

tation of ngram-OAXE is almost the same with that

of OAXE, except that we add one more line (in red

color) for constructing the probability distribution

of ngrams. We implement ngram-OAXE on top of

the source code of OAXE, and leverage the same

recipes (i.e., loss truncation and XE pretrain) to

effectively restrict the free-order nature of OAXE.

Algorithm 1 Bigram-OAXE Loss

Input: Ground truth Y, NAT output log P

bs,len =Y.size()

Y=Y.repeat(1, len).view(bs,len,len)

costM = -log P.gather(index=Y, dim=2)

costM =costM[:, :-1, :-1] +costM[:, 1:, 1:]

for i= 0 to bs do

bestMatch[i] = HungarianMatch(costM[i])

end for

Return:costM.gather(index=bestMatch)

Since both ngram-OAXE and OAXE only mod-

ify the training of NAT models, their inference

latency is the same with the CMLM baseline (e.g.,

15.3x speed up over the AT model). Concerning the

training latency, OAXE takes 36% more training

time over the CMLM baseline, and our ngram-

OAXE takes 40% more training time, which is

almost the same to OAXE since we only add one

more line of code.

Discussion

Some researchers may doubt that the

ngram-OAXE loss is not an intuitively understand-

able “global” loss, since some words are counted

multiple times.

We use the example in Figure 1to dispel the

doubt. Firstly, except for the ﬁrst and last words

(i.e., “I” and “afternoon”), the ngram-OAXE loss

equally counts the other words twice, which would

not introduce the count bias.

Secondly, we follow Du et al. (2021) to start with

an initialization pre-trained with the XE loss, which

ensures that the NAT models can produce reliable

token probabilities to compute ngram probabilities.

We also use the

loss truncation

technique (Kang

and Hashimoto,2020) to drop invalid ngrams with

low probabilities (e.g., “pizza this” | Pos:2,3) in the

selected ordering Oj.

Thirdly, the overlapped ngrams can help to pro-

duce more ﬂuent translations by modeling global

context in a manner of ngram LM. For exam-

ple, the high-probability overlapped token in posi-

tion 4 “ate” (i.e., P(ate | Pos:4) = 0.4) will guide

NAT models to assign high probabilities to the

neighbouring ngrams (“I ate” | Pos:3,4) and (“ate

pizza” | Pos:4,5), which form a consistent clause

(“I ate pizza | Pos:3,4,5”). In contrast, ngram-

OAXE would not simultaneously assign high prob-

abilities to the phrases (“this afternoon” | Pos:1,2)

and (“pizza this” | Pos:2,3), since the two phrases

require NAT models to assign high probabilities to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ngram-OAXE:Phrase-BasedOrder-AgnosticCrossEntropyforNon-AutoregressiveMachineTranslationCunxiaoDuSingaporeManagementUniversitycnsdunm@gmail.comZhaopengTuTencentAILabzptu@tencent.comLongyueWangTencentAILabvinnylywang@tencent.comJingJiangSingaporeManagementUniversityjingjiang@smu.edu.sgAbstractRecent...

展开>> 收起<<

ngram -OAXE Phrase-Based Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation Cunxiao Du.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ngram -OAXE Phrase-Based Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation Cunxiao Du

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: