ngram -OAXE Phrase-Based Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation Cunxiao Du

2025-05-02 0 0 1.3MB 11 页 10玖币
侵权投诉
ngram-OAXE : Phrase-Based Order-Agnostic Cross Entropy for
Non-Autoregressive Machine Translation
Cunxiao Du
Singapore Management University
cnsdunm@gmail.com
Zhaopeng Tu
Tencent AI Lab
zptu@tencent.com
Longyue Wang
Tencent AI Lab
vinnylywang@tencent.com
Jing Jiang
Singapore Management University
jingjiang@smu.edu.sg
Abstract
Recently, a new training OAXE loss (Du et al.,
2021) has proven effective to ameliorate the
effect of multimodality for non-autoregressive
translation (NAT), which removes the penalty
of word order errors in the standard cross-
entropy loss. Starting from the intuition that
reordering generally occurs between phrases,
we extend OAXE by only allowing reordering
between ngram phrases and still requiring a
strict match of word order within the phrases.
Extensive experiments on NAT benchmarks
across language pairs and data scales demon-
strate the effectiveness and universality of our
approach. Further analyses show that ngram-
OAXE indeed improves the translation of
ngram phrases, and produces more fluent trans-
lation with a better modeling of sentence struc-
ture.1
1 Introduction
Fully non-autoregressive translation (NAT) has re-
ceived increasing attention for its efficient decod-
ing by predicting every target token in parallel (Gu
et al.,2018;Ghazvininejad et al.,2019). However,
such advantage comes at the cost of sacrificing
translation quality due to the multimodality prob-
lem: there exist many possible translations of the
same sentence, while vanilla NAT models may con-
sider them at the same time due to the independent
predictions, which leads to multi-modal outputs in
the form of token repetitions (Gu et al.,2018).
Recent works have incorporated approaches to
improving the standard cross-entropy (XE) loss to
ameliorate the effect of multimodality. The motiva-
tion for these works is that modeling word order is
difficult for NAT, since the model cannot condition
on its previous predictions like its autoregressive
counterpart. Starting from this intuition, a thread of
Zhaopeng Tu is the corresponding author.
1
The codes and models are in
https://github.
com/tencent-ailab/machine-translation/
COLING22_ngram-OAXE/.
research relaxes the word order restriction based on
the
monotonic alignment
assumption (Libovický
and Helcl,2018;Ghazvininejad et al.,2020;Sa-
haria et al.,2020). Du et al. (2021) take a further
step by removing the penalty of word order errors
with a novel order-agnostic cross entropy (OAXE)
loss, which enables NAT models to handle
word
reordering
– a common source of multimodality
problem. Accordingly, OAXE achieves the best
performance among these model variants.
However, OAXE allows reordering between ev-
ery two words, which is not always valid in prac-
tice. For example, the reordering of the two words
“this afternoon” is not correct in grammar. The re-
ordering generally occurs between ngram phrases,
such as “I ate pizza” and “this afternoon”. Starting
from this intuition, we extend OAXE by constrain-
ing the reordering between ngrams and requiring
a strict match of word order within each ngram
(i.e., ngram-OAXE). To this end, we first build
the probability distributions of ngrams in the target
sentence using the word probabilities produced by
NAT models. Then we find the best ordering of
target ngrams to minimize the cross entropy loss.
We implement the ngram-OAXE loss in an effi-
cient way, which only adds one more line of code
on top of the source code of OAXE. Accordingly,
ngram-OAXE only marginally increases training
time (e.g., 3% more time) over OAXE.
Experimental results on widely-used NAT bench-
marks show that ngram-OAXE improves transla-
tion performance over OAXE in all cases. Encour-
agingly, ngram-OAXE outperforms OAXE by up
to +3.8 BLEU points on raw data (without knowl-
edge distillation) for WMT14 En-De translation
(Table 1), and narrows the performance gap be-
tween training on raw data and on distilled data.
Further analyses show that ngram-OAXE improves
over OAXE on the generation accuracy of ngram
phrases and modeling reordering between ngram
phrases, which makes ngram-OAXE handle long
arXiv:2210.03999v1 [cs.CL] 8 Oct 2022
I
ate
pizza
this
afternoon
0.2
0.1
0.1
0.5
0.1
0.1
0.1
0.1
0.1
0.6
Vocabulary
0.4
0.1
0.3
0.1
0.1
0.1
0.4
0.1
0.3
0.1
0.1
0.1
0.5
0.1
0.3
Output Probability Distribution
Pos:1 Pos:2 Pos:3 Pos:4 Pos:5
I ate
ate pizza
pizza this
this afternoon
0.02
0.01
0.01
0.30
0.01
0.03
0.01
0.01
Bigram List
0.16
0.01
0.09
0.01
0.01
0.20
0.01
0.09
Output Probability Distribution
Pos:1,2 Pos:2,3 Pos:3,4 Pos:4,5
Word Distribution
Bigram Distribution
Figure 1: Illustration of the proposed ngram-OAXE loss with N= 2 (i.e., bigram-OAXE). We only show the
probabilities of the target words and bigrams for better illustration. Firstly, ngram-OAXE transforms the word
probability distributions to the bigram distributions by multiplying the word probabilities at the corresponding
positions. For example, P(“I ate" | Pos:1,2) = P(“I” | Pos:1) * P(“ate” | Pos:2) = 0.2*0.1=0.02. Then, we select the
ngrams (highlighted in bold) for each neighbouring positions using the efficient Hungarian algorithm.
sentences better, especially on raw data. The
strength of ngram-OAXE on directly learning from
the complex raw data indicates the potential to train
NAT models without knowledge distillation.
2 Methodology
2.1 Preliminaries: NAT
Cross Entropy (XE)
Standard NAT models (Gu
et al.,2018) are trained with the cross entropy loss:
LXE =log P(Y|X) = X
yn
log P(yi|X),(1)
where
(X, Y )
with
Y={y1, . . . , yI}
is a bilin-
gual training example, and
P(yi|X)
is calculated
independently by the NAT model. XE requires a
strict match of word order between target tokens
and model predictions, thus will heavily penalize
hypotheses that are semantically equivalent to the
target but different in word order.
Order-Agnostic Cross Entropy (OAXE)
Du
et al. (2021) remove the word order restriction of
XE, and assign loss based on the best alignment
between target tokens and model predictions. They
define the ordering space
O={O1, . . . , OJ}
for
Y
, where
Oj
is an ordering of the set of target to-
kens
(y1, . . . , yI)
. The OAXE objective is defined
as finding the best ordering
Oj
to minimize the
cross entropy loss:
LOAXE = min
OjOlog P(Oj|X),(2)
where
log P(Oi|X)
is the cross entropy loss for
ordering Oi, which is calculated by Equation 1.
2.2 ngram-OAXE Loss
Figure 1illustrates the two-phase calculation of
ngram-OAXE : 1) constructing the probability dis-
tributions of the ngrams in the target sentence;
2) searching the best ordering of the considered
ngrams to minimize the cross entropy loss.
Formulation
Given the target
Y=
{y1, . . . , yI}
, we define the target ngrams
GN
of size N as all the
N
continuous tokens in
Y
:
{y1:N,· · · , yIN+1:I}
. The output ngram
distributions PGis defined as:
PG(yi:i+N1|X) =
i+N1
Y
t=i
P(yt|X),(3)
where
P(yt|X)
is the prediction probability of
NAT models for the token
yt
in position
t
of the
target sentence, and Nis the size of ngrams.
The ngram-OAXE objective is defined as finding
the best ordering
Oj
to minimize the cross entropy
loss of the considered ngrams in target sentence
Y
:
Lngram-OAXE = min
OjOlog PG(Oj|X).(4)
Ideally, the best ordering
Oj
should meet the fol-
lowing conditions:
1.
The ngrams in
Oj
should not be overlapped
(e.g., “I ate" and “ate pizza" should not occur
simultaneously in one O).
2. Oj
is a mixture of ngrams with different sizes
(e.g., “I ate pizza" and “this afternoon").
However, it is computationally infeasible to search
the best ngram segmentation of the target sentence
with highest probabilities. Given a target sentence
with length I, there are
2I
ngram segmentation (i.e,
each token can be labeled as the end of a ngram or
not). For each ngram segmentation with expected
length I/2, the time complexity is
O((I/2)3)
using
the efficient Hungarian algorithm. In this way, the
total computational complexity of the original two
conditions is O(2II3).
For computational tractability, we loosen the
conditions by:
1.
We consider all ngrams in the target sentence
to avoid searching the ngram segmentation. In
other words, each word is allowed to occur in
multiple ngrams in one ordering O.
2.
We only consider ngrams with a fixed size
N
(e.g., only bigrams), which enables us to cast
this problem as Maximum Bipartite Match-
ing and leverage the efficient Hungarian algo-
rithm, as done in (Du et al.,2021).
By loosening the conditions, there are (I-N+1)
ngrams of size
N
in the sentence, and the com-
putational complexity is
O(I3)
. Accordingly, the
loss of the ordering Ojis computed as:
PG(Oj|X) = Q
yi:i+N1Oj
PG(yi:i+N1|X).(5)
Figure 1shows the calculation of bigram-
OAXE loss for the target sentence “I ate pizza this
afternoon”. We consider all bigrams in the sentence
(see “Bigram List”), and obtain the probability dis-
tribution of the considered bigrams. We construct
the bipartite graph
G= (U, V, E)
where the first
part of vertices
U
is the set of N-1 neighbouring
positions (e.g., the first two positions“Pos:1,2”),
and the second part of vertices
V
is the list of N-1
target bigrams. Each edge in E is the prediction log
probability for the bigram in the corresponding po-
sition. We can follow Du et al. (2021) to leverage
the efficient Hungarian algorithm (Kuhn,1955) for
fast calculation of ngram-OAXE (see the assigned
probabilities for the consider bigrams).
Implementation
Algorithm 1shows the pseudo-
code of ngram-OAXE with
N= 2
. The implemen-
tation of ngram-OAXE is almost the same with that
of OAXE, except that we add one more line (in red
color) for constructing the probability distribution
of ngrams. We implement ngram-OAXE on top of
the source code of OAXE, and leverage the same
recipes (i.e., loss truncation and XE pretrain) to
effectively restrict the free-order nature of OAXE.
Algorithm 1 Bigram-OAXE Loss
Input: Ground truth Y, NAT output log P
bs,len =Y.size()
Y=Y.repeat(1, len).view(bs,len,len)
costM = -log P.gather(index=Y, dim=2)
costM =costM[:, :-1, :-1] +costM[:, 1:, 1:]
for i= 0 to bs do
bestMatch[i] = HungarianMatch(costM[i])
end for
Return:costM.gather(index=bestMatch)
Since both ngram-OAXE and OAXE only mod-
ify the training of NAT models, their inference
latency is the same with the CMLM baseline (e.g.,
15.3x speed up over the AT model). Concerning the
training latency, OAXE takes 36% more training
time over the CMLM baseline, and our ngram-
OAXE takes 40% more training time, which is
almost the same to OAXE since we only add one
more line of code.
Discussion
Some researchers may doubt that the
ngram-OAXE loss is not an intuitively understand-
able “global” loss, since some words are counted
multiple times.
We use the example in Figure 1to dispel the
doubt. Firstly, except for the first and last words
(i.e., “I” and “afternoon”), the ngram-OAXE loss
equally counts the other words twice, which would
not introduce the count bias.
Secondly, we follow Du et al. (2021) to start with
an initialization pre-trained with the XE loss, which
ensures that the NAT models can produce reliable
token probabilities to compute ngram probabilities.
We also use the
loss truncation
technique (Kang
and Hashimoto,2020) to drop invalid ngrams with
low probabilities (e.g., “pizza this” | Pos:2,3) in the
selected ordering Oj.
Thirdly, the overlapped ngrams can help to pro-
duce more fluent translations by modeling global
context in a manner of ngram LM. For exam-
ple, the high-probability overlapped token in posi-
tion 4 “ate” (i.e., P(ate | Pos:4) = 0.4) will guide
NAT models to assign high probabilities to the
neighbouring ngrams (“I ate” | Pos:3,4) and (“ate
pizza” | Pos:4,5), which form a consistent clause
(“I ate pizza | Pos:3,4,5”). In contrast, ngram-
OAXE would not simultaneously assign high prob-
abilities to the phrases (“this afternoon” | Pos:1,2)
and (“pizza this” | Pos:2,3), since the two phrases
require NAT models to assign high probabilities to
摘要:

ngram-OAXE:Phrase-BasedOrder-AgnosticCrossEntropyforNon-AutoregressiveMachineTranslationCunxiaoDuSingaporeManagementUniversitycnsdunm@gmail.comZhaopengTuTencentAILabzptu@tencent.comLongyueWangTencentAILabvinnylywang@tencent.comJingJiangSingaporeManagementUniversityjingjiang@smu.edu.sgAbstractRecent...

展开>> 收起<<
ngram -OAXE Phrase-Based Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation Cunxiao Du.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.3MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注