
with highest probabilities. Given a target sentence
with length I, there are
2I
ngram segmentation (i.e,
each token can be labeled as the end of a ngram or
not). For each ngram segmentation with expected
length I/2, the time complexity is
O((I/2)3)
using
the efficient Hungarian algorithm. In this way, the
total computational complexity of the original two
conditions is O(2II3).
For computational tractability, we loosen the
conditions by:
1.
We consider all ngrams in the target sentence
to avoid searching the ngram segmentation. In
other words, each word is allowed to occur in
multiple ngrams in one ordering O.
2.
We only consider ngrams with a fixed size
N
(e.g., only bigrams), which enables us to cast
this problem as Maximum Bipartite Match-
ing and leverage the efficient Hungarian algo-
rithm, as done in (Du et al.,2021).
By loosening the conditions, there are (I-N+1)
ngrams of size
N
in the sentence, and the com-
putational complexity is
O(I3)
. Accordingly, the
loss of the ordering Ojis computed as:
PG(Oj|X) = Q
yi:i+N−1∈Oj
PG(yi:i+N−1|X).(5)
Figure 1shows the calculation of bigram-
OAXE loss for the target sentence “I ate pizza this
afternoon”. We consider all bigrams in the sentence
(see “Bigram List”), and obtain the probability dis-
tribution of the considered bigrams. We construct
the bipartite graph
G= (U, V, E)
where the first
part of vertices
U
is the set of N-1 neighbouring
positions (e.g., the first two positions“Pos:1,2”),
and the second part of vertices
V
is the list of N-1
target bigrams. Each edge in E is the prediction log
probability for the bigram in the corresponding po-
sition. We can follow Du et al. (2021) to leverage
the efficient Hungarian algorithm (Kuhn,1955) for
fast calculation of ngram-OAXE (see the assigned
probabilities for the consider bigrams).
Implementation
Algorithm 1shows the pseudo-
code of ngram-OAXE with
N= 2
. The implemen-
tation of ngram-OAXE is almost the same with that
of OAXE, except that we add one more line (in red
color) for constructing the probability distribution
of ngrams. We implement ngram-OAXE on top of
the source code of OAXE, and leverage the same
recipes (i.e., loss truncation and XE pretrain) to
effectively restrict the free-order nature of OAXE.
Algorithm 1 Bigram-OAXE Loss
Input: Ground truth Y, NAT output log P
bs,len =Y.size()
Y=Y.repeat(1, len).view(bs,len,len)
costM = -log P.gather(index=Y, dim=2)
costM =costM[:, :-1, :-1] +costM[:, 1:, 1:]
for i= 0 to bs do
bestMatch[i] = HungarianMatch(costM[i])
end for
Return:costM.gather(index=bestMatch)
Since both ngram-OAXE and OAXE only mod-
ify the training of NAT models, their inference
latency is the same with the CMLM baseline (e.g.,
15.3x speed up over the AT model). Concerning the
training latency, OAXE takes 36% more training
time over the CMLM baseline, and our ngram-
OAXE takes 40% more training time, which is
almost the same to OAXE since we only add one
more line of code.
Discussion
Some researchers may doubt that the
ngram-OAXE loss is not an intuitively understand-
able “global” loss, since some words are counted
multiple times.
We use the example in Figure 1to dispel the
doubt. Firstly, except for the first and last words
(i.e., “I” and “afternoon”), the ngram-OAXE loss
equally counts the other words twice, which would
not introduce the count bias.
Secondly, we follow Du et al. (2021) to start with
an initialization pre-trained with the XE loss, which
ensures that the NAT models can produce reliable
token probabilities to compute ngram probabilities.
We also use the
loss truncation
technique (Kang
and Hashimoto,2020) to drop invalid ngrams with
low probabilities (e.g., “pizza this” | Pos:2,3) in the
selected ordering Oj.
Thirdly, the overlapped ngrams can help to pro-
duce more fluent translations by modeling global
context in a manner of ngram LM. For exam-
ple, the high-probability overlapped token in posi-
tion 4 “ate” (i.e., P(ate | Pos:4) = 0.4) will guide
NAT models to assign high probabilities to the
neighbouring ngrams (“I ate” | Pos:3,4) and (“ate
pizza” | Pos:4,5), which form a consistent clause
(“I ate pizza | Pos:3,4,5”). In contrast, ngram-
OAXE would not simultaneously assign high prob-
abilities to the phrases (“this afternoon” | Pos:1,2)
and (“pizza this” | Pos:2,3), since the two phrases
require NAT models to assign high probabilities to