
Multi-Granularity Optimization for Non-Autoregressive Translation
Yafu Li♠♥ , Leyang Cui♣†
, Yongjng Yin♠♥ , Yue Zhang♥♦†
♠Zhejiang University
♥School of Engineering, Westlake University
♣Tencent AI lab
♦Institute of Advanced Technology, Westlake Institute for Advanced Study
yafuly@gmail.com leyangcui@tencent.com
yinyongjing@westlake.edu.cn yue.zhang@wias.org.cn
Abstract
Despite low latency, non-autoregressive ma-
chine translation (NAT) suffers severe perfor-
mance deterioration due to the naive indepen-
dence assumption. This assumption is further
strengthened by cross-entropy loss, which en-
courages a strict match between the hypothesis
and the reference token by token. To alleviate
this issue, we propose multi-granularity opti-
mization for NAT, which collects model behav-
iors on translation segments of various granu-
larities and integrates feedback for backprop-
agation. Experiments on four WMT bench-
marks show that the proposed method signifi-
cantly outperforms the baseline models trained
with cross-entropy loss, and achieves the best
performance on WMT’16 En⇔Ro and highly
competitive results on WMT’14 En⇔De for
fully non-autoregressive translation.
1 Introduction
Neural machine translation (NMT) systems have
shown superior performance on various bench-
mark datasets (Vaswani et al.,2017;Edunov et al.,
2018a). In the training stage, NMT systems min-
imize the token-level cross-entropy loss between
the reference sequence and the model hypothesis.
During inference, NMT models adopt autoregres-
sive decoding, where the decoder generates the
target sentence token by token (
O(N)
). To reduce
the latency of NMT systems, Gu et al. (2018) pro-
pose non-autoregressive neural machine translation
(NAT), which improves the decoding speed by gen-
erating the entire target sequence in parallel (
O(1)
).
Despite low latency, without modeling the tar-
get sequence history, NAT models tend to generate
translations of low quality (Gu et al.,2018;Sun
et al.,2019;Ghazvininejad et al.,2019). NAT ig-
nores inter-token dependency and naively factor-
izes the sequence-level probability as a product of
independent token probability. However, vanilla
†Corresponding authors.
Each studentsof hasthe a book
Each studentsof hasthe a book
Granularity1
Multi-granularity Optimization
… …
Cross-Entropy Training
GranularitykGranularityn
… …
Figure 1: An illustration of modeling the multi-
granularity token-dependency beyond cross-entropy.
NAT adopts the same training optimization method
as autoregressive (AT) models, i.e., cross-entropy
loss (XE loss), which forces the model to learn
a strict position-to-position mapping, heavily pe-
nalizing hypotheses that suffer position shifts but
share large similarity with the references. Given a
reference “she left her keys yesterday .”, an inappro-
priate hypothesis “she left her her her .” can yield
a lower cross-entropy than one reasonable hypoth-
esis “yesterday she left her keys .”. Autoregressive
models suffer less from the issue by considering
previous generated tokens during inference, which
is however infeasible for parallel decoding under
the independence assumption. As a result, NAT
models trained using cross-entropy loss are weak
at handling multi-modality issues and prone to to-
ken repetition mistakes (Sun et al.,2019;Qian et al.,
2021;Ghazvininejad et al.,2020).
Intuitively, generating adequate and fluent trans-
lations involves resolving dependencies of various
ranges (Yule,2006). For example, to generate a
translation “Each of the students has a book”, the
model needs to consider the local n-gram pattern
“a - book”, the subject-verb agreement across the
non-continuous span “each - has”, and the global
context. To capture the token dependency without
arXiv:2210.11017v1 [cs.CL] 20 Oct 2022