Multi-Granularity Optimization for Non-Autoregressive Translation Yafu Li Leyang Cuiy Yongjng Yin Yue Zhangy Zhejiang University

2025-05-02 0 0 738.68KB 12 页 10玖币
侵权投诉
Multi-Granularity Optimization for Non-Autoregressive Translation
Yafu Li♠♥ , Leyang Cui
, Yongjng Yin♠♥ , Yue Zhang♥♦
Zhejiang University
School of Engineering, Westlake University
Tencent AI lab
Institute of Advanced Technology, Westlake Institute for Advanced Study
yafuly@gmail.com leyangcui@tencent.com
yinyongjing@westlake.edu.cn yue.zhang@wias.org.cn
Abstract
Despite low latency, non-autoregressive ma-
chine translation (NAT) suffers severe perfor-
mance deterioration due to the naive indepen-
dence assumption. This assumption is further
strengthened by cross-entropy loss, which en-
courages a strict match between the hypothesis
and the reference token by token. To alleviate
this issue, we propose multi-granularity opti-
mization for NAT, which collects model behav-
iors on translation segments of various granu-
larities and integrates feedback for backprop-
agation. Experiments on four WMT bench-
marks show that the proposed method signifi-
cantly outperforms the baseline models trained
with cross-entropy loss, and achieves the best
performance on WMT’16 EnRo and highly
competitive results on WMT’14 EnDe for
fully non-autoregressive translation.
1 Introduction
Neural machine translation (NMT) systems have
shown superior performance on various bench-
mark datasets (Vaswani et al.,2017;Edunov et al.,
2018a). In the training stage, NMT systems min-
imize the token-level cross-entropy loss between
the reference sequence and the model hypothesis.
During inference, NMT models adopt autoregres-
sive decoding, where the decoder generates the
target sentence token by token (
O(N)
). To reduce
the latency of NMT systems, Gu et al. (2018) pro-
pose non-autoregressive neural machine translation
(NAT), which improves the decoding speed by gen-
erating the entire target sequence in parallel (
O(1)
).
Despite low latency, without modeling the tar-
get sequence history, NAT models tend to generate
translations of low quality (Gu et al.,2018;Sun
et al.,2019;Ghazvininejad et al.,2019). NAT ig-
nores inter-token dependency and naively factor-
izes the sequence-level probability as a product of
independent token probability. However, vanilla
Corresponding authors.
Each studentsof hasthe a book
Each studentsof hasthe a book
Granularity1
Multi-granularity Optimization
… …
Cross-Entropy Training
GranularitykGranularityn
… …
Figure 1: An illustration of modeling the multi-
granularity token-dependency beyond cross-entropy.
NAT adopts the same training optimization method
as autoregressive (AT) models, i.e., cross-entropy
loss (XE loss), which forces the model to learn
a strict position-to-position mapping, heavily pe-
nalizing hypotheses that suffer position shifts but
share large similarity with the references. Given a
reference “she left her keys yesterday .”, an inappro-
priate hypothesis “she left her her her . can yield
a lower cross-entropy than one reasonable hypoth-
esis “yesterday she left her keys .”. Autoregressive
models suffer less from the issue by considering
previous generated tokens during inference, which
is however infeasible for parallel decoding under
the independence assumption. As a result, NAT
models trained using cross-entropy loss are weak
at handling multi-modality issues and prone to to-
ken repetition mistakes (Sun et al.,2019;Qian et al.,
2021;Ghazvininejad et al.,2020).
Intuitively, generating adequate and fluent trans-
lations involves resolving dependencies of various
ranges (Yule,2006). For example, to generate a
translation “Each of the students has a book”, the
model needs to consider the local n-gram pattern
“a - book”, the subject-verb agreement across the
non-continuous span “each - has”, and the global
context. To capture the token dependency without
arXiv:2210.11017v1 [cs.CL] 20 Oct 2022
the language model, feedback on model’s behav-
ior on text spans of multiple granularities can be
incorporated. To this end, we propose a multi-
granularity optimization method to provide NAT
models with rich feedback on various text spans
involving multi-level dependencies. As shown in
Figure 1, instead of exerting a strict token-level
supervision, we evaluate model behavior on vari-
ous granularities before integrating scores of each
granularity to optimize the model. In this way,
for each sample we highlight different parts of the
translation, e.g., “a book” or “each of the students
has”.
During training, instead of searching for a single
output for each source sequence, we explore the
search space by sampling a set of hypotheses. For
each hypothesis, we jointly mask part of the tokens
and those of the gold reference at the same posi-
tions. To directly evaluate each partially masked
hypothesis, we adopt metric-based optimization
(Ranzato et al.,2016;Shen et al.,2016) which re-
wards the model with a metric function measuring
hypothesis-reference text similarity. Since both
the hypothesis and the reference share the same
masked positions, the metric score of each sam-
ple is mainly determined by those exposed seg-
ments. Finally, we weigh each sample score by the
model confidence to integrate the metric feedback
on segments of various granularities. An illustra-
tive representation is shown in Figure 2, where a
set of masked hypothesis-reference pairs are sam-
pled and scored respectively before being merged
by segment probabilities. In this way, the model
is optimized based on its behavior on text spans
of multiple granularities for each training instance
within a single forward-backward pass.
We evaluate the proposed method across four
machine translation benchmarks: WMT14 En
De
and WMT16 En
Ro. Results show that the pro-
posed method outperforms baseline NAT models
trained with XE loss by a large margin, while
maintaining the same inference latency. The pro-
posed method achieves two best performances for
fully non-autoregressive models among four bench-
marks, and obtains highly competitive results com-
pared with the AT model. To the best of our knowl-
edge, we are the first to leverage multi-granularity
metric feedback for training NAT models. Our code
is released at https://github.com/yafuly/MGMO-
NAT.
2 Method
We first briefly introduce some preliminaries in-
cluding non-autoregressive machine translation
(Section 2.1) and cross-entropy (Section 2.2), and
then we elaborate our proposed method where the
model learns segments of different granularities for
each instance (Section 2.3).
2.1 Non-autoregressive Machine Translation
(NAT)
The machine translation task can be formally de-
fined as a sequence-to-sequence generation prob-
lem, where the model generates the target language
sequence
y={y1, y2, ..., yT}
given the source
language sequence
x={x1, x2, ..., xS}
based on
the conditional probability
pθ(y|x)
(
θ
denotes the
model parameters). Autoregressive neural machine
translation factorizes the conditional probability
to:
QT
t=1 p(yt|y1, ..., yt1,x)
. In contrast, non-
autoregressive machine translation (Gu et al.,2018)
ignores the dependency between target tokens and
factorizes the probability as
QT
t=1 p(yt|x)
, where
tokens at each time step are predicted indepen-
dently.
2.2 Cross Entropy (XE)
Similar to AT models, vanilla NAT models are typ-
ically trained using the cross entropy loss:
LXE =logp(y|x) =
T
X
t=1
logpθ(yt|x)(1)
In addition, a loss for length prediction during
inference is introduced:
Llength =logpθ(T|x)(2)
Ghazvininejad et al. (2019) adopt masking
scheme in masked language models and train the
NAT models as a conditional masked language
model (CMLM):
LCM LM =X
yt∈Y(y)
logpθ(yt|Ω(y,Y(y)),x)
(3)
where
Y(y)
is a randomly selected subset of tar-
get tokens and
denotes a function that masks a
selected set of tokens in
Y(y)
. During decoding,
the CMLM models can generate target language
sequences via iteratively refining translations from
previous iterations.
𝒙: Die Art und Weise,
wie wir diese Steuern
zahlen, ist unterschie
-dlich .
The way we pay these taxes will change
.
The way we pay these taxes will vary
<m> <m>
<m> <m>
- 0.36 0.45 0.27 0.56 - - 0.15𝑝(ℎ!
"|𝒙)
̂𝑝!= 𝑝"𝑇!𝒙 &
#$%&!'
𝑝"#
!𝒙
𝒚"
𝒉"
𝑅"=𝑀𝑒𝑡𝑟𝑖𝑐(𝒉", 𝒚")
𝑅#=𝑀𝑒𝑡𝑟𝑖𝑐(𝒉#, 𝒚#)
̂𝑝(= 𝑝"𝑇(𝒙 &
#$%&#'
𝑝"#
(𝒙
The way we pay taxes taxes will change
.
The way we pay these taxes will vary
<m>
<m>
- - - 0.27 0.21 0.34 - -
<m>
<m>
<m>
<m> <m>
<m>
.
0.85
𝑝(ℎ!
#|𝒙)
𝒉#
𝒚#
.
.
.
Sampling 𝐿!"!# 𝒙,𝒚
= −'
$%&
'
𝑅$̂𝑝$
$!̂𝑝$!
.
.
.
𝑘= 1
𝑘= K
𝑘 = {2,3,4,,𝐾 − 1}
<m>
<m>
<m>
<m>
Figure 2: Method illustration of multi-granularity optimization for NAT. During training, our method (MgMO)
samples Khypotheses for each source sequence, and focuses on different parts of each one by applying the random
masking strategy. For example, MgMO collects model’s performance on the partially exposed segments “way we
pay these” and “change” for the first hypothesis (h1), while pays more attention to the phrase “pay these taxes” for
the K-th one (hK).
2.3 Multi-granularity Optimization for NAT
We propose multi-granularity optimization which
integrates feedback on various types of granulari-
ties for each training instance. The overall method
illustration is presented in Figure 2.
Sequence Decomposition
In order to obtain out-
put spans of multiple granularities, we sample
K
output sequences from the model following a two-
step sampling process. In particular, we first sam-
ple a hypothesis length and then sample the output
token at each time step independently given the
sequence length. The probability of the
k
-th hy-
pothesis hkis calculated as:
pθ(hk|x) = pθ(Tk|x)
T
Y
t=1
pθ(hk
t|x)(4)
To highlight different segments of multiple gran-
ularities for each sample, we apply a masking strat-
egy that randomly masks a subset of the tokens
for both the hypothesis and the reference at the
same position. We denote the masked hypoth-
esis and reference as
hk={hk
1, . . . , hk
Tk}
and
yk={yk
1, . . . , yk
T}
, respectively, and denote the
set of masked positions as
Mk
. Note that the refer-
ence length
T
may be different from the hypothesis
length Tk.
For the first hypothesis output (
k= 1
) in Figure
2, given the randomly generated masked position
set
M1={1,6,7}
, the masked hypothesis and ref-
erence are
h1={hmi, h1
2, h1
3, ..., hmi,hmi, h1
8}
and
y1={hmi, y1
2, y1
3, ..., hmi,hmi, y1
8, y1
9}
, re-
spectively, where
hmi
represents the masked to-
ken. To determine the number of masked tokens
|Mk|
for each training instance, we first sample
a threshold
τ
from a uniform distribution
U(0, γ)
,
and computes |Mk|as follows:
|Mk|=max(bTkτTkc,0) (5)
where
γ
is a scaling ratio that controls the likeli-
hood of being masked for each token. Note that
the value of
|Mk|
lies within the range
[0, T k1]
,
meaning that at least one token is kept.
In this way, we decompose each training instance
into
K
pairs of masked hypotheses and references
with different granularities exposed. For example,
in the last sample (
k=K
) in Figure 2, only the
verb phrase “pay these taxes” and the period (“.”)
are learned by the model, whereas the sample (
k=
1) reveals more informative segments.
Metric-based Optimization (MO)
To avoid the
strict mapping of XE loss (Ghazvininejad et al.,
2020;Du et al.,2021), we incorporate metric-based
摘要:

Multi-GranularityOptimizationforNon-AutoregressiveTranslationYafuLi~,LeyangCui|y,YongjngYin~,YueZhang~}yZhejiangUniversity~SchoolofEngineering,WestlakeUniversity|TencentAIlab}InstituteofAdvancedTechnology,WestlakeInstituteforAdvancedStudyyafuly@gmail.comleyangcui@tencent.comyinyongjing@westlake.e...

展开>> 收起<<
Multi-Granularity Optimization for Non-Autoregressive Translation Yafu Li Leyang Cuiy Yongjng Yin Yue Zhangy Zhejiang University.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:738.68KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注