Non-Monotonic Latent Alignments for CTC-Based Non-Autoregressive Machine Translation Chenze Shao12 Yang Feng12

2025-05-02 0 0 2.56MB 19 页 10玖币
侵权投诉
Non-Monotonic Latent Alignments for CTC-Based
Non-Autoregressive Machine Translation
Chenze Shao1,2, Yang Feng1,2
1Key Laboratory of Intelligent Information Processing
Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
shaochenze18z@ict.ac.cn,fengyang@ict.ac.cn
Abstract
Non-autoregressive translation (NAT) models are typically trained with the cross-
entropy loss, which forces the model outputs to be aligned verbatim with the
target sentence and will highly penalize small shifts in word positions. Latent
alignment models relax the explicit alignment by marginalizing out all monotonic
latent alignments with the CTC loss. However, they cannot handle non-monotonic
alignments, which is non-negligible as there is typically global word reordering
in machine translation. In this work, we explore non-monotonic latent alignments
for NAT. We extend the alignment space to non-monotonic alignments to allow
for the global word reordering and further consider all alignments that overlap
with the target sentence. We non-monotonically match the alignments to the target
sentence and train the latent alignment model to maximize the F1 score of non-
monotonic matching. Extensive experiments on major WMT benchmarks show
that our method substantially improves the translation performance of CTC-based
models. Our best model achieves 30.06 BLEU on WMT14 En-De with only one-
iteration decoding, closing the gap between non-autoregressive and autoregressive
models.2
1 Introduction
Non-autoregressive translation (NAT) models achieve significant decoding speedup in neural machine
translation [NMT,
1
,
47
] by generating target words simultaneously [
17
]. This advantage usually
comes at the cost of translation quality due to the mismatch of training objectives. NAT models are
typically trained with the cross-entropy loss, which forces the model outputs to be aligned verbatim
with the target sentence and will highly penalize small shifts in word positions. The explicit alignment
required by the cross-entropy loss cannot be guaranteed due to the multi-modality problem that
there exist many possible translations for the same sentence [
17
], making the cross-entropy loss
intrinsically mismatched with NAT.
As the cross-entropy loss can not evaluate NAT outputs properly, many efforts have been devoted
to designing better training objectives for NAT [
29
,
40
42
,
38
,
13
,
36
,
9
,
43
]. Among them, latent
alignment models [
29
,
36
] relax the alignment restriction by marginalizing out all monotonic latent
alignments with the connectionist temporal classification loss [CTC,
15
], which receive much attention
for the ability to generate variable length translation.
Latent alignment models generally make a strong monotonic assumption on the mapping between
model output and the target sentence. As illustrated in Figure 1a, the monotonic assumption holds in
Corresponding author: Yang Feng
2Source code: https://github.com/ictnlp/NMLA-NAT.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.03953v1 [cs.CL] 8 Oct 2022
(a) CTC for ASR (b) CTC for NAT
Figure 1: Illustration of the monotonic alignment assumption of CTC: (a) CTC for ASR where there
is a natural monotonic mapping between the speech input and ASR target, (b) CTC for NAT where
there exists global word reordering that induces non-monotonic alignment.
is the blank token that
means ‘output nothing’.
classic application scenarios of CTC like automatic speech recognition (ASR) as there is a natural
monotonic mapping between the speech input and ASR target. However, non-monotonic alignments
are non-negligible in machine translation as there is typically global word reordering, which is a
common source of the multi-modality problem. As Figure 1b shows, when the target sentence is
“I ate pizza this afternoon” but the model produces a translation with a different but correct word
ordering “this afternoon I ate pizza”, the CTC loss cannot handle this non-monotonic alignment
between output and target and wrongly penalizes the model.
In this paper, we propose to model non-monotonic latent alignments for non-autoregressive machine
translation. We first extend the alignment space from monotonic alignments to non-monotonic
alignments to allow for the global word reordering in machine translation. Without the monotonic
structure, we have to optimize the best alignment found by the Hungarian algorithm [
46
,
28
,
7
,
9
] since
it becomes difficult to marginalize out all non-monotonic alignments with dynamic programming.
This difficulty can be overcome by not requiring an exact match between alignments and the target
sentence. In practice, it is not necessary to have the translation include exact words as contained in
the target sentence, but it would be favorable to have a large overlap between them. Therefore, we
further extend the alignment space by considering all alignments that overlap with the target sentence.
Specifically, we are interested in the overlap of n-grams, which is the core of some evaluation
metrics (e.g., BLEU). We accumulate n-grams from all alignments regardless of their positions and
non-monotonically match them to target n-grams. The latent alignment model is trained to maximize
the F1 score of n-gram matching, which reflects the translation quality to a certain extent [32].
We conduct experiments on major WMT benchmarks for NAT (WMT14 En
De, WMT16 En
Ro),
which shows that our method substantially improves the translation performance and achieves
comparable performance to autoregressive Transformer with only one-iteration parallel decoding.
2 Background
2.1 Non-Autoregressive Machine Translation
[
17
] proposes non-autoregressive neural machine translation, which achieves significant decoding
speedup by generating target words simultaneously. NAT breaks the dependency among target tokens
and factorizes the joint probability of target words in the following form:
p(y|x, θ) =
T
Y
t=1
pt(yt|x, θ),(1)
where
x
is the source sentence belonging to the input space
X, y ={y1, ..., yT}
is the target sentence
belonging to the output space Y, and pt(yt|x, θ)indicates the translation probability in position t.
The vanilla-NAT has a length predictor that takes encoder states as input to predict the target length.
During the training, the target length is set to the golden length, and the vanilla-NAT is trained with
the cross-entropy loss, which explicitly aligns model outputs to target words:
LCE (θ) =
T
X
t=1
log(pt(yt|x, θ)).(2)
2
CTC: ! " #$%&'(
' = (ε, A, A, ε, A, B, B, C, ε, D)
(A, A, A, B, B, C, D)
(A, A, B, C, D)
SCTC: ! " #)
$%&'(
Figure 2: An example of target sentences obtained from collapsing functions of CTC and SCTC.
The collapsing function of SCTC is
β1
s
, which only removeS blanks in the alignment
a
, where the
collapsing function of CTC removes repetitions first and then removes blanks.
2.2 NAT with Latent Alignments
The vanilla-NAT suffers from two major limitations. The first limitation is the explicit alignment
required by the cross-entropy loss, which cannot be guaranteed due to the multi-modality problem
and therefore leads to the inaccuracy of the loss. The second limitation is the requirement of target
length prediction. The predicted length may not be optimal and cannot be changed dynamically, so it
is often required to use multiple length candidates and re-score them to produce the final translation.
The two limitations can be addressed by using CTC-based latent alignment models [
15
,
29
,
36
],
which extend the output space
Y
with a blank token
that means ‘output nothing’. We define the
extended output space as
Y
. Following prior works [
14
,
36
], we refer to the elements
a∈ Y
as
alignments, since the location of the blank tokens determines an alignment between the extended
output and target sentence. Assume the generated alignments have length
Ta
, we define a function
β(y)
which returns a subset of
Y
including all possible alignments of length
Ta
for
y
, where
β1:Y7→ Y
is the collapsing function that first collapses all consecutive repeated words in
a
and then removes all blank tokens to obtain the target sentence. As illustrated in Figure 2, the
alignment
a= (, A, A, , A, B, B, C, , D)
is collapsed to the target sentence
y= (A, A, B, C, D)
with a monotonic mapping from target positions to alignment positions. During the training, latent
alignment models marginalize all alignments with the CTC loss [15]:
log p(y|x, θ) = log X
aβ(y)
p(a|x, θ),(3)
where the alignment probability p(a|x, θ)is modeled by a non-autoregressive Transformer:
p(a|x, θ) =
Ta
Y
t=1
pt(at|x, θ).(4)
NAT models with latent alignments achieve superior performance by overcoming the two major limi-
tations of NAT [
29
,
36
,
16
,
25
,
33
,
49
]. The weakness in handling non-monotonic latent alignments
is noticed in previous work [36, 16] but remains unsolved.
3 Approach
In this section, we explore non-monotonic latent alignments for NAT. We introduce a simplified CTC
loss in section 3.1, which has a simpler structure that helps further analysis and derivation under the
regular CTC loss. We explore non-monotonic alignments under the simplified CTC loss in section
3.2, and utilize the results to derive non-monotonic alignments under the regular CTC loss in section
3.3. Finally, we introduce the training strategy to combine monotonic and non-monotonic latent
alignments in section 3.4.
3.1 Simplified Connectionist Temporal Classification
In CTC, the collapsing function
β1
maps an alignment to a target sentence in two steps: (1) collapses
all consecutive repeated words, (2) removes all blank tokens. The two operations make the alignment
space
β(y)
complex. Therefore, we first introduce the simplified connectionist temporal classification
loss (SCTC), which has a simpler structure that helps the derivation of non-monotonic alignments,
and the results on SCTC are helpful for further analysis under the regular CTC loss.
3
To simplify the alignment structure, we consider only using one operation in the collapsing function.
One option is only collapsing all consecutive repeated words in the alignment. The concern is that it
will limit the expressive power of the model. For example, the target sentence
y= (A, A, B, C, D)
has probability 0 since it has repeated words that cannot be mapped from any alignment. Therefore,
we favor another option that only removes all blank tokens. In this way, the expressive power stays
the same and the alignment space becomes much simpler, where alignments of
y
simply contain
T
target words and TaTblank tokens.
We denote this simplified loss as SCTC, and illustrate the difference between CTC and SCTC in
Figure 2. The alignment space for the target sentence
y
is defined as
βs(y)
, and the collapsing
function is defined as
β1
s
. We can still obtain the translation probability
p(y|x, θ)
with dynamic
programming. We define the forward variable
αt(s)
to be the total probability of
y1:s
considering all
possible alignments a1:t, which can be calculated recursively from αt1(s)and αt1(s1):
αt(s) = αt1(s1)pt(ys|x, θ) + αt1(s)pt(|x, θ).(5)
Finally, the total translation probability
p(y|x, θ)
is given by the forward variable
αTa(T)
, so we can
train the model with the cross-entropy loss Lsctc(θ) = log αTa(T).
3.2 Non-Monotonic Alignments under SCTC
3.2.1 Bipartite Matching
In this section, we explore non-monotonic alignments under the SCTC loss, which are helpful for
further analysis under the regular CTC loss. We first extend the alignment space from monotonic
alignments
βs(y)
to non-monotonic alignments
γs(y)
to allow for the global word reordering in
machine translation, where γs(y)is defined as:
γs(y) =
y0P(y)βs(y0).(6)
In the above definition,
P(y)
represents all permutations of
y
. For example,
P((A, B)) =
{(A, B),(B, A)}
. By enumerating
P(y)
, we consider all possible word reorderings of the tar-
get sentence. Ideally, we want to traverse all alignments in
γs(y)
to calculate the log-likelihood loss:
Lsum(θ) = log X
aγs(y)
p(a|x, θ).(7)
However, without the monotonic structure, it becomes difficult to marginalize out all latent alignments
with dynamic programming. Alternatively, we can minimize the loss of the best alignment, which is
an upper bound of Equation 7:
Lmax(θ) = log max
aγs(y)p(a|x, θ).(8)
Following prior work [
46
,
7
,
9
], we formulize finding the best alignment as a maximum bipartite
matching problem and solve it with the Hungarian algorithm [
28
]. Specifically, we observe that the
alignment space
γs(y)
is simply permutations of
T
target words and
TaT
blank tokens. Therefore,
finding the best alignment is equivalent to finding the best bipartite matching between the
Ta
model
predictions and the
T
target words plus
TaT
blank tokens, where the two sets of nodes are
connected by edges with the prediction log-probability as weights.
3.2.2 N-Gram Matching
Without the monotonic structure desired for dynamic programming, calculating Equation 7 becomes
difficult. This difficulty is also caused by the strict requirement of exact match for alignments, which
makes it intractable to simplify the summation of probabilities. In practice, it is not necessary to force
the exact match between alignments and the target sentence since a large overlap is also favorable.
Therefore, we further extend the alignment space by considering all alignments that overlap with the
target sentence. Specifically, we are interested in the overlap of n-grams, which is the core of some
evaluation metrics (e.g., BLEU).
We propose a non-monotonic and non-exclusive n-gram matching objective based on SCTC to
encourage the overlap of n-grams. Following the underlying idea of probabilistic matching [
39
], we
introduce the probabilistic variant of the n-gram count to make the objective differentiable.
4
摘要:

Non-MonotonicLatentAlignmentsforCTC-BasedNon-AutoregressiveMachineTranslationChenzeShao1;2,YangFeng1;21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences2UniversityofChineseAcademyofSciencesshaochenze18z@ict.ac.cn,fengyang@ict.ac.cnAbstractNon-aut...

展开>> 收起<<
Non-Monotonic Latent Alignments for CTC-Based Non-Autoregressive Machine Translation Chenze Shao12 Yang Feng12.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:2.56MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注