Non-Monotonic Latent Alignments for CTC-Based Non-Autoregressive Machine Translation Chenze Shao12 Yang Feng12

2025-05-02 0 0 2.56MB 19 页 10玖币

侵权投诉

Non-Monotonic Latent Alignments for CTC-Based

Non-Autoregressive Machine Translation

Chenze Shao1,2, Yang Feng1,2∗

1Key Laboratory of Intelligent Information Processing

Institute of Computing Technology, Chinese Academy of Sciences

2University of Chinese Academy of Sciences

shaochenze18z@ict.ac.cn,fengyang@ict.ac.cn

Abstract

Non-autoregressive translation (NAT) models are typically trained with the cross-

entropy loss, which forces the model outputs to be aligned verbatim with the

target sentence and will highly penalize small shifts in word positions. Latent

alignment models relax the explicit alignment by marginalizing out all monotonic

latent alignments with the CTC loss. However, they cannot handle non-monotonic

alignments, which is non-negligible as there is typically global word reordering

in machine translation. In this work, we explore non-monotonic latent alignments

for NAT. We extend the alignment space to non-monotonic alignments to allow

for the global word reordering and further consider all alignments that overlap

with the target sentence. We non-monotonically match the alignments to the target

sentence and train the latent alignment model to maximize the F1 score of non-

monotonic matching. Extensive experiments on major WMT benchmarks show

that our method substantially improves the translation performance of CTC-based

models. Our best model achieves 30.06 BLEU on WMT14 En-De with only one-

iteration decoding, closing the gap between non-autoregressive and autoregressive

models.2

1 Introduction

Non-autoregressive translation (NAT) models achieve signiﬁcant decoding speedup in neural machine

translation [NMT,

] by generating target words simultaneously [

]. This advantage usually

comes at the cost of translation quality due to the mismatch of training objectives. NAT models are

typically trained with the cross-entropy loss, which forces the model outputs to be aligned verbatim

with the target sentence and will highly penalize small shifts in word positions. The explicit alignment

required by the cross-entropy loss cannot be guaranteed due to the multi-modality problem that

there exist many possible translations for the same sentence [

], making the cross-entropy loss

intrinsically mismatched with NAT.

As the cross-entropy loss can not evaluate NAT outputs properly, many efforts have been devoted

to designing better training objectives for NAT [

–

]. Among them, latent

alignment models [

] relax the alignment restriction by marginalizing out all monotonic latent

alignments with the connectionist temporal classiﬁcation loss [CTC,

], which receive much attention

for the ability to generate variable length translation.

Latent alignment models generally make a strong monotonic assumption on the mapping between

model output and the target sentence. As illustrated in Figure 1a, the monotonic assumption holds in

∗Corresponding author: Yang Feng

2Source code: https://github.com/ictnlp/NMLA-NAT.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.03953v1 [cs.CL] 8 Oct 2022

(a) CTC for ASR (b) CTC for NAT

Figure 1: Illustration of the monotonic alignment assumption of CTC: (a) CTC for ASR where there

is a natural monotonic mapping between the speech input and ASR target, (b) CTC for NAT where

there exists global word reordering that induces non-monotonic alignment.



is the blank token that

means ‘output nothing’.

classic application scenarios of CTC like automatic speech recognition (ASR) as there is a natural

monotonic mapping between the speech input and ASR target. However, non-monotonic alignments

are non-negligible in machine translation as there is typically global word reordering, which is a

common source of the multi-modality problem. As Figure 1b shows, when the target sentence is

“I ate pizza this afternoon” but the model produces a translation with a different but correct word

ordering “this afternoon I ate pizza”, the CTC loss cannot handle this non-monotonic alignment

between output and target and wrongly penalizes the model.

In this paper, we propose to model non-monotonic latent alignments for non-autoregressive machine

translation. We ﬁrst extend the alignment space from monotonic alignments to non-monotonic

alignments to allow for the global word reordering in machine translation. Without the monotonic

structure, we have to optimize the best alignment found by the Hungarian algorithm [

] since

it becomes difﬁcult to marginalize out all non-monotonic alignments with dynamic programming.

This difﬁculty can be overcome by not requiring an exact match between alignments and the target

sentence. In practice, it is not necessary to have the translation include exact words as contained in

the target sentence, but it would be favorable to have a large overlap between them. Therefore, we

further extend the alignment space by considering all alignments that overlap with the target sentence.

Speciﬁcally, we are interested in the overlap of n-grams, which is the core of some evaluation

metrics (e.g., BLEU). We accumulate n-grams from all alignments regardless of their positions and

non-monotonically match them to target n-grams. The latent alignment model is trained to maximize

the F1 score of n-gram matching, which reﬂects the translation quality to a certain extent [32].

We conduct experiments on major WMT benchmarks for NAT (WMT14 En

↔

De, WMT16 En

↔

Ro),

which shows that our method substantially improves the translation performance and achieves

comparable performance to autoregressive Transformer with only one-iteration parallel decoding.

2 Background

2.1 Non-Autoregressive Machine Translation

[

] proposes non-autoregressive neural machine translation, which achieves signiﬁcant decoding

speedup by generating target words simultaneously. NAT breaks the dependency among target tokens

and factorizes the joint probability of target words in the following form:

p(y|x, θ) =

t=1

pt(yt|x, θ),(1)

where

is the source sentence belonging to the input space

X, y ={y1, ..., yT}

is the target sentence

belonging to the output space Y, and pt(yt|x, θ)indicates the translation probability in position t.

The vanilla-NAT has a length predictor that takes encoder states as input to predict the target length.

During the training, the target length is set to the golden length, and the vanilla-NAT is trained with

the cross-entropy loss, which explicitly aligns model outputs to target words:

LCE (θ) = −

t=1

log(pt(yt|x, θ)).(2)

CTC: ! " #$%&'(

' = (ε, A, A, ε, A, B, B, C, ε, D)

(A, A, A, B, B, C, D)

(A, A, B, C, D)

SCTC: ! " #)

$%&'(

Figure 2: An example of target sentences obtained from collapsing functions of CTC and SCTC.

The collapsing function of SCTC is

β−1

, which only removeS blanks in the alignment

, where the

collapsing function of CTC removes repetitions ﬁrst and then removes blanks.

2.2 NAT with Latent Alignments

The vanilla-NAT suffers from two major limitations. The ﬁrst limitation is the explicit alignment

required by the cross-entropy loss, which cannot be guaranteed due to the multi-modality problem

and therefore leads to the inaccuracy of the loss. The second limitation is the requirement of target

length prediction. The predicted length may not be optimal and cannot be changed dynamically, so it

is often required to use multiple length candidates and re-score them to produce the ﬁnal translation.

The two limitations can be addressed by using CTC-based latent alignment models [

which extend the output space

with a blank token



that means ‘output nothing’. We deﬁne the

extended output space as

Y∗

. Following prior works [

], we refer to the elements

a∈ Y∗

alignments, since the location of the blank tokens determines an alignment between the extended

output and target sentence. Assume the generated alignments have length

, we deﬁne a function

β(y)

which returns a subset of

Y∗

including all possible alignments of length

for

, where

β−1:Y∗7→ Y

is the collapsing function that ﬁrst collapses all consecutive repeated words in

and then removes all blank tokens to obtain the target sentence. As illustrated in Figure 2, the

alignment

a= (, A, A, , A, B, B, C, , D)

is collapsed to the target sentence

y= (A, A, B, C, D)

with a monotonic mapping from target positions to alignment positions. During the training, latent

alignment models marginalize all alignments with the CTC loss [15]:

log p(y|x, θ) = log X

a∈β(y)

p(a|x, θ),(3)

where the alignment probability p(a|x, θ)is modeled by a non-autoregressive Transformer:

p(a|x, θ) =

t=1

pt(at|x, θ).(4)

NAT models with latent alignments achieve superior performance by overcoming the two major limi-

tations of NAT [

]. The weakness in handling non-monotonic latent alignments

is noticed in previous work [36, 16] but remains unsolved.

3 Approach

In this section, we explore non-monotonic latent alignments for NAT. We introduce a simpliﬁed CTC

loss in section 3.1, which has a simpler structure that helps further analysis and derivation under the

regular CTC loss. We explore non-monotonic alignments under the simpliﬁed CTC loss in section

3.2, and utilize the results to derive non-monotonic alignments under the regular CTC loss in section

3.3. Finally, we introduce the training strategy to combine monotonic and non-monotonic latent

alignments in section 3.4.

3.1 Simpliﬁed Connectionist Temporal Classiﬁcation

In CTC, the collapsing function

β−1

maps an alignment to a target sentence in two steps: (1) collapses

all consecutive repeated words, (2) removes all blank tokens. The two operations make the alignment

space

β(y)

complex. Therefore, we ﬁrst introduce the simpliﬁed connectionist temporal classiﬁcation

loss (SCTC), which has a simpler structure that helps the derivation of non-monotonic alignments,

and the results on SCTC are helpful for further analysis under the regular CTC loss.

To simplify the alignment structure, we consider only using one operation in the collapsing function.

One option is only collapsing all consecutive repeated words in the alignment. The concern is that it

will limit the expressive power of the model. For example, the target sentence

y= (A, A, B, C, D)

has probability 0 since it has repeated words that cannot be mapped from any alignment. Therefore,

we favor another option that only removes all blank tokens. In this way, the expressive power stays

the same and the alignment space becomes much simpler, where alignments of

simply contain

target words and Ta−Tblank tokens.

We denote this simpliﬁed loss as SCTC, and illustrate the difference between CTC and SCTC in

Figure 2. The alignment space for the target sentence

is deﬁned as

βs(y)

, and the collapsing

function is deﬁned as

β−1

. We can still obtain the translation probability

p(y|x, θ)

with dynamic

programming. We deﬁne the forward variable

αt(s)

to be the total probability of

y1:s

considering all

possible alignments a1:t, which can be calculated recursively from αt−1(s)and αt−1(s−1):

αt(s) = αt−1(s−1)pt(ys|x, θ) + αt−1(s)pt(|x, θ).(5)

Finally, the total translation probability

p(y|x, θ)

is given by the forward variable

αTa(T)

, so we can

train the model with the cross-entropy loss Lsctc(θ) = −log αTa(T).

3.2 Non-Monotonic Alignments under SCTC

3.2.1 Bipartite Matching

In this section, we explore non-monotonic alignments under the SCTC loss, which are helpful for

further analysis under the regular CTC loss. We ﬁrst extend the alignment space from monotonic

alignments

βs(y)

to non-monotonic alignments

γs(y)

to allow for the global word reordering in

machine translation, where γs(y)is deﬁned as:

γs(y) = ∪

y0∈P(y)βs(y0).(6)

In the above deﬁnition,

P(y)

represents all permutations of

. For example,

P((A, B)) =

{(A, B),(B, A)}

. By enumerating

P(y)

, we consider all possible word reorderings of the tar-

get sentence. Ideally, we want to traverse all alignments in

γs(y)

to calculate the log-likelihood loss:

Lsum(θ) = −log X

a∈γs(y)

p(a|x, θ).(7)

However, without the monotonic structure, it becomes difﬁcult to marginalize out all latent alignments

with dynamic programming. Alternatively, we can minimize the loss of the best alignment, which is

an upper bound of Equation 7:

Lmax(θ) = −log max

a∈γs(y)p(a|x, θ).(8)

Following prior work [

], we formulize ﬁnding the best alignment as a maximum bipartite

matching problem and solve it with the Hungarian algorithm [

]. Speciﬁcally, we observe that the

alignment space

γs(y)

is simply permutations of

target words and

Ta−T

blank tokens. Therefore,

ﬁnding the best alignment is equivalent to ﬁnding the best bipartite matching between the

model

predictions and the

target words plus

Ta−T

blank tokens, where the two sets of nodes are

connected by edges with the prediction log-probability as weights.

3.2.2 N-Gram Matching

Without the monotonic structure desired for dynamic programming, calculating Equation 7 becomes

difﬁcult. This difﬁculty is also caused by the strict requirement of exact match for alignments, which

makes it intractable to simplify the summation of probabilities. In practice, it is not necessary to force

the exact match between alignments and the target sentence since a large overlap is also favorable.

Therefore, we further extend the alignment space by considering all alignments that overlap with the

target sentence. Speciﬁcally, we are interested in the overlap of n-grams, which is the core of some

evaluation metrics (e.g., BLEU).

We propose a non-monotonic and non-exclusive n-gram matching objective based on SCTC to

encourage the overlap of n-grams. Following the underlying idea of probabilistic matching [

], we

introduce the probabilistic variant of the n-gram count to make the objective differentiable.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Non-MonotonicLatentAlignmentsforCTC-BasedNon-AutoregressiveMachineTranslationChenzeShao1;2,YangFeng1;21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences2UniversityofChineseAcademyofSciencesshaochenze18z@ict.ac.cn,fengyang@ict.ac.cnAbstractNon-aut...

展开>> 收起<<

Non-Monotonic Latent Alignments for CTC-Based Non-Autoregressive Machine Translation Chenze Shao12 Yang Feng12.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Non-Monotonic Latent Alignments for CTC-Based Non-Autoregressive Machine Translation Chenze Shao12 Yang Feng12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: