Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive Machine Translation Chenze Shao12 Zhengrui Ma12 Yang Feng12

2025-05-06 0 0 354.76KB 8 页 10玖币

侵权投诉

Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive

Machine Translation

Chenze Shao1,2, Zhengrui Ma1,2, Yang Feng1,2∗

1Key Laboratory of Intelligent Information Processing

Institute of Computing Technology, Chinese Academy of Sciences

2University of Chinese Academy of Sciences

{shaochenze18z,mazhengrui21b,fengyang}@ict.ac.cn

Abstract

Non-autoregressive models achieve signiﬁcant

decoding speedup in neural machine transla-

tion but lack the ability to capture sequential

dependency. Directed Acyclic Transformer

(DA-Transformer) was recently proposed to

model sequential dependency with a directed

acyclic graph. Consequently, it has to apply a

sequential decision process at inference time,

which harms the global translation accuracy.

In this paper, we present a Viterbi decoding

framework for DA-Transformer, which guar-

antees to ﬁnd the joint optimal solution for the

translation and decoding path under any length

constraint. Experimental results demonstrate

that our approach consistently improves the

performance of DA-Transformer while main-

taining a similar decoding speedup.1

1 Introduction

Non-autoregressive translation (Gu et al.,2018)

models achieve a signiﬁcant decoding speedup

but suffer from performance degradation, which

is mainly attributed to the multi-modality problem.

Multi-modality refers to the scenario where the

same source sentence may have multiple transla-

tions with a strong cross-correlation between target

words. However, non-autoregressive models gener-

ally hold the conditional independence assumption

on target words, which prevents them from captur-

ing the multimodal target distribution.

Recently, Directed Acyclic Transformer (Huang

et al.,2022) was proposed to model sequential de-

pendency with a directed acyclic graph consisting

of different decoding paths that enable the model to

capture multiple translation modalities. Although it

has been proven effective, it cannot directly ﬁnd the

most probable translation with the argmax opera-

tion. Therefore, DA-Transformer has to apply a se-

∗Corresponding author: Yang Feng

We implement our method on https://github.com/thu-

coai/DA-Transformer.

quential decision process at inference time, which

harms the global translation accuracy.

In this paper, we propose a Viterbi decoding

(Viterbi,1967) framework for DA-Transformer to

improve the decoding accuracy. Using the Markov

property of decoding path, we can apply Viterbi

decoding to ﬁnd the most probable path, condi-

tioned on which we can generate the translation

with argmax decoding. Then, we further improve

this decoding algorithm to perform a simultaneous

search for decoding paths and translations, which

guarantees to ﬁnd the joint optimal solution under

any length constraint. After Viterbi decoding, we

obtain a set of translations with different lengths

and rerank them to obtain the ﬁnal translation. We

apply a length penalty term in the reranking pro-

cess, which prevents the generation of empty trans-

lation (Stahlberg and Byrne,2019) and enables us

to control the translation length ﬂexibly.

Experimental results on several machine transla-

tion benchmark tasks (WMT14 En

↔

De, WMT17

↔

En) show that our approach consistently im-

proves the performance of DA-Transformer while

maintaining a similar decoding speedup.

2 Preliminaries: DA-Transformer

2.1 Model Architecture

DA-Transformer is formed by a Transformer en-

coder and a directed acyclic decoder. The encoder

and layers of the decoder are the same as vanilla

Transformer (Vaswani et al.,2017). On top of the

decoder, the hidden states are organized as a di-

rected acyclic graph, whose edges represent transi-

tion probabilities between hidden states.

Given a source sentence

X={x1,· · · , xN}

and a target sentence

Y={y1,· · · , yM}

, the de-

coder length

is set to

λ·N

, where

is a hyper-

parameter. The translation probability from

arXiv:2210.05193v2 [cs.CL] 2 Mar 2023

Yis formulated as:

Pθ(Y|X) = X

A∈Γ

Pθ(A|X)Pθ(Y|X, A),(1)

where

A={a1,· · · , aM}

is a translation path for

the target sentence

and

represents the position

of word

in the decoder.

contains all possible

translation paths with 1 = a1<· · · < aM=L.

The probability of translation path

is formu-

lated based on the Markov hypothesis:

Pθ(A|X) =

M−1

i=1

Pθ(ai+1|ai, X) =

M−1

i=1

Eai,ai+1 ,

(2)

where

E∈RL×L

is the transition matrix obtained

by self-attention, and

Eai,ai+1

represents the tran-

sition probability from position

to position

ai+1

is masked by a lower triangular matrix to ensure

that the translation path is acyclic.

Conditioned on

and the translation path

the translation probability of Yis formulated as:

Pθ(Y|A, X) =

i=1

Pθ(yi|ai, X),(3)

where

Pθ(yi|ai, X)

represents the translation prob-

ability of word yion the position aiof decoder.

2.2 Training and Inference

The training objective of DA-Transformer is to

maximize the log-likelihood

log Pθ(Y|X)

, which

requires marginalizing all paths

. Using

the Markov property of translation path, DA-

Transformer employs dynamic programming to

calculate the translation probability. Besides, it

applies glancing training (Qian et al.,2021) with a

hyper-parameter τto promote the learning.

During inference, the objective is to ﬁnd the most

probable translation

argmaxYPθ(Y|X)

. How-

ever, there is no known tractable decoding algo-

rithm for this problem. Huang et al. (2022) pro-

posed three approximate decoding strategies to ﬁnd

high-probability translations. The intuitive strat-

egy is greedy decoding, which sequentially takes

the most probable transition as the decoding path

and generates a translation according to the condi-

tional probabilities. Lookahead decoding improves

greedy decoding by taking the most probable com-

bination of transition and prediction as follows:

y∗

i, a∗

i= argmax

yi,ai

Pθ(yi|ai, X)Pθ(ai|ai−1, X).

(4)

Beam search decoding is a more accurate method

that merges the paths of the same preﬁx, which

approximates the real translation probability and

better represents the model’s preference. Beam

search can be optionally combined with an n-gram

language model to improve the performance further.

However, the speed of beam search is much lower

than greedy and lookahead decoding.

3 Methodology

This section presents a Viterbi decoding framework

for DA-Transformer to improve decoding accuracy.

We ﬁrst develop a basic algorithm to ﬁnd the opti-

mal decoding path and then improve it to ﬁnd the

joint optimal solution of the translations and decod-

ing paths. Finally, we introduce the technique to

rerank the Viterbi decoding outputs.

3.1 Optimal Decoding Path

Recall that the greedy decoding strategy sequen-

tially takes the most probable transition as the de-

coding path, which may not be optimal since the

greedy strategy does not consider long-term prof-

its. In response to this problem, we propose a

Viterbi decoding framework for DA-Transformer

that guarantees to ﬁnd the optimal decoding path

argmaxAPθ(A|X)under any length constraint.

Speciﬁcally, we consider decoding paths of

length

that end in position

ai=t

, and use

α(i, t)

to represent the maximum probability of

these paths. By deﬁnition, we set the initial state

α(1,1)=1

and

α(1, t > 1) = 0

. The Markov prop-

erty of decoding paths enables us to sequentially

calculate

α(i, ·)

from its previous step

α(i−1,·)

α(i, t) = max

t0α(i−1, t0)·Et,t0,

ψ(i, t) = argmax

α(i−1, t0)·Et,t0,(5)

where

is the transition matrix deﬁned in Equation

2and

ψ(i, t)

is the backtracking index pointing to

the previous position. After

iterations, we obtain

the score for every possible length, and then we can

ﬁnd the optimal length with the argmax function:

M= argmax

α(i, L).(6)

After determining the length

, we can trace the

best decoding path along the backtracking index

starting from aM=L:

ai=ψ(i+ 1, ai+1).(7)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ViterbiDecodingofDirectedAcyclicTransformerforNon-AutoregressiveMachineTranslationChenzeShao1;2,ZhengruiMa1;2,YangFeng1;21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences2UniversityofChineseAcademyofSciences{shaochenze18z,mazhengrui21b,fengyang}...

展开>> 收起<<

Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive Machine Translation Chenze Shao12 Zhengrui Ma12 Yang Feng12.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive Machine Translation Chenze Shao12 Zhengrui Ma12 Yang Feng12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: