Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive Machine Translation Chenze Shao12 Zhengrui Ma12 Yang Feng12

2025-05-06 0 0 354.76KB 8 页 10玖币
侵权投诉
Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive
Machine Translation
Chenze Shao1,2, Zhengrui Ma1,2, Yang Feng1,2
1Key Laboratory of Intelligent Information Processing
Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
{shaochenze18z,mazhengrui21b,fengyang}@ict.ac.cn
Abstract
Non-autoregressive models achieve significant
decoding speedup in neural machine transla-
tion but lack the ability to capture sequential
dependency. Directed Acyclic Transformer
(DA-Transformer) was recently proposed to
model sequential dependency with a directed
acyclic graph. Consequently, it has to apply a
sequential decision process at inference time,
which harms the global translation accuracy.
In this paper, we present a Viterbi decoding
framework for DA-Transformer, which guar-
antees to find the joint optimal solution for the
translation and decoding path under any length
constraint. Experimental results demonstrate
that our approach consistently improves the
performance of DA-Transformer while main-
taining a similar decoding speedup.1
1 Introduction
Non-autoregressive translation (Gu et al.,2018)
models achieve a significant decoding speedup
but suffer from performance degradation, which
is mainly attributed to the multi-modality problem.
Multi-modality refers to the scenario where the
same source sentence may have multiple transla-
tions with a strong cross-correlation between target
words. However, non-autoregressive models gener-
ally hold the conditional independence assumption
on target words, which prevents them from captur-
ing the multimodal target distribution.
Recently, Directed Acyclic Transformer (Huang
et al.,2022) was proposed to model sequential de-
pendency with a directed acyclic graph consisting
of different decoding paths that enable the model to
capture multiple translation modalities. Although it
has been proven effective, it cannot directly find the
most probable translation with the argmax opera-
tion. Therefore, DA-Transformer has to apply a se-
Corresponding author: Yang Feng
1
We implement our method on https://github.com/thu-
coai/DA-Transformer.
quential decision process at inference time, which
harms the global translation accuracy.
In this paper, we propose a Viterbi decoding
(Viterbi,1967) framework for DA-Transformer to
improve the decoding accuracy. Using the Markov
property of decoding path, we can apply Viterbi
decoding to find the most probable path, condi-
tioned on which we can generate the translation
with argmax decoding. Then, we further improve
this decoding algorithm to perform a simultaneous
search for decoding paths and translations, which
guarantees to find the joint optimal solution under
any length constraint. After Viterbi decoding, we
obtain a set of translations with different lengths
and rerank them to obtain the final translation. We
apply a length penalty term in the reranking pro-
cess, which prevents the generation of empty trans-
lation (Stahlberg and Byrne,2019) and enables us
to control the translation length flexibly.
Experimental results on several machine transla-
tion benchmark tasks (WMT14 En
De, WMT17
Zh
En) show that our approach consistently im-
proves the performance of DA-Transformer while
maintaining a similar decoding speedup.
2 Preliminaries: DA-Transformer
2.1 Model Architecture
DA-Transformer is formed by a Transformer en-
coder and a directed acyclic decoder. The encoder
and layers of the decoder are the same as vanilla
Transformer (Vaswani et al.,2017). On top of the
decoder, the hidden states are organized as a di-
rected acyclic graph, whose edges represent transi-
tion probabilities between hidden states.
Given a source sentence
X={x1,· · · , xN}
and a target sentence
Y={y1,· · · , yM}
, the de-
coder length
L
is set to
λ·N
, where
λ
is a hyper-
parameter. The translation probability from
X
to
arXiv:2210.05193v2 [cs.CL] 2 Mar 2023
Yis formulated as:
Pθ(Y|X) = X
AΓ
Pθ(A|X)Pθ(Y|X, A),(1)
where
A={a1,· · · , aM}
is a translation path for
the target sentence
Y
and
ai
represents the position
of word
yi
in the decoder.
Γ
contains all possible
translation paths with 1 = a1<· · · < aM=L.
The probability of translation path
A
is formu-
lated based on the Markov hypothesis:
Pθ(A|X) =
M1
Y
i=1
Pθ(ai+1|ai, X) =
M1
Y
i=1
Eai,ai+1 ,
(2)
where
ERL×L
is the transition matrix obtained
by self-attention, and
Eai,ai+1
represents the tran-
sition probability from position
ai
to position
ai+1
.
E
is masked by a lower triangular matrix to ensure
that the translation path is acyclic.
Conditioned on
X
and the translation path
A
,
the translation probability of Yis formulated as:
Pθ(Y|A, X) =
M
Y
i=1
Pθ(yi|ai, X),(3)
where
Pθ(yi|ai, X)
represents the translation prob-
ability of word yion the position aiof decoder.
2.2 Training and Inference
The training objective of DA-Transformer is to
maximize the log-likelihood
log Pθ(Y|X)
, which
requires marginalizing all paths
A
. Using
the Markov property of translation path, DA-
Transformer employs dynamic programming to
calculate the translation probability. Besides, it
applies glancing training (Qian et al.,2021) with a
hyper-parameter τto promote the learning.
During inference, the objective is to find the most
probable translation
argmaxYPθ(Y|X)
. How-
ever, there is no known tractable decoding algo-
rithm for this problem. Huang et al. (2022) pro-
posed three approximate decoding strategies to find
high-probability translations. The intuitive strat-
egy is greedy decoding, which sequentially takes
the most probable transition as the decoding path
and generates a translation according to the condi-
tional probabilities. Lookahead decoding improves
greedy decoding by taking the most probable com-
bination of transition and prediction as follows:
y
i, a
i= argmax
yi,ai
Pθ(yi|ai, X)Pθ(ai|ai1, X).
(4)
Beam search decoding is a more accurate method
that merges the paths of the same prefix, which
approximates the real translation probability and
better represents the model’s preference. Beam
search can be optionally combined with an n-gram
language model to improve the performance further.
However, the speed of beam search is much lower
than greedy and lookahead decoding.
3 Methodology
This section presents a Viterbi decoding framework
for DA-Transformer to improve decoding accuracy.
We first develop a basic algorithm to find the opti-
mal decoding path and then improve it to find the
joint optimal solution of the translations and decod-
ing paths. Finally, we introduce the technique to
rerank the Viterbi decoding outputs.
3.1 Optimal Decoding Path
Recall that the greedy decoding strategy sequen-
tially takes the most probable transition as the de-
coding path, which may not be optimal since the
greedy strategy does not consider long-term prof-
its. In response to this problem, we propose a
Viterbi decoding framework for DA-Transformer
that guarantees to find the optimal decoding path
argmaxAPθ(A|X)under any length constraint.
Specifically, we consider decoding paths of
length
i
that end in position
ai=t
, and use
α(i, t)
to represent the maximum probability of
these paths. By definition, we set the initial state
α(1,1)=1
and
α(1, t > 1) = 0
. The Markov prop-
erty of decoding paths enables us to sequentially
calculate
α(i, ·)
from its previous step
α(i1,·)
:
α(i, t) = max
t0α(i1, t0)·Et,t0,
ψ(i, t) = argmax
t0
α(i1, t0)·Et,t0,(5)
where
E
is the transition matrix defined in Equation
2and
ψ(i, t)
is the backtracking index pointing to
the previous position. After
L
iterations, we obtain
the score for every possible length, and then we can
find the optimal length with the argmax function:
M= argmax
i
α(i, L).(6)
After determining the length
M
, we can trace the
best decoding path along the backtracking index
starting from aM=L:
ai=ψ(i+ 1, ai+1).(7)
摘要:

ViterbiDecodingofDirectedAcyclicTransformerforNon-AutoregressiveMachineTranslationChenzeShao1;2,ZhengruiMa1;2,YangFeng1;21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences2UniversityofChineseAcademyofSciences{shaochenze18z,mazhengrui21b,fengyang}...

展开>> 收起<<
Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive Machine Translation Chenze Shao12 Zhengrui Ma12 Yang Feng12.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:354.76KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注