
Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive
Machine Translation
Chenze Shao1,2, Zhengrui Ma1,2, Yang Feng1,2∗
1Key Laboratory of Intelligent Information Processing
Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
{shaochenze18z,mazhengrui21b,fengyang}@ict.ac.cn
Abstract
Non-autoregressive models achieve significant
decoding speedup in neural machine transla-
tion but lack the ability to capture sequential
dependency. Directed Acyclic Transformer
(DA-Transformer) was recently proposed to
model sequential dependency with a directed
acyclic graph. Consequently, it has to apply a
sequential decision process at inference time,
which harms the global translation accuracy.
In this paper, we present a Viterbi decoding
framework for DA-Transformer, which guar-
antees to find the joint optimal solution for the
translation and decoding path under any length
constraint. Experimental results demonstrate
that our approach consistently improves the
performance of DA-Transformer while main-
taining a similar decoding speedup.1
1 Introduction
Non-autoregressive translation (Gu et al.,2018)
models achieve a significant decoding speedup
but suffer from performance degradation, which
is mainly attributed to the multi-modality problem.
Multi-modality refers to the scenario where the
same source sentence may have multiple transla-
tions with a strong cross-correlation between target
words. However, non-autoregressive models gener-
ally hold the conditional independence assumption
on target words, which prevents them from captur-
ing the multimodal target distribution.
Recently, Directed Acyclic Transformer (Huang
et al.,2022) was proposed to model sequential de-
pendency with a directed acyclic graph consisting
of different decoding paths that enable the model to
capture multiple translation modalities. Although it
has been proven effective, it cannot directly find the
most probable translation with the argmax opera-
tion. Therefore, DA-Transformer has to apply a se-
∗Corresponding author: Yang Feng
1
We implement our method on https://github.com/thu-
coai/DA-Transformer.
quential decision process at inference time, which
harms the global translation accuracy.
In this paper, we propose a Viterbi decoding
(Viterbi,1967) framework for DA-Transformer to
improve the decoding accuracy. Using the Markov
property of decoding path, we can apply Viterbi
decoding to find the most probable path, condi-
tioned on which we can generate the translation
with argmax decoding. Then, we further improve
this decoding algorithm to perform a simultaneous
search for decoding paths and translations, which
guarantees to find the joint optimal solution under
any length constraint. After Viterbi decoding, we
obtain a set of translations with different lengths
and rerank them to obtain the final translation. We
apply a length penalty term in the reranking pro-
cess, which prevents the generation of empty trans-
lation (Stahlberg and Byrne,2019) and enables us
to control the translation length flexibly.
Experimental results on several machine transla-
tion benchmark tasks (WMT14 En
↔
De, WMT17
Zh
↔
En) show that our approach consistently im-
proves the performance of DA-Transformer while
maintaining a similar decoding speedup.
2 Preliminaries: DA-Transformer
2.1 Model Architecture
DA-Transformer is formed by a Transformer en-
coder and a directed acyclic decoder. The encoder
and layers of the decoder are the same as vanilla
Transformer (Vaswani et al.,2017). On top of the
decoder, the hidden states are organized as a di-
rected acyclic graph, whose edges represent transi-
tion probabilities between hidden states.
Given a source sentence
X={x1,· · · , xN}
and a target sentence
Y={y1,· · · , yM}
, the de-
coder length
L
is set to
λ·N
, where
λ
is a hyper-
parameter. The translation probability from
X
to
arXiv:2210.05193v2 [cs.CL] 2 Mar 2023