other positions. To address this issue, we propose
to generate tokens at different layers and the upper-
layer token generation can depend on lower-layer
generated tokens from both left and right. In this
way, our model can explicitly learn the dependency
between tokens from different layers while enjoy-
ing full parallelism in NAR decoding, as shown in
Figure 1. To this end, we propose to extend the
early exit technique (Li et al.,2021c) to NAR text
generation: if there is sufficient confidence to gen-
erate a token at a lower layer, the model is allowed
to exit at this layer and make the prediction without
passing through the upper layers.
Furthermore, instead of exiting at a fixed layer
for a token, we aim to predict each token at differ-
ent layers for learning diverse token dependencies
in NAR text generation. Thus, inspired by XLNet
(Yang et al.,2019), we further propose a novel pre-
training objective based on early exit, i.e., Layer
Permutation Language Modeling (LPLM), to help
ELMER learn complex token dependencies. Given
a sequence, LPLM will permute the exit layer (from
1 to the maximum layer) for each token and maxi-
mize the NAR text generation probability w.r.t. all
possible exit layer permutations of the sequence.
Through LPLM, each token is able to exit at dif-
ferent layers and attend to all other tokens from
both forward and backward positions. In this way,
LPLM could effectively capture diverse token de-
pendencies from large-scale corpora. Pre-trained
with the general LPLM, ELMER can adapt to down-
stream text generation tasks and datasets by using
specific early exit strategies.
To the best of our knowledge, we are the first
to introduce the idea of early exit to NAR text
generation. We fine-tune ELMER on three popu-
lar text generation tasks. Experiments show that
ELMER significantly improves the best NAR mod-
els by +4.71 ROUGE-1 on XSUM, +1.79 ME-
TEOR on SQuAD v1.1, and +2.26 Distinct-2 on
PersonaChat, and narrows the performance gap
with auto-regressive PLMs (e.g., ELMER (29.92) vs
BART (30.61) ROUGE-L on XSUM) while achiev-
ing over 10x faster inference.
2 Related Work
Pre-trained Language Models.
Recent years
have witnessed remarkable achievement of PLMs
in text generation tasks (Li et al.,2021b). Most
PLMs adopt an AR paradigm to generate texts dur-
ing pre-training and fine-tuning. The work based
on GPT (Radford et al.,2019;Brown et al.,2020)
converts different tasks into language modeling by
sequentially predicting tokens. BART (Lewis et al.,
2020) employs an auto-regressive decoder to re-
cover the corrupted text in pre-training. T5 (Raffel
et al.,2020) masks word spans from input texts and
then sequentially predicts masked tokens. Tang
et al. (2022) pre-trains a text generation model us-
ing labeled datasets with multi-task learning. Li
et al. (2022b) leverages prompts to effectively fine-
tune text generation models. Differently, our PLM,
ELMER, adopts a NAR schema to generate texts,
which leads to a very low latency in inference.
Non-autoregressive Text Generation.
Recently,
there is a wide range of studies for NAR text genera-
tion (Gu et al.,2018;Ghazvininejad et al.,2019;Qi
et al.,2021). Among them, Gu et al. (2018) is the
first to propose NAR paradigm to reduce the infer-
ence latency. Ghazvininejad et al. (2019) iteratively
masks and predicts a fraction of tokens that the
model is least confident about. Several groups aim
to learn accurate input-output mapping. For exam-
ple, Saharia et al. (2020) and Libovický and Helcl
(2018) use connectionist temporal classification to
perform latent alignment in NAR models. Our
work is closely related to BANG (Qi et al.,2021),
a PLM bridging the NAR and AR generation. We
differ in that we use early exit to predict tokens
at different layers, which can help NAR models
learn the forward and backward token dependency.
Moreover, we propose a novel pre-training objec-
tive based on early exit, LPLM, for learning diverse
token dependencies by permuting the exit layer for
each token.
3 Preliminaries
Generally, the goal of text generation is to model
the conditional probability
Pr(Y|X )
, where
X=
hx1, . . . , xni
and
Y=hy1, . . . , ymi
denote the in-
put text and output text respectively and each con-
sists of a sequence of tokens from a vocabulary
V
.
There are three common generation paradigms to
model the conditional probability
Pr(Y|X )
,i.e., au-
toregressive (AR), non-autoregressive (NAR), and
semi-nonautoregressive (Semi-NAR) generation.
AR Generation.
AR generation models predict
the output text based on a left-to-right factorization
as:
Pr(Y|X ) =
m
Y
t=1
Pr(yt|y<t,X),(1)