ELMER A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation Junyi Li134 Tianyi Tang1 Wayne Xin Zhao14 Jian-Yun Nie3andJi-Rong Wen124

2025-05-03 0 0 507.97KB 15 页 10玖币
侵权投诉
ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient
and Effective Text Generation
Junyi Li1,3,4, Tianyi Tang1, Wayne Xin Zhao1,4
, Jian-Yun Nie3and Ji-Rong Wen1,2,4
1Gaoling School of Artificial Intelligence, Renmin University of China
2School of Information, Renmin University of China
3DIRO, Université de Montréal
4Beijing Key Laboratory of Big Data Management and Analysis Methods
{lijunyi,steven_tang}@ruc.edu.cn batmanfly@gmail.com
Abstract
We study the text generation task under
the approach of pre-trained language models
(PLMs). Typically, an auto-regressive (AR)
method is adopted for generating texts in a
token-by-token manner. Despite many ad-
vantages of AR generation, it usually suffers
from inefficient inference. Therefore, non-
autoregressive (NAR) models are proposed
to generate all target tokens simultaneously.
However, NAR models usually generate texts
of lower quality due to the absence of token
dependency in the output text. In this paper,
we propose ELMER: an Efficient and effec-
tive PLM for NAR tExt geneRation to explic-
itly model the token dependency during NAR
generation. By leveraging the early exit tech-
nique, ELMER enables the token generations
at different layers, according to their predic-
tion confidence (a more confident token will
exit at a lower layer). Besides, we propose
a novel pre-training objective, Layer Permuta-
tion Language Modeling, to pre-train ELMER
by permuting the exit layer for each token in
sequences. Experiments on three text gen-
eration tasks show that ELMER significantly
outperforms NAR models and further narrows
the performance gap with AR PLMs (e.g.,
ELMER (29.92) vs BART (30.61) ROUGE-L
in XSUM) while achieving over 10 times in-
ference speedup.
1 Introduction
Since the advant of GPT-2 (Radford et al.,2019),
pre-trained language models (PLMs) have achieved
state-of-the-art performance across text generation
tasks, which aim to generate human-like texts on
demand (Brown et al.,2020;Li et al.,2022c).
These PLMs usually adopt an auto-regressive (AR)
fashion to generate texts token-by-token: the next
token is predicted based on all previously gener-
ated tokens. A major limitation of this approach is
Corresponding author
that it is hard to be parallelized for the inference
process, thus leading to a relatively high inference
latency (Gu et al.,2018). Such a limitation prevents
AR models from wide deployment in online real-
time applications, such as query rewriting in search
engines and online chat-bot. Moreover, AR models
are prone to suffering from the exposure bias prob-
lem since there is a gap between AR training and in-
ference (Zeng and Nie,2021). These concerns have
sparked extensive interests in non-autoregressive
(NAR) models for text generation (Gu et al.,2018).
Compared to AR models, NAR models predict
target tokens in all positions simultaneously and
independently (Gu et al.,2018). This full paral-
lelism leads to an efficient and low-latency infer-
ence process. However, the independence assump-
tion prevents NAR models from learning the depen-
dency among target tokens, resulting in accuracy
degradation (Zhan et al.,2022). One widely-used
solution to improve the NAR generation quality
is to iteratively refine outputs (Gu et al.,2019;
Ghazvininejad et al.,2019), which however leads
to the loss in the speed-up advantage. In addition,
many studies aim to learn the input-output mapping
for more accurate generation via embedding map-
ping (Guo et al.,2019), latent alignment (Libovický
and Helcl,2018), and discrete variables (Ma et al.,
2019). While easing the difficulty of NAR gener-
ation to some extent, these methods still struggle
for generating complex sentences. Therefore, in-
spired by Zhan et al. (2022), we argue that the key
to NAR text generation is to enhance the learning
of token dependency—each token should be gener-
ated depending on forward and backward generated
tokens.
In this paper, we propose
ELMER
: an
E
fficient
and Effective P
LM
for NAR t
E
xt gene
R
ation, to
explicitly learn the bi-directional token dependency.
Typically, most NAR models predict tokens simul-
taneously only at the last layer, thus making the
token prediction unaware of tokens generated in
arXiv:2210.13304v2 [cs.CL] 28 Oct 2022
other positions. To address this issue, we propose
to generate tokens at different layers and the upper-
layer token generation can depend on lower-layer
generated tokens from both left and right. In this
way, our model can explicitly learn the dependency
between tokens from different layers while enjoy-
ing full parallelism in NAR decoding, as shown in
Figure 1. To this end, we propose to extend the
early exit technique (Li et al.,2021c) to NAR text
generation: if there is sufficient confidence to gen-
erate a token at a lower layer, the model is allowed
to exit at this layer and make the prediction without
passing through the upper layers.
Furthermore, instead of exiting at a fixed layer
for a token, we aim to predict each token at differ-
ent layers for learning diverse token dependencies
in NAR text generation. Thus, inspired by XLNet
(Yang et al.,2019), we further propose a novel pre-
training objective based on early exit, i.e., Layer
Permutation Language Modeling (LPLM), to help
ELMER learn complex token dependencies. Given
a sequence, LPLM will permute the exit layer (from
1 to the maximum layer) for each token and maxi-
mize the NAR text generation probability w.r.t. all
possible exit layer permutations of the sequence.
Through LPLM, each token is able to exit at dif-
ferent layers and attend to all other tokens from
both forward and backward positions. In this way,
LPLM could effectively capture diverse token de-
pendencies from large-scale corpora. Pre-trained
with the general LPLM, ELMER can adapt to down-
stream text generation tasks and datasets by using
specific early exit strategies.
To the best of our knowledge, we are the first
to introduce the idea of early exit to NAR text
generation. We fine-tune ELMER on three popu-
lar text generation tasks. Experiments show that
ELMER significantly improves the best NAR mod-
els by +4.71 ROUGE-1 on XSUM, +1.79 ME-
TEOR on SQuAD v1.1, and +2.26 Distinct-2 on
PersonaChat, and narrows the performance gap
with auto-regressive PLMs (e.g., ELMER (29.92) vs
BART (30.61) ROUGE-L on XSUM) while achiev-
ing over 10x faster inference.
2 Related Work
Pre-trained Language Models.
Recent years
have witnessed remarkable achievement of PLMs
in text generation tasks (Li et al.,2021b). Most
PLMs adopt an AR paradigm to generate texts dur-
ing pre-training and fine-tuning. The work based
on GPT (Radford et al.,2019;Brown et al.,2020)
converts different tasks into language modeling by
sequentially predicting tokens. BART (Lewis et al.,
2020) employs an auto-regressive decoder to re-
cover the corrupted text in pre-training. T5 (Raffel
et al.,2020) masks word spans from input texts and
then sequentially predicts masked tokens. Tang
et al. (2022) pre-trains a text generation model us-
ing labeled datasets with multi-task learning. Li
et al. (2022b) leverages prompts to effectively fine-
tune text generation models. Differently, our PLM,
ELMER, adopts a NAR schema to generate texts,
which leads to a very low latency in inference.
Non-autoregressive Text Generation.
Recently,
there is a wide range of studies for NAR text genera-
tion (Gu et al.,2018;Ghazvininejad et al.,2019;Qi
et al.,2021). Among them, Gu et al. (2018) is the
first to propose NAR paradigm to reduce the infer-
ence latency. Ghazvininejad et al. (2019) iteratively
masks and predicts a fraction of tokens that the
model is least confident about. Several groups aim
to learn accurate input-output mapping. For exam-
ple, Saharia et al. (2020) and Libovický and Helcl
(2018) use connectionist temporal classification to
perform latent alignment in NAR models. Our
work is closely related to BANG (Qi et al.,2021),
a PLM bridging the NAR and AR generation. We
differ in that we use early exit to predict tokens
at different layers, which can help NAR models
learn the forward and backward token dependency.
Moreover, we propose a novel pre-training objec-
tive based on early exit, LPLM, for learning diverse
token dependencies by permuting the exit layer for
each token.
3 Preliminaries
Generally, the goal of text generation is to model
the conditional probability
Pr(Y|X )
, where
X=
hx1, . . . , xni
and
Y=hy1, . . . , ymi
denote the in-
put text and output text respectively and each con-
sists of a sequence of tokens from a vocabulary
V
.
There are three common generation paradigms to
model the conditional probability
Pr(Y|X )
,i.e., au-
toregressive (AR), non-autoregressive (NAR), and
semi-nonautoregressive (Semi-NAR) generation.
AR Generation.
AR generation models predict
the output text based on a left-to-right factorization
as:
Pr(Y|X ) =
m
Y
t=1
Pr(yt|y<t,X),(1)
where each token
yt
is generated based on the input
text
X
and previous tokens
y<t
. Note that AR gen-
eration only models the forward token dependency.
The token-by-token fashion makes AR generation
process hard to be parallelized. Most of existing
text generation PLMs adopt AR approach (Radford
et al.,2019;Lewis et al.,2020;Raffel et al.,2020).
NAR Generation.
In contrast to AR models, NAR
text generation models predict each token in output
text simultaneously as follows, without modeling
the forward or backward token dependency:
Pr(Y|X ) =
m
Y
t=1
Pr(yt|X ),(2)
where each token
yt
is predicted only based on the
input text
X
. The independence assumption makes
NAR generation process parallelizable, thus signif-
icantly accelerating the inference speed (Gu et al.,
2018). While, in the absence of token dependency,
the generation quality of NAR models is lower than
their AR counterparts (Wang et al.,2019).
Semi-NAR Generation.
Semi-NAR generation is
formalized between AR and NAR generation as:
Pr(Y|X ) =
m
Y
t=1
Pr(yt|Yct,X),(3)
where each token
yt
is conditioned on the input text
X
and a visible part
Yct
of the output text
Y
.
Yct
is designed differently to balance inference latency
and accuracy (Stern et al.,2019;Lee et al.,2018).
Note that the lower-layer generated tokens in our
model is similar to the visible part
Yct
. While, our
model keeps the advantage of full parallelism in
contrast to iterative Semi-NAR methods.
In this paper, we mainly focus on the NAR
approach, considering both effectiveness and ef-
ficiency for text generation models.
4 Approach
Our proposed NAR text generation PLM, ELMER,
is depicted in Figure 1. ELMER aims to enhance
the modeling of token dependency for NAR mod-
els. With early exit, tokens exiting at different lay-
ers can build the bi-directional token dependency
with each other. Moreover, we design Layer Per-
mutation Language Modeling (LPLM) to pre-train
ELMER by permuting the exit layer for each token.
Next, we will describe each part in detail.
Layer 1
Layer 2
Layer 3
Layer 4
𝑦!
𝑦"
𝑦#
𝑦$
[MASK] [MASK] [MASK] [MASK] [MASK]
Copy hidden state
Early exit
...
Layer 𝐿NAR Decoder
Layer Permutation
Language Modeling
Exit Layer Permutation
𝒫%= 2, 3, 1, 4, 3
Exit Layer Permutation
𝒫
&= 4, 3, 2, 1, 2
𝑦!
𝑦"
𝑦#
𝑦$
𝑦%
𝑦!
𝑦"
𝑦#
𝑦$
𝑦%
...
𝑦'token dependency
Figure 1: Overview of our proposed model ELMER.
4.1 Early Exit for Dependency Modeling
Most NAR models simultaneously predict tokens
only at the last layer (Jiang et al.,2021;Zhan et al.,
2022), which makes the current token generation
unaware of the tokens generated in other positions.
Thus, to model the bi-directional token dependency
(both forward and backward), we propose to predict
tokens at different layers by leveraging the early
exit technique (Li et al.,2021c). In this way, the
upper-layer token generation can depend on tokens
generated at lower layers from both left and right.
NAR Transformer.
ELMER is built on the Trans-
former encoder-decoder architecture (Vaswani
et al.,2017). Both encoder and decoder consist of
L
stacked layers where each layer contains several
sub-layers (e.g., multi-head self-attention and feed-
forward network). Unlike the original Transformer
decoder that auto-regressively generates text, our
model uses NAR fashion to generate tokens simul-
taneously. Given a pair of input-output text
hX ,Yi
,
X
is fed into the encoder and processed as hidden
states
S=hs1, ..., sni
. We then feed a sequence
of “
[MASK]
” tokens into the NAR decoder to gen-
erate every token in output text Yin parallel.
Specifically, we first replace the original masked
multi-head attention in decoder with bi-directional
multi-head attention akin to the encoder. For each
[MASK]
” token in the
t
-th position, the
L
decoder
layers process it to hidden states {hl
t}1lLas:
hl
t=Layerl(hl1
1tT,S),(4)
h0
t=Embed([MASK]),(5)
where
Layerl(·)
denotes the
l
-th layer,
Embed(·)
is
the sum of word embedding and position embed-
ding, and
T
is the maximum length of the decoder.
摘要:

ELMER:ANon-AutoregressivePre-trainedLanguageModelforEfcientandEffectiveTextGenerationJunyiLi1,3,4,TianyiTang1,WayneXinZhao1,4,Jian-YunNie3andJi-RongWen1,2,41GaolingSchoolofArticialIntelligence,RenminUniversityofChina2SchoolofInformation,RenminUniversityofChina3DIRO,UniversitédeMontréal4BeijingKey...

展开>> 收起<<
ELMER A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation Junyi Li134 Tianyi Tang1 Wayne Xin Zhao14 Jian-Yun Nie3andJi-Rong Wen124.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:507.97KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注