ELMER A Non-Autoregressive Pre-trained Language Model for Efﬁcient and Effective Text Generation Junyi Li134 Tianyi Tang1 Wayne Xin Zhao14 Jian-Yun Nie3andJi-Rong Wen124

2025-05-03 0 0 507.97KB 15 页 10玖币

侵权投诉

ELMER: A Non-Autoregressive Pre-trained Language Model for Efﬁcient

and Effective Text Generation

Junyi Li1,3,4, Tianyi Tang1, Wayne Xin Zhao1,4∗

, Jian-Yun Nie3and Ji-Rong Wen1,2,4

1Gaoling School of Artiﬁcial Intelligence, Renmin University of China

2School of Information, Renmin University of China

3DIRO, Université de Montréal

4Beijing Key Laboratory of Big Data Management and Analysis Methods

{lijunyi,steven_tang}@ruc.edu.cn batmanfly@gmail.com

Abstract

We study the text generation task under

the approach of pre-trained language models

(PLMs). Typically, an auto-regressive (AR)

method is adopted for generating texts in a

token-by-token manner. Despite many ad-

vantages of AR generation, it usually suffers

from inefﬁcient inference. Therefore, non-

autoregressive (NAR) models are proposed

to generate all target tokens simultaneously.

However, NAR models usually generate texts

of lower quality due to the absence of token

dependency in the output text. In this paper,

we propose ELMER: an Efﬁcient and effec-

tive PLM for NAR tExt geneRation to explic-

itly model the token dependency during NAR

generation. By leveraging the early exit tech-

nique, ELMER enables the token generations

at different layers, according to their predic-

tion conﬁdence (a more conﬁdent token will

exit at a lower layer). Besides, we propose

a novel pre-training objective, Layer Permuta-

tion Language Modeling, to pre-train ELMER

by permuting the exit layer for each token in

sequences. Experiments on three text gen-

eration tasks show that ELMER signiﬁcantly

outperforms NAR models and further narrows

the performance gap with AR PLMs (e.g.,

ELMER (29.92) vs BART (30.61) ROUGE-L

in XSUM) while achieving over 10 times in-

ference speedup.

1 Introduction

Since the advant of GPT-2 (Radford et al.,2019),

pre-trained language models (PLMs) have achieved

state-of-the-art performance across text generation

tasks, which aim to generate human-like texts on

demand (Brown et al.,2020;Li et al.,2022c).

These PLMs usually adopt an auto-regressive (AR)

fashion to generate texts token-by-token: the next

token is predicted based on all previously gener-

ated tokens. A major limitation of this approach is

∗Corresponding author

that it is hard to be parallelized for the inference

process, thus leading to a relatively high inference

latency (Gu et al.,2018). Such a limitation prevents

AR models from wide deployment in online real-

time applications, such as query rewriting in search

engines and online chat-bot. Moreover, AR models

are prone to suffering from the exposure bias prob-

lem since there is a gap between AR training and in-

ference (Zeng and Nie,2021). These concerns have

sparked extensive interests in non-autoregressive

(NAR) models for text generation (Gu et al.,2018).

Compared to AR models, NAR models predict

target tokens in all positions simultaneously and

independently (Gu et al.,2018). This full paral-

lelism leads to an efﬁcient and low-latency infer-

ence process. However, the independence assump-

tion prevents NAR models from learning the depen-

dency among target tokens, resulting in accuracy

degradation (Zhan et al.,2022). One widely-used

solution to improve the NAR generation quality

is to iteratively reﬁne outputs (Gu et al.,2019;

Ghazvininejad et al.,2019), which however leads

to the loss in the speed-up advantage. In addition,

many studies aim to learn the input-output mapping

for more accurate generation via embedding map-

ping (Guo et al.,2019), latent alignment (Libovický

and Helcl,2018), and discrete variables (Ma et al.,

2019). While easing the difﬁculty of NAR gener-

ation to some extent, these methods still struggle

for generating complex sentences. Therefore, in-

spired by Zhan et al. (2022), we argue that the key

to NAR text generation is to enhance the learning

of token dependency—each token should be gener-

ated depending on forward and backward generated

tokens.

In this paper, we propose

ELMER

: an

fﬁcient

and Effective P

for NAR t

xt gene

ation, to

explicitly learn the bi-directional token dependency.

Typically, most NAR models predict tokens simul-

taneously only at the last layer, thus making the

token prediction unaware of tokens generated in

arXiv:2210.13304v2 [cs.CL] 28 Oct 2022

other positions. To address this issue, we propose

to generate tokens at different layers and the upper-

layer token generation can depend on lower-layer

generated tokens from both left and right. In this

way, our model can explicitly learn the dependency

between tokens from different layers while enjoy-

ing full parallelism in NAR decoding, as shown in

Figure 1. To this end, we propose to extend the

early exit technique (Li et al.,2021c) to NAR text

generation: if there is sufﬁcient conﬁdence to gen-

erate a token at a lower layer, the model is allowed

to exit at this layer and make the prediction without

passing through the upper layers.

Furthermore, instead of exiting at a ﬁxed layer

for a token, we aim to predict each token at differ-

ent layers for learning diverse token dependencies

in NAR text generation. Thus, inspired by XLNet

(Yang et al.,2019), we further propose a novel pre-

training objective based on early exit, i.e., Layer

Permutation Language Modeling (LPLM), to help

ELMER learn complex token dependencies. Given

a sequence, LPLM will permute the exit layer (from

1 to the maximum layer) for each token and maxi-

mize the NAR text generation probability w.r.t. all

possible exit layer permutations of the sequence.

Through LPLM, each token is able to exit at dif-

ferent layers and attend to all other tokens from

both forward and backward positions. In this way,

LPLM could effectively capture diverse token de-

pendencies from large-scale corpora. Pre-trained

with the general LPLM, ELMER can adapt to down-

stream text generation tasks and datasets by using

speciﬁc early exit strategies.

To the best of our knowledge, we are the ﬁrst

to introduce the idea of early exit to NAR text

generation. We ﬁne-tune ELMER on three popu-

lar text generation tasks. Experiments show that

ELMER signiﬁcantly improves the best NAR mod-

els by +4.71 ROUGE-1 on XSUM, +1.79 ME-

TEOR on SQuAD v1.1, and +2.26 Distinct-2 on

PersonaChat, and narrows the performance gap

with auto-regressive PLMs (e.g., ELMER (29.92) vs

BART (30.61) ROUGE-L on XSUM) while achiev-

ing over 10x faster inference.

2 Related Work

Pre-trained Language Models.

Recent years

have witnessed remarkable achievement of PLMs

in text generation tasks (Li et al.,2021b). Most

PLMs adopt an AR paradigm to generate texts dur-

ing pre-training and ﬁne-tuning. The work based

on GPT (Radford et al.,2019;Brown et al.,2020)

converts different tasks into language modeling by

sequentially predicting tokens. BART (Lewis et al.,

2020) employs an auto-regressive decoder to re-

cover the corrupted text in pre-training. T5 (Raffel

et al.,2020) masks word spans from input texts and

then sequentially predicts masked tokens. Tang

et al. (2022) pre-trains a text generation model us-

ing labeled datasets with multi-task learning. Li

et al. (2022b) leverages prompts to effectively ﬁne-

tune text generation models. Differently, our PLM,

ELMER, adopts a NAR schema to generate texts,

which leads to a very low latency in inference.

Non-autoregressive Text Generation.

Recently,

there is a wide range of studies for NAR text genera-

tion (Gu et al.,2018;Ghazvininejad et al.,2019;Qi

et al.,2021). Among them, Gu et al. (2018) is the

ﬁrst to propose NAR paradigm to reduce the infer-

ence latency. Ghazvininejad et al. (2019) iteratively

masks and predicts a fraction of tokens that the

model is least conﬁdent about. Several groups aim

to learn accurate input-output mapping. For exam-

ple, Saharia et al. (2020) and Libovický and Helcl

(2018) use connectionist temporal classiﬁcation to

perform latent alignment in NAR models. Our

work is closely related to BANG (Qi et al.,2021),

a PLM bridging the NAR and AR generation. We

differ in that we use early exit to predict tokens

at different layers, which can help NAR models

learn the forward and backward token dependency.

Moreover, we propose a novel pre-training objec-

tive based on early exit, LPLM, for learning diverse

token dependencies by permuting the exit layer for

each token.

3 Preliminaries

Generally, the goal of text generation is to model

the conditional probability

Pr(Y|X )

, where

hx1, . . . , xni

and

Y=hy1, . . . , ymi

denote the in-

put text and output text respectively and each con-

sists of a sequence of tokens from a vocabulary

There are three common generation paradigms to

model the conditional probability

Pr(Y|X )

,i.e., au-

toregressive (AR), non-autoregressive (NAR), and

semi-nonautoregressive (Semi-NAR) generation.

AR Generation.

AR generation models predict

the output text based on a left-to-right factorization

as:

Pr(Y|X ) =

t=1

Pr(yt|y<t,X),(1)

where each token

is generated based on the input

text

and previous tokens

y<t

. Note that AR gen-

eration only models the forward token dependency.

The token-by-token fashion makes AR generation

process hard to be parallelized. Most of existing

text generation PLMs adopt AR approach (Radford

et al.,2019;Lewis et al.,2020;Raffel et al.,2020).

NAR Generation.

In contrast to AR models, NAR

text generation models predict each token in output

text simultaneously as follows, without modeling

the forward or backward token dependency:

Pr(Y|X ) =

t=1

Pr(yt|X ),(2)

where each token

is predicted only based on the

input text

. The independence assumption makes

NAR generation process parallelizable, thus signif-

icantly accelerating the inference speed (Gu et al.,

2018). While, in the absence of token dependency,

the generation quality of NAR models is lower than

their AR counterparts (Wang et al.,2019).

Semi-NAR Generation.

Semi-NAR generation is

formalized between AR and NAR generation as:

Pr(Y|X ) =

t=1

Pr(yt|Yct,X),(3)

where each token

is conditioned on the input text

and a visible part

Yct

of the output text

Yct

is designed differently to balance inference latency

and accuracy (Stern et al.,2019;Lee et al.,2018).

Note that the lower-layer generated tokens in our

model is similar to the visible part

Yct

. While, our

model keeps the advantage of full parallelism in

contrast to iterative Semi-NAR methods.

In this paper, we mainly focus on the NAR

approach, considering both effectiveness and ef-

ﬁciency for text generation models.

4 Approach

Our proposed NAR text generation PLM, ELMER,

is depicted in Figure 1. ELMER aims to enhance

the modeling of token dependency for NAR mod-

els. With early exit, tokens exiting at different lay-

ers can build the bi-directional token dependency

with each other. Moreover, we design Layer Per-

mutation Language Modeling (LPLM) to pre-train

ELMER by permuting the exit layer for each token.

Next, we will describe each part in detail.

Layer 1

Layer 2

Layer 3

Layer 4

𝑦!

𝑦"

𝑦#

𝑦$

[MASK] [MASK] [MASK] [MASK] [MASK]

Copy hidden state

Early exit

...

Layer 𝐿NAR Decoder

Layer Permutation

Language Modeling

Exit Layer Permutation

𝒫%= 2, 3, 1, 4, 3

Exit Layer Permutation

𝒫

&= 4, 3, 2, 1, 2

𝑦!

𝑦"

𝑦#

𝑦$

𝑦%

𝑦!

𝑦"

𝑦#

𝑦$

𝑦%

...

𝑦'token dependency

Figure 1: Overview of our proposed model ELMER.

4.1 Early Exit for Dependency Modeling

Most NAR models simultaneously predict tokens

only at the last layer (Jiang et al.,2021;Zhan et al.,

2022), which makes the current token generation

unaware of the tokens generated in other positions.

Thus, to model the bi-directional token dependency

(both forward and backward), we propose to predict

tokens at different layers by leveraging the early

exit technique (Li et al.,2021c). In this way, the

upper-layer token generation can depend on tokens

generated at lower layers from both left and right.

NAR Transformer.

ELMER is built on the Trans-

former encoder-decoder architecture (Vaswani

et al.,2017). Both encoder and decoder consist of

stacked layers where each layer contains several

sub-layers (e.g., multi-head self-attention and feed-

forward network). Unlike the original Transformer

decoder that auto-regressively generates text, our

model uses NAR fashion to generate tokens simul-

taneously. Given a pair of input-output text

hX ,Yi

is fed into the encoder and processed as hidden

states

S=hs1, ..., sni

. We then feed a sequence

of “

[MASK]

” tokens into the NAR decoder to gen-

erate every token in output text Yin parallel.

Speciﬁcally, we ﬁrst replace the original masked

multi-head attention in decoder with bi-directional

multi-head attention akin to the encoder. For each

“

[MASK]

” token in the

-th position, the

decoder

layers process it to hidden states {hl

t}1≤l≤Las:

t=Layerl(hl−1

1≤t≤T,S),(4)

t=Embed([MASK]),(5)

where

Layerl(·)

denotes the

-th layer,

Embed(·)

the sum of word embedding and position embed-

ding, and

is the maximum length of the decoder.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ELMER:ANon-AutoregressivePre-trainedLanguageModelforEfcientandEffectiveTextGenerationJunyiLi1,3,4,TianyiTang1,WayneXinZhao1,4,Jian-YunNie3andJi-RongWen1,2,41GaolingSchoolofArticialIntelligence,RenminUniversityofChina2SchoolofInformation,RenminUniversityofChina3DIRO,UniversitédeMontréal4BeijingKey...

展开>> 收起<<

ELMER A Non-Autoregressive Pre-trained Language Model for Efﬁcient and Effective Text Generation Junyi Li134 Tianyi Tang1 Wayne Xin Zhao14 Jian-Yun Nie3andJi-Rong Wen124.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ELMER A Non-Autoregressive Pre-trained Language Model for Efﬁcient and Effective Text Generation Junyi Li134 Tianyi Tang1 Wayne Xin Zhao14 Jian-Yun Nie3andJi-Rong Wen124

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: