Effective Pre-Training Objectives for Transformer-based Autoencoders Luca Di Liello1 Matteo Gabburo1 Alessandro Moschitti2 1University of Trento2Amazon Alexa AI

2025-05-03 0 0 722.49KB 15 页 10玖币

侵权投诉

Effective Pre-Training Objectives for Transformer-based Autoencoders

Luca Di Liello1∗

, Matteo Gabburo1∗

, Alessandro Moschitti2

1University of Trento, 2Amazon Alexa AI

{luca.diliello,matteo.gabburo}@unitn.it

{amosch}@amazon.com

Abstract

In this paper, we study trade-offs between

efﬁciency, cost and accuracy when pre-

training Transformer encoders with different

pre-training objectives. For this purpose, we

analyze features of common objectives and

combine them to create new effective pre-

training approaches. Speciﬁcally, we designed

light token generators based on a straightfor-

ward statistical approach, which can replace

ELECTRA computationally heavy generators,

thus highly reducing cost. Our experiments

also show that (i) there are more efﬁcient al-

ternatives to BERT’s MLM, and (ii) it is possi-

ble to efﬁciently pre-train Transformer-based

models using lighter generators without a sig-

niﬁcant drop in performance.

1 Introduction

Transformer-based models (Vaswani et al.,2017)

require expensive hardware to be pre-trained

(Strubell et al.,2019;Brown et al.,2020). Recently,

many works focused on reducing pre-training cost

(Lan et al.,2020;Sanh et al.,2020;Turc et al.,

2019). ELECTRA, for example, proposes to train

BERT as a discriminator rather than a generator

(Clark et al.,2020). They replace the Masked Lan-

guage Modeling objective (MLM) (Devlin et al.,

2019) with Token Detection (TD). Then, the dis-

criminator detects if input tokens are original or

fake created by a small generator network.

On the one hand, the discriminator is much more

efﬁcient than MLM. On the other hand, the use

of a generator requires the pre-training of a sec-

ond transformer, increasing the pre-training cost.

ELECTRA has been shown to be more accurate

than BERT. However, it is not clear if this superior

performance is due to its innovative architecture

or to the long and extensive training, which highly

increases the computation cost for obtaining the

ﬁnal language model.

∗Equal contribution

In this paper, we study pre-training strategies

with respect to the trade-off between efﬁciency,

cost, and accuracy. Theoretical efﬁciency and com-

putational cost do not always align well because

the latter is inﬂuenced by the underlying infras-

tructure and by hardware acceleration technologies

(i.e., NVIDIA Tensor Cores). For this purpose,

we analyze the main important components of pre-

training, i.e., pre-training objectives and the algo-

rithms with which they are applied. For example,

we note that MLM needs a large classiﬁcation head

that spans over the whole vocabulary (which usu-

ally contains several tens of thousands of tokens),

while TD requires a smaller head, which is much

more efﬁcient and uses low computation resources.

We summarize our contribution as follows: First,

we propose Random Token Substitution (RTS)

and Cluster-based Random Token Substitution (C-

RTS), two fast alternatives to ELECTRA’s genera-

tor, which allows us to set a middle-ground in the

trade-off between efﬁciency and accuracy. Indeed,

RTS consists in just detecting tokens that are ran-

domly changed into others, so very low cost, while

C-RTS, which is a bit more expensive than RTS, ex-

ploits the knowledge about predictions in previous

iterations to select more challenging replacements.

Both our objectives increase the efﬁciency (20% -

45%) thanks to a much smaller binary classiﬁca-

tion head on top and are equally accurate to MLM

on most of the tasks from a statistical signiﬁcance

viewpoint. We also demonstrate that, if trained for

a longer time, C-RTS outperforms RTS on many

benchmarks because it is a more challenging pre-

training task.

Second, we propose Swapped Language Mod-

eling (SLM), a variant of BERT’s MLM that only

replaces tokens with others, thus removing the spe-

cial MASK token, which is responsible for BERT’s

pre-training/ﬁne-tuning discrepancy (Clark et al.,

2020). We show that this objective increases cost

with respect to RTS and C-RTS, but outperforms

arXiv:2210.13536v1 [cs.CL] 24 Oct 2022

MLM in almost every task using precisely the same

computational budget.

Finally, we empirically study the trade-offs men-

tioned above by pre-training standard BERT mod-

els with the proposed objectives, also comparing

them with state-of-the-art architectures trained and

tested on the same data. We perform an accurate

comparison by evaluating our models on several

natural language inference benchmarks: all tasks in

the GLUE benchmark suite, ASNQ, WikiQA and

TREC-QA, reporting accuracy as well as efﬁciency

and cost. To better assess the latter, we also test the

impact of objectives on smaller architectures (e.g.,

BERT-small), showing that our approaches have a

broader impact on those classes of models.

2 Related Work

Many different objectives for self-supervised learn-

ing have been proposed in recent works, such as

Causal Language Modeling (CLM) (Radford,2018;

Radford et al.,2019;Brown et al.,2020), Masked

Language Modeling (MLM) (Devlin et al.,2019;

Liu et al.,2019b) and Token Detection (TD) (Clark

et al.,2020), the latter used by ELECTRA, which

is composed by a generator and a discriminator.

While the generator is trained with MLM to ﬁnd

suitable candidates to replace the special MASK

tokens, the discriminator should recognize the re-

placements in the original text instead. After pre-

training, the generator is removed, and the discrim-

inator is used as the pre-trained language model.

ELECTRA introduces many innovations: (i) the

exploitation of the whole output of the discrimi-

nator to compute the loss function, thus having a

stronger signal for the back-propagation; (ii) the

usage of a generator network to ﬁnd suitable re-

placements and (iii) the fact that the discriminator

does not see spurious tokens such as the MASK

token. The latter is a main drawback of the original

BERT, as it creates input discrepancies between

pre-training and ﬁne-tuning, since the CLS repre-

sentation is dependent on all input tokens thanks to

the self-attention mechanism.

Other research directions to reduce training time

address the architecture instead of the learning ob-

jective. In ALBERT (Lan et al.,2020), the authors

tie the weights of every Transformer layer to save

GPU memory, thus enabling bigger batch sizes.

However, since the expressive power of their mod-

els is reduced when layers are tied, they must train

for much longer. Sanh et al. (2020) and Turc et al.

(2019) instead use distillation to reduce the model

size, but the pre-training is still expensive because

it requires a large teacher architecture.

Although pre-training is performed only once, it

usually requires weeks and costly machines (Liu

et al.,2019b;Brown et al.,2020), so it is important

to ﬁnd alternative ways to pre-train transformers.

Tay et al. (2020) provides an overview of many

recent advancements in transformer efﬁciency.

Another successful MLM improvement regard-

ing the objectives is SpanBERT (Joshi et al.,2020),

which proposes two new objectives: Span-Masking

and Span-Boundary-Objective (SBO). Speciﬁcally,

the Span-Masking objective is a reﬁned version of

MLM that masks contiguous spans of text instead

of single tokens, while, with SBO, they predict

the span content by considering only the output

representation corresponding to the tokens on the

boundaries. Furthermore, in Zhang et al. (2020),

the authors propose a technique to improve down-

stream performance by adapting the model to the

ﬁnal task while pre-training. Similarly, in Di Liello

et al. (2022) they do continuous pre-training with

custom objectives to better adapt the model for

Answer Sentence Selection (AS2).

In T5 (Raffel et al.,2020), the authors propose to

use deshufﬂing (Liu et al.,2019a) to pre-train an au-

toregressive model. They shufﬂe random spans of

text and ask the model to output tokens in the orig-

inal order. This technique provides good results on

an extensive collection of benchmarks. However,

we cannot compare with them because we focus on

autoencoder architectures only.

Finally, we mention the work by Izsak et al.

(2021), in which the authors list many optimiza-

tions that could be applied to the transformers for

faster pre-training. They also claim that using

larger models with the same runtime leads to better

results. We focus instead on the pre-training objec-

tive efﬁciency and on the classiﬁcation head size.

Thus we state that those techniques are orthogonal

to our work and could be applied along with our

alternative pre-training objectives.

3 Background on Pre-training Objectives

Before describing our models, we provide a de-

tailed description of the most common token-level

pre-training objectives used in the literature.

Masked Language Model (MLM)

was pro-

posed by Devlin et al. (2019). In MLM, 15% of

the input tokens are replaced with a special mask,

E0E1E2EM

. . .

Classification

Head

<s> The fox is sat on the table

<s> The <mask>is sat on the table

<s> </s>The fox is sat on the table

Output

Sequence

Embeddings

Corrupted

Sequence

Input Masker

Input

Sequence

. . .. . .

. . .

Vocabulary

size

. . .

Layers

</s>

Classification

Head

0 00 10 0 0 0 0

Output

Sequence

Always

E0E1E2EM

. . .

Layers

The fox is sat on the table

<s> </s>The bottle is sat on the table

Embeddings

Corrupted

Sequence

Input Handler

Input

Sequence

. . .

E0E1E2EM

. . .

Classification

Head

<s> The fox is sat on the table

The bottle is sat on the table

<s> </s>The fox is sat on the table

Output

Sequence

Embeddings

Corrupted

Sequence

Input Handler

Input

Sequence

. . .. . .

. . .

Vocabulary

size

. . .

Layers

</s>

Figure 1: MLM, RTS and SLM architectures (from left to right). Notice that the classiﬁcation head used by RTS

is several times smaller than those used by MLM and SLM (see Appendix G).

and the model has to predict the original value. An

improvement of MLM is whole-word-masking, in

which masking is applied to every token belonging

to a word and not just to independent tokens.

The model needs a large classiﬁcation head for

pre-training, as shown in Figure 1. Its dimension is

proportional to the vocabulary size, and (especially

for small models) this constitutes a signiﬁcant frac-

tion of the entire architecture parameters. In the

base architectures, the MLM head constitutes about

20% of the model parameters while for small mod-

els, the fraction increases to 30%

. The memory

footprint of the LM head while training is about

47% for base and 64% for small models

. See Ap-

pendix Gfor more details. For this reason, a binary

classiﬁcation head (as for TD) can provide signif-

icant efﬁciency improvement. Moreover, sharing

the parameters of the MLM classiﬁcation head with

the embeddings does not reduce the computational

cost but leads only to marginally lower memory

requirements. In the embedding layer, only a few

row vectors corresponding to the actual sentence

tokens are updated at every step. On the contrary,

MLM’s softmax continuously computes gradients

for the whole output linear transformation.

Causal Language Model (CLM)

is used to train

autoregressive models by predicting the next to-

ken in a sequence (Radford et al.,2019;Radford,

2018). Similarly to MLM, it requires a large classi-

ﬁcation head to output predictions over the whole

vocabulary.

Permutation Language Modeling (PLM)

was

proposed by Yang et al. (2020) to combine the

Very often the language modelling head parameters are

shared with the embedding layer

Gradients have to be computed for every token in the

vocabulary because of the ﬁnal softmax layer

generative power of autoregressive models with

the bidirectional context of autoencoders. This

is accomplished by permuting the input tokens

and by letting the model use only the left con-

text for the next token prediction. In this way, the

model keeps the strengths of autoregressive mod-

els while exploiting the whole input sequence for

better-contextualized output embedding.

Token Detection (TD)

was introduced by ELEC-

TRA (Clark et al.,2020), which is an architec-

ture composed of a discriminator and a smaller

generator network. First, the generator is trained

with MLM and ﬁnds suitable replacements for the

masked tokens, as in BERT. Then, those candidates

are inserted in the original sentence, and the re-

sulting sequence is fed to the discriminator, which

classiﬁes whether a token is original or not. TD has

the advantage of computing the loss on the whole

discriminator output and having a minimal memory

footprint. However, the whole system is inefﬁcient

because of the presence of the generator, which is

still MLM-based.

4 Effective Pre-training Objectives

This section presents our alternative pre-training

objectives, which can potentially be applied to a

wide range of Transformer-based models.

Random Token Substitution (RTS)

ELECTRA, RTS trains a model that discriminates

between original and substituted tokens. The main

difference is that RTS replaces 15% of the tokens

with random alternatives, thus avoiding using

computational resources to train a separate and

expensive network. Besides, unlike MLM, this

approach relies on a smaller classiﬁcation head

that is not proportional to the vocabulary size, see

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EffectivePre-TrainingObjectivesforTransformer-basedAutoencodersLucaDiLiello1,MatteoGabburo1,AlessandroMoschitti21UniversityofTrento,2AmazonAlexaAI{luca.diliello,matteo.gabburo}@unitn.it{amosch}@amazon.comAbstractInthispaper,westudytrade-offsbetweenefciency,costandaccuracywhenpre-trainingTransform...

展开>> 收起<<

Effective Pre-Training Objectives for Transformer-based Autoencoders Luca Di Liello1 Matteo Gabburo1 Alessandro Moschitti2 1University of Trento2Amazon Alexa AI.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Effective Pre-Training Objectives for Transformer-based Autoencoders Luca Di Liello1 Matteo Gabburo1 Alessandro Moschitti2 1University of Trento2Amazon Alexa AI

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: