Effective Pre-Training Objectives for Transformer-based Autoencoders Luca Di Liello1 Matteo Gabburo1 Alessandro Moschitti2 1University of Trento2Amazon Alexa AI

2025-05-03 0 0 722.49KB 15 页 10玖币
侵权投诉
Effective Pre-Training Objectives for Transformer-based Autoencoders
Luca Di Liello1
, Matteo Gabburo1
, Alessandro Moschitti2
1University of Trento, 2Amazon Alexa AI
{luca.diliello,matteo.gabburo}@unitn.it
{amosch}@amazon.com
Abstract
In this paper, we study trade-offs between
efficiency, cost and accuracy when pre-
training Transformer encoders with different
pre-training objectives. For this purpose, we
analyze features of common objectives and
combine them to create new effective pre-
training approaches. Specifically, we designed
light token generators based on a straightfor-
ward statistical approach, which can replace
ELECTRA computationally heavy generators,
thus highly reducing cost. Our experiments
also show that (i) there are more efficient al-
ternatives to BERT’s MLM, and (ii) it is possi-
ble to efficiently pre-train Transformer-based
models using lighter generators without a sig-
nificant drop in performance.
1 Introduction
Transformer-based models (Vaswani et al.,2017)
require expensive hardware to be pre-trained
(Strubell et al.,2019;Brown et al.,2020). Recently,
many works focused on reducing pre-training cost
(Lan et al.,2020;Sanh et al.,2020;Turc et al.,
2019). ELECTRA, for example, proposes to train
BERT as a discriminator rather than a generator
(Clark et al.,2020). They replace the Masked Lan-
guage Modeling objective (MLM) (Devlin et al.,
2019) with Token Detection (TD). Then, the dis-
criminator detects if input tokens are original or
fake created by a small generator network.
On the one hand, the discriminator is much more
efficient than MLM. On the other hand, the use
of a generator requires the pre-training of a sec-
ond transformer, increasing the pre-training cost.
ELECTRA has been shown to be more accurate
than BERT. However, it is not clear if this superior
performance is due to its innovative architecture
or to the long and extensive training, which highly
increases the computation cost for obtaining the
final language model.
Equal contribution
In this paper, we study pre-training strategies
with respect to the trade-off between efficiency,
cost, and accuracy. Theoretical efficiency and com-
putational cost do not always align well because
the latter is influenced by the underlying infras-
tructure and by hardware acceleration technologies
(i.e., NVIDIA Tensor Cores). For this purpose,
we analyze the main important components of pre-
training, i.e., pre-training objectives and the algo-
rithms with which they are applied. For example,
we note that MLM needs a large classification head
that spans over the whole vocabulary (which usu-
ally contains several tens of thousands of tokens),
while TD requires a smaller head, which is much
more efficient and uses low computation resources.
We summarize our contribution as follows: First,
we propose Random Token Substitution (RTS)
and Cluster-based Random Token Substitution (C-
RTS), two fast alternatives to ELECTRAs genera-
tor, which allows us to set a middle-ground in the
trade-off between efficiency and accuracy. Indeed,
RTS consists in just detecting tokens that are ran-
domly changed into others, so very low cost, while
C-RTS, which is a bit more expensive than RTS, ex-
ploits the knowledge about predictions in previous
iterations to select more challenging replacements.
Both our objectives increase the efficiency (20% -
45%) thanks to a much smaller binary classifica-
tion head on top and are equally accurate to MLM
on most of the tasks from a statistical significance
viewpoint. We also demonstrate that, if trained for
a longer time, C-RTS outperforms RTS on many
benchmarks because it is a more challenging pre-
training task.
Second, we propose Swapped Language Mod-
eling (SLM), a variant of BERT’s MLM that only
replaces tokens with others, thus removing the spe-
cial MASK token, which is responsible for BERT’s
pre-training/fine-tuning discrepancy (Clark et al.,
2020). We show that this objective increases cost
with respect to RTS and C-RTS, but outperforms
arXiv:2210.13536v1 [cs.CL] 24 Oct 2022
MLM in almost every task using precisely the same
computational budget.
Finally, we empirically study the trade-offs men-
tioned above by pre-training standard BERT mod-
els with the proposed objectives, also comparing
them with state-of-the-art architectures trained and
tested on the same data. We perform an accurate
comparison by evaluating our models on several
natural language inference benchmarks: all tasks in
the GLUE benchmark suite, ASNQ, WikiQA and
TREC-QA, reporting accuracy as well as efficiency
and cost. To better assess the latter, we also test the
impact of objectives on smaller architectures (e.g.,
BERT-small), showing that our approaches have a
broader impact on those classes of models.
2 Related Work
Many different objectives for self-supervised learn-
ing have been proposed in recent works, such as
Causal Language Modeling (CLM) (Radford,2018;
Radford et al.,2019;Brown et al.,2020), Masked
Language Modeling (MLM) (Devlin et al.,2019;
Liu et al.,2019b) and Token Detection (TD) (Clark
et al.,2020), the latter used by ELECTRA, which
is composed by a generator and a discriminator.
While the generator is trained with MLM to find
suitable candidates to replace the special MASK
tokens, the discriminator should recognize the re-
placements in the original text instead. After pre-
training, the generator is removed, and the discrim-
inator is used as the pre-trained language model.
ELECTRA introduces many innovations: (i) the
exploitation of the whole output of the discrimi-
nator to compute the loss function, thus having a
stronger signal for the back-propagation; (ii) the
usage of a generator network to find suitable re-
placements and (iii) the fact that the discriminator
does not see spurious tokens such as the MASK
token. The latter is a main drawback of the original
BERT, as it creates input discrepancies between
pre-training and fine-tuning, since the CLS repre-
sentation is dependent on all input tokens thanks to
the self-attention mechanism.
Other research directions to reduce training time
address the architecture instead of the learning ob-
jective. In ALBERT (Lan et al.,2020), the authors
tie the weights of every Transformer layer to save
GPU memory, thus enabling bigger batch sizes.
However, since the expressive power of their mod-
els is reduced when layers are tied, they must train
for much longer. Sanh et al. (2020) and Turc et al.
(2019) instead use distillation to reduce the model
size, but the pre-training is still expensive because
it requires a large teacher architecture.
Although pre-training is performed only once, it
usually requires weeks and costly machines (Liu
et al.,2019b;Brown et al.,2020), so it is important
to find alternative ways to pre-train transformers.
Tay et al. (2020) provides an overview of many
recent advancements in transformer efficiency.
Another successful MLM improvement regard-
ing the objectives is SpanBERT (Joshi et al.,2020),
which proposes two new objectives: Span-Masking
and Span-Boundary-Objective (SBO). Specifically,
the Span-Masking objective is a refined version of
MLM that masks contiguous spans of text instead
of single tokens, while, with SBO, they predict
the span content by considering only the output
representation corresponding to the tokens on the
boundaries. Furthermore, in Zhang et al. (2020),
the authors propose a technique to improve down-
stream performance by adapting the model to the
final task while pre-training. Similarly, in Di Liello
et al. (2022) they do continuous pre-training with
custom objectives to better adapt the model for
Answer Sentence Selection (AS2).
In T5 (Raffel et al.,2020), the authors propose to
use deshuffling (Liu et al.,2019a) to pre-train an au-
toregressive model. They shuffle random spans of
text and ask the model to output tokens in the orig-
inal order. This technique provides good results on
an extensive collection of benchmarks. However,
we cannot compare with them because we focus on
autoencoder architectures only.
Finally, we mention the work by Izsak et al.
(2021), in which the authors list many optimiza-
tions that could be applied to the transformers for
faster pre-training. They also claim that using
larger models with the same runtime leads to better
results. We focus instead on the pre-training objec-
tive efficiency and on the classification head size.
Thus we state that those techniques are orthogonal
to our work and could be applied along with our
alternative pre-training objectives.
3 Background on Pre-training Objectives
Before describing our models, we provide a de-
tailed description of the most common token-level
pre-training objectives used in the literature.
Masked Language Model (MLM)
was pro-
posed by Devlin et al. (2019). In MLM, 15% of
the input tokens are replaced with a special mask,
E0E1E2EM
. . .
Classification
Head
<s> The fox is sat on the table
<s> The <mask>is sat on the table
<s> </s>The fox is sat on the table
Output
Sequence
Embeddings
Corrupted
Sequence
Input Masker
Input
Sequence
. . .. . .
. . .
. . .
Vocabulary
size
. . .
. . .
. . .
. . .
. . .
Layers
</s>
</s>
Classification
Head
0 00 10 0 0 0 0
Output
Sequence
Always
2
E0E1E2EM
. . .
Layers
<s> </s>
The fox is sat on the table
<s> </s>The bottle is sat on the table
Embeddings
Corrupted
Sequence
Input Handler
Input
Sequence
. . .
. . .
. . .
. . .
E0E1E2EM
. . .
Classification
Head
<s> The fox is sat on the table
<s> </s>
The bottle is sat on the table
<s> </s>The fox is sat on the table
Output
Sequence
Embeddings
Corrupted
Sequence
Input Handler
Input
Sequence
. . .. . .
. . .
. . .
Vocabulary
size
. . .
. . .
. . .
. . .
. . .
Layers
</s>
Figure 1: MLM, RTS and SLM architectures (from left to right). Notice that the classification head used by RTS
is several times smaller than those used by MLM and SLM (see Appendix G).
and the model has to predict the original value. An
improvement of MLM is whole-word-masking, in
which masking is applied to every token belonging
to a word and not just to independent tokens.
The model needs a large classification head for
pre-training, as shown in Figure 1. Its dimension is
proportional to the vocabulary size, and (especially
for small models) this constitutes a significant frac-
tion of the entire architecture parameters. In the
base architectures, the MLM head constitutes about
20% of the model parameters while for small mod-
els, the fraction increases to 30%
1
. The memory
footprint of the LM head while training is about
47% for base and 64% for small models
2
. See Ap-
pendix Gfor more details. For this reason, a binary
classification head (as for TD) can provide signif-
icant efficiency improvement. Moreover, sharing
the parameters of the MLM classification head with
the embeddings does not reduce the computational
cost but leads only to marginally lower memory
requirements. In the embedding layer, only a few
row vectors corresponding to the actual sentence
tokens are updated at every step. On the contrary,
MLM’s softmax continuously computes gradients
for the whole output linear transformation.
Causal Language Model (CLM)
is used to train
autoregressive models by predicting the next to-
ken in a sequence (Radford et al.,2019;Radford,
2018). Similarly to MLM, it requires a large classi-
fication head to output predictions over the whole
vocabulary.
Permutation Language Modeling (PLM)
was
proposed by Yang et al. (2020) to combine the
1
Very often the language modelling head parameters are
shared with the embedding layer
2
Gradients have to be computed for every token in the
vocabulary because of the final softmax layer
generative power of autoregressive models with
the bidirectional context of autoencoders. This
is accomplished by permuting the input tokens
and by letting the model use only the left con-
text for the next token prediction. In this way, the
model keeps the strengths of autoregressive mod-
els while exploiting the whole input sequence for
better-contextualized output embedding.
Token Detection (TD)
was introduced by ELEC-
TRA (Clark et al.,2020), which is an architec-
ture composed of a discriminator and a smaller
generator network. First, the generator is trained
with MLM and finds suitable replacements for the
masked tokens, as in BERT. Then, those candidates
are inserted in the original sentence, and the re-
sulting sequence is fed to the discriminator, which
classifies whether a token is original or not. TD has
the advantage of computing the loss on the whole
discriminator output and having a minimal memory
footprint. However, the whole system is inefficient
because of the presence of the generator, which is
still MLM-based.
4 Effective Pre-training Objectives
This section presents our alternative pre-training
objectives, which can potentially be applied to a
wide range of Transformer-based models.
Random Token Substitution (RTS)
Like
ELECTRA, RTS trains a model that discriminates
between original and substituted tokens. The main
difference is that RTS replaces 15% of the tokens
with random alternatives, thus avoiding using
computational resources to train a separate and
expensive network. Besides, unlike MLM, this
approach relies on a smaller classification head
that is not proportional to the vocabulary size, see
摘要:

EffectivePre-TrainingObjectivesforTransformer-basedAutoencodersLucaDiLiello1,MatteoGabburo1,AlessandroMoschitti21UniversityofTrento,2AmazonAlexaAI{luca.diliello,matteo.gabburo}@unitn.it{amosch}@amazon.comAbstractInthispaper,westudytrade-offsbetweenefciency,costandaccuracywhenpre-trainingTransform...

展开>> 收起<<
Effective Pre-Training Objectives for Transformer-based Autoencoders Luca Di Liello1 Matteo Gabburo1 Alessandro Moschitti2 1University of Trento2Amazon Alexa AI.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:722.49KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注