MLM in almost every task using precisely the same
computational budget.
Finally, we empirically study the trade-offs men-
tioned above by pre-training standard BERT mod-
els with the proposed objectives, also comparing
them with state-of-the-art architectures trained and
tested on the same data. We perform an accurate
comparison by evaluating our models on several
natural language inference benchmarks: all tasks in
the GLUE benchmark suite, ASNQ, WikiQA and
TREC-QA, reporting accuracy as well as efficiency
and cost. To better assess the latter, we also test the
impact of objectives on smaller architectures (e.g.,
BERT-small), showing that our approaches have a
broader impact on those classes of models.
2 Related Work
Many different objectives for self-supervised learn-
ing have been proposed in recent works, such as
Causal Language Modeling (CLM) (Radford,2018;
Radford et al.,2019;Brown et al.,2020), Masked
Language Modeling (MLM) (Devlin et al.,2019;
Liu et al.,2019b) and Token Detection (TD) (Clark
et al.,2020), the latter used by ELECTRA, which
is composed by a generator and a discriminator.
While the generator is trained with MLM to find
suitable candidates to replace the special MASK
tokens, the discriminator should recognize the re-
placements in the original text instead. After pre-
training, the generator is removed, and the discrim-
inator is used as the pre-trained language model.
ELECTRA introduces many innovations: (i) the
exploitation of the whole output of the discrimi-
nator to compute the loss function, thus having a
stronger signal for the back-propagation; (ii) the
usage of a generator network to find suitable re-
placements and (iii) the fact that the discriminator
does not see spurious tokens such as the MASK
token. The latter is a main drawback of the original
BERT, as it creates input discrepancies between
pre-training and fine-tuning, since the CLS repre-
sentation is dependent on all input tokens thanks to
the self-attention mechanism.
Other research directions to reduce training time
address the architecture instead of the learning ob-
jective. In ALBERT (Lan et al.,2020), the authors
tie the weights of every Transformer layer to save
GPU memory, thus enabling bigger batch sizes.
However, since the expressive power of their mod-
els is reduced when layers are tied, they must train
for much longer. Sanh et al. (2020) and Turc et al.
(2019) instead use distillation to reduce the model
size, but the pre-training is still expensive because
it requires a large teacher architecture.
Although pre-training is performed only once, it
usually requires weeks and costly machines (Liu
et al.,2019b;Brown et al.,2020), so it is important
to find alternative ways to pre-train transformers.
Tay et al. (2020) provides an overview of many
recent advancements in transformer efficiency.
Another successful MLM improvement regard-
ing the objectives is SpanBERT (Joshi et al.,2020),
which proposes two new objectives: Span-Masking
and Span-Boundary-Objective (SBO). Specifically,
the Span-Masking objective is a refined version of
MLM that masks contiguous spans of text instead
of single tokens, while, with SBO, they predict
the span content by considering only the output
representation corresponding to the tokens on the
boundaries. Furthermore, in Zhang et al. (2020),
the authors propose a technique to improve down-
stream performance by adapting the model to the
final task while pre-training. Similarly, in Di Liello
et al. (2022) they do continuous pre-training with
custom objectives to better adapt the model for
Answer Sentence Selection (AS2).
In T5 (Raffel et al.,2020), the authors propose to
use deshuffling (Liu et al.,2019a) to pre-train an au-
toregressive model. They shuffle random spans of
text and ask the model to output tokens in the orig-
inal order. This technique provides good results on
an extensive collection of benchmarks. However,
we cannot compare with them because we focus on
autoencoder architectures only.
Finally, we mention the work by Izsak et al.
(2021), in which the authors list many optimiza-
tions that could be applied to the transformers for
faster pre-training. They also claim that using
larger models with the same runtime leads to better
results. We focus instead on the pre-training objec-
tive efficiency and on the classification head size.
Thus we state that those techniques are orthogonal
to our work and could be applied along with our
alternative pre-training objectives.
3 Background on Pre-training Objectives
Before describing our models, we provide a de-
tailed description of the most common token-level
pre-training objectives used in the literature.
Masked Language Model (MLM)
was pro-
posed by Devlin et al. (2019). In MLM, 15% of
the input tokens are replaced with a special mask,