Xu and Zhao,2021;Wang et al.,2020;Guu et al.,
2020). For example, masked language modeling
(MLM) (Devlin et al.,2019) replaces some input
tokens in a sentence with a special symbol. BART
uses token deletion, text infilling, and sentence
permutation for corruption (Lewis et al.,2020a).
2) Denoising enables a language model to
predict missing or otherwise corrupted tokens in
the input sequences. Recent studies focus on
designing improved language modeling functions
to mitigate discrepancies between the pre-training
phase and the fine-tuning phase. Yang et al. (2019)
reformulates MLM in XLNet by restoring the
permuted tokens in factorization order, such that
the input sequence is autoregressively generated
after permutation. In addition, using synonyms
for the masking purpose (Cui et al.,2020) and
simple pre-training objectives based on token-level
classification tasks (Yamaguchi et al.,2021) have
also proved effective as an MLM alternative.
Most of the existing studies of PrLMs fall into
the scope of either investigating better ennoising
operations or more effective denoising strategies.
They treat training instances equally throughout
the training process. Little attention is paid
to the individual contribution of those instances.
In standard MLM ennoising, randomly masking
different tokens would lead to different degrees
of corruption that may, therefore, cause different
levels of difficulty in sentence restoration in
denoising (as shown in Figure 1) and thus increase
the uncertainty in restoring the original sentence
structure during the denoising process. For
example, if “not” is masked, the corrupted sentence
tends to have a contrary meaning.
In this work, we are motivated to estimate
the complexity of restoring the original sentences
from corrupted ones in language model pre-
training, to provide explicit regularization signals
to encourage more effective and robust pre-training.
Our approach includes two sides of penalty:
1) ennoising corruption penalty that measures
the distribution disparity between the corrupted
sentence and the original sentence, to measure
the corruption degree in the ennoising process;
2) denoising prediction penalty that measures
the distribution difference between the restored
sequence and the original sentence to measure
the sentence-level prediction confidence in the
denoising counterpart. Experiments show that
language models trained with our regularization
terms can yield better performance and become
more robust against adversarial attacks.
2 Related Work
Training powerful large-scale language models
on a large unlabeled corpus with self-supervised
objectives has attracted lots of attention, which
commonly work in two procedures of ennoising
and denoising. The most representative task
for pre-training is MLM, which is introduced in
Devlin et al. (2019) to pre-train a bidirectional
BERT. A spectrum of ennoising extensions has
been proposed to enhance MLM further and
alleviate the potential drawbacks, which fall into
two categories: 1) mask units and 2) noising
scheme. Mask units correspond to the language
modeling units that serve as knowledge carriers
in different granularity. The variants focusing on
mask units include the standard subword masking
(Devlin et al.,2019), span masking (Joshi et al.,
2020), and
n
-gram masking (Levine et al.,2021;
Li and Zhao,2021). For noising scheme, BART
(Lewis et al.,2020a) corrupts text with arbitrary
noising functions, including token deletion, text
infilling, sentence permutation, in conjunction with
MLM. UniLM (Dong et al.,2019) extends the
mask prediction to generation tasks by adding
the auto-regressive objectives. XLNet (Yang
et al.,2019) proposes the permuted language
modeling to learn the dependencies among the
masked tokens. MacBERT (Cui et al.,2020)
suggests using similar words for the masking
purpose. Yamaguchi et al. (2021) also investigates
simple pre-training objectives based on token-level
classification tasks as replacements of MLM, which
are often computationally cheaper and result in
comparable performance to MLM. In addition,
ELECTRA (Clark et al.,2020) proposes a novel
training objective called replaced token detection,
which is defined over all input tokens.
Although the above studies have an adequate
investigation to reduce the mismatch between
pre-training and fine-tuning tasks, an essential
problem of the common denoising mechanism
lacks attention. The construction of training
examples based on ennoising operations would
cause the break of sentence structure, either for
replacement, addition, or deletion-based noising
functions. In extreme cases, the destruction would
lead to completely different sentences, making it
difficult for the model to predict the corrupted