
InforMask: Unsupervised Informative Masking for
Language Model Pretraining
Nafis Sadeq∗
, Canwen Xu∗, Julian McAuley
University of California, San Diego
{nsadeq,cxu,jmcauley}@ucsd.edu
Abstract
Masked language modeling is widely used for
pretraining large language models for natu-
ral language understanding (NLU). However,
random masking is suboptimal, allocating an
equal masking rate for all tokens. In this pa-
per, we propose InforMask, a new unsuper-
vised masking strategy for training masked
language models. InforMask exploits Point-
wise Mutual Information (PMI) to select the
most informative tokens to mask. We further
propose two optimizations for InforMask to
improve its efficiency. With a one-off pre-
processing step, InforMask outperforms ran-
dom masking and previously proposed mask-
ing strategies on the factual recall benchmark
LAMA and the question answering benchmark
SQuAD v1 and v2.1
1 Introduction
Masked Language Modeling (MLM) is widely
used for training language models (Devlin et al.,
2019;Liu et al.,2019;Lewis et al.,2020;Raffel
et al.,2020). MLM randomly selects a portion
of tokens from a text sample and replaces them
with a special mask token (e.g.,
[MASK]
). However,
random masking has a few drawbacks — it some-
times produces masks that are too easy to guess,
providing a small loss that is inefficient for train-
ing; some randomly masked tokens can be guessed
with only local cues (Joshi et al.,2020); all tokens
have an identical probability to be masked, while
(e.g.) named entities are more important and need
special attention (Sun et al.,2019;Levine et al.,
2021).
In this paper, we propose a new strategy for
choosing tokens to mask in text samples. We
aim to select words with the most information
that can benefit the language model, especially for
∗Equal contribution.
1
The code and model checkpoints are available at
https:
//github.com/NafisSadeq/InforMask.
knowledge-intense tasks. To tackle this challenge,
we propose InforMask, an unsupervised informa-
tive masking strategy for language model pretrain-
ing. First, we introduce Informative Relevance,
a metric based on Pointwise Mutual Information
(PMI, Fano,1961) to measure the quality of a mask-
ing choice. Optimizing this measure ensures the
informativeness of the masked token while main-
taining a moderate difficulty for the model to pre-
dict the masked tokens. This metric is based on the
statistical analysis of the corpus, which does not
require any supervision or external resource.
However, maximizing the total Informative Rel-
evance of a text sample with multiple masks can
be computationally challenging. Thus, we propose
a sample-and-score algorithm to reduce the time
complexity of masking and diversify the patterns
in the output. An example is shown in Figure 1.
For training a language model with more epochs,
we can further accelerate the masking process by
only running the algorithm once as a preprocess-
ing step and assigning a token-specific masking
rate for each token according to their masking fre-
quency in the corpus, to approximate the masking
decisions of the sample-and-score algorithm. After
this one-off preprocessing step, masking can be as
fast as the original random masking without any
further overhead, which can be desirable for large-
scale distributed language model training of many
epochs.
To verify the effectiveness of our proposed
method, we conduct extensive experiments on two
knowledge-intense tasks — factual recall and ques-
tion answering. On the factual recall benchmark
LAMA (Petroni et al.,2019), InforMask outper-
forms other masking strategies by a large margin.
Also, our base-size model, InformBERT, trained
with the same corpus and epochs as BERT (De-
vlin et al.,2019) outperforms BERT-base on ques-
tion answering benchmark SQuAD (Rajpurkar
et al.,2016,2018). Notably, on the LAMA
arXiv:2210.11771v1 [cs.CL] 21 Oct 2022