InforMask Unsupervised Informative Masking for Language Model Pretraining Nafis Sadeq Canwen Xu Julian McAuley

2025-05-05 0 0 547.6KB 13 页 10玖币
侵权投诉
InforMask: Unsupervised Informative Masking for
Language Model Pretraining
Nafis Sadeq
, Canwen Xu, Julian McAuley
University of California, San Diego
{nsadeq,cxu,jmcauley}@ucsd.edu
Abstract
Masked language modeling is widely used for
pretraining large language models for natu-
ral language understanding (NLU). However,
random masking is suboptimal, allocating an
equal masking rate for all tokens. In this pa-
per, we propose InforMask, a new unsuper-
vised masking strategy for training masked
language models. InforMask exploits Point-
wise Mutual Information (PMI) to select the
most informative tokens to mask. We further
propose two optimizations for InforMask to
improve its efficiency. With a one-off pre-
processing step, InforMask outperforms ran-
dom masking and previously proposed mask-
ing strategies on the factual recall benchmark
LAMA and the question answering benchmark
SQuAD v1 and v2.1
1 Introduction
Masked Language Modeling (MLM) is widely
used for training language models (Devlin et al.,
2019;Liu et al.,2019;Lewis et al.,2020;Raffel
et al.,2020). MLM randomly selects a portion
of tokens from a text sample and replaces them
with a special mask token (e.g.,
[MASK]
). However,
random masking has a few drawbacks — it some-
times produces masks that are too easy to guess,
providing a small loss that is inefficient for train-
ing; some randomly masked tokens can be guessed
with only local cues (Joshi et al.,2020); all tokens
have an identical probability to be masked, while
(e.g.) named entities are more important and need
special attention (Sun et al.,2019;Levine et al.,
2021).
In this paper, we propose a new strategy for
choosing tokens to mask in text samples. We
aim to select words with the most information
that can benefit the language model, especially for
Equal contribution.
1
The code and model checkpoints are available at
https:
//github.com/NafisSadeq/InforMask.
knowledge-intense tasks. To tackle this challenge,
we propose InforMask, an unsupervised informa-
tive masking strategy for language model pretrain-
ing. First, we introduce Informative Relevance,
a metric based on Pointwise Mutual Information
(PMI, Fano,1961) to measure the quality of a mask-
ing choice. Optimizing this measure ensures the
informativeness of the masked token while main-
taining a moderate difficulty for the model to pre-
dict the masked tokens. This metric is based on the
statistical analysis of the corpus, which does not
require any supervision or external resource.
However, maximizing the total Informative Rel-
evance of a text sample with multiple masks can
be computationally challenging. Thus, we propose
a sample-and-score algorithm to reduce the time
complexity of masking and diversify the patterns
in the output. An example is shown in Figure 1.
For training a language model with more epochs,
we can further accelerate the masking process by
only running the algorithm once as a preprocess-
ing step and assigning a token-specific masking
rate for each token according to their masking fre-
quency in the corpus, to approximate the masking
decisions of the sample-and-score algorithm. After
this one-off preprocessing step, masking can be as
fast as the original random masking without any
further overhead, which can be desirable for large-
scale distributed language model training of many
epochs.
To verify the effectiveness of our proposed
method, we conduct extensive experiments on two
knowledge-intense tasks — factual recall and ques-
tion answering. On the factual recall benchmark
LAMA (Petroni et al.,2019), InforMask outper-
forms other masking strategies by a large margin.
Also, our base-size model, InformBERT, trained
with the same corpus and epochs as BERT (De-
vlin et al.,2019) outperforms BERT-base on ques-
tion answering benchmark SQuAD (Rajpurkar
et al.,2016,2018). Notably, on the LAMA
arXiv:2210.11771v1 [cs.CL] 21 Oct 2022
Thomas Edison was an inventor and businessman.
Thomas Edison was an inventor and businessman.
Thomas Edison was an inventor and businessman.
Thomas Edison was an inventor and businessman.
Thomas Edison was an inventor and businessman.
[M] [M] 17.2
[M] 21.9
[M]
[M]
13.4
[M] [M] 2.5
Interesting and challenging!
Steve Jobs? Ben Franklin?
The first mask is too easy!
Boring! Too easy to guess!
[M]
Figure 1: The informative scores of randomly sampled masking candidates (s= 4). [M] denotes the masked
tokens. The pretraining objective of the masked language model (MLM) is to predict the masked tokens based on
the context.
benchmark, InformBERT outperforms BERT and
RoBERTa (Liu et al.,2019) models that have 3
×
parameters and 10×corpus size.
To summarize, our contributions are as follows:
We propose InforMask, an informative mask-
ing strategy for language model pretraining
that does not require extra supervision or ex-
ternal resource.
We pretrain and release InformBERT, a base-
size English BERT model that substantially
outperforms BERT and RoBERTa on the fac-
tual recall benchmark LAMA despite having
much fewer parameters and less training data.
InformBERT also achieves competitive results
on the question answering datasets SQuAD
v1 and v2.
2 Related Work
Random Masking
For pretraining Trans-
former (Vaswani et al.,2017) based language
models such as BERT (Devlin et al.,2019), a
portion of the tokens is randomly chosen to be
masked to set up the masked language model
(MLM) objective. Prior studies have commonly
used a masking rate of 15% (Devlin et al.,2019;
Joshi et al.,2020;Levine et al.,2021;Sun et al.,
2019;Lan et al.,2020;He et al.,2021), while
some recent studies argue that masking rate of 15%
may be a limitation (Clark et al.,2020) and the
pretraining process may benefit from increasing
the masking rate to 40% (Wettig et al.,2022).
However, random masking is not an ideal choice
for learning factual and commonsense knowledge.
Words that have high informative value may be
masked less frequently compared to (e.g.) stop
words, given their frequencies in the corpus.
Span Masking
Although random masking is ef-
fective for pretraining a language model, some
prior works have attempted to optimize the mask-
ing procedure. Joshi et al. (2020) propose Span-
BERT where they show improved performance on
downstream NLP tasks by masking a span of words
instead of individual tokens. They randomly select
the starting point of a span, then sample a span size
from a geometric distribution and mask the selected
span. They continue to mask spans until the target
masking rate is met. This paper suggests mask-
ing spans instead of single words can prevent the
model from predicting masked words by only look-
ing at local cues. However, this masking strategy
inevitably reduces the modeling between the words
in a span, etc., Mount-Fuji, Mona-Lisa, which may
hinder its performance in knowledge-intense tasks.
Entity-based Masking
Baidu-ERNIE (Sun
et al.,2019) introduces an informed masking
strategy where a span containing named entities
will be masked. This approach shows improvement
compared to random masking but requires prior
knowledge regarding named entities. Similarly,
Guu et al. (2020) propose Salient Span Masking
where a span corresponding to a unique entity will
be masked. They rely on an off-the-shelf named
entity recognition (NER) system to identify entity
names. LUKE (Yamada et al.,2020) exploits an
annotated entity corpus to explicitly mark out the
named entities in the pretraining corpus, and masks
non-entity words and named entities separately.
PMI Masking
Levine et al. (2021) propose a
masking strategy based on Pointwise Mutual In-
formation (PMI, Fano,1961), where a span of up
to five words can be masked based on the joint
PMI of the span of words. PMI-Masking is an
adaption of SpanBERT (Joshi et al.,2020) where
meaningful spans are masked instead of random
ones. However, PMI-Masking only considers cor-
related spans and fails to focus on unigram named
entities. This may lead to suboptimal performance
on knowledge intense tasks (details in Section 4.2).
In our proposed method, we exploit PMI to deter-
mine the informative value of tokens to encourage
more efficient training and improve performance
on knowledge-intense tasks.
Knowledge-Enhanced LMs
KnowBERT (Pe-
ters et al.,2019) shows that factual recall perfor-
mance in BERT can be improved significantly by
embedding knowledge bases into additional layers
of the model. Tsinghua-ERNIE (Zhang et al.,2019)
exploits a similar approach that injects knowledge
graphs into the language model during pretraining.
KEPLER (Wang et al.,2021) uses a knowledge
base to jointly optimizes the knowledge embedding
loss and MLM loss on a general corpus, to improve
the knowledge capacity of the language model.
Similar ideas are also explored in K-BERT (Liu
et al.,2020) and CoLAKE (Sun et al.,2020). Coke-
BERT (Su et al.,2021) demonstrates that incor-
porating embeddings for dynamic knowledge con-
text can be more effective than incorporating static
knowledge graphs. Other works have attempted to
incorporate knowledge in the form of lexical rela-
tion (Lauscher et al.,2020), word sense (Levine
et al.,2020), syntax (Bai et al.,2021), and parts-of-
speech (POS) tags (Ke et al.,2020). However, a
high-quality knowledge base is expensive to con-
struct and not available for many languages. Differ-
ent from these methods, our method is fully unsu-
pervised and does not rely on any external resource.
3 Methodology
InforMask aims to make masking decisions more
‘informative’. Since not all words are equally rich
in information (Levine et al.,2021), we aim to
automatically identify more important tokens (e.g.,
named entities) and increase their probability to
be masked while preserving the factual hints to
recover them. On the other hand, we would like
to reduce the frequency of masking stop words.
Stop words are naturally common in the corpus
and they can be important for learning the syntax
and structure of a sentence. However, masked stop
words can be too easy for a language model to
predict, especially in later stages of LM pretraining.
Thus, properly reducing the masking frequency of
stop words can improve both the efficiency and
performance of the model.
Figure 2: The PMI matrix of the words in the sentence
‘The dual is between Harry Potter and Lord Voldemort.
3.1 Informative Relevance
To generate highly informative masking decisions
for a sentence, we introduce a new concept, namely
Informative Relevance. Informative Relevance is
used to measure how relevant a masked word is
to the unmasked words so that it can be meaning-
ful and predictable. The Informative Relevance
of a word is calculated by summing up the Point-
wise Mutual Information (PMI, Fano,1961) be-
tween the masked word and all unmasked words in
the sentence. PMI between two words
w1
and
w2
represents how ‘surprising’ is the co-occurrence
between two words, accounting for their own prob-
abilities. Formally, the PMI of the combination
w1w2is defined as:
pmi(w1, w2) = log p(w1, w2)
p(w1)p(w2)(1)
The PMI matrix is calculated corpus-wise. Note
that instead of using bigrams (i.e., two words have
to be next to each other), we consider the skip-
gram co-occurrence within a window. The window
size is selected in a way that enables sentence-level
co-occurrence to be considered as well as local
co-occurrence.
Maximizing the Informative Relevance enables
the model to better memorize knowledge and fo-
cus on more informative words. Since Informative
Relevance is calculated between a masked word
and the unmasked words, it also encourages hints
to be preserved so that the model can reasonably
摘要:

InforMask:UnsupervisedInformativeMaskingforLanguageModelPretrainingNasSadeq,CanwenXu,JulianMcAuleyUniversityofCalifornia,SanDiego{nsadeq,cxu,jmcauley}@ucsd.eduAbstractMaskedlanguagemodelingiswidelyusedforpretraininglargelanguagemodelsfornatu-rallanguageunderstanding(NLU).However,randommaskingissu...

展开>> 收起<<
InforMask Unsupervised Informative Masking for Language Model Pretraining Nafis Sadeq Canwen Xu Julian McAuley.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:547.6KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注