InforMask Unsupervised Informative Masking for Language Model Pretraining Naﬁs Sadeq Canwen Xu Julian McAuley

2025-05-05 0 0 547.6KB 13 页 10玖币

侵权投诉

InforMask: Unsupervised Informative Masking for

Language Model Pretraining

Naﬁs Sadeq∗

, Canwen Xu∗, Julian McAuley

University of California, San Diego

{nsadeq,cxu,jmcauley}@ucsd.edu

Abstract

Masked language modeling is widely used for

pretraining large language models for natu-

ral language understanding (NLU). However,

random masking is suboptimal, allocating an

equal masking rate for all tokens. In this pa-

per, we propose InforMask, a new unsuper-

vised masking strategy for training masked

language models. InforMask exploits Point-

wise Mutual Information (PMI) to select the

most informative tokens to mask. We further

propose two optimizations for InforMask to

improve its efﬁciency. With a one-off pre-

processing step, InforMask outperforms ran-

dom masking and previously proposed mask-

ing strategies on the factual recall benchmark

LAMA and the question answering benchmark

SQuAD v1 and v2.1

1 Introduction

Masked Language Modeling (MLM) is widely

used for training language models (Devlin et al.,

2019;Liu et al.,2019;Lewis et al.,2020;Raffel

et al.,2020). MLM randomly selects a portion

of tokens from a text sample and replaces them

with a special mask token (e.g.,

[MASK]

). However,

random masking has a few drawbacks — it some-

times produces masks that are too easy to guess,

providing a small loss that is inefﬁcient for train-

ing; some randomly masked tokens can be guessed

with only local cues (Joshi et al.,2020); all tokens

have an identical probability to be masked, while

(e.g.) named entities are more important and need

special attention (Sun et al.,2019;Levine et al.,

2021).

In this paper, we propose a new strategy for

choosing tokens to mask in text samples. We

aim to select words with the most information

that can beneﬁt the language model, especially for

∗Equal contribution.

The code and model checkpoints are available at

https:

//github.com/NafisSadeq/InforMask.

knowledge-intense tasks. To tackle this challenge,

we propose InforMask, an unsupervised informa-

tive masking strategy for language model pretrain-

ing. First, we introduce Informative Relevance,

a metric based on Pointwise Mutual Information

(PMI, Fano,1961) to measure the quality of a mask-

ing choice. Optimizing this measure ensures the

informativeness of the masked token while main-

taining a moderate difﬁculty for the model to pre-

dict the masked tokens. This metric is based on the

statistical analysis of the corpus, which does not

require any supervision or external resource.

However, maximizing the total Informative Rel-

evance of a text sample with multiple masks can

be computationally challenging. Thus, we propose

a sample-and-score algorithm to reduce the time

complexity of masking and diversify the patterns

in the output. An example is shown in Figure 1.

For training a language model with more epochs,

we can further accelerate the masking process by

only running the algorithm once as a preprocess-

ing step and assigning a token-speciﬁc masking

rate for each token according to their masking fre-

quency in the corpus, to approximate the masking

decisions of the sample-and-score algorithm. After

this one-off preprocessing step, masking can be as

fast as the original random masking without any

further overhead, which can be desirable for large-

scale distributed language model training of many

epochs.

To verify the effectiveness of our proposed

method, we conduct extensive experiments on two

knowledge-intense tasks — factual recall and ques-

tion answering. On the factual recall benchmark

LAMA (Petroni et al.,2019), InforMask outper-

forms other masking strategies by a large margin.

Also, our base-size model, InformBERT, trained

with the same corpus and epochs as BERT (De-

vlin et al.,2019) outperforms BERT-base on ques-

tion answering benchmark SQuAD (Rajpurkar

et al.,2016,2018). Notably, on the LAMA

arXiv:2210.11771v1 [cs.CL] 21 Oct 2022

Thomas Edison was an inventor and businessman.

[M] [M] 17.2

[M] 21.9

[M]

13.4

[M] [M] 2.5

✅

Interesting and challenging!

Steve Jobs? Ben Franklin?

The first mask is too easy!

Boring! Too easy to guess!

[M]

Figure 1: The informative scores of randomly sampled masking candidates (s= 4). [M] denotes the masked

tokens. The pretraining objective of the masked language model (MLM) is to predict the masked tokens based on

the context.

benchmark, InformBERT outperforms BERT and

RoBERTa (Liu et al.,2019) models that have 3

parameters and 10×corpus size.

To summarize, our contributions are as follows:

•

We propose InforMask, an informative mask-

ing strategy for language model pretraining

that does not require extra supervision or ex-

ternal resource.

•

We pretrain and release InformBERT, a base-

size English BERT model that substantially

outperforms BERT and RoBERTa on the fac-

tual recall benchmark LAMA despite having

much fewer parameters and less training data.

InformBERT also achieves competitive results

on the question answering datasets SQuAD

v1 and v2.

2 Related Work

Random Masking

For pretraining Trans-

former (Vaswani et al.,2017) based language

models such as BERT (Devlin et al.,2019), a

portion of the tokens is randomly chosen to be

masked to set up the masked language model

(MLM) objective. Prior studies have commonly

used a masking rate of 15% (Devlin et al.,2019;

Joshi et al.,2020;Levine et al.,2021;Sun et al.,

2019;Lan et al.,2020;He et al.,2021), while

some recent studies argue that masking rate of 15%

may be a limitation (Clark et al.,2020) and the

pretraining process may beneﬁt from increasing

the masking rate to 40% (Wettig et al.,2022).

However, random masking is not an ideal choice

for learning factual and commonsense knowledge.

Words that have high informative value may be

masked less frequently compared to (e.g.) stop

words, given their frequencies in the corpus.

Span Masking

Although random masking is ef-

fective for pretraining a language model, some

prior works have attempted to optimize the mask-

ing procedure. Joshi et al. (2020) propose Span-

BERT where they show improved performance on

downstream NLP tasks by masking a span of words

instead of individual tokens. They randomly select

the starting point of a span, then sample a span size

from a geometric distribution and mask the selected

span. They continue to mask spans until the target

masking rate is met. This paper suggests mask-

ing spans instead of single words can prevent the

model from predicting masked words by only look-

ing at local cues. However, this masking strategy

inevitably reduces the modeling between the words

in a span, etc., Mount-Fuji, Mona-Lisa, which may

hinder its performance in knowledge-intense tasks.

Entity-based Masking

Baidu-ERNIE (Sun

et al.,2019) introduces an informed masking

strategy where a span containing named entities

will be masked. This approach shows improvement

compared to random masking but requires prior

knowledge regarding named entities. Similarly,

Guu et al. (2020) propose Salient Span Masking

where a span corresponding to a unique entity will

be masked. They rely on an off-the-shelf named

entity recognition (NER) system to identify entity

names. LUKE (Yamada et al.,2020) exploits an

annotated entity corpus to explicitly mark out the

named entities in the pretraining corpus, and masks

non-entity words and named entities separately.

PMI Masking

Levine et al. (2021) propose a

masking strategy based on Pointwise Mutual In-

formation (PMI, Fano,1961), where a span of up

to ﬁve words can be masked based on the joint

PMI of the span of words. PMI-Masking is an

adaption of SpanBERT (Joshi et al.,2020) where

meaningful spans are masked instead of random

ones. However, PMI-Masking only considers cor-

related spans and fails to focus on unigram named

entities. This may lead to suboptimal performance

on knowledge intense tasks (details in Section 4.2).

In our proposed method, we exploit PMI to deter-

mine the informative value of tokens to encourage

more efﬁcient training and improve performance

on knowledge-intense tasks.

Knowledge-Enhanced LMs

KnowBERT (Pe-

ters et al.,2019) shows that factual recall perfor-

mance in BERT can be improved signiﬁcantly by

embedding knowledge bases into additional layers

of the model. Tsinghua-ERNIE (Zhang et al.,2019)

exploits a similar approach that injects knowledge

graphs into the language model during pretraining.

KEPLER (Wang et al.,2021) uses a knowledge

base to jointly optimizes the knowledge embedding

loss and MLM loss on a general corpus, to improve

the knowledge capacity of the language model.

Similar ideas are also explored in K-BERT (Liu

et al.,2020) and CoLAKE (Sun et al.,2020). Coke-

BERT (Su et al.,2021) demonstrates that incor-

porating embeddings for dynamic knowledge con-

text can be more effective than incorporating static

knowledge graphs. Other works have attempted to

incorporate knowledge in the form of lexical rela-

tion (Lauscher et al.,2020), word sense (Levine

et al.,2020), syntax (Bai et al.,2021), and parts-of-

speech (POS) tags (Ke et al.,2020). However, a

high-quality knowledge base is expensive to con-

struct and not available for many languages. Differ-

ent from these methods, our method is fully unsu-

pervised and does not rely on any external resource.

3 Methodology

InforMask aims to make masking decisions more

‘informative’. Since not all words are equally rich

in information (Levine et al.,2021), we aim to

automatically identify more important tokens (e.g.,

named entities) and increase their probability to

be masked while preserving the factual hints to

recover them. On the other hand, we would like

to reduce the frequency of masking stop words.

Stop words are naturally common in the corpus

and they can be important for learning the syntax

and structure of a sentence. However, masked stop

words can be too easy for a language model to

predict, especially in later stages of LM pretraining.

Thus, properly reducing the masking frequency of

stop words can improve both the efﬁciency and

performance of the model.

Figure 2: The PMI matrix of the words in the sentence

‘The dual is between Harry Potter and Lord Voldemort.’

3.1 Informative Relevance

To generate highly informative masking decisions

for a sentence, we introduce a new concept, namely

Informative Relevance. Informative Relevance is

used to measure how relevant a masked word is

to the unmasked words so that it can be meaning-

ful and predictable. The Informative Relevance

of a word is calculated by summing up the Point-

wise Mutual Information (PMI, Fano,1961) be-

tween the masked word and all unmasked words in

the sentence. PMI between two words

and

represents how ‘surprising’ is the co-occurrence

between two words, accounting for their own prob-

abilities. Formally, the PMI of the combination

w1w2is deﬁned as:

pmi(w1, w2) = log p(w1, w2)

p(w1)p(w2)(1)

The PMI matrix is calculated corpus-wise. Note

that instead of using bigrams (i.e., two words have

to be next to each other), we consider the skip-

gram co-occurrence within a window. The window

size is selected in a way that enables sentence-level

co-occurrence to be considered as well as local

co-occurrence.

Maximizing the Informative Relevance enables

the model to better memorize knowledge and fo-

cus on more informative words. Since Informative

Relevance is calculated between a masked word

and the unmasked words, it also encourages hints

to be preserved so that the model can reasonably

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InforMask:UnsupervisedInformativeMaskingforLanguageModelPretrainingNasSadeq,CanwenXu,JulianMcAuleyUniversityofCalifornia,SanDiego{nsadeq,cxu,jmcauley}@ucsd.eduAbstractMaskedlanguagemodelingiswidelyusedforpretraininglargelanguagemodelsfornatu-rallanguageunderstanding(NLU).However,randommaskingissu...

展开>> 收起<<

InforMask Unsupervised Informative Masking for Language Model Pretraining Naﬁs Sadeq Canwen Xu Julian McAuley.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

InforMask Unsupervised Informative Masking for Language Model Pretraining Naﬁs Sadeq Canwen Xu Julian McAuley

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: