Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval Dingkun Long Yanzhao Zhang Guangwei Xu Pengjun Xie

2025-04-29 0 0 283.25KB 7 页 10玖币

侵权投诉

Retrieval Oriented Masking Pre-training

Language Model for Dense Passage Retrieval

Dingkun Long, Yanzhao Zhang, Guangwei Xu, Pengjun Xie

Alibaba Group

dingkun.ldk,zhangyanzhao.zyz@alibaba-inc.com

kunka.xgw,pengjun.xpj@alibaba-inc.com

Abstract

Pre-trained language model (PTM) has been

shown to yield powerful text representations

for dense passage retrieval task. The Masked

Language Modeling (MLM) is a major sub-

task of the pre-training process. However, we

found that the conventional random masking

strategy tend to select a large number of to-

kens that have limited effect on the passage

retrieval task (e,g. stop-words and punctua-

tion). By noticing the term importance weight

can provide valuable information for passage

retrieval, we hereby propose alternative re-

trieval oriented masking (dubbed as ROM)

strategy where more important tokens will

have a higher probability of being masked out,

to capture this straightforward yet essential

information to facilitate the language model

pre-training process. Notably, the proposed

new token masking method will not change

the architecture and learning objective of orig-

inal PTM. Our experiments verify that the pro-

posed ROM enables term importance informa-

tion to help language model pre-training thus

achieving better performance on multiple pas-

sage retrieval benchmarks.

1 Introduction

Dense passage retrieval has drown much attention

recently due to its beneﬁts to a wide range of down-

streaming applications, such as open-domain ques-

tion answering (Karpukhin et al.,2020;Qu et al.,

2021;Zhu et al.,2021), conversational systems (Yu

et al.,2021) and web search (Lin et al.,2021;Fan

et al.,2021;Long et al.,2022). To balance efﬁ-

ciency and effectiveness, existing dense passage

retrieval methods usually leverage a dual-encoder

architecture. Speciﬁcally, query and passage are

encoded into continuous vector representations by

language models (LMs) respectively, then, a score

function is applied to estimate the semantic simi-

larity between the query-passage pair.

Based on the dual-encoder architecture, various

optimization methods have been proposed recently,

including hard negative training examples min-

ing (Xiong et al.,2021), optimized PTMs specially

designed for dense retrieval (Gao and Callan,2021,

2022;Ma et al.,2022) and alternative text represen-

tation methods or ﬁne-tuning strategies (Karpukhin

et al.,2020;Zhang et al.,2022a,2021). In this pa-

per, we focus on studying the part of pre-trained lan-

guage model. We observe that the widely adopted

random token masking MLM pre-training objective

is sub-optimal for dense passage retrieval task. Re-

ferring to previous studies, introducing the weight

of each term (or token) to assist in estimating the

query-passage relevance is effective in both pas-

sage retrieval and ranking stages (Dai and Callan,

2020;Ma et al.,2021;Wu et al.,2022). How-

ever, the random masking strategy does not dis-

tinguish the term importance of tokens. Further,

we ﬁnd that about

40%

of the masked tokens pro-

duced by the

15%

random masking method are

stop-words or punctuation

. Nonetheless, the effect

of these tokens on passage retrieval is extremely

limited (Fawcett et al.,2020). Therefore, we infer

that LMs pre-trained with random token masking

MLM objective is sub-optimal for dense passage

retrieval due to its shortcoming in distinguishing

token importance.

To address the limitation above, we propose al-

ternative retrieval oriented masking (ROM) strat-

egy aiming to mask tokens that are required for

passage retrieval. Speciﬁcally, in the pre-training

process of LM, the probability of each token be-

ing masked is not random, but is superimposed by

the important weight of the token corresponded.

Here, the important weight is represented as a ﬂoat

number between

and

. In this way, we can

greatly improve the probability of higher-weight to-

kens being masked out. Therefore, the pre-trained

language model will pay more attention to higher-

weight words thus making it more proper for down-

streaming dense passage retrieval applications.

1We used nltk and gensim stop-words lists.

arXiv:2210.15133v1 [cs.CL] 27 Oct 2022

To verify the effectiveness and robustness of our

proposed retrieval oriented masking method, we

conduct experiments on two commonly used pas-

sage retrieval benchmarks: the MS MARCO pas-

sage ranking and Neural Questions (NQ) datasets.

Empirically experiment results demonstrate that

our method can remarkably improve the passage

retrieval performance.

2 Related Work

Existing dense passage retrieval methods usu-

ally adopts a dual-encoder architecture. In

DPR (Karpukhin et al.,2020), they ﬁrstly pre-

sented that the passage retrieval performance of

dense dual-encoder framework can remarkable out-

perform traditional term match based method like

BM25. Based on the dual-encoder framework, stud-

ies explore to various strategies to enhance dense

retrieval models, including mining hard negatives

in ﬁne-tuning stage (Xiong et al.,2021;Zhan et al.,

2021), knowledge distillation from more powerful

cross-encoder model (Ren et al.,2021;Zhang et al.,

2021;Lu et al.,2022), data augmentation (Qu et al.,

2021) and tailored PTMs (Chang et al.,2020;Gao

and Callan,2021,2022;Ma et al.,2022;Liu and

Shao,2022;Wu et al.,2022).

For the pre-training of language model, previous

research attend to design additional pre-training

objectives tailored for dense passage retrieval (Lee

et al.,2019;Chang et al.,2020) or adjust the Trans-

former encoder architecture (Gao and Callan,2021,

2022) to obtain more practicable language models.

In this paper, we seek to make simple transforma-

tions of the original MLM learning objective to

improve the model performance, thereby reducing

the complexity of the pre-training process.

3 Methodology

In this section, we describe our proposed pre-

training method for the dense passage retrieval task.

We ﬁrst give a brief overview of the conventional

BERT pre-training model with MLM loss. Then

we will introduce how to extend it to our model

with retrieval oriented masking pre-training.

3.1 BERT Pre-trained Model

MLM Pre-training

Many popular Transformer

Encoder language models (e,g. BERT, RoBERTa)

adopts the MLM objective at pre-training phase.

MLM masks out a subset of input tokens and re-

quires the model to predict them. Speciﬁcally, the

MLM loss can be formulated as:

Lmlm =X

i∈masked

CrossEntropy(W hL

i, xi),

where

is the ﬁnal representation of masked to-

ken xiand Lis the number of Transformer layers.

Random Masking

In general, the selection of

masked out tokens is random, and the proportion

of masking in a sentence is set at

15%

. Mathemati-

cally, for each token

xi∈x

, the probability of

being masked out

p(xi)

is sampled from a uniform

distribution between

and

. If the value of

p(xi)

is in the top

15%

of the entire input sequence, then

xiwill be masked out.

3.2 Disadvantages of Random Masking

The signiﬁcant issue of the random masking

method is that it does not distinguish the impor-

tant weight of each token. Statistic analysis illus-

trates that

40%

of the tokens masked by the random

masking strategy are stop-words or punctuation. As

shown in previous studies, it is valuable to distin-

guish the weights of different terms for passage

retrieval. Whether for the query or passage, terms

with higher important weights should contribute

more to the query-passage relevance estimation

process. Although the pre-train language model

itself is contextualized aware, we still hope that

the language model has a stronger feature of distin-

guishing term importance for retrieval task. How-

ever, the language model trained by the random

masking strategy is ﬂawed.

3.3 Retrieval Oriented Masking

As mentioned above, term importance is instruc-

tive for passage retrieval. Here, we explore to in-

troduce term importance into the MLM training.

More speciﬁcally, we incorporate the term impor-

tance information into token masking. Different

from the random masking strategy, whether a token

is masked is not only determined by the ran-

dom probability

pr(xi)

, but also determined by its

term weight

pw(xi)

. Here,

pw(xi)

is normalized

between value

and

. The ﬁnal probability of

token xibeing masked out is pr(xi)+pw(xi).

Then the problem now is to calculate the term

weight of each token. Previous studies have

proposed different methods to calculate word

weights (Mallia et al.,2021;Ma et al.,2021), which

can be roughly divided into unsupervised and su-

pervised categories. To maintain the unsupervised

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RetrievalOrientedMaskingPre-trainingLanguageModelforDensePassageRetrievalDingkunLong,YanzhaoZhang,GuangweiXu,PengjunXieAlibabaGroupdingkun.ldk,zhangyanzhao.zyz@alibaba-inc.comkunka.xgw,pengjun.xpj@alibaba-inc.comAbstractPre-trainedlanguagemodel(PTM)hasbeenshowntoyieldpowerfultextrepresentationsforde...

展开>> 收起<<

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval Dingkun Long Yanzhao Zhang Guangwei Xu Pengjun Xie.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval Dingkun Long Yanzhao Zhang Guangwei Xu Pengjun Xie

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: