Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval Dingkun Long Yanzhao Zhang Guangwei Xu Pengjun Xie

2025-04-29 0 0 283.25KB 7 页 10玖币
侵权投诉
Retrieval Oriented Masking Pre-training
Language Model for Dense Passage Retrieval
Dingkun Long, Yanzhao Zhang, Guangwei Xu, Pengjun Xie
Alibaba Group
dingkun.ldk,zhangyanzhao.zyz@alibaba-inc.com
kunka.xgw,pengjun.xpj@alibaba-inc.com
Abstract
Pre-trained language model (PTM) has been
shown to yield powerful text representations
for dense passage retrieval task. The Masked
Language Modeling (MLM) is a major sub-
task of the pre-training process. However, we
found that the conventional random masking
strategy tend to select a large number of to-
kens that have limited effect on the passage
retrieval task (e,g. stop-words and punctua-
tion). By noticing the term importance weight
can provide valuable information for passage
retrieval, we hereby propose alternative re-
trieval oriented masking (dubbed as ROM)
strategy where more important tokens will
have a higher probability of being masked out,
to capture this straightforward yet essential
information to facilitate the language model
pre-training process. Notably, the proposed
new token masking method will not change
the architecture and learning objective of orig-
inal PTM. Our experiments verify that the pro-
posed ROM enables term importance informa-
tion to help language model pre-training thus
achieving better performance on multiple pas-
sage retrieval benchmarks.
1 Introduction
Dense passage retrieval has drown much attention
recently due to its benefits to a wide range of down-
streaming applications, such as open-domain ques-
tion answering (Karpukhin et al.,2020;Qu et al.,
2021;Zhu et al.,2021), conversational systems (Yu
et al.,2021) and web search (Lin et al.,2021;Fan
et al.,2021;Long et al.,2022). To balance effi-
ciency and effectiveness, existing dense passage
retrieval methods usually leverage a dual-encoder
architecture. Specifically, query and passage are
encoded into continuous vector representations by
language models (LMs) respectively, then, a score
function is applied to estimate the semantic simi-
larity between the query-passage pair.
Based on the dual-encoder architecture, various
optimization methods have been proposed recently,
including hard negative training examples min-
ing (Xiong et al.,2021), optimized PTMs specially
designed for dense retrieval (Gao and Callan,2021,
2022;Ma et al.,2022) and alternative text represen-
tation methods or fine-tuning strategies (Karpukhin
et al.,2020;Zhang et al.,2022a,2021). In this pa-
per, we focus on studying the part of pre-trained lan-
guage model. We observe that the widely adopted
random token masking MLM pre-training objective
is sub-optimal for dense passage retrieval task. Re-
ferring to previous studies, introducing the weight
of each term (or token) to assist in estimating the
query-passage relevance is effective in both pas-
sage retrieval and ranking stages (Dai and Callan,
2020;Ma et al.,2021;Wu et al.,2022). How-
ever, the random masking strategy does not dis-
tinguish the term importance of tokens. Further,
we find that about
40%
of the masked tokens pro-
duced by the
15%
random masking method are
stop-words or punctuation
1
. Nonetheless, the effect
of these tokens on passage retrieval is extremely
limited (Fawcett et al.,2020). Therefore, we infer
that LMs pre-trained with random token masking
MLM objective is sub-optimal for dense passage
retrieval due to its shortcoming in distinguishing
token importance.
To address the limitation above, we propose al-
ternative retrieval oriented masking (ROM) strat-
egy aiming to mask tokens that are required for
passage retrieval. Specifically, in the pre-training
process of LM, the probability of each token be-
ing masked is not random, but is superimposed by
the important weight of the token corresponded.
Here, the important weight is represented as a float
number between
0
and
1
. In this way, we can
greatly improve the probability of higher-weight to-
kens being masked out. Therefore, the pre-trained
language model will pay more attention to higher-
weight words thus making it more proper for down-
streaming dense passage retrieval applications.
1We used nltk and gensim stop-words lists.
arXiv:2210.15133v1 [cs.CL] 27 Oct 2022
To verify the effectiveness and robustness of our
proposed retrieval oriented masking method, we
conduct experiments on two commonly used pas-
sage retrieval benchmarks: the MS MARCO pas-
sage ranking and Neural Questions (NQ) datasets.
Empirically experiment results demonstrate that
our method can remarkably improve the passage
retrieval performance.
2 Related Work
Existing dense passage retrieval methods usu-
ally adopts a dual-encoder architecture. In
DPR (Karpukhin et al.,2020), they firstly pre-
sented that the passage retrieval performance of
dense dual-encoder framework can remarkable out-
perform traditional term match based method like
BM25. Based on the dual-encoder framework, stud-
ies explore to various strategies to enhance dense
retrieval models, including mining hard negatives
in fine-tuning stage (Xiong et al.,2021;Zhan et al.,
2021), knowledge distillation from more powerful
cross-encoder model (Ren et al.,2021;Zhang et al.,
2021;Lu et al.,2022), data augmentation (Qu et al.,
2021) and tailored PTMs (Chang et al.,2020;Gao
and Callan,2021,2022;Ma et al.,2022;Liu and
Shao,2022;Wu et al.,2022).
For the pre-training of language model, previous
research attend to design additional pre-training
objectives tailored for dense passage retrieval (Lee
et al.,2019;Chang et al.,2020) or adjust the Trans-
former encoder architecture (Gao and Callan,2021,
2022) to obtain more practicable language models.
In this paper, we seek to make simple transforma-
tions of the original MLM learning objective to
improve the model performance, thereby reducing
the complexity of the pre-training process.
3 Methodology
In this section, we describe our proposed pre-
training method for the dense passage retrieval task.
We first give a brief overview of the conventional
BERT pre-training model with MLM loss. Then
we will introduce how to extend it to our model
with retrieval oriented masking pre-training.
3.1 BERT Pre-trained Model
MLM Pre-training
Many popular Transformer
Encoder language models (e,g. BERT, RoBERTa)
adopts the MLM objective at pre-training phase.
MLM masks out a subset of input tokens and re-
quires the model to predict them. Specifically, the
MLM loss can be formulated as:
Lmlm =X
imasked
CrossEntropy(W hL
i, xi),
where
hL
i
is the final representation of masked to-
ken xiand Lis the number of Transformer layers.
Random Masking
In general, the selection of
masked out tokens is random, and the proportion
of masking in a sentence is set at
15%
. Mathemati-
cally, for each token
xix
, the probability of
xi
being masked out
p(xi)
is sampled from a uniform
distribution between
0
and
1
. If the value of
p(xi)
is in the top
15%
of the entire input sequence, then
xiwill be masked out.
3.2 Disadvantages of Random Masking
The significant issue of the random masking
method is that it does not distinguish the impor-
tant weight of each token. Statistic analysis illus-
trates that
40%
of the tokens masked by the random
masking strategy are stop-words or punctuation. As
shown in previous studies, it is valuable to distin-
guish the weights of different terms for passage
retrieval. Whether for the query or passage, terms
with higher important weights should contribute
more to the query-passage relevance estimation
process. Although the pre-train language model
itself is contextualized aware, we still hope that
the language model has a stronger feature of distin-
guishing term importance for retrieval task. How-
ever, the language model trained by the random
masking strategy is flawed.
3.3 Retrieval Oriented Masking
As mentioned above, term importance is instruc-
tive for passage retrieval. Here, we explore to in-
troduce term importance into the MLM training.
More specifically, we incorporate the term impor-
tance information into token masking. Different
from the random masking strategy, whether a token
xi
is masked is not only determined by the ran-
dom probability
pr(xi)
, but also determined by its
term weight
pw(xi)
. Here,
pw(xi)
is normalized
between value
0
and
1
. The final probability of
token xibeing masked out is pr(xi)+pw(xi).
Then the problem now is to calculate the term
weight of each token. Previous studies have
proposed different methods to calculate word
weights (Mallia et al.,2021;Ma et al.,2021), which
can be roughly divided into unsupervised and su-
pervised categories. To maintain the unsupervised
摘要:

RetrievalOrientedMaskingPre-trainingLanguageModelforDensePassageRetrievalDingkunLong,YanzhaoZhang,GuangweiXu,PengjunXieAlibabaGroupdingkun.ldk,zhangyanzhao.zyz@alibaba-inc.comkunka.xgw,pengjun.xpj@alibaba-inc.comAbstractPre-trainedlanguagemodel(PTM)hasbeenshowntoyieldpowerfultextrepresentationsforde...

展开>> 收起<<
Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval Dingkun Long Yanzhao Zhang Guangwei Xu Pengjun Xie.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:283.25KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注