
Retrieval Oriented Masking Pre-training
Language Model for Dense Passage Retrieval
Dingkun Long, Yanzhao Zhang, Guangwei Xu, Pengjun Xie
Alibaba Group
dingkun.ldk,zhangyanzhao.zyz@alibaba-inc.com
kunka.xgw,pengjun.xpj@alibaba-inc.com
Abstract
Pre-trained language model (PTM) has been
shown to yield powerful text representations
for dense passage retrieval task. The Masked
Language Modeling (MLM) is a major sub-
task of the pre-training process. However, we
found that the conventional random masking
strategy tend to select a large number of to-
kens that have limited effect on the passage
retrieval task (e,g. stop-words and punctua-
tion). By noticing the term importance weight
can provide valuable information for passage
retrieval, we hereby propose alternative re-
trieval oriented masking (dubbed as ROM)
strategy where more important tokens will
have a higher probability of being masked out,
to capture this straightforward yet essential
information to facilitate the language model
pre-training process. Notably, the proposed
new token masking method will not change
the architecture and learning objective of orig-
inal PTM. Our experiments verify that the pro-
posed ROM enables term importance informa-
tion to help language model pre-training thus
achieving better performance on multiple pas-
sage retrieval benchmarks.
1 Introduction
Dense passage retrieval has drown much attention
recently due to its benefits to a wide range of down-
streaming applications, such as open-domain ques-
tion answering (Karpukhin et al.,2020;Qu et al.,
2021;Zhu et al.,2021), conversational systems (Yu
et al.,2021) and web search (Lin et al.,2021;Fan
et al.,2021;Long et al.,2022). To balance effi-
ciency and effectiveness, existing dense passage
retrieval methods usually leverage a dual-encoder
architecture. Specifically, query and passage are
encoded into continuous vector representations by
language models (LMs) respectively, then, a score
function is applied to estimate the semantic simi-
larity between the query-passage pair.
Based on the dual-encoder architecture, various
optimization methods have been proposed recently,
including hard negative training examples min-
ing (Xiong et al.,2021), optimized PTMs specially
designed for dense retrieval (Gao and Callan,2021,
2022;Ma et al.,2022) and alternative text represen-
tation methods or fine-tuning strategies (Karpukhin
et al.,2020;Zhang et al.,2022a,2021). In this pa-
per, we focus on studying the part of pre-trained lan-
guage model. We observe that the widely adopted
random token masking MLM pre-training objective
is sub-optimal for dense passage retrieval task. Re-
ferring to previous studies, introducing the weight
of each term (or token) to assist in estimating the
query-passage relevance is effective in both pas-
sage retrieval and ranking stages (Dai and Callan,
2020;Ma et al.,2021;Wu et al.,2022). How-
ever, the random masking strategy does not dis-
tinguish the term importance of tokens. Further,
we find that about
40%
of the masked tokens pro-
duced by the
15%
random masking method are
stop-words or punctuation
1
. Nonetheless, the effect
of these tokens on passage retrieval is extremely
limited (Fawcett et al.,2020). Therefore, we infer
that LMs pre-trained with random token masking
MLM objective is sub-optimal for dense passage
retrieval due to its shortcoming in distinguishing
token importance.
To address the limitation above, we propose al-
ternative retrieval oriented masking (ROM) strat-
egy aiming to mask tokens that are required for
passage retrieval. Specifically, in the pre-training
process of LM, the probability of each token be-
ing masked is not random, but is superimposed by
the important weight of the token corresponded.
Here, the important weight is represented as a float
number between
0
and
1
. In this way, we can
greatly improve the probability of higher-weight to-
kens being masked out. Therefore, the pre-trained
language model will pay more attention to higher-
weight words thus making it more proper for down-
streaming dense passage retrieval applications.
1We used nltk and gensim stop-words lists.
arXiv:2210.15133v1 [cs.CL] 27 Oct 2022