SimANS Simple Ambiguous Negatives Sampling for Dense Text Retrieval Kun Zhou13y Yeyun Gong4 Xiao Liu4 Wayne Xin Zhao23 Yelong Shen5 Anlei Dong5 Jingwen Lu5Rangan Majumder5Ji-Rong Wen23Nan Duan4Weizhu Chen5

2025-05-03 0 0 5.4MB 12 页 10玖币
侵权投诉
SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval
Kun Zhou1,3
, Yeyun Gong4, Xiao Liu4, Wayne Xin Zhao2,3
, Yelong Shen5, Anlei Dong5,
Jingwen Lu5,Rangan Majumder5,Ji-Rong Wen2,3,Nan Duan4,Weizhu Chen5
1School of Information, Renmin University of China,
2Gaoling School of Artificial Intelligence, Renmin University of China,
3Beijing Key Laboratory of Big Data Management and Analysis Methods,
4Microsoft Research, 5Microsoft
Abstract
Sampling proper negatives from a large docu-
ment pool is vital to effectively train a dense re-
trieval model. However, existing negative sam-
pling strategies suffer from the uninformative
or false negative problem. In this work, we em-
pirically show that according to the measured
relevance scores, the negatives ranked around
the positives are generally more informative
and less likely to be false negatives. Intuitively,
these negatives are not too hard (may be false
negatives) or too easy (uninformative). They
are the ambiguous negatives and need more at-
tention during training. Thus, we propose a
simple ambiguous negatives sampling method,
SimANS, which incorporates a new sampling
probability distribution to sample more am-
biguous negatives. Extensive experiments on
four public and one industry datasets show the
effectiveness of our approach. We made the
code and models publicly available in https:
//github.com/microsoft/SimXNS.
1 Introduction
Dense text retrieval, which uses low-dimensional
vectors to represent queries and documents and
measure their relevance, has become a popular
topic (Karpukhin et al.,2020;Luan et al.,2021)
for both researchers and practitioners. It can im-
prove various downstream applications, e.g., web
search (Brickley et al.,2019;Qiu et al.,2022) and
question answer (Izacard and Grave,2021). A key
challenge for training a dense text retrieval model
is how to select appropriate negatives from a large
document pool (i.e., negative sampling), as most
existing methods use a contrastive loss (Karpukhin
et al.,2020;Xiong et al.,2021) to encourage the
model to rank positive documents higher than neg-
atives. However, the commonly-used negative
sampling strategies, namely random negative sam-
pling (Luan et al.,2021;Karpukhin et al.,2020)
This work was done during internship at MSRA.
Corresponding author, email: batmanfly@gmail.com.
(using random documents in the same batch) and
top-
k
hard negatives sampling (Xiong et al.,2021;
Zhan et al.,2021) (using an auxiliary retriever to
obtain the top-
k
documents), have their limitations.
Random negative sampling tends to select uninfor-
mative negatives that are rather easy to be distin-
guished from positives and fail to provide useful
information (Xiong et al.,2021), while top-
k
hard
negatives sampling may include false negatives (Qu
et al.,2021), degrading the model performance.
Motivated by these problems, we propose to sam-
ple the ambiguous negatives
1
that are neither too
easy (uninformative) nor too hard (potential false
negatives). Our approach is inspired by an empir-
ical observation from experiments (in §3) using
gradients to assess the impact of data instances on
deep models (Koh and Liang,2017;Pruthi et al.,
2020): according to the measured relevance scores
using the dense retrieval model, negatives that rank
lower are mostly uninformative, as their gradient
means are close to zero; negatives that rank higher
are likely to be false negatives, as their gradient
variances are significantly higher than expected.
Both types of negatives are detrimental to the con-
vergence of deep matching models (Xiong et al.,
2021;Qu et al.,2021). Interestingly, we find that
the negatives ranked around positive examples tend
to have relatively larger gradient means and smaller
variances, indicating that they are informative and
have a lower risk of being false negatives, thus
probably being high-quality ambiguous negatives.
Based on these insights, we propose a
Sim
ple
A
mbiguous
N
egative
S
ampling method, namely
SimANS
, for improving deep text retrieval. Our
main idea is to design a sampling probability distri-
bution that can assign higher probabilities to the am-
biguous negatives while lower probabilities to the
1
We call them ambiguous negatives following the def-
inition of ambiguous examples (Swayamdipta et al.,2020;
Meissner et al.,2021), referring to the instances that are nei-
ther too hard nor too easy to learn.
arXiv:2210.11773v2 [cs.CL] 24 Oct 2022
possible false and uninformative negatives, based
on the differences of the relevance scores between
positives and candidate negatives. We also incorpo-
rate two hyper-parameters to better adjust the peak
and density of the sampling probability distribu-
tion. Our approach is simple and flexible, which
can be easily applied to various dense retrieval mod-
els and combined with other effective techniques,
e.g., knowledge distillation (Qu et al.,2021) and
adversarial training (Zhang et al.,2021).
To validate the effectiveness of SimANS, we
conduct extensive experiments on four public
datasets and one industrial dataset collected from
Bing search logs. Experimental results show that
SimANS can improve the performance of competi-
tive baselines, including state-of-the-art methods.
2 Preliminary
Dense Text Retrieval.
Given a query
q
, the dense
text retrieval task aims to retrieve the most relevant
top-
k
documents
{di}k
i=1
from a large candidate
pool
D
. To achieve it, the dual-encoder architec-
ture is widely used due to its efficiency (Reimers
and Gurevych,2019;Karpukhin et al.,2020). It
consists of a query encoder
Eq
and a document
encoder
Ed
to map the query
q
and document
d
into
k
-dimensional dense vectors
hq
and
hd
, re-
spectively. Then, the semantic relevance score of
q
and dcan be computed using dot product as
s(q, d) = hq·hd.(1)
Recent works mostly adopt pre-trained language
models (PLMs) (Devlin et al.,2019) as the two en-
coders, and utilize the representations of the
[CLS]
token as dense vectors.
Training with Negative Sampling.
The training
objective of dense text retrieval task is to pull the
representations of the query
q
and relevant doc-
uments
D+
together (as positives), while push-
ing apart irrelevant ones
D=D \ D+
(as neg-
atives). However, the irrelevant documents are
from a large document pool, which would lead
to millions of negatives. To reduce the unreachable
training cost, negative sampling has been widely
used. Previous works either randomly sample neg-
atives (Karpukhin et al.,2020), or select the top-
k
hard negatives ranked by BM25 or the dense re-
trieval model itself (Xiong et al.,2021;Qu et al.,
2021), denoted as
e
D
. Then, the optimization ob-
jective can be formulated as:
θ= arg min
θX
q
X
d+∈D+
X
d
e
D
L(s(q, d+), s(q, d)),
(2)
where L(·)is the loss function.
3 Motivation Study
We first analyze the uninformative and false neg-
ative problems from the perspective of gradients.
Then, we perform an empirical study to test how
gradients of negatives change w.r.t. ranks accord-
ing to measured relevance scores using a dense
retrieval model, and find that the gradients of neg-
atives ranked near positives have relatively larger
means and smaller variances.
3.1 Analysis for Gradients of Negatives
Existing dense retrieval methods (Karpukhin et al.,
2020;Xiong et al.,2021) commonly incorporate
the binary cross entropy (BCE) loss to compute
gradients
2
, where the relevance scores of a positive
and sampled negatives are usually normalized by
the softmax function. In this way, the gradients of
model parameters θare computed by
5θl(q, d) = (sn(q, d)1) 5θsn(q, d)if d ∈ D+
sn(q, d)5θsn(q, d)if d ∈ D
where
sn(q, d)
is the normalized value of
s(q, d)
and is within
[0,1]
. Based on it, we review the
gradients of uninformative and false negatives. Un-
informative negatives can be easily distinguished
by dense retrieval models, and are more likely to be
selected by random sampling (Xiong et al.,2021).
As their normalized relevance scores are usually
rather small, i.e.,
sn(q, d)0
, their gradient
means will be bounded into near-zero values, i.e.,
5θl(q, d)0
. Such near-zero gradients are
also uninformative and contribute little to model
convergence. False negatives are usually seman-
tically similar to positives, and are more likely to
be selected by top-
k
hard negatives sampling (Qu
et al.,2021). Therefore, for the gradients of false
negatives and positives, the right terms
5θsn(q, d)
may be similar, while the left terms are greater
than zero and less than 0, respectively. As a result,
the variance of gradients will be larger, which may
cause the optimization of parameters to be unstable.
Furthermore, existing works (Katharopoulos and
Fleuret,2018;Johnson and Guestrin,2018) have
2
In this work, we perform the analysis using BCE loss,
and such analysis can also be extended to other loss functions.
20 40 60 80 100 120 140 160 180
Ranks
0.00
0.25
0.50
0.75
1.00
Mean Rank of Positives
Normalized Gradient Mean Normalized Gradient Variance
Figure 1: The mean and variance of gradients change
curves w.r.t. the ranks of negatives on MS-MARCO
Passage Ranking dataset using AR2 (Zhang et al.,
2021).
theoretically proved that larger gradient variance is
detrimental to model convergence.
3.2 Empirical Study on Gradients of
Negatives w.r.t. Relevance Scores
Although we have analyzed that the harmful influ-
ence of uninformative and false negatives derives
from the smaller means and larger variances of
gradients respectively, it is time-consuming to com-
pute gradients of all candidate negatives to identify
and remove them. Here, we empirically study if the
query-document relevance scores can be leveraged
to avoid sampling these harmful negatives.
Experimental Setup.
We use AR2 (Zhang et al.,
2021) as the retrieval model and investigate its gra-
dients on the development set of MS-MARCO Pas-
sage Ranking dataset (Nguyen et al.,2016). Con-
cretely, for each query, we rank all negatives ac-
cording to their relevance scores, and compute the
means and variances of gradients of all negatives
in the same rank
3
. To better show the tendency
w.r.t. ranks of relevance scores, we normalize the
means and variances of gradients by dividing the
maximum values, and only report the results of top
200 ranked negatives.
Results and Findings.
As shown in Figure 1, the
mean and variance of gradients will gradually de-
crease with the increase of the rank. Despite that,
the gradient means of the top 200 negatives are still
in the same order of magnitude (
1.00.25
),
while the gradient variances of the top 10 ranked
negatives are significantly larger than others. The
reason is that the higher-ranking negatives have
larger probabilities to be false negatives. Besides,
3
As AR2 adopts ERNIE-2.0 (Sun et al.,2020) as the
backbone that has millions of parameters, we only compute
gradients on the parameters of its last layer for efficiency.
a surprising finding is that the mean rank of posi-
tives is approximate the boundary point of the high
gradient variance part and the negatives near it can
produce relatively larger gradient means and lower
gradient variances. It means that they are high-
quality ambiguous negatives that can balance the
informativeness and the risk of being false neg-
atives. Therefore, it is promising to rely on the
relevance scores of positives and candidate nega-
tives to devise more effective negative sampling
methods for training dense retrieval models.
4 Approach
Based on the findings in §3, we conjecture that the
ambiguous negatives ranked near positives accord-
ing to relevance scores are high-quality negatives,
as they are neither too easy (uninformative) nor
too hard (may be false negatives). Therefore, we
propose a simple ambiguous negative sampling
method, namely SimANS.
4.1 Ambiguous Negative Sampling
To focus on sampling ambiguous negatives, we
design a new sampling probability distribution that
can estimate the influence of each negative using
the dense retrieval models. As follows, we first
devise a general sampling distribution and then
propose its simple and efficient implementation.
General Sampling Distribution.
We draw the fol-
lowing conclusions from our results about how to
choose a good sampling probability distribution for
negatives: (1) Negatives that are clearly irrelevant
and have low relevance scores should be sampled
less frequently; (2) Negatives that are highly rel-
evant and have high relevance scores should also
be sampled less frequently, because they are more
likely to be positives in disguise; (3) Negatives that
are uncertain and have relevance scores similar to
positives should be sampled more frequently, be-
cause they provide useful information and have a
lower chance of being false negatives. We propose
a general formula for negative sampling probability
that reflects these principles:
pif(|s(q, di)¯s(q, d+)b|),di D \ D+,(3)
where
f(·)
is a function to determine the ten-
dency of the probability distribution,
b
is a hyper-
parameter to control the peak of the distribution,
¯s(q, d+)
is the mean relevance score of all posi-
tives with the query.
f(·)
should be a monotone
摘要:

SimANS:SimpleAmbiguousNegativesSamplingforDenseTextRetrievalKunZhou1,3y,YeyunGong4,XiaoLiu4,WayneXinZhao2,3,YelongShen5,AnleiDong5,JingwenLu5,RanganMajumder5,Ji-RongWen2,3,NanDuan4,WeizhuChen51SchoolofInformation,RenminUniversityofChina,2GaolingSchoolofArticialIntelligence,RenminUniversityofChina,...

展开>> 收起<<
SimANS Simple Ambiguous Negatives Sampling for Dense Text Retrieval Kun Zhou13y Yeyun Gong4 Xiao Liu4 Wayne Xin Zhao23 Yelong Shen5 Anlei Dong5 Jingwen Lu5Rangan Majumder5Ji-Rong Wen23Nan Duan4Weizhu Chen5.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:5.4MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注