SimANS Simple Ambiguous Negatives Sampling for Dense Text Retrieval Kun Zhou13y Yeyun Gong4 Xiao Liu4 Wayne Xin Zhao23 Yelong Shen5 Anlei Dong5 Jingwen Lu5Rangan Majumder5Ji-Rong Wen23Nan Duan4Weizhu Chen5

2025-05-03 0 0 5.4MB 12 页 10玖币

侵权投诉

SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval

Kun Zhou1,3†

, Yeyun Gong4, Xiao Liu4, Wayne Xin Zhao2,3∗

, Yelong Shen5, Anlei Dong5,

Jingwen Lu5,Rangan Majumder5,Ji-Rong Wen2,3,Nan Duan4,Weizhu Chen5

1School of Information, Renmin University of China,

2Gaoling School of Artiﬁcial Intelligence, Renmin University of China,

3Beijing Key Laboratory of Big Data Management and Analysis Methods,

4Microsoft Research, 5Microsoft

Abstract

Sampling proper negatives from a large docu-

ment pool is vital to effectively train a dense re-

trieval model. However, existing negative sam-

pling strategies suffer from the uninformative

or false negative problem. In this work, we em-

pirically show that according to the measured

relevance scores, the negatives ranked around

the positives are generally more informative

and less likely to be false negatives. Intuitively,

these negatives are not too hard (may be false

negatives) or too easy (uninformative). They

are the ambiguous negatives and need more at-

tention during training. Thus, we propose a

simple ambiguous negatives sampling method,

SimANS, which incorporates a new sampling

probability distribution to sample more am-

biguous negatives. Extensive experiments on

four public and one industry datasets show the

effectiveness of our approach. We made the

code and models publicly available in https:

//github.com/microsoft/SimXNS.

1 Introduction

Dense text retrieval, which uses low-dimensional

vectors to represent queries and documents and

measure their relevance, has become a popular

topic (Karpukhin et al.,2020;Luan et al.,2021)

for both researchers and practitioners. It can im-

prove various downstream applications, e.g., web

search (Brickley et al.,2019;Qiu et al.,2022) and

question answer (Izacard and Grave,2021). A key

challenge for training a dense text retrieval model

is how to select appropriate negatives from a large

document pool (i.e., negative sampling), as most

existing methods use a contrastive loss (Karpukhin

et al.,2020;Xiong et al.,2021) to encourage the

model to rank positive documents higher than neg-

atives. However, the commonly-used negative

sampling strategies, namely random negative sam-

pling (Luan et al.,2021;Karpukhin et al.,2020)

†† This work was done during internship at MSRA.

∗∗

Corresponding author, email: batmanﬂy@gmail.com.

(using random documents in the same batch) and

top-

hard negatives sampling (Xiong et al.,2021;

Zhan et al.,2021) (using an auxiliary retriever to

obtain the top-

documents), have their limitations.

Random negative sampling tends to select uninfor-

mative negatives that are rather easy to be distin-

guished from positives and fail to provide useful

information (Xiong et al.,2021), while top-

hard

negatives sampling may include false negatives (Qu

et al.,2021), degrading the model performance.

Motivated by these problems, we propose to sam-

ple the ambiguous negatives

that are neither too

easy (uninformative) nor too hard (potential false

negatives). Our approach is inspired by an empir-

ical observation from experiments (in §3) using

gradients to assess the impact of data instances on

deep models (Koh and Liang,2017;Pruthi et al.,

2020): according to the measured relevance scores

using the dense retrieval model, negatives that rank

lower are mostly uninformative, as their gradient

means are close to zero; negatives that rank higher

are likely to be false negatives, as their gradient

variances are signiﬁcantly higher than expected.

Both types of negatives are detrimental to the con-

vergence of deep matching models (Xiong et al.,

2021;Qu et al.,2021). Interestingly, we ﬁnd that

the negatives ranked around positive examples tend

to have relatively larger gradient means and smaller

variances, indicating that they are informative and

have a lower risk of being false negatives, thus

probably being high-quality ambiguous negatives.

Based on these insights, we propose a

Sim

ple

mbiguous

egative

ampling method, namely

SimANS

, for improving deep text retrieval. Our

main idea is to design a sampling probability distri-

bution that can assign higher probabilities to the am-

biguous negatives while lower probabilities to the

We call them ambiguous negatives following the def-

inition of ambiguous examples (Swayamdipta et al.,2020;

Meissner et al.,2021), referring to the instances that are nei-

ther too hard nor too easy to learn.

arXiv:2210.11773v2 [cs.CL] 24 Oct 2022

possible false and uninformative negatives, based

on the differences of the relevance scores between

positives and candidate negatives. We also incorpo-

rate two hyper-parameters to better adjust the peak

and density of the sampling probability distribu-

tion. Our approach is simple and ﬂexible, which

can be easily applied to various dense retrieval mod-

els and combined with other effective techniques,

e.g., knowledge distillation (Qu et al.,2021) and

adversarial training (Zhang et al.,2021).

To validate the effectiveness of SimANS, we

conduct extensive experiments on four public

datasets and one industrial dataset collected from

Bing search logs. Experimental results show that

SimANS can improve the performance of competi-

tive baselines, including state-of-the-art methods.

2 Preliminary

Dense Text Retrieval.

Given a query

, the dense

text retrieval task aims to retrieve the most relevant

top-

documents

{di}k

i=1

from a large candidate

pool

. To achieve it, the dual-encoder architec-

ture is widely used due to its efﬁciency (Reimers

and Gurevych,2019;Karpukhin et al.,2020). It

consists of a query encoder

and a document

encoder

to map the query

and document

into

-dimensional dense vectors

and

, re-

spectively. Then, the semantic relevance score of

and dcan be computed using dot product as

s(q, d) = hq·hd.(1)

Recent works mostly adopt pre-trained language

models (PLMs) (Devlin et al.,2019) as the two en-

coders, and utilize the representations of the

[CLS]

token as dense vectors.

Training with Negative Sampling.

The training

objective of dense text retrieval task is to pull the

representations of the query

and relevant doc-

uments

D+

together (as positives), while push-

ing apart irrelevant ones

D−=D \ D+

(as neg-

atives). However, the irrelevant documents are

from a large document pool, which would lead

to millions of negatives. To reduce the unreachable

training cost, negative sampling has been widely

used. Previous works either randomly sample neg-

atives (Karpukhin et al.,2020), or select the top-

hard negatives ranked by BM25 or the dense re-

trieval model itself (Xiong et al.,2021;Qu et al.,

2021), denoted as

D−

. Then, the optimization ob-

jective can be formulated as:

θ∗= arg min

θX

d+∈D+

d−∈

D−

L(s(q, d+), s(q, d−)),

(2)

where L(·)is the loss function.

3 Motivation Study

We ﬁrst analyze the uninformative and false neg-

ative problems from the perspective of gradients.

Then, we perform an empirical study to test how

gradients of negatives change w.r.t. ranks accord-

ing to measured relevance scores using a dense

retrieval model, and ﬁnd that the gradients of neg-

atives ranked near positives have relatively larger

means and smaller variances.

3.1 Analysis for Gradients of Negatives

Existing dense retrieval methods (Karpukhin et al.,

2020;Xiong et al.,2021) commonly incorporate

the binary cross entropy (BCE) loss to compute

gradients

, where the relevance scores of a positive

and sampled negatives are usually normalized by

the softmax function. In this way, the gradients of

model parameters θare computed by

5θl(q, d) = (sn(q, d)−1) 5θsn(q, d)if d ∈ D+

sn(q, d)5θsn(q, d)if d ∈ D−

where

sn(q, d)

is the normalized value of

s(q, d)

and is within

[0,1]

. Based on it, we review the

gradients of uninformative and false negatives. Un-

informative negatives can be easily distinguished

by dense retrieval models, and are more likely to be

selected by random sampling (Xiong et al.,2021).

As their normalized relevance scores are usually

rather small, i.e.,

sn(q, d)−→ 0

, their gradient

means will be bounded into near-zero values, i.e.,

5θl(q, d)−→ 0

. Such near-zero gradients are

also uninformative and contribute little to model

convergence. False negatives are usually seman-

tically similar to positives, and are more likely to

be selected by top-

hard negatives sampling (Qu

et al.,2021). Therefore, for the gradients of false

negatives and positives, the right terms

5θsn(q, d)

may be similar, while the left terms are greater

than zero and less than 0, respectively. As a result,

the variance of gradients will be larger, which may

cause the optimization of parameters to be unstable.

Furthermore, existing works (Katharopoulos and

Fleuret,2018;Johnson and Guestrin,2018) have

In this work, we perform the analysis using BCE loss,

and such analysis can also be extended to other loss functions.

20 40 60 80 100 120 140 160 180

Ranks

0.00

0.25

0.50

0.75

1.00

Mean Rank of Positives

Normalized Gradient Mean Normalized Gradient Variance

Figure 1: The mean and variance of gradients change

curves w.r.t. the ranks of negatives on MS-MARCO

Passage Ranking dataset using AR2 (Zhang et al.,

2021).

theoretically proved that larger gradient variance is

detrimental to model convergence.

3.2 Empirical Study on Gradients of

Negatives w.r.t. Relevance Scores

Although we have analyzed that the harmful inﬂu-

ence of uninformative and false negatives derives

from the smaller means and larger variances of

gradients respectively, it is time-consuming to com-

pute gradients of all candidate negatives to identify

and remove them. Here, we empirically study if the

query-document relevance scores can be leveraged

to avoid sampling these harmful negatives.

Experimental Setup.

We use AR2 (Zhang et al.,

2021) as the retrieval model and investigate its gra-

dients on the development set of MS-MARCO Pas-

sage Ranking dataset (Nguyen et al.,2016). Con-

cretely, for each query, we rank all negatives ac-

cording to their relevance scores, and compute the

means and variances of gradients of all negatives

in the same rank

. To better show the tendency

w.r.t. ranks of relevance scores, we normalize the

means and variances of gradients by dividing the

maximum values, and only report the results of top

200 ranked negatives.

Results and Findings.

As shown in Figure 1, the

mean and variance of gradients will gradually de-

crease with the increase of the rank. Despite that,

the gradient means of the top 200 negatives are still

in the same order of magnitude (

1.0−→ 0.25

while the gradient variances of the top 10 ranked

negatives are signiﬁcantly larger than others. The

reason is that the higher-ranking negatives have

larger probabilities to be false negatives. Besides,

As AR2 adopts ERNIE-2.0 (Sun et al.,2020) as the

backbone that has millions of parameters, we only compute

gradients on the parameters of its last layer for efﬁciency.

a surprising ﬁnding is that the mean rank of posi-

tives is approximate the boundary point of the high

gradient variance part and the negatives near it can

produce relatively larger gradient means and lower

gradient variances. It means that they are high-

quality ambiguous negatives that can balance the

informativeness and the risk of being false neg-

atives. Therefore, it is promising to rely on the

relevance scores of positives and candidate nega-

tives to devise more effective negative sampling

methods for training dense retrieval models.

4 Approach

Based on the ﬁndings in §3, we conjecture that the

ambiguous negatives ranked near positives accord-

ing to relevance scores are high-quality negatives,

as they are neither too easy (uninformative) nor

too hard (may be false negatives). Therefore, we

propose a simple ambiguous negative sampling

method, namely SimANS.

4.1 Ambiguous Negative Sampling

To focus on sampling ambiguous negatives, we

design a new sampling probability distribution that

can estimate the inﬂuence of each negative using

the dense retrieval models. As follows, we ﬁrst

devise a general sampling distribution and then

propose its simple and efﬁcient implementation.

General Sampling Distribution.

We draw the fol-

lowing conclusions from our results about how to

choose a good sampling probability distribution for

negatives: (1) Negatives that are clearly irrelevant

and have low relevance scores should be sampled

less frequently; (2) Negatives that are highly rel-

evant and have high relevance scores should also

be sampled less frequently, because they are more

likely to be positives in disguise; (3) Negatives that

are uncertain and have relevance scores similar to

positives should be sampled more frequently, be-

cause they provide useful information and have a

lower chance of being false negatives. We propose

a general formula for negative sampling probability

that reﬂects these principles:

pi∝f(|s(q, di)−¯s(q, d+)−b|),∀di∈ D \ D+,(3)

where

f(·)

is a function to determine the ten-

dency of the probability distribution,

is a hyper-

parameter to control the peak of the distribution,

¯s(q, d+)

is the mean relevance score of all posi-

tives with the query.

f(·)

should be a monotone

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SimANS:SimpleAmbiguousNegativesSamplingforDenseTextRetrievalKunZhou1,3y,YeyunGong4,XiaoLiu4,WayneXinZhao2,3,YelongShen5,AnleiDong5,JingwenLu5,RanganMajumder5,Ji-RongWen2,3,NanDuan4,WeizhuChen51SchoolofInformation,RenminUniversityofChina,2GaolingSchoolofArticialIntelligence,RenminUniversityofChina,...

展开>> 收起<<

SimANS Simple Ambiguous Negatives Sampling for Dense Text Retrieval Kun Zhou13y Yeyun Gong4 Xiao Liu4 Wayne Xin Zhao23 Yelong Shen5 Anlei Dong5 Jingwen Lu5Rangan Majumder5Ji-Rong Wen23Nan Duan4Weizhu Chen5.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SimANS Simple Ambiguous Negatives Sampling for Dense Text Retrieval Kun Zhou13y Yeyun Gong4 Xiao Liu4 Wayne Xin Zhao23 Yelong Shen5 Anlei Dong5 Jingwen Lu5Rangan Majumder5Ji-Rong Wen23Nan Duan4Weizhu Chen5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: