COCO-DR Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning Yue Yu1Chenyan Xiong2Si Sun3Chao Zhang1Arnold Overwijk2

2025-04-29 0 0 809.16KB 18 页 10玖币

侵权投诉

COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval

with Contrastive and Distributionally Robust Learning

Yue Yu1∗Chenyan Xiong2Si Sun3Chao Zhang1Arnold Overwijk2

1Georgia Institute of Technology 2Microsoft 3Tsinghua University

{yueyu, chaozhang}@gatech.edu,s-sun17@mails.tsinghua.edu.cn

{chenyan.xiong, arnold.overwijk}@microsoft.com

Abstract

We present a new zero-shot dense retrieval

(ZeroDR) method, COCO-DR, to improve

the generalization ability of dense retrieval

by combating the distribution shifts between

source training tasks and target scenarios. To

mitigate the impact of document differences,

COCO-DR continues pretraining the language

model on the target corpora to adapt to target

distributions via COtinuous COtrastive learn-

ing. To prepare for unseen target queries,

COCO-DR leverages implicit Distributionally

Robust Optimization (iDRO) to reweight sam-

ples from different source query clusters for

improving model robustness over rare queries

during ﬁne-tuning. COCO-DR achieves su-

perior average performance on BEIR, the

zero-shot retrieval benchmark. At BERTBase

scale, COCO-DRBase outperforms other Ze-

roDR models with 60×larger size. At

BERTLarge scale, COCO-DRLarge outperforms

the giant GPT-3 embedding model which

has 500×more parameters. Our analysis

show the correlation of COCO-DR’s effective-

ness in combating distribution shifts and im-

proving zero-shot accuracy. Our code and

model can be found at https://github.com/

OpenMatch/COCO-DR.

1 Introduction

Learning to represent and match queries and

documents by embeddings, dense retrieval (DR)

achieves strong performances in scenarios with

sufﬁcient training signals (Bajaj et al.,2016;

Kwiatkowski et al.,2019). However, in many real

world scenarios, obtaining relevance labels can be

challenging due to the reliance on domain exper-

tise, or even infeasible because of the strict privacy

constraints. Deploying dense retrieval in these sce-

narios becomes zero-shot (ZeroDR, Thakur et al.

(2021)), which requires ﬁrst training DR models on

source tasks and then generalizing to target tasks

∗Work partly done during Yue’s internship at Microsoft.

Figure 1: The average nDCG@10 of COCO-DR versus

large scale models on the 11 BEIR tasks selected in

Neelakantan et al. (2022). X-axis is in log scale.

with zero in-domain supervision (Izacard et al.,

2022;Ni et al.,2021;Neelakantan et al.,2022).

ZeroDR poses great challenges to the general-

ization ability of DR models under the distribution

shift between source and target data (Gulrajani and

Lopez-Paz,2021;Wiles et al.,2022), as it requires

the alignment between queries and their relevant

documents in the embedding space. It is much

harder to generalize than standard classiﬁcation or

ranking tasks, where a robust decision boundary is

sufﬁcient (Xin et al.,2022).

In this work, we ﬁrst analyze the distribution

shifts in zero-shot dense retrieval. We illustrate

the signiﬁcant distribution shifts in both query in-

tent and document language from the source to

target tasks. After that, we show the strong correla-

tion between the distribution shifts and the reduced

zero-shot accuracy of dense retrieval models, which

conﬁrms the negative impact of distribution shifts

on the generalization ability of dense retrieval.

We then present COCO-DR, a ZeroDR model

that combats the distribution shifts between source

and target tasks. In many ZeroDR scenarios, even

though relevancy labels or queries are unavailable,

the target corpus is often available pre-deploy (oth-

arXiv:2210.15212v2 [cs.CL] 24 Nov 2022

erwise there is nothing to index) (Xin et al.,2022;

Wang et al.,2022). We thus design COCO-DR

to perform COntinuous COntrastive pretraining

(COCO) on the target corpora, which treats two

text sequences from the same document as positive

pairs and sequences from different documents as

negative pairs. This enables COCO-DR to miti-

gate document distribution shifts by improving the

alignment and uniformity of sequence representa-

tions for target tasks.

The distribution shift on the query intent, how-

ever, is more challenging as there only exists a few,

if any, example queries available under ZeroDR sce-

narios. COCO-DR introduces an implicit distribu-

tionally robust optimization (iDRO) method when

ﬁne-tuning on the source retrieval labels. Speciﬁ-

cally, it ﬁrst clusters the source queries into groups

based on their learned embeddings. Then, it dy-

namically reweights the losses on these query clus-

ters by using the gradient similarity among groups.

This improves model robustness on less represented

query groups in the source, thus implicitly boosts

the generalization ability of the DR model on un-

seen target queries.

COCO-DR is conceptually simple but empiri-

cally powerful. On 18 retrieval tasks included in

BEIR, the standard ZeroDR benchmark (Thakur

et al.,2021), COCO-DR outperforms state-of-the-

art domain adaptation methods (Wang et al.,2022)

which leverage per-task generated pseudo labels

and cross-encoder teachers. COCO-DR also outper-

forms large scale models with orders of magnitude

more parameters. As shown in Figure 1, at only

BERT

base

scale with 110M parameters, COCO-

DR outperforms GTR

XXL

(Ni et al.,2021) and

CPT

(Neelakantan et al.,2022), which use

∼

more parameters. At BERT

Large

scale, COCO-DR

surpasses CPT

(Neelakantan et al.,2022), the

largest DR model to date (175B parameters) on its

selected tasks, only using 0.17% of its parameters.

Our analysis conﬁrms that the better generaliza-

tion ability of COCO-DR comes from its ability

to combat the distribution shifts. Continuous con-

trastive learning helps the pretrained model bet-

ter capture target corpora’ sequence representation,

leading to better generalization ability of models

after ﬁne-tuning. Training with iDRO helps COCO-

DR achieve robust performances on source query

clusters that share similar search intents to target

queries, which then lead to better jgeneralization

to corresponding target tasks.

In the rest of this paper, we discuss related work

in Section 2, analyze the distribution shift in Sec-

tion 3, and present COCO-DR in Section 4. Our

experiments are discussed in Section 5and we con-

clude in Section 6.

2 Related Work

Earlier research has explored various ways to learn

representations for retrieval (Deerwester et al.,

1990;Huang et al.,2013). Recently, with pre-

trained language models (Lee et al.,2019), hard

training negative selection (Karpukhin et al.,2020;

Xiong et al.,2021), and retrieval-oriented pretrain-

ing (Lu et al.,2021;Gao and Callan,2022), dense

retrieval has shown strong advantages over sparse

retrieval methods, although the advantages are

more observed in supervised settings than zero-

shot scenarios (Thakur et al.,2021).

One research direction to improve zero-shot

dense retrieval is bringing in domain adaption tech-

niques. Xin et al. (2022) employ domain invariant

learning to narrow the representation gap between

source and target domains. Ma et al. (2021) and

Wang et al. (2022) generate pseudo labels for each

target task to train in-domain DR models. These

techniques employ one specially trained retrieval

model for each target task and improve zero-shot

retrieval accuracy.

Another way to improve ZeroDR is to scale up

model size and source training data. Ni et al. (2021)

and Neelakantan et al. (2022) leverage models with

billions of parameters (T5-XXL and GPT-3) and

large-scale training data to increase the generaliza-

tion capacity of DR model. Izacard et al. (2022)

and Xu et al. (2022) enlarge the size of training

data with retrieval-oriented pretraining tasks. As

illustrated in Figure 1, the beneﬁt of scale follows

the scaling law of language models (Kaplan et al.,

2020): A linear increment of zero-shot accuracy re-

quires exponentially more training data and model

parameters.

Combining dense models with sparse retrieval

yields better zero-shot retrieval performances on

BEIR (Formal et al.,2022;Xu et al.,2022). The

reranking models, using stronger cross-encoders,

can be used as teachers to improve the robustness

of dense retrieval models (Wang et al.,2022).

More generally speaking, continuous pretrain-

ing and distributionally robust optimization (DRO)

are two techniques for improving model gener-

alization on other applications. Continuous pre-

training BERT’s masked language modeling tasks

on target domain corpora have shown beneﬁts on

both language tasks (Gururangan et al.,2020) and

the reranking step of search systems (Wang et al.,

2021b). The beneﬁts of DRO are more ambiva-

lent (Gulrajani and Lopez-Paz,2021;Wiles et al.,

2022) and are more observed when explicit group

partitions are available (Oren et al.,2019;Sagawa

et al.,2020;Zhou et al.,2021).

3 Distribution Shifts in Dense Retrieval

In this section, we ﬁrst introduce the preliminaries

of dense retrieval. Then we discuss the standard

zero-shot dense retrieval settings and study the im-

pact of distribution shifts on ZeroDR accuracy.

3.1 Preliminaries on Dense Retrieval

In dense retrieval, the query qand document dare

represented by dense vectors (Huang et al.,2013)

and the relevance score

f(q, d;θ)

is often calcu-

lated by simple similarity metrics, e.g., dot prod-

uct (Lee et al.,2019):

f(q, d;θ) = hg(q;θ), g(d;θ)i.(1)

Here

g(·;θ)

denotes the text encoder and

is the

collection of parameter of

, which is often initial-

ized by BERT (Devlin et al.,2019). The learning

objective for dense retrieval can be expressed as

θ∗= arg min

θ`(θ) =

−Eq∼p(·)Ed+∼ppos(q)Ed−∼pneg(q)log pθd+|q, d−,

(2)

where

p(·)

is the distribution of queries, and

and

d−

are sampled from the distribution of pos-

itive and negative document for

(denoted as

ppos(q)

and

pneg(q)

), respectively. In practice, the

negative documents can either be BM25 nega-

tives (Karpukhin et al.,2020) or mined by DR

models from the past episode (Xiong et al.,2021).

During training, we aim to maximize the prob-

ability of selecting the ground-truth document

over the negative document d−as

pθ(d+|q,d−) = exp (f(q, d+;θ))

exp (f(q, d+;θ)) + exp (f(q, d−;θ)) ,

(3)

This dense retrieval conﬁguration has shown

strong empirical performances in a wide range of

supervised scenarios, where the training and test-

ing data are drawn from the same distributions,

and a large amount of relevance labels are avail-

able (Karpukhin et al.,2020;Xiong et al.,2021;

Qu et al.,2021).

3.2 ZeroDR and Distribution Shifts

Unlike supervised settings, the empirical advan-

tages of dense retrieval are more ambivalent in

zero-shot scenarios (Thakur et al.,2021). We ﬁrst

discuss the common setups of ZeroDR and then

investigate the impact of distribution shifts on zero-

shot performance of dense retrieval models.

ZeroDR Task.

A retrieval task is considered

zero-shot if no task-speciﬁc signal is available. Un-

less in large commercialized scenarios like web

search, zero-shot is often the norm, e.g., when

building search systems for a new application, in

domains where annotations require speciﬁc exper-

tise, or in personalized scenarios where each user

has her own corpus.

Besides relevance labels, the availability of in-

domain queries is also a rarity—often only a few

example queries are available. The most accessi-

ble in-domain information is the corpus, which is

a prerequisite to build search systems. Sparse re-

trieval needs to pre-build the inverted index before

serving any query; dense retrieval systems have to

pre-compute the document embeddings.

These properties of zero-shot retrieval lead to a

common ZeroDR setup where models can leverage

the target corpus to perform unsupervised domain

adaptation, but their supervised training signals

only come from the source retrieval task, namely

MS MARCO (Xin et al.,2022;Wang et al.,2022).

In this paper, we follow the standard practice in

recent ZeroDR research, with MS MARCO pas-

sage retrieval (Bajaj et al.,2016) as the source re-

trieval task, the tasks collected in the BEIR bench-

mark (Thakur et al.,2021) as the zero-shot target,

and the corpora of BEIR tasks available at training

time for unsupervised domain adaptation.

Distribution Shifts.

Before discussing our Ze-

roDR method, we ﬁrst study the distribution shifts

between the source training task (MARCO) and

the zero-shot target tasks (BEIR).

Following the analysis in Thakur et al. (2021),

we use pairwise weighted Jaccard similarity (Ioffe,

2010) to quantify the distribution shifts both at the

query side and the document side. The document

distribution shift is measured directly at the lexicon

(a) Q, ANCE (BERT) (b) Q, ANCE (coCondenser) (c) Doc, ANCE (BERT)

(d) Doc, ANCE (coCondenser)

Figure 2: Distribution shifts and zero-shot retrieval performances of ANCE trained on MS MARCO. X-axes are

the similarity between MS MARCO and BEIR. Y-axes are NDCG@10 differences on BEIR.

level, by the similarity of their unigram word dis-

tributions. The query distribution shift is measured

on the distribution of query types, using the nine-

type categorization from Ren et al. (2022) (more

details in Appendix C.1). As shown in (Ren et al.,

2022), search intent types are more representative

than lexicon for short queries.

Figure 2plots the distribution shifts from

MARCO to BEIR tasks and the corresponding per-

formance differences between dense retrieval and

sparse retrieval. We use BM25 as the sparse re-

trieval method and ANCE starting from pretrained

BERT (Xiong et al.,2021) and coCondenser (Gao

and Callan,2022) as representative DR models.

The average similarity between MS MARCO

and BEIR tasks are 32.4% and 34.6% for queries

and documents, indicating the existence of signif-

icant distribution shifts from MARCO to BEIR.

Furthermore, these shifts are correlated with the

performance degradation of dense retrieval models,

as DR models perform much worse than BM25 on

BEIR tasks that are less similar to MS MARCO.

The contrastive learning on MARCO does not ad-

dress this challenge; ANCE initialized from coCon-

denser still underperforms BM25 on BEIR tasks

where distribution shifts are severe.

4 COCO-DR Method

To combat the distribution shifts from training

source to zero-shot targets, COCO-DR introduces

two training techniques: COntinuous COntrastive

pretraining (COCO) and implicit Distributionally

Robust optimization (iDRO). The ﬁrst continuously

pretrains the language model on target corpora

to handle document distribution shifts. The latter

improves the model robustness during ﬁne-tuning,

which then lead to better generalization for unseen

target queries. This section describes these two

components in detail.

4.1 Continuous Contrastive Pretraining

Sequence Contrastive Learning (SCL) aims to im-

prove the alignment of similar text sequences in

the pretrained representations and the uniformity of

unrelated text sequences (Meng et al.,2021), which

beneﬁts supervised dense retrieval (Gao and Callan,

2022;Ma et al.,2022). In zero-shot settings, how-

ever, SCL-pretrained models still suffer from the

distribution shifts, as observed in Figure 2.

COCO addresses this challenge via continuously

pretraining the language model on the target cor-

pora, using the contrastive learning settings widely

adopted in recent research (Ni et al.,2021;Gao and

Callan,2022;Neelakantan et al.,2022).

Speciﬁcally, for each document

in target cor-

pora, we randomly extract two disjoint sequences

si,1and si,2from dito form the positive pair in:

Lco =

i=1

`(si,1, si,2)) (4)

i=1

−log exp(hg(si,1), g(si,2)i)

Pj=1,2Ps−∈Bexp(hg(si,j ), g(s−)i).

The contrastive loss with sequence representations

g(s)and in batch negatives s−∈B.

This contrastive learning is used in combination

with language modeling (Gao and Callan,2022) to

continuous pretrain on target corpora (Gururangan

et al.,2020). It adapts the language models to tar-

get corpora before ﬁne-tuning on source labels, to

reduce the impact of document distribution shifts.

4.2 Distributionally Robust Optimization

The query distribution shifts are more challenging,

as often target queries are only available, if any, at

a small amount. For example, applying COCO on

a few queries is unlikely useful.

To address this challenge, we exploit the as-

sumption from distributional robust optimization

(DRO): a model trained to be more robust on the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

COCO-DR:CombatingDistributionShiftsinZero-ShotDenseRetrievalwithContrastiveandDistributionallyRobustLearningYueYu1ChenyanXiong2SiSun3ChaoZhang1ArnoldOverwijk21GeorgiaInstituteofTechnology2Microsoft3TsinghuaUniversity{yueyu,chaozhang}@gatech.edu,s-sun17@mails.tsinghua.edu.cn{chenyan.xiong,arnold.ove...

展开>> 收起<<

COCO-DR Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning Yue Yu1Chenyan Xiong2Si Sun3Chao Zhang1Arnold Overwijk2.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

COCO-DR Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning Yue Yu1Chenyan Xiong2Si Sun3Chao Zhang1Arnold Overwijk2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: