COCO-DR Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning Yue Yu1Chenyan Xiong2Si Sun3Chao Zhang1Arnold Overwijk2

2025-04-29 0 0 809.16KB 18 页 10玖币
侵权投诉
COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval
with Contrastive and Distributionally Robust Learning
Yue Yu1Chenyan Xiong2Si Sun3Chao Zhang1Arnold Overwijk2
1Georgia Institute of Technology 2Microsoft 3Tsinghua University
{yueyu, chaozhang}@gatech.edu,s-sun17@mails.tsinghua.edu.cn
{chenyan.xiong, arnold.overwijk}@microsoft.com
Abstract
We present a new zero-shot dense retrieval
(ZeroDR) method, COCO-DR, to improve
the generalization ability of dense retrieval
by combating the distribution shifts between
source training tasks and target scenarios. To
mitigate the impact of document differences,
COCO-DR continues pretraining the language
model on the target corpora to adapt to target
distributions via COtinuous COtrastive learn-
ing. To prepare for unseen target queries,
COCO-DR leverages implicit Distributionally
Robust Optimization (iDRO) to reweight sam-
ples from different source query clusters for
improving model robustness over rare queries
during fine-tuning. COCO-DR achieves su-
perior average performance on BEIR, the
zero-shot retrieval benchmark. At BERTBase
scale, COCO-DRBase outperforms other Ze-
roDR models with 60×larger size. At
BERTLarge scale, COCO-DRLarge outperforms
the giant GPT-3 embedding model which
has 500×more parameters. Our analysis
show the correlation of COCO-DR’s effective-
ness in combating distribution shifts and im-
proving zero-shot accuracy. Our code and
model can be found at https://github.com/
OpenMatch/COCO-DR.
1 Introduction
Learning to represent and match queries and
documents by embeddings, dense retrieval (DR)
achieves strong performances in scenarios with
sufficient training signals (Bajaj et al.,2016;
Kwiatkowski et al.,2019). However, in many real
world scenarios, obtaining relevance labels can be
challenging due to the reliance on domain exper-
tise, or even infeasible because of the strict privacy
constraints. Deploying dense retrieval in these sce-
narios becomes zero-shot (ZeroDR, Thakur et al.
(2021)), which requires first training DR models on
source tasks and then generalizing to target tasks
Work partly done during Yue’s internship at Microsoft.
Figure 1: The average nDCG@10 of COCO-DR versus
large scale models on the 11 BEIR tasks selected in
Neelakantan et al. (2022). X-axis is in log scale.
with zero in-domain supervision (Izacard et al.,
2022;Ni et al.,2021;Neelakantan et al.,2022).
ZeroDR poses great challenges to the general-
ization ability of DR models under the distribution
shift between source and target data (Gulrajani and
Lopez-Paz,2021;Wiles et al.,2022), as it requires
the alignment between queries and their relevant
documents in the embedding space. It is much
harder to generalize than standard classification or
ranking tasks, where a robust decision boundary is
sufficient (Xin et al.,2022).
In this work, we first analyze the distribution
shifts in zero-shot dense retrieval. We illustrate
the significant distribution shifts in both query in-
tent and document language from the source to
target tasks. After that, we show the strong correla-
tion between the distribution shifts and the reduced
zero-shot accuracy of dense retrieval models, which
confirms the negative impact of distribution shifts
on the generalization ability of dense retrieval.
We then present COCO-DR, a ZeroDR model
that combats the distribution shifts between source
and target tasks. In many ZeroDR scenarios, even
though relevancy labels or queries are unavailable,
the target corpus is often available pre-deploy (oth-
arXiv:2210.15212v2 [cs.CL] 24 Nov 2022
erwise there is nothing to index) (Xin et al.,2022;
Wang et al.,2022). We thus design COCO-DR
to perform COntinuous COntrastive pretraining
(COCO) on the target corpora, which treats two
text sequences from the same document as positive
pairs and sequences from different documents as
negative pairs. This enables COCO-DR to miti-
gate document distribution shifts by improving the
alignment and uniformity of sequence representa-
tions for target tasks.
The distribution shift on the query intent, how-
ever, is more challenging as there only exists a few,
if any, example queries available under ZeroDR sce-
narios. COCO-DR introduces an implicit distribu-
tionally robust optimization (iDRO) method when
fine-tuning on the source retrieval labels. Specifi-
cally, it first clusters the source queries into groups
based on their learned embeddings. Then, it dy-
namically reweights the losses on these query clus-
ters by using the gradient similarity among groups.
This improves model robustness on less represented
query groups in the source, thus implicitly boosts
the generalization ability of the DR model on un-
seen target queries.
COCO-DR is conceptually simple but empiri-
cally powerful. On 18 retrieval tasks included in
BEIR, the standard ZeroDR benchmark (Thakur
et al.,2021), COCO-DR outperforms state-of-the-
art domain adaptation methods (Wang et al.,2022)
which leverage per-task generated pseudo labels
and cross-encoder teachers. COCO-DR also outper-
forms large scale models with orders of magnitude
more parameters. As shown in Figure 1, at only
BERT
base
scale with 110M parameters, COCO-
DR outperforms GTR
XXL
(Ni et al.,2021) and
CPT
L
(Neelakantan et al.,2022), which use
50
×
more parameters. At BERT
Large
scale, COCO-DR
surpasses CPT
XL
(Neelakantan et al.,2022), the
largest DR model to date (175B parameters) on its
selected tasks, only using 0.17% of its parameters.
Our analysis confirms that the better generaliza-
tion ability of COCO-DR comes from its ability
to combat the distribution shifts. Continuous con-
trastive learning helps the pretrained model bet-
ter capture target corpora’ sequence representation,
leading to better generalization ability of models
after fine-tuning. Training with iDRO helps COCO-
DR achieve robust performances on source query
clusters that share similar search intents to target
queries, which then lead to better jgeneralization
to corresponding target tasks.
In the rest of this paper, we discuss related work
in Section 2, analyze the distribution shift in Sec-
tion 3, and present COCO-DR in Section 4. Our
experiments are discussed in Section 5and we con-
clude in Section 6.
2 Related Work
Earlier research has explored various ways to learn
representations for retrieval (Deerwester et al.,
1990;Huang et al.,2013). Recently, with pre-
trained language models (Lee et al.,2019), hard
training negative selection (Karpukhin et al.,2020;
Xiong et al.,2021), and retrieval-oriented pretrain-
ing (Lu et al.,2021;Gao and Callan,2022), dense
retrieval has shown strong advantages over sparse
retrieval methods, although the advantages are
more observed in supervised settings than zero-
shot scenarios (Thakur et al.,2021).
One research direction to improve zero-shot
dense retrieval is bringing in domain adaption tech-
niques. Xin et al. (2022) employ domain invariant
learning to narrow the representation gap between
source and target domains. Ma et al. (2021) and
Wang et al. (2022) generate pseudo labels for each
target task to train in-domain DR models. These
techniques employ one specially trained retrieval
model for each target task and improve zero-shot
retrieval accuracy.
Another way to improve ZeroDR is to scale up
model size and source training data. Ni et al. (2021)
and Neelakantan et al. (2022) leverage models with
billions of parameters (T5-XXL and GPT-3) and
large-scale training data to increase the generaliza-
tion capacity of DR model. Izacard et al. (2022)
and Xu et al. (2022) enlarge the size of training
data with retrieval-oriented pretraining tasks. As
illustrated in Figure 1, the benefit of scale follows
the scaling law of language models (Kaplan et al.,
2020): A linear increment of zero-shot accuracy re-
quires exponentially more training data and model
parameters.
Combining dense models with sparse retrieval
yields better zero-shot retrieval performances on
BEIR (Formal et al.,2022;Xu et al.,2022). The
reranking models, using stronger cross-encoders,
can be used as teachers to improve the robustness
of dense retrieval models (Wang et al.,2022).
More generally speaking, continuous pretrain-
ing and distributionally robust optimization (DRO)
are two techniques for improving model gener-
alization on other applications. Continuous pre-
training BERT’s masked language modeling tasks
on target domain corpora have shown benefits on
both language tasks (Gururangan et al.,2020) and
the reranking step of search systems (Wang et al.,
2021b). The benefits of DRO are more ambiva-
lent (Gulrajani and Lopez-Paz,2021;Wiles et al.,
2022) and are more observed when explicit group
partitions are available (Oren et al.,2019;Sagawa
et al.,2020;Zhou et al.,2021).
3 Distribution Shifts in Dense Retrieval
In this section, we first introduce the preliminaries
of dense retrieval. Then we discuss the standard
zero-shot dense retrieval settings and study the im-
pact of distribution shifts on ZeroDR accuracy.
3.1 Preliminaries on Dense Retrieval
In dense retrieval, the query qand document dare
represented by dense vectors (Huang et al.,2013)
and the relevance score
f(q, d;θ)
is often calcu-
lated by simple similarity metrics, e.g., dot prod-
uct (Lee et al.,2019):
f(q, d;θ) = hg(q;θ), g(d;θ)i.(1)
Here
g(·;θ)
denotes the text encoder and
θ
is the
collection of parameter of
g
, which is often initial-
ized by BERT (Devlin et al.,2019). The learning
objective for dense retrieval can be expressed as
θ= arg min
θ`(θ) =
Eqp(·)Ed+ppos(q)Edpneg(q)log pθd+|q, d,
(2)
where
p(·)
is the distribution of queries, and
d+
and
d
are sampled from the distribution of pos-
itive and negative document for
q
(denoted as
ppos(q)
and
pneg(q)
), respectively. In practice, the
negative documents can either be BM25 nega-
tives (Karpukhin et al.,2020) or mined by DR
models from the past episode (Xiong et al.,2021).
During training, we aim to maximize the prob-
ability of selecting the ground-truth document
d+
over the negative document das
pθ(d+|q,d) = exp (f(q, d+;θ))
exp (f(q, d+;θ)) + exp (f(q, d;θ)) ,
(3)
This dense retrieval configuration has shown
strong empirical performances in a wide range of
supervised scenarios, where the training and test-
ing data are drawn from the same distributions,
and a large amount of relevance labels are avail-
able (Karpukhin et al.,2020;Xiong et al.,2021;
Qu et al.,2021).
3.2 ZeroDR and Distribution Shifts
Unlike supervised settings, the empirical advan-
tages of dense retrieval are more ambivalent in
zero-shot scenarios (Thakur et al.,2021). We first
discuss the common setups of ZeroDR and then
investigate the impact of distribution shifts on zero-
shot performance of dense retrieval models.
ZeroDR Task.
A retrieval task is considered
zero-shot if no task-specific signal is available. Un-
less in large commercialized scenarios like web
search, zero-shot is often the norm, e.g., when
building search systems for a new application, in
domains where annotations require specific exper-
tise, or in personalized scenarios where each user
has her own corpus.
Besides relevance labels, the availability of in-
domain queries is also a rarity—often only a few
example queries are available. The most accessi-
ble in-domain information is the corpus, which is
a prerequisite to build search systems. Sparse re-
trieval needs to pre-build the inverted index before
serving any query; dense retrieval systems have to
pre-compute the document embeddings.
These properties of zero-shot retrieval lead to a
common ZeroDR setup where models can leverage
the target corpus to perform unsupervised domain
adaptation, but their supervised training signals
only come from the source retrieval task, namely
MS MARCO (Xin et al.,2022;Wang et al.,2022).
In this paper, we follow the standard practice in
recent ZeroDR research, with MS MARCO pas-
sage retrieval (Bajaj et al.,2016) as the source re-
trieval task, the tasks collected in the BEIR bench-
mark (Thakur et al.,2021) as the zero-shot target,
and the corpora of BEIR tasks available at training
time for unsupervised domain adaptation.
Distribution Shifts.
Before discussing our Ze-
roDR method, we first study the distribution shifts
between the source training task (MARCO) and
the zero-shot target tasks (BEIR).
Following the analysis in Thakur et al. (2021),
we use pairwise weighted Jaccard similarity (Ioffe,
2010) to quantify the distribution shifts both at the
query side and the document side. The document
distribution shift is measured directly at the lexicon
(a) Q, ANCE (BERT) (b) Q, ANCE (coCondenser) (c) Doc, ANCE (BERT)
(d) Doc, ANCE (coCondenser)
Figure 2: Distribution shifts and zero-shot retrieval performances of ANCE trained on MS MARCO. X-axes are
the similarity between MS MARCO and BEIR. Y-axes are NDCG@10 differences on BEIR.
level, by the similarity of their unigram word dis-
tributions. The query distribution shift is measured
on the distribution of query types, using the nine-
type categorization from Ren et al. (2022) (more
details in Appendix C.1). As shown in (Ren et al.,
2022), search intent types are more representative
than lexicon for short queries.
Figure 2plots the distribution shifts from
MARCO to BEIR tasks and the corresponding per-
formance differences between dense retrieval and
sparse retrieval. We use BM25 as the sparse re-
trieval method and ANCE starting from pretrained
BERT (Xiong et al.,2021) and coCondenser (Gao
and Callan,2022) as representative DR models.
The average similarity between MS MARCO
and BEIR tasks are 32.4% and 34.6% for queries
and documents, indicating the existence of signif-
icant distribution shifts from MARCO to BEIR.
Furthermore, these shifts are correlated with the
performance degradation of dense retrieval models,
as DR models perform much worse than BM25 on
BEIR tasks that are less similar to MS MARCO.
The contrastive learning on MARCO does not ad-
dress this challenge; ANCE initialized from coCon-
denser still underperforms BM25 on BEIR tasks
where distribution shifts are severe.
4 COCO-DR Method
To combat the distribution shifts from training
source to zero-shot targets, COCO-DR introduces
two training techniques: COntinuous COntrastive
pretraining (COCO) and implicit Distributionally
Robust optimization (iDRO). The first continuously
pretrains the language model on target corpora
to handle document distribution shifts. The latter
improves the model robustness during fine-tuning,
which then lead to better generalization for unseen
target queries. This section describes these two
components in detail.
4.1 Continuous Contrastive Pretraining
Sequence Contrastive Learning (SCL) aims to im-
prove the alignment of similar text sequences in
the pretrained representations and the uniformity of
unrelated text sequences (Meng et al.,2021), which
benefits supervised dense retrieval (Gao and Callan,
2022;Ma et al.,2022). In zero-shot settings, how-
ever, SCL-pretrained models still suffer from the
distribution shifts, as observed in Figure 2.
COCO addresses this challenge via continuously
pretraining the language model on the target cor-
pora, using the contrastive learning settings widely
adopted in recent research (Ni et al.,2021;Gao and
Callan,2022;Neelakantan et al.,2022).
Specifically, for each document
di
in target cor-
pora, we randomly extract two disjoint sequences
si,1and si,2from dito form the positive pair in:
Lco =
n
X
i=1
`(si,1, si,2)) (4)
=
n
X
i=1
log exp(hg(si,1), g(si,2)i)
Pj=1,2PsBexp(hg(si,j ), g(s)i).
The contrastive loss with sequence representations
g(s)and in batch negatives sB.
This contrastive learning is used in combination
with language modeling (Gao and Callan,2022) to
continuous pretrain on target corpora (Gururangan
et al.,2020). It adapts the language models to tar-
get corpora before fine-tuning on source labels, to
reduce the impact of document distribution shifts.
4.2 Distributionally Robust Optimization
The query distribution shifts are more challenging,
as often target queries are only available, if any, at
a small amount. For example, applying COCO on
a few queries is unlikely useful.
To address this challenge, we exploit the as-
sumption from distributional robust optimization
(DRO): a model trained to be more robust on the
摘要:

COCO-DR:CombatingDistributionShiftsinZero-ShotDenseRetrievalwithContrastiveandDistributionallyRobustLearningYueYu1ChenyanXiong2SiSun3ChaoZhang1ArnoldOverwijk21GeorgiaInstituteofTechnology2Microsoft3TsinghuaUniversity{yueyu,chaozhang}@gatech.edu,s-sun17@mails.tsinghua.edu.cn{chenyan.xiong,arnold.ove...

展开>> 收起<<
COCO-DR Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning Yue Yu1Chenyan Xiong2Si Sun3Chao Zhang1Arnold Overwijk2.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:809.16KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注