erwise there is nothing to index) (Xin et al.,2022;
Wang et al.,2022). We thus design COCO-DR
to perform COntinuous COntrastive pretraining
(COCO) on the target corpora, which treats two
text sequences from the same document as positive
pairs and sequences from different documents as
negative pairs. This enables COCO-DR to miti-
gate document distribution shifts by improving the
alignment and uniformity of sequence representa-
tions for target tasks.
The distribution shift on the query intent, how-
ever, is more challenging as there only exists a few,
if any, example queries available under ZeroDR sce-
narios. COCO-DR introduces an implicit distribu-
tionally robust optimization (iDRO) method when
fine-tuning on the source retrieval labels. Specifi-
cally, it first clusters the source queries into groups
based on their learned embeddings. Then, it dy-
namically reweights the losses on these query clus-
ters by using the gradient similarity among groups.
This improves model robustness on less represented
query groups in the source, thus implicitly boosts
the generalization ability of the DR model on un-
seen target queries.
COCO-DR is conceptually simple but empiri-
cally powerful. On 18 retrieval tasks included in
BEIR, the standard ZeroDR benchmark (Thakur
et al.,2021), COCO-DR outperforms state-of-the-
art domain adaptation methods (Wang et al.,2022)
which leverage per-task generated pseudo labels
and cross-encoder teachers. COCO-DR also outper-
forms large scale models with orders of magnitude
more parameters. As shown in Figure 1, at only
BERT
base
scale with 110M parameters, COCO-
DR outperforms GTR
XXL
(Ni et al.,2021) and
CPT
L
(Neelakantan et al.,2022), which use
∼
50
×
more parameters. At BERT
Large
scale, COCO-DR
surpasses CPT
XL
(Neelakantan et al.,2022), the
largest DR model to date (175B parameters) on its
selected tasks, only using 0.17% of its parameters.
Our analysis confirms that the better generaliza-
tion ability of COCO-DR comes from its ability
to combat the distribution shifts. Continuous con-
trastive learning helps the pretrained model bet-
ter capture target corpora’ sequence representation,
leading to better generalization ability of models
after fine-tuning. Training with iDRO helps COCO-
DR achieve robust performances on source query
clusters that share similar search intents to target
queries, which then lead to better jgeneralization
to corresponding target tasks.
In the rest of this paper, we discuss related work
in Section 2, analyze the distribution shift in Sec-
tion 3, and present COCO-DR in Section 4. Our
experiments are discussed in Section 5and we con-
clude in Section 6.
2 Related Work
Earlier research has explored various ways to learn
representations for retrieval (Deerwester et al.,
1990;Huang et al.,2013). Recently, with pre-
trained language models (Lee et al.,2019), hard
training negative selection (Karpukhin et al.,2020;
Xiong et al.,2021), and retrieval-oriented pretrain-
ing (Lu et al.,2021;Gao and Callan,2022), dense
retrieval has shown strong advantages over sparse
retrieval methods, although the advantages are
more observed in supervised settings than zero-
shot scenarios (Thakur et al.,2021).
One research direction to improve zero-shot
dense retrieval is bringing in domain adaption tech-
niques. Xin et al. (2022) employ domain invariant
learning to narrow the representation gap between
source and target domains. Ma et al. (2021) and
Wang et al. (2022) generate pseudo labels for each
target task to train in-domain DR models. These
techniques employ one specially trained retrieval
model for each target task and improve zero-shot
retrieval accuracy.
Another way to improve ZeroDR is to scale up
model size and source training data. Ni et al. (2021)
and Neelakantan et al. (2022) leverage models with
billions of parameters (T5-XXL and GPT-3) and
large-scale training data to increase the generaliza-
tion capacity of DR model. Izacard et al. (2022)
and Xu et al. (2022) enlarge the size of training
data with retrieval-oriented pretraining tasks. As
illustrated in Figure 1, the benefit of scale follows
the scaling law of language models (Kaplan et al.,
2020): A linear increment of zero-shot accuracy re-
quires exponentially more training data and model
parameters.
Combining dense models with sparse retrieval
yields better zero-shot retrieval performances on
BEIR (Formal et al.,2022;Xu et al.,2022). The
reranking models, using stronger cross-encoders,
can be used as teachers to improve the robustness
of dense retrieval models (Wang et al.,2022).
More generally speaking, continuous pretrain-
ing and distributionally robust optimization (DRO)
are two techniques for improving model gener-
alization on other applications. Continuous pre-