
2 Related Work
Dense retrieval
Retrieving relevant documents
for a query (Mitra and Craswell,2018) is crucial in
many NLP applications like open-domain question
answering and knowledge-intensive tasks (Petroni
et al.,2021). Dense retrievers typically build a
search index for all documents by pre-computing
the dense representations of documents using an en-
coder. Off-the-shelf libraries for a maximum inner
product search (MIPS) (Johnson et al.,2019;Guo
et al.,2020) enable model training and indexing to
be developed independently (Lin,2022). However,
both training dense retrievers and building indexes
should take into account the final retrieval accuracy.
In this respect, we aim to close the gap between
training and inference of dense retrievers.
Phrase retrieval
Phrase retrieval (Seo et al.,
2019) directly finds an answer with MIPS from
an index of contextualized phrase vectors. This
removes the need to run an expensive reader for
open-domain QA. As a result, phrase retrieval al-
lows real-time search tens of times faster than
retriever-reader approaches as an alternative for
open-domain QA. DensePhrases (Lee et al.,2021a)
removes the requirement of sparse features and
significantly improves the accuracy from previous
phrase retrieval methods (Seo et al.,2019;Lee et al.,
2020). Lee et al. (2021b) show how retrieving
phrases could be translated into retrieving larger
units of texts like a sentence, passage, or document,
making phrase retrieval a general framework for
retrieval. Despite these advantages, phrase retrieval
requires building a large index from billions of rep-
resentations. In this work, we focus on improving
phrase retrieval with more efficient validation.
Validation of dense retrieval
Careful validation
is essential for developing machine learning mod-
els to find a better configuration (Melis et al.,2018)
or avoid falling to a wrong conclusion. However,
many works on dense retrieval do not clearly state
the validation strategy, and most of them presum-
ably perform validation on the entire corpus. It is
doable but quite expensive
1
to perform frequent
validation and comprehensive tuning. Hence, it
motivates us to devise efficient validation for dense
1
For example, dense passage retrieval (DPR) (Karpukhin
et al.,2020) takes 8.8 hours on 8 GPUs to compute 21-million
passage embeddings and 8.5 hours to build a FAISS index.
Also, ColBERT (Khattab and Zaharia,2020) takes 3 hours to
index 9M passages in the MS MARCO dataset (Nguyen et al.,
2016) using 4 GPUs.
retrieval. Like ours, Hofstätter et al. (2021) con-
struct a small validation set by sampling queries
and using a baseline model for approximate dense
passage retrieval but limited to early stopping. Liu
et al. (2021) demonstrate that small and synthetic
benchmarks can recapitulate innovation of ques-
tion answering models on SQuAD (Rajpurkar et al.,
2016) by measuring the concurrence of accuracy
between benchmarks. We share the intuition that
smaller and well-curated datasets may lead to the
same (or sometimes better) model development
while faster but with more focus on the validation
process.
Hard examples
Adversarial data collection by
an iterative model (or human) in the loop pro-
cess aims to evaluate or reinforce models’ weak-
nesses, including the robustness to adversarial at-
tacks (Kaushik et al.,2021;Bartolo et al.,2021;
Nie et al.,2020;Kiela et al.,2021). In this work,
we construct a compact corpus from a pre-trained
dense retriever for efficient validation. Also, we
extract hard negatives from retrieval results of the
previous model for better dense representations.
3 Efficient Validation of Phrase Retrieval
Our goal is to train a dense retriever
M
that can
accurately find a correct answer in the entire corpus
C
(in our case, Wikipedia). Careful validation is
necessary to confirm whether new training methods
are truly effective. It also helps finding optimal con-
figurations induced by those techniques. However,
building a large-scale index for every model makes
the model development process slow and also re-
quires huge memory. Thus, an efficient validation
method could expedite modeling innovations in
the correct directions. It could also allow frequent
comparison of different checkpoints when updating
a full index simultaneously during the training is
computationally infeasible.2
Measuring the retrieval accuracy on an index
from a smaller subset of the full corpus (denoted
as
C?
) for model validation would be a practi-
cal choice, hoping
argmaxM∈Ωacc(D|M, C?)≈
argmaxM∈Ωacc(D|M, C)
where
Ω
is a set of
model candidates and
acc
means the retrieval ac-
curacy on a QA dataset
D
. We first examine how
a relative order of accuracy between modeling ap-
2
Although some works (Guu et al.,2020;Xiong et al.,
2020) do asynchronous updates per specific number of training
steps and use the intermediate index for better modeling, it
requires a huge amount of computational resource.