Bridging the Training-Inference Gap for Dense Phrase Retrieval Gyuwan Kim1Jinhyuk Lee2Barlas O guz3Wenhan Xiong3 Yizhe Zhang3yYashar Mehdad3William Yang Wang1

2025-04-30 0 0 763.61KB 12 页 10玖币
侵权投诉
Bridging the Training-Inference Gap for Dense Phrase Retrieval
Gyuwan Kim1Jinhyuk Lee2Barlas O˘
guz3Wenhan Xiong3
Yizhe Zhang3Yashar Mehdad3William Yang Wang1
1University of California, Santa Barbara 2Korea University 3Meta AI
gyuwankim@ucsb.edu, jinhyuk_lee@korea.ac.kr
{barlaso, xwhan, yizhezhang, mehdad}@fb.com, william@cs.ucsb.edu
Abstract
Building dense retrievers requires a series of
standard procedures, including training and
validating neural models and creating indexes
for efficient search. However, these proce-
dures are often misaligned in that training ob-
jectives do not exactly reflect the retrieval sce-
nario at inference time. In this paper, we ex-
plore how the gap between training and infer-
ence in dense retrieval can be reduced, focus-
ing on dense phrase retrieval (Lee et al.,2021a)
where billions of representations are indexed
at inference. Since validating every dense re-
triever with a large-scale index is practically
infeasible, we propose an efficient way of vali-
dating dense retrievers using a small subset of
the entire corpus. This allows us to validate
various training strategies including unifying
contrastive loss terms and using hard negatives
for phrase retrieval, which largely reduces the
training-inference discrepancy. As a result,
we improve top-1 phrase retrieval accuracy by
23 points and top-20 passage retrieval accu-
racy by 24 points for open-domain question
answering. Our work urges modeling dense re-
trievers with careful consideration of training
and inference via efficient validation while ad-
vancing phrase retrieval as a general solution
for dense retrieval.
1 Introduction
Dense retrieval aims to learn effective representa-
tions of queries and documents by making represen-
tations of relevant query-document pairs to be simi-
lar (Chopra et al.,2005;Van den Oord et al.,2018).
With the success of dense passage retrieval for
open-domain question answering (QA) (Lee et al.,
2019;Karpukhin et al.,2020), recent studies build
an index for a finer granularity such as dense phrase
retrieval (Lee et al.,2021a), which largely improves
the computational efficiency of open-domain QA
JL currently works at Google Research.
YZ currently works at Apple.
by replacing the retriever-reader model (Chen et al.,
2017) with a retriever-only model (Seo et al.,2019;
Lewis et al.,2021). Also, phrase retrieval provides
a unifying solution for multi-granularity retrieval
ranging from open-domain QA (formulated as re-
trieving phrases) to document retrieval (Lee et al.,
2021b), which makes it particularly attractive.
Building a dense retrieval system involves mul-
tiple steps (Figure 1) including training a dual en-
coder (§4), selecting the best model with validation
3), and constructing an index (often with filter-
ing) for an efficient search (§5). However, these
components are somewhat loosely connected to
each other. For example, model training is not di-
rectly optimizing the retrieval performance using
the full corpus on which models should be evalu-
ated. In this paper, we aim to minimize the gap
between training and inference of dense retrievers
to achieve better retrieval performance.
However, developing a better dense retriever re-
quires validation, which requires building large in-
dexes from a full corpus (e.g., the entire Wikipedia
for open-domain QA) for inference with a huge
amount of computational resources and time. To
tackle this problem, we first propose an efficient
way of validating dense retrievers without building
large-scale indexes. Analysis of using a smaller ran-
dom corpus with different sizes for the validation
reveals that the accuracy from small indexes does
not necessarily correlate well with the retrieval ac-
curacy on the full index. As an alternative, we con-
struct a compact corpus using a pre-trained dense
retriever so that validation on this corpus better cor-
relates well with the retrieval on the full scale while
keeping the size of the corpus as small as possible
to perform efficient validation.
With our efficient validation, we revisit the train-
ing method of dense phrase retrieval (Lee et al.,
2021a,b), a general framework for retrieving differ-
ent granularities of texts such as phrases, passages,
and documents. We reduce the training-inference
arXiv:2210.13678v1 [cs.CL] 25 Oct 2022
(a) Lee et al., 2021 (b) Ours
Full Corpus
Checkpoint
Metric
Index
n = 3B
Metric
Training
Validation
Retriever Pseudo Corpus
Index
n = 6M
Metric
Retriever
+
=
Checkpoint Checkpoint
Figure 1: Comparison of the (a) original (Lee et al.,2021a) and (b) proposed procedure for DensePhrases training
(top) and validation (bottom). We unify training loss terms Linp and Linb that enforce the representation of a
question (q) similar to the representation of a positive phrase (p+) while contrasting from representations of in-
passage negative phrases (p-
inp) and in-batch negative phrases (p-
inb) respectively into a single term Ltrain and expand
negatives in number and difficulty with hard negatives (p-
hard). Also, we use a retrieval accuracy on the development
set Qdev using a smaller corpus instead of the full corpus as an efficient validation metric for selecting the best
checkpoint. Query-side fine-tuning and token filtering are not described in this overview figure.
discrepancy by unifying previous loss terms to dis-
criminate a gold answer phrase from other negative
phrases altogether instead of applying in-passage
negatives (Lee et al.,2021b) and in-batch negatives
separately. To better approximate the retrieval at in-
ference where the number of negatives is extremely
large, we use all available negative phrases from
training passages to increase the number of nega-
tives and put more weights on negative phrases. We
also leverage model-based hard negatives (Xiong
et al.,2020) for phrase retrieval, which hasn’t been
explored in previous studies. This enables our
dense retrievers correct mistakes made at inference
time.
Lastly, we study the effect of a representation
filter (Seo et al.,2018), an essential component for
efficient search. We separate the training and vali-
dation of a phrase filtering module to disentangle
the effect of contrastive learning and representation
filtering. This allows us to do careful validation of
the representation filter and achieve a better preci-
sion/recall trade-off. Interestingly, we find that a
representation filter has a dual role of reducing the
index size and also improving retrieval accuracy,
meaning smaller indexes are often better than larger
ones in terms of accuracy. This gives a different
view of other filtering methods that have been ap-
plied in previous studies for efficient open-domain
QA (Min et al.,2021;Izacard et al.,2020;Fajcik
et al.,2021;Yang and Seo,2021).
We reemphasize that phrase retrieval is an attrac-
tive solution for open-domain question answering
compared to other retriever-reader models, consid-
ering both accuracy and efficiency. Our contribu-
tions are summarized as follows:
We introduce an efficient method of validat-
ing dense retrievers to confirm and accelerate
better modeling of dense retrievers.
Based on our efficient validation, we improve
dense phrase retrieval models with modified
training objectives and hard negatives.
Consequently, we achieve the state-of-the-art
phrase retrieval accuracy for open-domain QA
and also largely improve passage retrieval ac-
curacy on Natural Questions (Kwiatkowski
et al.,2019) and TriviaQA (Joshi et al.,2017).
2 Related Work
Dense retrieval
Retrieving relevant documents
for a query (Mitra and Craswell,2018) is crucial in
many NLP applications like open-domain question
answering and knowledge-intensive tasks (Petroni
et al.,2021). Dense retrievers typically build a
search index for all documents by pre-computing
the dense representations of documents using an en-
coder. Off-the-shelf libraries for a maximum inner
product search (MIPS) (Johnson et al.,2019;Guo
et al.,2020) enable model training and indexing to
be developed independently (Lin,2022). However,
both training dense retrievers and building indexes
should take into account the final retrieval accuracy.
In this respect, we aim to close the gap between
training and inference of dense retrievers.
Phrase retrieval
Phrase retrieval (Seo et al.,
2019) directly finds an answer with MIPS from
an index of contextualized phrase vectors. This
removes the need to run an expensive reader for
open-domain QA. As a result, phrase retrieval al-
lows real-time search tens of times faster than
retriever-reader approaches as an alternative for
open-domain QA. DensePhrases (Lee et al.,2021a)
removes the requirement of sparse features and
significantly improves the accuracy from previous
phrase retrieval methods (Seo et al.,2019;Lee et al.,
2020). Lee et al. (2021b) show how retrieving
phrases could be translated into retrieving larger
units of texts like a sentence, passage, or document,
making phrase retrieval a general framework for
retrieval. Despite these advantages, phrase retrieval
requires building a large index from billions of rep-
resentations. In this work, we focus on improving
phrase retrieval with more efficient validation.
Validation of dense retrieval
Careful validation
is essential for developing machine learning mod-
els to find a better configuration (Melis et al.,2018)
or avoid falling to a wrong conclusion. However,
many works on dense retrieval do not clearly state
the validation strategy, and most of them presum-
ably perform validation on the entire corpus. It is
doable but quite expensive
1
to perform frequent
validation and comprehensive tuning. Hence, it
motivates us to devise efficient validation for dense
1
For example, dense passage retrieval (DPR) (Karpukhin
et al.,2020) takes 8.8 hours on 8 GPUs to compute 21-million
passage embeddings and 8.5 hours to build a FAISS index.
Also, ColBERT (Khattab and Zaharia,2020) takes 3 hours to
index 9M passages in the MS MARCO dataset (Nguyen et al.,
2016) using 4 GPUs.
retrieval. Like ours, Hofstätter et al. (2021) con-
struct a small validation set by sampling queries
and using a baseline model for approximate dense
passage retrieval but limited to early stopping. Liu
et al. (2021) demonstrate that small and synthetic
benchmarks can recapitulate innovation of ques-
tion answering models on SQuAD (Rajpurkar et al.,
2016) by measuring the concurrence of accuracy
between benchmarks. We share the intuition that
smaller and well-curated datasets may lead to the
same (or sometimes better) model development
while faster but with more focus on the validation
process.
Hard examples
Adversarial data collection by
an iterative model (or human) in the loop pro-
cess aims to evaluate or reinforce models’ weak-
nesses, including the robustness to adversarial at-
tacks (Kaushik et al.,2021;Bartolo et al.,2021;
Nie et al.,2020;Kiela et al.,2021). In this work,
we construct a compact corpus from a pre-trained
dense retriever for efficient validation. Also, we
extract hard negatives from retrieval results of the
previous model for better dense representations.
3 Efficient Validation of Phrase Retrieval
Our goal is to train a dense retriever
M
that can
accurately find a correct answer in the entire corpus
C
(in our case, Wikipedia). Careful validation is
necessary to confirm whether new training methods
are truly effective. It also helps finding optimal con-
figurations induced by those techniques. However,
building a large-scale index for every model makes
the model development process slow and also re-
quires huge memory. Thus, an efficient validation
method could expedite modeling innovations in
the correct directions. It could also allow frequent
comparison of different checkpoints when updating
a full index simultaneously during the training is
computationally infeasible.2
Measuring the retrieval accuracy on an index
from a smaller subset of the full corpus (denoted
as
C?
) for model validation would be a practi-
cal choice, hoping
argmaxM∈acc(D|M, C?)
argmaxM∈acc(D|M, C)
where
is a set of
model candidates and
acc
means the retrieval ac-
curacy on a QA dataset
D
. We first examine how
a relative order of accuracy between modeling ap-
2
Although some works (Guu et al.,2020;Xiong et al.,
2020) do asynchronous updates per specific number of training
steps and use the intermediate index for better modeling, it
requires a huge amount of computational resource.
摘要:

BridgingtheTraining-InferenceGapforDensePhraseRetrievalGyuwanKim1JinhyukLee2BarlasO guz3WenhanXiong3YizheZhang3yYasharMehdad3WilliamYangWang11UniversityofCalifornia,SantaBarbara2KoreaUniversity3MetaAIgyuwankim@ucsb.edu,jinhyuk_lee@korea.ac.kr{barlaso,xwhan,yizhezhang,mehdad}@fb.com,william@cs.ucsb....

展开>> 收起<<
Bridging the Training-Inference Gap for Dense Phrase Retrieval Gyuwan Kim1Jinhyuk Lee2Barlas O guz3Wenhan Xiong3 Yizhe Zhang3yYashar Mehdad3William Yang Wang1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:763.61KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注