Bridging the Training-Inference Gap for Dense Phrase Retrieval Gyuwan Kim1Jinhyuk Lee2Barlas O guz3Wenhan Xiong3 Yizhe Zhang3yYashar Mehdad3William Yang Wang1

2025-04-30 0 0 763.61KB 12 页 10玖币

侵权投诉

Bridging the Training-Inference Gap for Dense Phrase Retrieval

Gyuwan Kim1Jinhyuk Lee2∗Barlas O˘

guz3Wenhan Xiong3

Yizhe Zhang3†Yashar Mehdad3William Yang Wang1

1University of California, Santa Barbara 2Korea University 3Meta AI

gyuwankim@ucsb.edu, jinhyuk_lee@korea.ac.kr

{barlaso, xwhan, yizhezhang, mehdad}@fb.com, william@cs.ucsb.edu

Abstract

Building dense retrievers requires a series of

standard procedures, including training and

validating neural models and creating indexes

for efﬁcient search. However, these proce-

dures are often misaligned in that training ob-

jectives do not exactly reﬂect the retrieval sce-

nario at inference time. In this paper, we ex-

plore how the gap between training and infer-

ence in dense retrieval can be reduced, focus-

ing on dense phrase retrieval (Lee et al.,2021a)

where billions of representations are indexed

at inference. Since validating every dense re-

triever with a large-scale index is practically

infeasible, we propose an efﬁcient way of vali-

dating dense retrievers using a small subset of

the entire corpus. This allows us to validate

various training strategies including unifying

contrastive loss terms and using hard negatives

for phrase retrieval, which largely reduces the

training-inference discrepancy. As a result,

we improve top-1 phrase retrieval accuracy by

2∼3 points and top-20 passage retrieval accu-

racy by 2∼4 points for open-domain question

answering. Our work urges modeling dense re-

trievers with careful consideration of training

and inference via efﬁcient validation while ad-

vancing phrase retrieval as a general solution

for dense retrieval.

1 Introduction

Dense retrieval aims to learn effective representa-

tions of queries and documents by making represen-

tations of relevant query-document pairs to be simi-

lar (Chopra et al.,2005;Van den Oord et al.,2018).

With the success of dense passage retrieval for

open-domain question answering (QA) (Lee et al.,

2019;Karpukhin et al.,2020), recent studies build

an index for a ﬁner granularity such as dense phrase

retrieval (Lee et al.,2021a), which largely improves

the computational efﬁciency of open-domain QA

∗JL currently works at Google Research.

†YZ currently works at Apple.

by replacing the retriever-reader model (Chen et al.,

2017) with a retriever-only model (Seo et al.,2019;

Lewis et al.,2021). Also, phrase retrieval provides

a unifying solution for multi-granularity retrieval

ranging from open-domain QA (formulated as re-

trieving phrases) to document retrieval (Lee et al.,

2021b), which makes it particularly attractive.

Building a dense retrieval system involves mul-

tiple steps (Figure 1) including training a dual en-

coder (§4), selecting the best model with validation

(§3), and constructing an index (often with ﬁlter-

ing) for an efﬁcient search (§5). However, these

components are somewhat loosely connected to

each other. For example, model training is not di-

rectly optimizing the retrieval performance using

the full corpus on which models should be evalu-

ated. In this paper, we aim to minimize the gap

between training and inference of dense retrievers

to achieve better retrieval performance.

However, developing a better dense retriever re-

quires validation, which requires building large in-

dexes from a full corpus (e.g., the entire Wikipedia

for open-domain QA) for inference with a huge

amount of computational resources and time. To

tackle this problem, we ﬁrst propose an efﬁcient

way of validating dense retrievers without building

large-scale indexes. Analysis of using a smaller ran-

dom corpus with different sizes for the validation

reveals that the accuracy from small indexes does

not necessarily correlate well with the retrieval ac-

curacy on the full index. As an alternative, we con-

struct a compact corpus using a pre-trained dense

retriever so that validation on this corpus better cor-

relates well with the retrieval on the full scale while

keeping the size of the corpus as small as possible

to perform efﬁcient validation.

With our efﬁcient validation, we revisit the train-

ing method of dense phrase retrieval (Lee et al.,

2021a,b), a general framework for retrieving differ-

ent granularities of texts such as phrases, passages,

and documents. We reduce the training-inference

arXiv:2210.13678v1 [cs.CL] 25 Oct 2022

(a) Lee et al., 2021 (b) Ours

Full Corpus

Checkpoint

Metric

Index

n = 3B

Metric

Training

Validation

Retriever Pseudo Corpus

Index

n = 6M

Metric

Retriever

…

Checkpoint Checkpoint

…

……

Figure 1: Comparison of the (a) original (Lee et al.,2021a) and (b) proposed procedure for DensePhrases training

(top) and validation (bottom). We unify training loss terms Linp and Linb that enforce the representation of a

question (q) similar to the representation of a positive phrase (p+) while contrasting from representations of in-

passage negative phrases (p-

inp) and in-batch negative phrases (p-

inb) respectively into a single term Ltrain and expand

negatives in number and difﬁculty with hard negatives (p-

hard). Also, we use a retrieval accuracy on the development

set Qdev using a smaller corpus instead of the full corpus as an efﬁcient validation metric for selecting the best

checkpoint. Query-side ﬁne-tuning and token ﬁltering are not described in this overview ﬁgure.

discrepancy by unifying previous loss terms to dis-

criminate a gold answer phrase from other negative

phrases altogether instead of applying in-passage

negatives (Lee et al.,2021b) and in-batch negatives

separately. To better approximate the retrieval at in-

ference where the number of negatives is extremely

large, we use all available negative phrases from

training passages to increase the number of nega-

tives and put more weights on negative phrases. We

also leverage model-based hard negatives (Xiong

et al.,2020) for phrase retrieval, which hasn’t been

explored in previous studies. This enables our

dense retrievers correct mistakes made at inference

time.

Lastly, we study the effect of a representation

ﬁlter (Seo et al.,2018), an essential component for

efﬁcient search. We separate the training and vali-

dation of a phrase ﬁltering module to disentangle

the effect of contrastive learning and representation

ﬁltering. This allows us to do careful validation of

the representation ﬁlter and achieve a better preci-

sion/recall trade-off. Interestingly, we ﬁnd that a

representation ﬁlter has a dual role of reducing the

index size and also improving retrieval accuracy,

meaning smaller indexes are often better than larger

ones in terms of accuracy. This gives a different

view of other ﬁltering methods that have been ap-

plied in previous studies for efﬁcient open-domain

QA (Min et al.,2021;Izacard et al.,2020;Fajcik

et al.,2021;Yang and Seo,2021).

We reemphasize that phrase retrieval is an attrac-

tive solution for open-domain question answering

compared to other retriever-reader models, consid-

ering both accuracy and efﬁciency. Our contribu-

tions are summarized as follows:

•

We introduce an efﬁcient method of validat-

ing dense retrievers to conﬁrm and accelerate

better modeling of dense retrievers.

•

Based on our efﬁcient validation, we improve

dense phrase retrieval models with modiﬁed

training objectives and hard negatives.

•

Consequently, we achieve the state-of-the-art

phrase retrieval accuracy for open-domain QA

and also largely improve passage retrieval ac-

curacy on Natural Questions (Kwiatkowski

et al.,2019) and TriviaQA (Joshi et al.,2017).

2 Related Work

Dense retrieval

Retrieving relevant documents

for a query (Mitra and Craswell,2018) is crucial in

many NLP applications like open-domain question

answering and knowledge-intensive tasks (Petroni

et al.,2021). Dense retrievers typically build a

search index for all documents by pre-computing

the dense representations of documents using an en-

coder. Off-the-shelf libraries for a maximum inner

product search (MIPS) (Johnson et al.,2019;Guo

et al.,2020) enable model training and indexing to

be developed independently (Lin,2022). However,

both training dense retrievers and building indexes

should take into account the ﬁnal retrieval accuracy.

In this respect, we aim to close the gap between

training and inference of dense retrievers.

Phrase retrieval

Phrase retrieval (Seo et al.,

2019) directly ﬁnds an answer with MIPS from

an index of contextualized phrase vectors. This

removes the need to run an expensive reader for

open-domain QA. As a result, phrase retrieval al-

lows real-time search tens of times faster than

retriever-reader approaches as an alternative for

open-domain QA. DensePhrases (Lee et al.,2021a)

removes the requirement of sparse features and

signiﬁcantly improves the accuracy from previous

phrase retrieval methods (Seo et al.,2019;Lee et al.,

2020). Lee et al. (2021b) show how retrieving

phrases could be translated into retrieving larger

units of texts like a sentence, passage, or document,

making phrase retrieval a general framework for

retrieval. Despite these advantages, phrase retrieval

requires building a large index from billions of rep-

resentations. In this work, we focus on improving

phrase retrieval with more efﬁcient validation.

Validation of dense retrieval

Careful validation

is essential for developing machine learning mod-

els to ﬁnd a better conﬁguration (Melis et al.,2018)

or avoid falling to a wrong conclusion. However,

many works on dense retrieval do not clearly state

the validation strategy, and most of them presum-

ably perform validation on the entire corpus. It is

doable but quite expensive

to perform frequent

validation and comprehensive tuning. Hence, it

motivates us to devise efﬁcient validation for dense

For example, dense passage retrieval (DPR) (Karpukhin

et al.,2020) takes 8.8 hours on 8 GPUs to compute 21-million

passage embeddings and 8.5 hours to build a FAISS index.

Also, ColBERT (Khattab and Zaharia,2020) takes 3 hours to

index 9M passages in the MS MARCO dataset (Nguyen et al.,

2016) using 4 GPUs.

retrieval. Like ours, Hofstätter et al. (2021) con-

struct a small validation set by sampling queries

and using a baseline model for approximate dense

passage retrieval but limited to early stopping. Liu

et al. (2021) demonstrate that small and synthetic

benchmarks can recapitulate innovation of ques-

tion answering models on SQuAD (Rajpurkar et al.,

2016) by measuring the concurrence of accuracy

between benchmarks. We share the intuition that

smaller and well-curated datasets may lead to the

same (or sometimes better) model development

while faster but with more focus on the validation

process.

Hard examples

Adversarial data collection by

an iterative model (or human) in the loop pro-

cess aims to evaluate or reinforce models’ weak-

nesses, including the robustness to adversarial at-

tacks (Kaushik et al.,2021;Bartolo et al.,2021;

Nie et al.,2020;Kiela et al.,2021). In this work,

we construct a compact corpus from a pre-trained

dense retriever for efﬁcient validation. Also, we

extract hard negatives from retrieval results of the

previous model for better dense representations.

3 Efﬁcient Validation of Phrase Retrieval

Our goal is to train a dense retriever

that can

accurately ﬁnd a correct answer in the entire corpus

(in our case, Wikipedia). Careful validation is

necessary to conﬁrm whether new training methods

are truly effective. It also helps ﬁnding optimal con-

ﬁgurations induced by those techniques. However,

building a large-scale index for every model makes

the model development process slow and also re-

quires huge memory. Thus, an efﬁcient validation

method could expedite modeling innovations in

the correct directions. It could also allow frequent

comparison of different checkpoints when updating

a full index simultaneously during the training is

computationally infeasible.2

Measuring the retrieval accuracy on an index

from a smaller subset of the full corpus (denoted

) for model validation would be a practi-

cal choice, hoping

argmaxM∈Ωacc(D|M, C?)≈

argmaxM∈Ωacc(D|M, C)

where

Ω

is a set of

model candidates and

acc

means the retrieval ac-

curacy on a QA dataset

. We ﬁrst examine how

a relative order of accuracy between modeling ap-

Although some works (Guu et al.,2020;Xiong et al.,

2020) do asynchronous updates per speciﬁc number of training

steps and use the intermediate index for better modeling, it

requires a huge amount of computational resource.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BridgingtheTraining-InferenceGapforDensePhraseRetrievalGyuwanKim1JinhyukLee2BarlasOguz3WenhanXiong3YizheZhang3yYasharMehdad3WilliamYangWang11UniversityofCalifornia,SantaBarbara2KoreaUniversity3MetaAIgyuwankim@ucsb.edu,jinhyuk_lee@korea.ac.kr{barlaso,xwhan,yizhezhang,mehdad}@fb.com,william@cs.ucsb....

展开>> 收起<<

Bridging the Training-Inference Gap for Dense Phrase Retrieval Gyuwan Kim1Jinhyuk Lee2Barlas O guz3Wenhan Xiong3 Yizhe Zhang3yYashar Mehdad3William Yang Wang1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bridging the Training-Inference Gap for Dense Phrase Retrieval Gyuwan Kim1Jinhyuk Lee2Barlas O guz3Wenhan Xiong3 Yizhe Zhang3yYashar Mehdad3William Yang Wang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: