Nonparametric Decoding for Generative Retrieval Hyunji Lee1Jaeyoung Kim3Hoyeon Chang1Hanseok Oh1Sohee Yang1 Vlad Karpukhin2Yi Lu2Minjoon Seo1

2025-05-02 0 0 669.61KB 18 页 10玖币
侵权投诉
Nonparametric Decoding for Generative Retrieval
Hyunji Lee1Jaeyoung Kim3Hoyeon Chang1Hanseok Oh1Sohee Yang1
Vlad Karpukhin2Yi Lu2Minjoon Seo1
1KAIST AI 2Forethought.AI 3Kakao
{hyunji.amy.lee, hanseok, sohee.yang, retapurayo, minjoon}@kaist.ac.kr
{vlad.karpukhin, yi.lu}@forethought.ai jay.eong@kakaocorp.com
Abstract
The generative retrieval model depends solely
on the information encoded in its model param-
eters without external memory, its information
capacity is limited and fixed. To overcome the
limitation, we propose Nonparametric Decod-
ing (Np Decoding) which can be applied to
existing generative retrieval models. Np De-
coding uses nonparametric contextualized vo-
cab embeddings (external memory) rather than
vanilla vocab embeddings as decoder vocab em-
beddings. By leveraging the contextualized vo-
cab embeddings, the generative retrieval model
is able to utilize both the parametric and non-
parametric space. Evaluation over 9 datasets (8
single-hop and 1 multi-hop) in the document
retrieval task shows that applying Np Decod-
ing to generative retrieval models significantly
improves the performance. We also show that
Np Decoding is data- and parameter-efficient,
and shows high performance in the zero-shot
setting.1
1 Introduction
Text retrieval is often formulated as finding the
most relevant items from a large corpus given an
input query. The bi-encoder approach of using an
encoder to map the documents and the query to
a common vector space and performing a nearest
neighbor search has been a common practice in text
retrieval tasks (Karpukhin et al.,2020;Wu et al.,
2020;Ni et al.,2021). Despite its high performance
and popularity, it has an embedding space bottle-
neck (Luan et al.,2021;Lee et al.,2022); limited
expressiveness due to fixed-size embeddings and
misses the fine-grained interaction between embed-
dings as they interact in L2 or inner product space.
Moreover, the bi-encoder approach requires large
storage space to save all document embeddings.
Work done during internship at KAIST AI.
1The code and datasets used in our work is at
https://github.com/amy-hyunji/Contextualized-Generative-
Retrieval.
A recently-proposed alternative to the bi-encoder
approach is using a generative retrieval model (Cao
et al.,2021;Tay et al.,2022;Bevilacqua et al.,
2022;Lee et al.,2022;Wang et al.,2022;Lafferty
and Zhai,2003;Croft and Lafferty,2010). It is an
autoregressive model that retrieves the most rele-
vant sequence by generating the target sequence
(e.g., title, passage, document ID) token-by-token.
It overcomes the embedding space bottleneck by
interacting in the parametric space. Also, it is stor-
age efficient by not having any external memory.
However, the information capacity of such fully
parametric models tends to be bounded by their
sizes as it has to encode all information in its pa-
rameters (Tay et al.,2022;Roberts et al.,2020).
To this end, we propose Nonparametric Decod-
ing (Np Decoding), a decoding method for gener-
ative retrieval models. It uses nonparametric con-
textualized vocab embeddings rather than vanilla
vocab embeddings as decoder vocab embeddings.
The contextualized vocab embeddings are output
embeddings of an encoder that constructs a non-
parametric dense vector space and are frozen dur-
ing the training step whereas the vanilla vocab em-
beddings are trainable model vocab embeddings
that construct a parametric space of the model.
Therefore, by using Np Decoding, the generative
retrieval model does not have to rely solely on its
own parameters but can utilize the surrounding
information encoded in the contextualized vocab
embeddings (external memory). Note that while it
utilizes the dense vector space as in the bi-encoder
approach, unlike the approach, it does not have
embedding space bottleneck as it is a variant of
the generative retrieval model, and saves storage
space by storing only clustering centroid embed-
dings (Section 3.5).
As shown in Figure 1, any generative retrieval
model can incorporate Np Decoding by replacing
the decoder vocab embeddings from the vanilla
embedding matrix to contextualized embedding
arXiv:2210.02068v3 [cs.IR] 28 May 2023
Vanilla Embedding Space
Contextualized
Embedding Space
.
Cape .Town
.Cape
.South
.
Africa
.of
.Climate
.Cape
.Town
.Climate
.of
.Africa
.South
: Vanilla
: w/ NP Decoding
“During which season does cape town
receive rainfall”
Softmax
Climate of South Africa
Vanilla Embedding Matrix
Cape
Cape
Climate
Town
South
Africa
Cape Town: Cape Town is one
of South Africa’s three
CE Encoder
Contextualized
Embedding Matrix (CE)
Cape
Town
of
: frozen 🔥: trainable
save only the target
sequence (e.g., title)
Generative
Retriever
Figure 1: Np Decoding can be applied to any generative retrieval model by replacing the decoder vocab embeddings from the
vanilla embedding matrix with the contextualized embedding matrix (CE). CE is composed of the output embeddings of the
language model encoder (CE Encoder). Only the retrieval target sequences are added to CE, which in this figure we use the title
(Cape Town) as the target sequence. Unlike vanilla vocab embeddings, contextualized vocab embeddings that consist CE contain
context information, and a single token can have multiple token embeddings. This creates a more expressive and fine-grained
contextualized embedding space compared to vanilla embedding space as shown on the right side of the figure.
matrix (CE) for both the training and the infer-
ence steps. By the replacement, Np Decoding has
two key benefits over vanilla decoding. First, the
generative retrieval model can utilize not only its
parametric space but also its nonparametric space.
The nonparametric space is constructed with de-
coder vocab embeddings of Np Decoding (CE),
nonparametric and context-aware embeddings that
capture surrounding information. Second, CE al-
lows a token to have multiple token embeddings,
unlike vanilla vocab embeddings where a token has
a unique embedding. Therefore, the decoder vocab
embedding space of CE becomes more expressive
and fine-grained (right side of Figure 1). Since hav-
ing a well-constructed CE is important for achiev-
ing high performance, we propose three different
encoders (CE Encoder) used to output contextual-
ized vocab embeddings added to CE (Section 3).
We demonstrate that CE Encoder with contrastive
learning results in a significant increase in perfor-
mance.
The main contributions of our paper are as fol-
lows:
We propose Nonparametric Decoding (Np De-
coding), a simple and novel decoding method
that can be applied to all existing generative
retrieval models. Experimental results over
9 datasets show that Np Decoding can signif-
icantly improve the performance of existing
generative retrieval models by leveraging both
the parametric and the nonparametric space;
4.4% R-precision improvement for single-hop,
5.4% Recall@2 improvement for multi-hop
datasets.
We present various CE Encoder and show that
training CE Encoder with contrastive learning
further increases the performance by a large
margin.
We show generative retrieval models with Np
Decoding are data- and parameter-efficient,
and show higher performance in a zero-shot
setting.
2 Related Work
Generative Retrieval Generative retrieval mod-
els retrieve relevant items by generating sub/either
the identifiers or entire sequences of the items.
GENRE (Cao et al.,2021) retrieves a document
by generating the titles with a constrained beam
search. DSI (Tay et al.,2022) assigns a unique ID
to each item in the corpus and retrieves the item
by generating the ID of the most relevant docu-
ment. SEAL (Bevilacqua et al.,2022) retrieves any
span from any position in the corpus by using FM-
Index. GMR (Lee et al.,2022) retrieves the most
relevant item by generating the whole sequence.
Though high performance, as generative retrieval
models solely rely on the information stored in
their parameter, the information capacity is limited
and fixed. To overcome the limitation, we propose
Nonparametric Decoding (Np Decoding) for gen-
erative retrieval models. By replacing the decoder
vocab embeddings with nonparametric contextual-
ized vocab embeddings, the model is able to utilize
not only the parametric space but also the nonpara-
metric space of contextualized embeddings.
Memory Augmented Models KNN-LM (Khan-
delwal et al.,2020), TRIME (Zhong et al.,2022),
RAG (Lewis et al.,2020), and RETRO (Borgeaud
et al.,2022) are memory augmented models which
use both the parametric space of the model and
the non-parametric space of the external memory.
KNN-LM improves the LM performance by gener-
ating the next token through interpolation between
the nearest neighbor distribution (distance in the
contextualized embedding space) and the model
vocab distribution only during the inference step.
TRIME expands the work to use the objective also
during the training step. RAG and RETRO first
retrieve relevant texts with the retriever from the
external memory and generate the output based
on the retrieved texts. Moreover, concurrent work
NPM (Min et al.,2022) proposes a nonparametric
masked language model which operates over the
nonparametric distribution of the external memory.
Generative retrieval models with Nonparametric
Decoding also utilize the external memory, but
rather than considering it as an external source,
it is incorporated with the model by utilizing the
external memory as decoder vocab embeddings.
3 Nonparametric Decoding
Generative retrieval is the task of retrieving the
most relevant retrieval target (e.g., title, passage,
document identifier) by generating the target token-
by-token when given an input query. The training
objective of the generative retrieval model is to
maximize
P((t1,· · · , tn)|q) =
n
Y
i=1
P(ti|q, t<i)(1)
where
t
denotes the tokens of the retrieval target
and
q
is the input query. Such an approach has
shown high performance while using a low stor-
age footprint (Cao et al.,2021;Tay et al.,2022;
Bevilacqua et al.,2022;Lee et al.,2022). However,
it has limitation in that the model depends solely
on the information encoded in its own parameters.
Thus, the performance is likely to be bounded by
how much information can be stored in the model
parameter (Tay et al.,2022;Roberts et al.,2020).
To address the limitation, we propose a new
decoding method called Nonparametric Decod-
ing (Np Decoding) for generative retrieval. To in-
corporate Np Decoding on the existing generative
retrieval model, the only amendment is to use the
frozen contextualized vocab embedding (external
memory) rather than the vanilla vocab embedding
as the decoder vocab embedding during each gen-
eration step (Figure 1). The embeddings are the
output embeddings of an encoder when given a tar-
get sequence as input. Note that existing generative
retrieval models such as GENRE and DSI utilize
the pre-trained language model architecture as-is:
vanilla vocab embedding as the decoder vocab em-
bedding.
In Section 3.1, we show the key benefits of us-
ing Np Decoding over vanilla decoding. For Sec-
tion 3.2 to Section 3.4, we show the details of base
Np Decoding (BASE), and two variants (ASYNC,
CONTRA). In Section 3.5, we describe how we
reduce the number of contextualized token embed-
dings.
3.1 Key Benefits
Using Np Decoding has two key benefits over
vanilla decoding. First, the generative retrieval
model with Np Decoding can utilize not only the
information encoded in its own parameters (para-
metric space) but also the surrounding information
encoded in the contextualized vocab embeddings
(nonparametric space) during each decoding step.
Second, the generative retrieval model with Np
Decoding has more expressive and fine-grained
decoder vocab embedding space than that of the
model with vanilla decoding. As in Figure 1, Np
Decoding allows a single token to have multiple
contextualized token embeddings for the decoder
vocab embeddings (e.g., the same token "Cape" has
two different contextualized embeddings) depend-
ing on the surrounding information of the token,
whereas vanilla decoding allows only a single to-
ken embedding for a single token. Note that we do
not save all possible token embeddings, but reduce
the number of tokens to save without performance
degradation by practical tactics (Section 3.5).
3.2 BASE Nonparametric Decoding
In this work, we propose three different Np De-
coding (Base Nonparametric Decoding and two
variants) which we name the three different Np De-
coding based on the characteristics of the Contex-
tualized Embedding Encoders (CE Encoder). CE
Encoder is an encoder that outputs contextualized
token embeddings when given a target sequence
(e.g., title, document ID, passage) as input. The
contextualized token embeddings are added to CE
2
,
2
Details of how we construct CE for different target se-
quences are in Section 4.3.
Climate(1):
Cape Town: Cape Town is one
of South Africa’s three
EMB
1.1 0.1 0.1
1.7 0.6 0.5 1.6 0.4
Cape(2):
1.7 0.6 0.6 0.8 0.5
Town(1):
lacrosse
Cape
Box
of
Climate
Contextualized Embedding Space
“how many players on a box
lacrosse team”
T5
: Negative
: Positive
first decoder
output embedding
🔥: Trainable
Figure 2: Token-level contrastive learning of CONTRA Np
Decoding. Given a query ("how many players on a box
lacrosse team") and target sequence ("Box lacrosse"), we train
T5 on token-level contrastive learning where all tokens of the
target sequence are the positive pairs and the rest of the tokens
in CE are negative pairs.
the decoder vocab embedding matrix of generative
retriever with Np Decoding. BASE Nonparametric
Decoding (BASE) uses the most basic CE Encoder,
the pre-trained T5 encoder as-is. CE is constructed
once with the output embeddings of CE Encoder
before the generative retrieval training step. Note
that during the training step of the generative re-
trieval, CE Encoder is frozen (Figure 1).
3.3 ASYNC Nonparametric Decoding
Asynchronous Nonparametric Decoding (ASYNC)
uses CE Encoder which is asynchronously replaced
every
N
epoch by the encoder of generative re-
triever during the generative retrieval training step.
By replacing CE Encoder periodically, ASYNC has
more coherency between CE Encoder and the gen-
erative retriever than BASE. After every replace-
ment (
N
epoch), we construct a new CE with the
output embeddings of replaced CE Encoder and
resume training the generative retriever. Note that
during the generative retrieval training step, CE
Encoder is frozen but simply replaced, and only
generative retriever is trainable. We keep
N= 20
for all experiments. See Appendix C.3 for details
on how Naffects the performance.
3.4 CONTRASTIVE Nonparametric Decoding
CONTRASTIVE Nonparametric Decoding
(CONTRA) uses CE Encoder trained on token-level
contrastive learning. The CE Encoder constructs
CE, the nonparametric decoder vocab space of
generative retrieval model with Np Decoding.
The token-level contrastive learning (Equation 2)
is performed as an intermediate step before
training T5 on the generative retrieval task
(Equation 1). Bi-encoder retrieval models with
contrastive loss have shown high performance
as the model learns to construct well-structured
global embedding space and regularize the space
to be uniform (Ni et al.,2021;Gao et al.,2021b;
Gao and Callan,2022;Izacard et al.,2022). In a
similar way, CE Encoder with contrastive learning
constructs a more meaningful dense vector space
(non-parametric space of the generative retriever)
than CE Encoder of BASE.
As in Figure 2, given a query, we train the first
output embedding of the T5 decoder
3
with all to-
kens of the target sequence as positive pairs and
the rest of the tokens in CE
4
as negative pairs. Af-
ter training T5 with token-level contrastive learn-
ing, we construct the CE with its encoder as CE
Encoder, and then further train the model on the
generative retrieval task.
Step 1. Token-level Contrastive Learning
Given a training dataset of pairs
{(q,t)}
where
q
is the query text, and
t
is the retrieval target (e.g.,
the title of the document to retrieve) composed
of multiple tokens
ti
(
1ik
where
k
is the
length of the target), we split the training dataset
into
k
separate pairs
{(q,ti)}
to construct the train-
ing dataset of query-token. With the query-token
dataset, we train the first output token embedding
from the T5 decoder to be close to all token em-
beddings in
T+
when given query
q
as an input
to generative retriever (Figure 2).
T+
is a set of
positive token embeddings
5
(tokens that make up
one retrieval target), and
T
is the set of negative
token embeddings
6
(all other token embeddings in
CE). The objective is to minimize the contrastive
loss:
L(q,t+
1,· · · ,t+
|T +|,t
1,· · · ,t
|T |)
=log Pt+∈T +e<q,t+>
Σt+∈T +e<q,t+>+ Σt∈T e<q,t>
(2)
where
,
is the inner product value between
the two embeddings. We also experiment with a
contrastive loss having a single token per target as
positive and in-batch negatives loss (Appendix A.1)
where the contrastive loss with multiple tokens
3
We use the embedding of decoder (Ni et al.,2021), not
the encoder, to initialize generative retriever with both the
encoder and the decoder trained on contrastive learning.
4
As we freeze the token embeddings (CE) and only train
the T5, calculating over entire embedding space is possible.
CE used in the step is constructed with the output embeddings
of the pre-trained T5 encoder model.
5T+={t+
1,· · · ,t+
k}(k=|T +|)
6T={t
1,· · · ,t
|T |}
摘要:

NonparametricDecodingforGenerativeRetrievalHyunjiLee1JaeyoungKim3∗HoyeonChang1HanseokOh1SoheeYang1VladKarpukhin2YiLu2MinjoonSeo11KAISTAI2Forethought.AI3Kakao{hyunji.amy.lee,hanseok,sohee.yang,retapurayo,minjoon}@kaist.ac.kr{vlad.karpukhin,yi.lu}@forethought.aijay.eong@kakaocorp.comAbstractThegenerat...

展开>> 收起<<
Nonparametric Decoding for Generative Retrieval Hyunji Lee1Jaeyoung Kim3Hoyeon Chang1Hanseok Oh1Sohee Yang1 Vlad Karpukhin2Yi Lu2Minjoon Seo1.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:669.61KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注