
by these successes, our work investigates retrieval
augmentation for re-ranking using a fixed retrieval
component.
Query expansion and pseudo-relevance feed-
back (PRF).
In early work, Diaz and Metzler
(2006) showed it is effective to incorporate infor-
mation from an external corpus into a non-neural
language modeling framework. We exploit such in-
formation when using a pre-trained language model
for re-ranking by directly augmenting the original
query with the top-ranked results from an external
corpus. An orthogonal research direction is to im-
prove re-ranking models by incorporating pseudo-
relevance feedback (PRF) signals as in (Li et al.,
2018;Padaki et al.,2020;Zheng et al.,2020;Yu
et al.,2021;Naseri et al.,2021). One essential
component therein identifies the relevant informa-
tion from the pseudo relevance, avoiding the topic
shift. Besides, these methods are involved with ex-
pensive multiple iterations to collect the PRF and
use that for re-ranking. In contrast, our model con-
sumes high-quality external augmentation text and
requires one single iteration.
3 Method
We adopt Nogueira et al.’s method for re-ranking
with LLMs (Nogueira et al.,2019). Let
q
be the
query string,
d
be the document string, and
y
be
a string that represents the binary relevance of a
document, e.g., “True” or “False”. We construct a
(string) instance xas,
x="Query:qDocument:dRelevant:y"(1)
The model is trained to generate the final token (i.e.
y
) based on the ground-truth relevance of the query-
document pair. To score a new query-document
pair, the normalised score of the final token is used
for re-ranking.
We are interested in augmenting
x
with infor-
mation from an external corpus. We assume that
access to the external corpus is mediated through a
retrieval service
f
such that
f(q) = [σ1, . . . , σm]
,
where
σi
is a retrieved passage (e.g. web search
snippet, indexed passage). It is important to note
that the retrieval service can only retrieve items
from a given external corpus and cannot re-rank or
re-score documents in the target corpus.
We represent the information
f(q)
as an aug-
menting string
˜q
. We can directly concatenate the
m
passages to construct
˜q
; we refer to this as natu-
ral language expansion. Although we expect the
natural language expansion to be more compatible
with LLMs, the fixed capacity of LLM modeling
can result in situations where informative text is
obscured by ‘linguistic glue’ often discarded as
stop words (Tay et al.,2020). Alternatively, we can
extract the most salient topical terms from
f(q)
as
in (Dang and Croft,2013). Specifically, we se-
lect terms using the KL2 method (Carpineto et al.,
2001;Amati,2003). In this method, we select
k
terms from all of the terms in
f(q)
using each in-
dividual words’ contribution in the KL-divergence
between the language model in
f(q)
(denoted as
A) and the corpus (denoted as C).
w(t, A) = P(t|A)log2
P(t|A)
P(t|C)(2)
We estimate the corpus language model using the
target retrieval dataset. We refer to this as topical
term expansion. In both expansion methods, we
truncate the concatenated snippets, paragraphs, or
ordered set of topical words (according to Eq. 2) to
a maximum sequence length.
To incorporate retrieved information, repre-
sented as
˜q
(the expansion terms), we add the text
as a new subsequence (“Description”) in x,
x="Query:qDescription:˜q
Document:dRelevant:y"
Because we are representing instances as strings
with a terminal relevance label, we can easily adopt
the same re-ranking method as Nogueira et al.
(2019).
4 Experiments
Training data.
We use two training datasets,
namely, Natural Questions (NQ) originally pro-
posed in (Kwiatkowski et al.,2019), and, the MS
MARCO (Nguyen et al.,2016) passage re-ranking
dataset. The NQ dataset includes 79k user queries
from the Google search engine. The subset of NQ
derived in (Karpukhin et al.,2020) are used. The
data has the form (question, passage, label), where
only the queries with short answers are included.
The task is to retrieve and re-rank the chunked para-
graphs from Wikipedia with up to 100 words for
the queries. Meanwhile, we use the MS MARCO
triplet training dataset (Nguyen et al.,2016), which
includes 550k positive query-passage pairs. For
validation purposes, we measure Success@20 (also
called Hits@20) on the 8757 questions in the NQ