Decoding a Neural Retriever’s Latent Space for Query Suggestion
Leonard Adolphs†Michelle Chen Huebscher‡Christian Buck‡
Sertan Girgin‡Olivier Bachem‡Massimiliano Ciaramita‡Thomas Hofmann†
†ETH Zürich
ladolphs@inf.ethz.ch
‡Google Research
Abstract
Neural retrieval models have superseded clas-
sic bag-of-words methods such as BM25 as
the retrieval framework of choice. However,
neural systems lack the interpretability of bag-
of-words models; it is not trivial to connect a
query change to a change in the latent space
that ultimately determines the retrieval results.
To shed light on this embedding space, we
learn a “query decoder” that, given a latent
representation of a neural search engine, gen-
erates the corresponding query. We show that
it is possible to decode a meaningful query
from its latent representation and, when mov-
ing in the right direction in latent space, to de-
code a query that retrieves the relevant para-
graph. In particular, the query decoder can be
useful to understand “what should have been
asked” to retrieve a particular paragraph from
the collection. We employ the query decoder
to generate a large synthetic dataset of query
reformulations for MSMarco, leading to im-
proved retrieval performance. On this data, we
train a pseudo-relevance feedback (PRF) T5
model for the application of query suggestion
that outperforms both query reformulation and
PRF information retrieval baselines.
1 Introduction
Neural encoder models (Karpukhin et al.,2020;Ni
et al.,2021;Izacard et al.,2021) have improved
document retrieval in various settings. They have
become an essential building block for applications
in open-domain question answering (Karpukhin
et al.,2020;Lewis et al.,2020b;Izacard and Grave,
2021), open-domain conversational agents (Shuster
et al.,2021;Adolphs et al.,2021), and, recently,
language modeling (Shuster et al.,2022). Neural
encoders embed documents and queries in a shared
(or joint) latent space, so that paragraphs can be
ranked and retrieved based on their vector similar-
ity with a given query. This constitutes a concep-
tually powerful approach to discovering semantic
similarities between queries and documents that is
often found to be more nuanced than simple term
frequency statistics typical of classic sparse rep-
resentations. However, such encoders may come
with shortcomings in practice. First, they are prone
to domain overfitting, failing to consistently outper-
form bag-of-words approaches on out-of-domain
queries (Thakur et al.,2021). Second, they are no-
toriously hard to interpret as similarity is no longer
controlled by word overlap, but rather by seman-
tic similarities that lack explainability. Third, they
may be non-robust as small changes in the query
can lead to inexplicably different retrieval results.
In bag-of-words models, it can be straightfor-
ward to modify a query to retrieve a given docu-
ment: e.g., following insights from relevance feed-
back (Rocchio,1971), by increasing the weight of
terms contained in the target document (Adolphs
et al.,2022;Huebscher et al.,2022). This approach
is not trivially applicable to neural retrieval models
as it is unclear how an added term might change
the latent code of a query.
In this paper, we look into the missing link connect-
ing latent codes back to actual queries. We thus
propose to train a “query decoder”, which maps
embeddings in the shared query-document space to
query strings, inverting the fixed encoder of the neu-
ral retriever (cf. Figure 1a). As we will show, such
a decoder lets us find queries that are optimized to
retrieve a given target document. It deciphers what
information is in the latent code of a document and
how to phrase a query to retrieve it.
We use this model to explore the latent space of
a state-of-the-art neural retrieval model, GTR (Ni
et al.,2021). In particular, we leverage the structure
of the latent space by traversing from the embed-
ding of a specific query to its human-labeled gold
paragraph and use our query decoder to generate
reformulation examples from intermediate points
along the path as shown in Figure 1b. We find
that using this approach, we can generate a large
arXiv:2210.12084v1 [cs.CL] 21 Oct 2022