
Variational Open-Domain Question Answering
P
documents, where
K < P ≪N
.
K
refers to the num-
ber of sampled documents, while
P
represents the pool of
documents from which the top
K
documents are selected.
This truncation provides two key advantages: i) it enables
efficient caching or retention of document scores, as only
P
documents need to be stored in memory, and ii) the value
P
serves as an exploration-exploitation threshold: a higher
value of
P
yield greater diversity in document sampling,
promoting exploration. While, a smaller value of
P
en-
sures that during training, all documents in the set
Tϕ
are
more likely visited, facilitating exploitation of the available
information.
Assuming the retrieval distributions to be described by score
functions
fθ: Ω2→R
and
fϕ: Ω3→R
. We define the
truncated retrievers as:6
pθ(d|q):=[d∈ Tϕ] exp fθ(d,q)
Pd′∈Tϕexp fθ(d′,q)(8a)
rϕ(d|a,q):=[d∈ Tϕ] exp fϕ(a,q,d)
Pd′∈Tϕexp fϕ(a,q,d′)(8b)
where
Tϕ
is the set of the top
P≤N
documents ranked
by the score
fϕ(a,q,d)
. The score function
fθ
and
fϕ
can be implemented using BM25 and/or contextual vector
representations extracted using pretrained language mod-
els such as DPR or ColBERT (Karpukhin et al., 2020;
Khattab & Zaharia, 2020). For instance using a dual-
encoder model
fθ(d,q) = BERTθ(d)TBERTθ(q)
and
fϕ(a,q,d) = BERTϕ([q;a])TBERTϕ(d)
where
BERT
is the function that return the output of a BERT model
at the CLS token and
[·;·]
is the concatenation operator.
Retrieving the top
P
documents is efficient when using
elasticsearch7and/or faiss (Johnson et al., 2021).
2.4. Applying VOD
In this paper, we show how to apply the VOD framework
to multiple-choice ODQA. Nevertheless, VOD is general-
purpose and designed for latent variable models defined on
a discrete and finite space. In NLP, it applies to a wide range
of settings such as generative, extractive, multiple-choice
ODQA as well as retrieval-augmented language modelling.
Find a non-exhaustive list of examples in Appendix E.
3. Related work
VOD aids the development of retrieval-augmented models
for language modeling (LM) tasks. In this section, we
review previous work on retrieval for LM, and compare
to VOD (summarized with references in Table 1).
6
When
P > K
, evaluating the retriever density eq. (8a) is
generally intractable due to the sum over Pdocuments.
7http://www.elastic.co/
Table 1.
Deep retrievers in literature, detailing if training was end-
to-end, variational, as well the size of support during training.
Method Retriever training
End-to-end
learning
Posterior
Guided
Retriever
Support
DPR1Supervised ✗ ✗ –
ColBERT2Supervised ✗ ✗ –
Contriever3Self-supervised ✗ ✗ –
FiD4Frozen DPR dual-encoder ✗ ✗ –
RETRO5Frozen BERT dual-encoder ✗ ✗ –
ORQA6Self-supervised + MLL*(✓)✗top-Kdoc.
RAG7MLL*+ frozen DPR doc. encoder (✓)✗top-Kdoc.
REALM8Self-supervised + MLL*✓ ✗ top-Kdoc.
EMDR-29Self-supervised + Expect.-Max. ✓ ✓ top-Kdoc.
Hindsight10 ColBERT init. + ELBO + MLL*✓ ✓ top-Kdoc.
VOD Rényi variational bound ✓ ✓ top-Pdoc.†
1Karpukhin et al. (2020), 2Khattab et al. (2021), 3Izacard et al. (2021), 4Izacard & Grave (2020)
5Borgeaud et al. (2021), 6Lee et al. (2019), 7Lewis et al. (2020), 8Guu et al. (2020)
9Sachan et al. (2021), 10 Paranjape et al. (2021), *MLL: marginal log-likelihood
†K≤P≤N(K:# of documents in a batch, N: corpus size, P: chosen)
Learning to search Retrieval-based training have gained
much attention for improving pre-trained LMs. ORQA and
Contriever proposed a self-supervised approach using con-
trastive learning to match a text passage with its context,
and is widely adopted in pre-training to enable zero-shot
retrieval (Inverse Cloze Task; Lee et al. (2019)). In contrast,
DPR and ColBERT use supervised contrastive learning with
questions paired to annotated documents. This method has
sparked many retrieval-augmented attempts such as FiD,
RETRO, and RAG to enhance auto-regressive LMs con-
ditioned on a frozen retriever. ORQA and REALM, later
followed by RAG, EMDR, Hindsight, and VOD proposed
optimizing both a retrieval component and a reader or lan-
guage modelling component end-to-end, by maximizing the
marginal log-likelihood (MLL).
Posterior guided supervision Many efforts has been
devoted to leveraging external knowledge with posterior
guided supervision. EMDR learns a retriever end-to-
end with an Expectation-Maximization objective evalu-
ated under the posterior distribution of
pθ(d|a,q)∝
pθ(d|q)pθ(a|d,q)
, while Hindsight optimizes the varia-
tional lower-bound (ELBO) evaluating under a target-aware
approximate posterior
rϕ(d|a,q)
. Among previous meth-
ods, Hindsight is most akin to VOD as both methods rely
on maximizing a variational bound. Nonetheless, VOD in-
troduces the more general Rényi variational bound, which
offers to model the sampling distribution explicitly. Ul-
timately, a more principled approach makes VOD more
versatile and capable of handling a wider range of problems.
Navigating large knowledge bases The large size of
knowledge bases such as Wikipedia makes it computation-
ally intractable to consider all
N
documents when comput-
ing MLL. To address this, all related methods rely on a strict
truncation of the retriever to the top-
K
cached documents.
In contrast to these aforementioned approaches, which lim-
its to a fixed set of
K
documents, we propose a truncated
4