
learning objective, which we describe below.
Fusion-in-Decoder (FiD)
The reader model is
based on pretrained language model (specifically,
T5-large (Raffel et al.,2020)). Each retrieved pas-
sage,
pi(i= [1, N])
, is concatenated with the
question,
q
, before being encoded by T5 to generate
representations,
[hi
1, ..., hi
m]
, where
m
is the length
of the
i
th passage prepended with the question. All
N
passages are then concatenated to form a sin-
gle sequence,
[h1
1, ..., h1
m, ..., hN
1, ..., hN
m]
, which
the decoder interacts with using cross-attention to
generate the answer.2
We use trained FiD (large) checkpoint provided
by the authors for most analysis.
3
When evaluating
models with access to different number of passages,
we re-train FiD model (pretrained weights loaded
from T5-large) using 1, 5, 20 and 50 passages re-
trieved by DPR. Refer to Appendix A.2 for full
model and training details.
Retrieval Augmented Generation (RAG)
RAG conditions on each retrieved evidence
document individually to produce an answer,
marginalizing the probability of producing an
answer over all retrieved evidence documents.
4
By applying this constraint, RAG is able to jointly
train the reader and retriever, at the cost of ignoring
interactions between evidence documents. FiD,
in contrast, is able to model such interactions
during decoding while the reader and retriever is
completely disjoint.
Recent work explored jointly training the reader
and retriever in FiD (Izacard and Grave,2020;
Sachan et al.,2021;Yang and Seo,2020), show-
ing small gains. Table 1summarizes differ-
ent architectures, including two open book ap-
proaches (Karpukhin et al.,2020;Guu et al.,2020),
one closed book approach (Roberts et al.,2020)
and two retrieval-based generation approaches. As
FiD is efficient and effective, we focus most of
our analysis (Section 4,B) on it. We only report
RAG results on a few of our main analyses to verify
that general trends of the FID model hold for RAG
(which they typically do).
2
We use the version proposed in Izacard and Grave (2020)
with knowledge distillation from reader.
3https://github.com/facebookresearch/FiD
4
RAG also presents a variant of a model that relies on
each retrieved document to generate for each token, but
shows worse performance. We use the version in
https:
//huggingface.co/facebook/rag-sequence-nq
2.2 Model Confidence Study
We analyze the model confidence score, asking a
more nuanced question: is model’s confidence on
the gold answer decreased after we perturb knowl-
edge sources? We compare the model confidence
on the same example before and after perturbation.
We determine the confidence of the model using
either (1) the generation probability of the answer
(i.e., the product of the probability of generating
each token conditioned on all the previously gen-
erated tokens) or (2) the confidence score of sepa-
rately trained answer calibrator, which provides a
score indicating the probability of the model cor-
rectly predicting the answer for each example. We
train a binary calibrator following prior work (Ka-
math et al.,2020;Zhang et al.,2021), using gradi-
ent boosting library XGBoost (Chen and Guestrin,
2016). The goal of the calibrator is to enable selec-
tive question answering – equipping models to de-
cide when to abstain from answering. Given an in-
put question
q
and learned model
Mθ
, the calibrator
predicts whether the predicted answer
ˆy=Mθ(q)
will match the annotated answer
y∗
. We follow the
settings of calibrator from prior work (Zhang et al.,
2021), and details can be found in Appendix A.1.
3 When do retrieval-based generation
models rely on parametric knowledge?
As an initial step investigating whether retrieval-
based generation models ground their answers
in the retrieval corpus or in the pretrained lan-
guage model’s parametric knowledge, we evaluate
whether models generate a novel answer that is not
present in a set of evidence documents. Unlike
extractive QA models (Seo et al.,2017), genera-
tion based approaches (Roberts et al.,2020;Izacard
and Grave,2021) do not require the evidence docu-
ments to contain the gold answer span. Thus, we
first analyze whether they actually generate novel
answer spans not found in the retrieved passages.
Table 2reports how often models generate a span
not found in the evidence passages, split by the re-
trieval performance on the NQ-Open (Kwiatkowski
et al.,2019;Lee et al.,2019) and TriviaQA (Joshi
et al.,2017) development set. We observe that
models typically copy a span from the evidence pas-
sages, only generating novel spans for 3.4%/6.2%
of examples in NQ/TriviaQA for FiD and 20.2%
for RAG in NQ. Even for the small subset of
examples where the retrieved documents do not
contain the answer string, FiD remains extractive