contains a given answer. However, these are only
a weak signal of potential relevance and may en-
courage DPR to retrieve misleading documents.
Secondly, the document retriever and answer gen-
erator are trained separately. To ensure that the an-
swer generator sees relevant documents in training,
systems can retrieve large numbers of documents
(
∼
50+) (Gao et al.,2022;Gui et al.,2021), but at
the cost of slower training and more GPU usage,
and also possibly presenting misleading material
to the answer generator.
Joint training of the retriever and answer genera-
tor offers a solution to these problems. The aim is
twofold: (1) to improve the retrieval of documents
truly relevant to providing a given answer; and (2)
to reject documents with pseudo relevance but not
actual relevance.
Retrieval Augmented Generation (RAG) (Lewis
et al.,2020) has shown that end-to-end joint train-
ing of a DPR-based QA system can outperform
baseline two-step systems. A notable feature of
RAG is a loss function that incorporates marginal-
ized likelihoods over retrieved documents such that
the training score of a document is increased when-
ever it improves prediction.
However, in preliminary OK-VQA experiments
we found that RAG did not perform well. Our in-
vestigations found that a good portion of OK-VQA
training questions are answerable in closed-book
form (i.e. using pre-trained models such as T5 (Raf-
fel et al.,2020)) with information extracted only
from the image, with the unintended consequence
that the RAG loss function awards credit to docu-
ments that did not actually contribute to answering
a question. We also found that difficult questions
that are unanswerable with the knowledge avail-
able to retrieval were more prevalent in OK-VQA
than in the Open QA datasets (e.g. Natural Ques-
tions (Kwiatkowski et al.,2019)) on which RAG
was developed. In both of these scenarios, the RAG
loss function leads to counter-intuitive adjustments
to the document scores used in training the retrieval
model, leading to decreased VQA performance.
Motivated by these findings, we propose a novel
neural-retrieval-in-the-loop framework for joint
training of the retriever and the answer generator.
We formulate a loss function that avoids sending
misleading signals to the retrieval model in the
presence of irrelevant documents. This formalism
combines both pseudo relevance labels and model
predictions to refine document scores in training.
We find significantly better performance on OK-
VQA compared to RAG. In this paper:
•
We present a novel joint training frame-
work
R
etrieval
A
ugmented
V
isual
Q
uestion
A
nswering (RA-VQA) for Knowledge Re-
trieval and Answer Generation that improves
over RAG and two-step baseline systems
based on DPR (Karpukhin et al.,2020).
•
We investigate visually grounded features
transformed into ‘language space’ and assess
their contribution to OK-VQA performance.
•
We study the role of document retrieval in
KB-VQA and evaluate its interaction with
retrieval-augmented generation. We also show
that retrieval becomes more efficient in joint
training, requiring retrieval of relatively few
(∼5) documents in training.
2 Related Work
Open-domain QA systems.
These QA systems
are designed to answer questions from datasets
such as Natural Questions (Kwiatkowski et al.,
2019). The knowledge needed to answer questions
can be in pre-trained models (Roberts et al.,2020),
knowledge-graphs (KGs) (Lin et al.,2019;Feng
et al.,2020;Lv et al.,2020;Saffari et al.,2021) or
document collections (Chen et al.,2017;Izacard
and Grave,2021;Guu et al.,2020;Lee et al.,2019;
Lewis et al.,2020). In retrieval-based systems,
differential retrieval can be combined with extrac-
tive question answering, as in REALM (Guu et al.,
2020) and ORQA (Lee et al.,2019), as well as with
generative answer generation, as in RAG (Lewis
et al.,2020).
VQA Systems.
Modelling vision and language
is central to VQA. Models can aggregate visual
and textual features via cross-modality fusion (Yu
et al.,2018;Singh et al.,2019;Yu et al.,2019;
Jiang et al.,2020;Guo et al.,2021). Systems can
also be pre-trained on large vision-and-language
collections (Jia et al.,2021) and then fine-tuned
for VQA tasks (Tan and Bansal,2019;Chen et al.,
2020;Gan et al.,2020;Li et al.,2020b;Wang et al.,
2022;Zhang et al.,2021;Li et al.,2021) with VQA
datasets such as VQA 2.0 (Antol et al.,2015).
Knowledge-based VQA Systems.
KB-VQA can
access both structured data, such as ConceptNet
and other KGs (Narasimhan et al.,2018a;Garderes
et al.,2020;Li et al.,2020a;Wu et al.,2022;
Marino et al.,2021), as well as unstructured data
such as Wikipedia passages (Wu et al.,2022;Gao