Retrieval Augmented Visual Question Answering with Outside Knowledge Weizhe Lin Department of Engineering

2025-04-29 0 0 1.45MB 18 页 10玖币
侵权投诉
Retrieval Augmented Visual Question Answering with Outside Knowledge
Weizhe Lin
Department of Engineering
University of Cambridge
United Kingdom
wl356@cam.ac.uk
Bill Byrne
Department of Engineering
University of Cambridge
United Kingdom
bill.byrne@eng.cam.ac.uk
Abstract
Outside-Knowledge Visual Question Answer-
ing (OK-VQA) is a challenging VQA task that
requires retrieval of external knowledge to an-
swer questions about images. Recent OK-
VQA systems use Dense Passage Retrieval
(DPR) to retrieve documents from external
knowledge bases, such as Wikipedia, but with
DPR trained separately from answer genera-
tion, introducing a potential limit on the over-
all system performance. Instead, we propose
a joint training scheme which includes differ-
entiable DPR integrated with answer genera-
tion so that the system can be trained in an
end-to-end fashion. Our experiments show
that our scheme outperforms recent OK-VQA
systems with strong DPR for retrieval. We
also introduce new diagnostic metrics to ana-
lyze how retrieval and generation interact. The
strong retrieval ability of our model signifi-
cantly reduces the number of retrieved docu-
ments needed in training, yielding significant
benefits in answer quality and computation re-
quired for training.
1 Introduction
Visual Question Answering (VQA) is a challenging
problem that lies at the intersection of Computer
Vision, Natural Language Processing, and Infor-
mation Retrieval. The objective in VQA is to read
an image and provide an answer to an accompa-
nying question about the image content. Current
approaches to VQA employ deep-learning-based
systems to jointly understand images and text.
VQA is particularly challenging when the an-
swer to the question is not directly available in
the image. In Knowledge-based VQA (KB-VQA),
the VQA system must access external knowledge
sources to find a correct and complete answer. The
Ouside-Knowledge VQA task (OK-VQA) (Marino
et al.,2019) consists of questions that requires gen-
eral knowledge and simple inference to answer
(Fig. 1). Such questions are even hard for humans.
Unlike other KB-VQA datasets (e.g. FVQA (Wang
et al.,2017)) which provide an associated knowl-
edge base, OK-VQA encourages using any outside
knowledge in answering questions.
Question : How many
teeth does this animal use
to have?
Answer: 26
Figure 1: OK-VQA contains questions whose answer
cannot be found within the image.
The need to adapt and refresh knowledge sources
motivates the study of KB-VQA systems that can
extract knowledge from both structured (e.g. Con-
ceptNet (Speer et al.,2017)) and unstructured
knowledge representations (e.g. Wikipedia pas-
sages). Recent designs (Luo et al.,2021;Gao et al.,
2022) approach VQA in two distinct steps: (1)
Knowledge Retrieval extracts documents from a
large knowledge base; (2) Answer Generation pro-
duces an answer from these documents. Knowl-
edge Retrieval can be done via Dense Passage Re-
trieval (DPR) (Karpukhin et al.,2020), which con-
sists of a question encoder and a document encoder
(both Transformer-based) that encode questions
and documents into separate dense representations.
The DPR system is trained to assign higher scores
to documents intended to be helpful in answering
questions, so that document sets can be retrieved
and passed to Answer Generation.
Knowledge Retrieval based on DPR is powerful
but has some readily observed limitations, particu-
larly in model training. Firstly, whether a retrieved
document is useful in answering a question cannot
be easily determined, even if an answer is provided.
Prior work (Qu et al.,2021;Luo et al.,2021) has
addressed this problem using “Pseudo Relevance
Labels” which are based on whether a document
arXiv:2210.03809v2 [cs.CL] 29 Oct 2022
contains a given answer. However, these are only
a weak signal of potential relevance and may en-
courage DPR to retrieve misleading documents.
Secondly, the document retriever and answer gen-
erator are trained separately. To ensure that the an-
swer generator sees relevant documents in training,
systems can retrieve large numbers of documents
(
50+) (Gao et al.,2022;Gui et al.,2021), but at
the cost of slower training and more GPU usage,
and also possibly presenting misleading material
to the answer generator.
Joint training of the retriever and answer genera-
tor offers a solution to these problems. The aim is
twofold: (1) to improve the retrieval of documents
truly relevant to providing a given answer; and (2)
to reject documents with pseudo relevance but not
actual relevance.
Retrieval Augmented Generation (RAG) (Lewis
et al.,2020) has shown that end-to-end joint train-
ing of a DPR-based QA system can outperform
baseline two-step systems. A notable feature of
RAG is a loss function that incorporates marginal-
ized likelihoods over retrieved documents such that
the training score of a document is increased when-
ever it improves prediction.
However, in preliminary OK-VQA experiments
we found that RAG did not perform well. Our in-
vestigations found that a good portion of OK-VQA
training questions are answerable in closed-book
form (i.e. using pre-trained models such as T5 (Raf-
fel et al.,2020)) with information extracted only
from the image, with the unintended consequence
that the RAG loss function awards credit to docu-
ments that did not actually contribute to answering
a question. We also found that difficult questions
that are unanswerable with the knowledge avail-
able to retrieval were more prevalent in OK-VQA
than in the Open QA datasets (e.g. Natural Ques-
tions (Kwiatkowski et al.,2019)) on which RAG
was developed. In both of these scenarios, the RAG
loss function leads to counter-intuitive adjustments
to the document scores used in training the retrieval
model, leading to decreased VQA performance.
Motivated by these findings, we propose a novel
neural-retrieval-in-the-loop framework for joint
training of the retriever and the answer generator.
We formulate a loss function that avoids sending
misleading signals to the retrieval model in the
presence of irrelevant documents. This formalism
combines both pseudo relevance labels and model
predictions to refine document scores in training.
We find significantly better performance on OK-
VQA compared to RAG. In this paper:
We present a novel joint training frame-
work
R
etrieval
A
ugmented
V
isual
Q
uestion
A
nswering (RA-VQA) for Knowledge Re-
trieval and Answer Generation that improves
over RAG and two-step baseline systems
based on DPR (Karpukhin et al.,2020).
We investigate visually grounded features
transformed into ‘language space’ and assess
their contribution to OK-VQA performance.
We study the role of document retrieval in
KB-VQA and evaluate its interaction with
retrieval-augmented generation. We also show
that retrieval becomes more efficient in joint
training, requiring retrieval of relatively few
(5) documents in training.
2 Related Work
Open-domain QA systems.
These QA systems
are designed to answer questions from datasets
such as Natural Questions (Kwiatkowski et al.,
2019). The knowledge needed to answer questions
can be in pre-trained models (Roberts et al.,2020),
knowledge-graphs (KGs) (Lin et al.,2019;Feng
et al.,2020;Lv et al.,2020;Saffari et al.,2021) or
document collections (Chen et al.,2017;Izacard
and Grave,2021;Guu et al.,2020;Lee et al.,2019;
Lewis et al.,2020). In retrieval-based systems,
differential retrieval can be combined with extrac-
tive question answering, as in REALM (Guu et al.,
2020) and ORQA (Lee et al.,2019), as well as with
generative answer generation, as in RAG (Lewis
et al.,2020).
VQA Systems.
Modelling vision and language
is central to VQA. Models can aggregate visual
and textual features via cross-modality fusion (Yu
et al.,2018;Singh et al.,2019;Yu et al.,2019;
Jiang et al.,2020;Guo et al.,2021). Systems can
also be pre-trained on large vision-and-language
collections (Jia et al.,2021) and then fine-tuned
for VQA tasks (Tan and Bansal,2019;Chen et al.,
2020;Gan et al.,2020;Li et al.,2020b;Wang et al.,
2022;Zhang et al.,2021;Li et al.,2021) with VQA
datasets such as VQA 2.0 (Antol et al.,2015).
Knowledge-based VQA Systems.
KB-VQA can
access both structured data, such as ConceptNet
and other KGs (Narasimhan et al.,2018a;Garderes
et al.,2020;Li et al.,2020a;Wu et al.,2022;
Marino et al.,2021), as well as unstructured data
such as Wikipedia passages (Wu et al.,2022;Gao
et al.,2022;Gui et al.,2021). A variety of multi-
modal approaches have been explored to access
external knowledge. ConceptBERT (Garderes
et al.,2020) uses attention to aggregate graph node
embeddings from ConceptNet. KRISP (Marino
et al.,2021) uses a “symbolic knowledge mod-
ule” to match ConceptNet KG entities with lan-
guage/visual elements in questions. MAVEx (Wu
et al.,2022) uses multiple information sources
(Google Images, Wikipedia sentences, and Con-
ceptNet) to validate promising answer candidates.
VRR (Luo et al.,2021) uses Google Search in a
retriever-reader pipeline to perform open-ended an-
swer generation.
We also note unpublished contemporaneous
work on OK-VQA at the time of submission.
TRiG (Gao et al.,2022) shows that it is feasible
to transform images into textual features for VQA.
The features used are similar to those presented
here, although without an emphasis on the role
of knowledge retrieval. PICa (Yang et al.,2022)
‘prompts’ GPT-3 with descriptive captions gener-
ated from images, and KAT (Gui et al.,2021) ex-
ploits an ensemble of DPR, T5, and GPT-3 to im-
prove OK-VQA performance.
3 Methodology
We present our RA-VQA framework that con-
sists of: (1) Vision-to-Language Transformation
(Sec. 3.1); (2) Weakly-supervised Dense Passage
Retrieval (Sec. 3.2); (3) Joint Training of Retrieval
and Answer Generation (Sec. 3.3).
3.1 Vision-to-Language Transformation
Prior work has established that images can be
transformed into text such that large pre-trained
language-based Transformers (e.g. BERT (Devlin
et al.,2019), GPT-2 (Radford et al.,2019), and T5)
can be applied to VQA tasks (Luo et al.,2021;Yang
et al.,2022). Systems can be based on straightfor-
ward image caption, but we have found improve-
ments by introducing additional visually-grounded
features. In RA-VQA, each image is represented
by visual objects and their attributes, image cap-
tion, and any text strings detected within the image.
We use an object detection model VinVL (Zhang
et al.,2021) that was pre-trained on large object
detection datasets to extract visual elements and
their attributes (e.g. color and material).
Formally, for an image
I
we use VinVL to ex-
tract a set of visual objects
{oi}
, along with a set of
text attributes for each visual object
{ai,j }
. Visual
objects and their attributes are extracted by VinVL
at confidence thresholds 0.8and 0.6, respectively.
Image captioning is performed to extract rela-
tionships and interactions among visual elements
such as “a woman holding a knife cuts a cake”.
The pre-trained captioning model Oscar+ (Zhang
et al.,2021) is applied to process visual features
extracted from the VinVL model to generate a cap-
tion for the image. To answer questions related
to text strings in images (e.g. “which language is
the book written in?”), Google OCR (Optical Char-
acter Recognition) APIs are used to extract text
strings from each image.
Hence, a VQA training set
{(I, q, S)}
, where
S
is a set of answers to a question
q
about
I
, can
be transformed into a text-only training set
T=
{(x, S)}
that we use for RA-VQA. The string
x
contains all the text features extracted from the
image (the question, the textual attributes for each
identified visual object, the generated caption, and
any OCR’d text), with special tokens marking the
start and end of each type of feature (Fig. 2).
3.2 Weakly-supervised Dense Passage
Retrieval
Dense Passage Retrieval in RA-VQA consists of
a query encoder
Fq
and a document encoder
Fd
,
both as Transformer-like encoders. The aim is to
retrieve
K
documents from an external knowledge
database
Z={zi}Nd
i=1
(e.g. Wikipedia passages)
that are expected to be useful for answering a ques-
tion. DPR encodes questions and documents sepa-
rately into dense feature vectors
Fq(x)Rh
and
Fd(z)Rh
. A scoring function is used to retrieve
documents for each question as the inner product
between the representations of xand z
r(x, z) = F>
q(x)Fd(z)(1)
RA-VQA training aims to maximize
r(x, z)
when
document
z
is relevant to answering the question.
As discussed in Sec. 1, the relevance between
q
and
z
cannot be easily obtained and “pseudo relevance
labels” serve as a proxy. We use a pseudo relevance
function
H(z, S)
which is 1 if
z
contains an answer
in S(by string match), and 0 otherwise.
For each question-answer pair
(x, S)
one posi-
tive document
z+(x)
is extracted for training. In-
batch negative sampling is used: all documents in
a training batch other than
z+(x)
are considered
to be negative for
(x, S)
(Karpukhin et al.,2020).
<BOK> flash floods occur within six hours of a
rain event, or after a dam or levee failure, and flash
floods can catch people unprepared. ... <EOK>
<BOK> these types of storm are hard to predict
... flooding floods occur due to rain and other water
rising faster than the drains can handle. <EOK>
Document
Encoder
Question
Encoder
MIPS
Index
.....
Transformer
flood
rain
<BOQ> What weather phenomenon
most likely happened? <EOQ>
<BOC> a man sitting on a bench in
a flooded park. <EOC>
<BOV>wood brown red bench
<SOV> large tall green tree <SOV>
calm gray water <SOV> white
docked boat <SOV> cloudy gray
white sky ...... <SOV> [OCR texts if
exists] <EOV>
1Image-to-Text Transform
2Dense Passage Retrieval
Knowledge
Database
storm
Pseudo Relevance
Answers: flood, hurricane, rain
3Joint Training of
Backpropogation
<BOK> the most common cause of flooding is
water due to rain and/or snowmelt that accumulates
faster than soils can absorb it or rivers can carry it
away. <EOK>
Gradient
Backpropogate
Trainable
Parameters
Non-parametric
Transform
flood
max joint
probability
4Prediction
Figure 2: Model overview. (1) Using object detection/image captioning/Optical Character Recognition to trans-
form visual signals into language space. (2) Dense Passage Retrieval retrieves documents that are expected to be
helpful from the knowledge database; (3) Training the retriever pθand the answer generator pφtogether using our
proposed RA-VQA loss. (4) The answer with highest joint probability pθ(zi|x)pφ(yi|x, zi)is selected.
Denoting the negative documents as
N(x, S)
and
the score of the positive document as
br+(x)
leads
to the DPR loss LDP R :
X
(x,S)∈T
log exp (br+(x))
exp (br+(x)) + X
z∈N (x,S)
exp (br(x, z))
(2)
3.3 RA-VQA: Joint Training of Document
Retrieval and Answer Generation
Given a full query string
x
extracted from the
image-question pair
(I, q)
, DPR returns the
K
highest scoring documents
{zk}K
k=1
. The score
assigned by the document retriever
pθ(·|x)
to a
retrieved document is
pθ(zk|x) = exp(br(x, zk))
PK
j=1 exp(br(x, zj)) (3)
Open-ended answer generation for each re-
trieved document
zk
is performed with a generative
model, such as T5, with parameters φ:
yk= argmax
y
pφ(y|x, zk)(4)
For each document
zk
retrieved for a training
item
(x, S)
, we train the answer generator to pro-
duce the answer string
s
k
from the concatenation
of
x
and
zk
(as shown in Fig. 2). We select the most
popular
1
human response
s
k
from
S
such that
s
k
is
contained in
zk
; in the case that
zk
does not contain
any answer, the most popular answer
s∈ S
is se-
lected
s
k=s
. Through this design, we customize
1
There are 5 annotators for each OKVQA question. The
popularity of an answer is measured by the number of annota-
tors who voted for it.
the generation target
s
k
for each retrieved docu-
ment instead of training all
(x, zk)
pairs towards
the most popular human response
s
. This has
been proved to improve the system performance
(Appendix B.1).
We identify two subsets of the retrieved docu-
ments
{zk}K
k=1
based on pseudo relevance labels
and model predictions:
P+(x, S) = {k:yk=s
kH(zk,S)=1};
P(x, S) = {k:yk6=s
kH(zk,S)=0}.(5)
P+
are indices of pseudo relevant documents that
also help the model generate popular answers
whereas
P
identifies documents not expected to
benefit answer generation. In joint training, we
intend to increase the scores of documents in
P+
while decreasing the scores for those in
P
.
zk
will be put into the negative set if it does not con-
tain any answer (
H(zk, S)=0
) and the generation
is incorrect (
yk6=s
k
).
2
This is motivated by our
intention to reduce scores for those documents that
contain no answers and fail to answer questions.
Formally, joint training of retrieval and answer
generation is achieved with a loss
LRAV QA
that
reflects both model predictions and pseudo rele-
vance:
X
(x,S)∈T
K
X
k=1
log pφ(s
k|x, zk)
+X
k∈P+(x,S)
log pθ(zk|x)X
k∈P(x,S)
log pθ(zk|x)(6)
2
Note that in this case
H(zk, S) = 0
already implies that
zkdoes not contain any answer and thus s
k=s.
摘要:

RetrievalAugmentedVisualQuestionAnsweringwithOutsideKnowledgeWeizheLinDepartmentofEngineeringUniversityofCambridgeUnitedKingdomwl356@cam.ac.ukBillByrneDepartmentofEngineeringUniversityofCambridgeUnitedKingdombill.byrne@eng.cam.ac.ukAbstractOutside-KnowledgeVisualQuestionAnswer-ing(OK-VQA)isachalleng...

展开>> 收起<<
Retrieval Augmented Visual Question Answering with Outside Knowledge Weizhe Lin Department of Engineering.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.45MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注