Retrieval Augmented Visual Question Answering with Outside Knowledge Weizhe Lin Department of Engineering

2025-04-29 0 0 1.45MB 18 页 10玖币

侵权投诉

Retrieval Augmented Visual Question Answering with Outside Knowledge

Weizhe Lin

Department of Engineering

University of Cambridge

United Kingdom

wl356@cam.ac.uk

Bill Byrne

Department of Engineering

University of Cambridge

United Kingdom

bill.byrne@eng.cam.ac.uk

Abstract

Outside-Knowledge Visual Question Answer-

ing (OK-VQA) is a challenging VQA task that

requires retrieval of external knowledge to an-

swer questions about images. Recent OK-

VQA systems use Dense Passage Retrieval

(DPR) to retrieve documents from external

knowledge bases, such as Wikipedia, but with

DPR trained separately from answer genera-

tion, introducing a potential limit on the over-

all system performance. Instead, we propose

a joint training scheme which includes differ-

entiable DPR integrated with answer genera-

tion so that the system can be trained in an

end-to-end fashion. Our experiments show

that our scheme outperforms recent OK-VQA

systems with strong DPR for retrieval. We

also introduce new diagnostic metrics to ana-

lyze how retrieval and generation interact. The

strong retrieval ability of our model signiﬁ-

cantly reduces the number of retrieved docu-

ments needed in training, yielding signiﬁcant

beneﬁts in answer quality and computation re-

quired for training.

1 Introduction

Visual Question Answering (VQA) is a challenging

problem that lies at the intersection of Computer

Vision, Natural Language Processing, and Infor-

mation Retrieval. The objective in VQA is to read

an image and provide an answer to an accompa-

nying question about the image content. Current

approaches to VQA employ deep-learning-based

systems to jointly understand images and text.

VQA is particularly challenging when the an-

swer to the question is not directly available in

the image. In Knowledge-based VQA (KB-VQA),

the VQA system must access external knowledge

sources to ﬁnd a correct and complete answer. The

Ouside-Knowledge VQA task (OK-VQA) (Marino

et al.,2019) consists of questions that requires gen-

eral knowledge and simple inference to answer

(Fig. 1). Such questions are even hard for humans.

Unlike other KB-VQA datasets (e.g. FVQA (Wang

et al.,2017)) which provide an associated knowl-

edge base, OK-VQA encourages using any outside

knowledge in answering questions.

Question : How many

teeth does this animal use

to have?

Answer: 26

Figure 1: OK-VQA contains questions whose answer

cannot be found within the image.

The need to adapt and refresh knowledge sources

motivates the study of KB-VQA systems that can

extract knowledge from both structured (e.g. Con-

ceptNet (Speer et al.,2017)) and unstructured

knowledge representations (e.g. Wikipedia pas-

sages). Recent designs (Luo et al.,2021;Gao et al.,

2022) approach VQA in two distinct steps: (1)

Knowledge Retrieval extracts documents from a

large knowledge base; (2) Answer Generation pro-

duces an answer from these documents. Knowl-

edge Retrieval can be done via Dense Passage Re-

trieval (DPR) (Karpukhin et al.,2020), which con-

sists of a question encoder and a document encoder

(both Transformer-based) that encode questions

and documents into separate dense representations.

The DPR system is trained to assign higher scores

to documents intended to be helpful in answering

questions, so that document sets can be retrieved

and passed to Answer Generation.

Knowledge Retrieval based on DPR is powerful

but has some readily observed limitations, particu-

larly in model training. Firstly, whether a retrieved

document is useful in answering a question cannot

be easily determined, even if an answer is provided.

Prior work (Qu et al.,2021;Luo et al.,2021) has

addressed this problem using “Pseudo Relevance

Labels” which are based on whether a document

arXiv:2210.03809v2 [cs.CL] 29 Oct 2022

contains a given answer. However, these are only

a weak signal of potential relevance and may en-

courage DPR to retrieve misleading documents.

Secondly, the document retriever and answer gen-

erator are trained separately. To ensure that the an-

swer generator sees relevant documents in training,

systems can retrieve large numbers of documents

(

∼

50+) (Gao et al.,2022;Gui et al.,2021), but at

the cost of slower training and more GPU usage,

and also possibly presenting misleading material

to the answer generator.

Joint training of the retriever and answer genera-

tor offers a solution to these problems. The aim is

twofold: (1) to improve the retrieval of documents

truly relevant to providing a given answer; and (2)

to reject documents with pseudo relevance but not

actual relevance.

Retrieval Augmented Generation (RAG) (Lewis

et al.,2020) has shown that end-to-end joint train-

ing of a DPR-based QA system can outperform

baseline two-step systems. A notable feature of

RAG is a loss function that incorporates marginal-

ized likelihoods over retrieved documents such that

the training score of a document is increased when-

ever it improves prediction.

However, in preliminary OK-VQA experiments

we found that RAG did not perform well. Our in-

vestigations found that a good portion of OK-VQA

training questions are answerable in closed-book

form (i.e. using pre-trained models such as T5 (Raf-

fel et al.,2020)) with information extracted only

from the image, with the unintended consequence

that the RAG loss function awards credit to docu-

ments that did not actually contribute to answering

a question. We also found that difﬁcult questions

that are unanswerable with the knowledge avail-

able to retrieval were more prevalent in OK-VQA

than in the Open QA datasets (e.g. Natural Ques-

tions (Kwiatkowski et al.,2019)) on which RAG

was developed. In both of these scenarios, the RAG

loss function leads to counter-intuitive adjustments

to the document scores used in training the retrieval

model, leading to decreased VQA performance.

Motivated by these ﬁndings, we propose a novel

neural-retrieval-in-the-loop framework for joint

training of the retriever and the answer generator.

We formulate a loss function that avoids sending

misleading signals to the retrieval model in the

presence of irrelevant documents. This formalism

combines both pseudo relevance labels and model

predictions to reﬁne document scores in training.

We ﬁnd signiﬁcantly better performance on OK-

VQA compared to RAG. In this paper:

•

We present a novel joint training frame-

work

etrieval

ugmented

isual

uestion

nswering (RA-VQA) for Knowledge Re-

trieval and Answer Generation that improves

over RAG and two-step baseline systems

based on DPR (Karpukhin et al.,2020).

•

We investigate visually grounded features

transformed into ‘language space’ and assess

their contribution to OK-VQA performance.

•

We study the role of document retrieval in

KB-VQA and evaluate its interaction with

retrieval-augmented generation. We also show

that retrieval becomes more efﬁcient in joint

training, requiring retrieval of relatively few

(∼5) documents in training.

2 Related Work

Open-domain QA systems.

These QA systems

are designed to answer questions from datasets

such as Natural Questions (Kwiatkowski et al.,

2019). The knowledge needed to answer questions

can be in pre-trained models (Roberts et al.,2020),

knowledge-graphs (KGs) (Lin et al.,2019;Feng

et al.,2020;Lv et al.,2020;Saffari et al.,2021) or

document collections (Chen et al.,2017;Izacard

and Grave,2021;Guu et al.,2020;Lee et al.,2019;

Lewis et al.,2020). In retrieval-based systems,

differential retrieval can be combined with extrac-

tive question answering, as in REALM (Guu et al.,

2020) and ORQA (Lee et al.,2019), as well as with

generative answer generation, as in RAG (Lewis

et al.,2020).

VQA Systems.

Modelling vision and language

is central to VQA. Models can aggregate visual

and textual features via cross-modality fusion (Yu

et al.,2018;Singh et al.,2019;Yu et al.,2019;

Jiang et al.,2020;Guo et al.,2021). Systems can

also be pre-trained on large vision-and-language

collections (Jia et al.,2021) and then ﬁne-tuned

for VQA tasks (Tan and Bansal,2019;Chen et al.,

2020;Gan et al.,2020;Li et al.,2020b;Wang et al.,

2022;Zhang et al.,2021;Li et al.,2021) with VQA

datasets such as VQA 2.0 (Antol et al.,2015).

Knowledge-based VQA Systems.

KB-VQA can

access both structured data, such as ConceptNet

and other KGs (Narasimhan et al.,2018a;Garderes

et al.,2020;Li et al.,2020a;Wu et al.,2022;

Marino et al.,2021), as well as unstructured data

such as Wikipedia passages (Wu et al.,2022;Gao

et al.,2022;Gui et al.,2021). A variety of multi-

modal approaches have been explored to access

external knowledge. ConceptBERT (Garderes

et al.,2020) uses attention to aggregate graph node

embeddings from ConceptNet. KRISP (Marino

et al.,2021) uses a “symbolic knowledge mod-

ule” to match ConceptNet KG entities with lan-

guage/visual elements in questions. MAVEx (Wu

et al.,2022) uses multiple information sources

(Google Images, Wikipedia sentences, and Con-

ceptNet) to validate promising answer candidates.

VRR (Luo et al.,2021) uses Google Search in a

retriever-reader pipeline to perform open-ended an-

swer generation.

We also note unpublished contemporaneous

work on OK-VQA at the time of submission.

TRiG (Gao et al.,2022) shows that it is feasible

to transform images into textual features for VQA.

The features used are similar to those presented

here, although without an emphasis on the role

of knowledge retrieval. PICa (Yang et al.,2022)

‘prompts’ GPT-3 with descriptive captions gener-

ated from images, and KAT (Gui et al.,2021) ex-

ploits an ensemble of DPR, T5, and GPT-3 to im-

prove OK-VQA performance.

3 Methodology

We present our RA-VQA framework that con-

sists of: (1) Vision-to-Language Transformation

(Sec. 3.1); (2) Weakly-supervised Dense Passage

Retrieval (Sec. 3.2); (3) Joint Training of Retrieval

and Answer Generation (Sec. 3.3).

3.1 Vision-to-Language Transformation

Prior work has established that images can be

transformed into text such that large pre-trained

language-based Transformers (e.g. BERT (Devlin

et al.,2019), GPT-2 (Radford et al.,2019), and T5)

can be applied to VQA tasks (Luo et al.,2021;Yang

et al.,2022). Systems can be based on straightfor-

ward image caption, but we have found improve-

ments by introducing additional visually-grounded

features. In RA-VQA, each image is represented

by visual objects and their attributes, image cap-

tion, and any text strings detected within the image.

We use an object detection model VinVL (Zhang

et al.,2021) that was pre-trained on large object

detection datasets to extract visual elements and

their attributes (e.g. color and material).

Formally, for an image

we use VinVL to ex-

tract a set of visual objects

{oi}

, along with a set of

text attributes for each visual object

{ai,j }

. Visual

objects and their attributes are extracted by VinVL

at conﬁdence thresholds 0.8and 0.6, respectively.

Image captioning is performed to extract rela-

tionships and interactions among visual elements

such as “a woman holding a knife cuts a cake”.

The pre-trained captioning model Oscar+ (Zhang

et al.,2021) is applied to process visual features

extracted from the VinVL model to generate a cap-

tion for the image. To answer questions related

to text strings in images (e.g. “which language is

the book written in?”), Google OCR (Optical Char-

acter Recognition) APIs are used to extract text

strings from each image.

Hence, a VQA training set

{(I, q, S)}

, where

is a set of answers to a question

about

, can

be transformed into a text-only training set

{(x, S)}

that we use for RA-VQA. The string

contains all the text features extracted from the

image (the question, the textual attributes for each

identiﬁed visual object, the generated caption, and

any OCR’d text), with special tokens marking the

start and end of each type of feature (Fig. 2).

3.2 Weakly-supervised Dense Passage

Retrieval

Dense Passage Retrieval in RA-VQA consists of

a query encoder

and a document encoder

both as Transformer-like encoders. The aim is to

retrieve

documents from an external knowledge

database

Z={zi}Nd

i=1

(e.g. Wikipedia passages)

that are expected to be useful for answering a ques-

tion. DPR encodes questions and documents sepa-

rately into dense feature vectors

Fq(x)∈Rh

and

Fd(z)∈Rh

. A scoring function is used to retrieve

documents for each question as the inner product

between the representations of xand z

r(x, z) = F>

q(x)Fd(z)(1)

RA-VQA training aims to maximize

r(x, z)

when

document

is relevant to answering the question.

As discussed in Sec. 1, the relevance between

and

cannot be easily obtained and “pseudo relevance

labels” serve as a proxy. We use a pseudo relevance

function

H(z, S)

which is 1 if

contains an answer

in S(by string match), and 0 otherwise.

For each question-answer pair

(x, S)

one posi-

tive document

z+(x)

is extracted for training. In-

batch negative sampling is used: all documents in

a training batch other than

z+(x)

are considered

to be negative for

(x, S)

(Karpukhin et al.,2020).

<BOK> flash floods occur within six hours of a

rain event, or after a dam or levee failure, and flash

floods can catch people unprepared. ... <EOK>

<BOK> these types of storm are hard to predict

... flooding floods occur due to rain and other water

rising faster than the drains can handle. <EOK>

Document

Encoder

Question

Encoder

MIPS

Index

.....

Transformer

flood

rain

<BOQ> What weather phenomenon

most likely happened? <EOQ>

<BOC> a man sitting on a bench in

a flooded park. <EOC>

<BOV>wood brown red bench

<SOV> large tall green tree <SOV>

calm gray water <SOV> white

docked boat <SOV> cloudy gray

white sky ...... <SOV> [OCR texts if

exists] <EOV>

1Image-to-Text Transform

2Dense Passage Retrieval

Knowledge

Database

storm

Pseudo Relevance

Answers: flood, hurricane, rain

3Joint Training of

Backpropogation

<BOK> the most common cause of flooding is

water due to rain and/or snowmelt that accumulates

faster than soils can absorb it or rivers can carry it

away. <EOK>

Gradient

Backpropogate

Trainable

Parameters

Non-parametric

Transform

flood

max joint

probability

4Prediction

Figure 2: Model overview. (1) Using object detection/image captioning/Optical Character Recognition to trans-

form visual signals into language space. (2) Dense Passage Retrieval retrieves documents that are expected to be

helpful from the knowledge database; (3) Training the retriever pθand the answer generator pφtogether using our

proposed RA-VQA loss. (4) The answer with highest joint probability pθ(zi|x)pφ(yi|x, zi)is selected.

Denoting the negative documents as

N(x, S)

and

the score of the positive document as

br+(x)

leads

to the DPR loss LDP R :

−X

(x,S)∈T

log exp (br+(x))

exp (br+(x)) + X

z∈N (x,S)

exp (br(x, z))

(2)

3.3 RA-VQA: Joint Training of Document

Retrieval and Answer Generation

Given a full query string

extracted from the

image-question pair

(I, q)

, DPR returns the

highest scoring documents

{zk}K

k=1

. The score

assigned by the document retriever

pθ(·|x)

to a

retrieved document is

pθ(zk|x) = exp(br(x, zk))

j=1 exp(br(x, zj)) (3)

Open-ended answer generation for each re-

trieved document

is performed with a generative

model, such as T5, with parameters φ:

yk= argmax

pφ(y|x, zk)(4)

For each document

retrieved for a training

item

(x, S)

, we train the answer generator to pro-

duce the answer string

s∗

from the concatenation

and

(as shown in Fig. 2). We select the most

popular

human response

s∗

from

such that

s∗

contained in

; in the case that

does not contain

any answer, the most popular answer

s∗∈ S

is se-

lected

s∗

k=s∗

. Through this design, we customize

There are 5 annotators for each OKVQA question. The

popularity of an answer is measured by the number of annota-

tors who voted for it.

the generation target

s∗

for each retrieved docu-

ment instead of training all

(x, zk)

pairs towards

the most popular human response

s∗

. This has

been proved to improve the system performance

(Appendix B.1).

We identify two subsets of the retrieved docu-

ments

{zk}K

k=1

based on pseudo relevance labels

and model predictions:

P+(x, S) = {k:yk=s∗

k∧H(zk,S)=1};

P−(x, S) = {k:yk6=s∗

k∧H(zk,S)=0}.(5)

P+

are indices of pseudo relevant documents that

also help the model generate popular answers

whereas

P−

identiﬁes documents not expected to

beneﬁt answer generation. In joint training, we

intend to increase the scores of documents in

P+

while decreasing the scores for those in

P−

will be put into the negative set if it does not con-

tain any answer (

H(zk, S)=0

) and the generation

is incorrect (

yk6=s∗

This is motivated by our

intention to reduce scores for those documents that

contain no answers and fail to answer questions.

Formally, joint training of retrieval and answer

generation is achieved with a loss

LRA−V QA

that

reﬂects both model predictions and pseudo rele-

vance:

−X

(x,S)∈T

K

k=1

log pφ(s∗

k|x, zk)

k∈P+(x,S)

log pθ(zk|x)−X

k∈P−(x,S)

log pθ(zk|x)(6)

Note that in this case

H(zk, S) = 0

already implies that

zkdoes not contain any answer and thus s∗

k=s∗.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RetrievalAugmentedVisualQuestionAnsweringwithOutsideKnowledgeWeizheLinDepartmentofEngineeringUniversityofCambridgeUnitedKingdomwl356@cam.ac.ukBillByrneDepartmentofEngineeringUniversityofCambridgeUnitedKingdombill.byrne@eng.cam.ac.ukAbstractOutside-KnowledgeVisualQuestionAnswer-ing(OK-VQA)isachalleng...

展开>> 收起<<

Retrieval Augmented Visual Question Answering with Outside Knowledge Weizhe Lin Department of Engineering.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Retrieval Augmented Visual Question Answering with Outside Knowledge Weizhe Lin Department of Engineering

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: