Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering Jialin Wu

2025-05-06 0 0 7.81MB 12 页 10玖币

侵权投诉

Entity-Focused Dense Passage Retrieval for

Outside-Knowledge Visual Question Answering

Jialin Wu

Department of Computer Science

The University of Texas at Austin

jialinwu@utexas.edu

Raymond J. Mooney

Department of Computer Science

The University of Texas at Austin

mooney@cs.utexas.edu

Abstract

Most Outside-Knowledge Visual Question An-

swering (OK-VQA) systems employ a two-

stage framework that ﬁrst retrieves external

knowledge given the visual question and then

predicts the answer based on the retrieved con-

tent. However, the retrieved knowledge is of-

ten inadequate. Retrievals are frequently too

general and fail to cover speciﬁc knowledge

needed to answer the question. Also, the nat-

urally available supervision (whether the pas-

sage contains the correct answer) is weak and

does not guarantee question relevancy. To

address these issues, we propose an Entity-

Focused Retrieval (EnFoRe) model that pro-

vides stronger supervision during training and

recognizes question-relevant entities to help

retrieve more speciﬁc knowledge. Experi-

ments show that our EnFoRe model achieves

superior retrieval performance on OK-VQA,

the currently largest outside-knowledge VQA

dataset. We also combine the retrieved knowl-

edge with state-of-the-art VQA models, and

achieve a new state-of-the-art performance on

OK-VQA.

1 Introduction

Passage retrieval under a multi-modal setting is a

critical prerequisite for applications such as outside-

knowledge visual question answering (OK-VQA)

(Marino et al.,2019), which requires effectively uti-

lizing knowledge external to the image. Recently,

dense passage retrievers with deep semantic rep-

resentations powered by large transformer models

have shown superior performance to traditional

sparse retrievers such as BM25 (Robertson and

Zaragoza,2009) and TF-IDF under both textual

(Karpukhin et al.,2020;Chen et al.,2021;Lewis

et al.,2022) and multi-modal settings (Luo et al.,

2021;Qu et al.,2021;Gui et al.,2021).

In this work, we investigate two main drawbacks

of recent dense retrievers (Karpukhin et al.,2020;

Chen et al.,2021;Lewis et al.,2022;Luo et al.,

Being omnivores they enjoy eating

live crickets and other insects and

small amounts of chopped fruits and

vegetables such as …

Bell pepper The bell pepper is … in

different colours, including red,

yellow, orange, …they are commonly

used as a vegetable ingredient …

Q: What holiday is this?

A: Thanksgiving.

Q: This plush toy was named

after what US president?

A: Theodore Teddy Roosevelt.

critical entity: teddy bear critical entity: turkey

Q: Is the large yellow object a fruit or a vegetable?

A: vegetable. critical entity: bell pepper

Figure 1: Top: Examples of critical entities upon which

retrieval models should focus; Bottom: Example of im-

proved passage retrieval using critical entities.

2021;Qu et al.,2021;Gui et al.,2021), which are

typically trained to produce similar representations

for input queries and passages containing ground-

truth answers.

First, as most retrieval models encode the query

and passages as a whole, they fail to explicitly

discover entities critical to answering the ques-

tion (Chen et al.,2021). This frequently leads

to retrieving overly-general knowledge lacking a

speciﬁc focus. Ideally, a retrieval model should

identify the critical entities for the query and then

retrieve question-relevant knowledge speciﬁcally

about them. For example, as shown in the top half

of Figure 1, retrieval models should realize that the

entities “turkey” and “teddy bear” are critical.

Second, on the supervision side, the positive

signals are often passages containing the right an-

swers with top sparse-retrieval scores such as BM

25 (Robertson and Zaragoza,2009) and TF-IDF.

However, this criterion is inadequate to guarantee

question relevancy, since good positive passages

should reveal facts that actually support the correct

answer using the critical entities depicted in the im-

arXiv:2210.10176v2 [cs.CL] 20 Oct 2022

age. For example, as shown in the bottom of Figure

1, both passages mention the correct answer “veg-

etable” but only the second one which focuses on

the critical entity “bell pepper” is question-relevant.

In order to address these shortcomings, we pro-

pose an

tity-

cused

trieval (EnFoRe) model

that improves the quality of the positive passages

for stronger supervision. EnFoRe automatically

identiﬁes critical entities for the question and then

retrieves knowledge focused on them. We focus

on entities that improve a sparse retriever’s perfor-

mance if emphasized during retrieval as critical

entities. We use the top passages containing both

critical entities and the correct answer as positive

supervision. Then, our EnFoRe model learns two

scores to indicate (1) the importance of each entity

given the question and the image and (2) a score

that measured how well each entity ﬁts the context

of each candidate passage.

We evaluate EnFoRe on OK-VQA (Marino et al.,

2019), currently the largest knowledge-based VQA

dataset. Our approach achieves state-of-the-art

(SOTA) knowledge retrieval results, indicating the

effectiveness of explicitly recognizing key enti-

ties during retrieval. We also combine this re-

treived knowledge with SOTA OK-VQA mod-

els and achieve a new SOTA OK-VQA perfor-

mance. Our code is available at https://github.com/

jialinwu17/EnFoRe.git.

2 Background and Related Work

2.1 OK-VQA

Visual Question Answering (VQA) has witnessed

remarkable progress over the past few years, in

terms of both the scope of the questions (Antol

et al.,2015;Hudson and Manning,2019;Wang

et al.,2018;Gurari et al.,2018;Singh et al.,2019),

and the sophistication of the model design (An-

tol et al.,2015;Lu et al.,2016;Anderson et al.,

2018;Kim et al.,2018,2020;Wu et al.,2019;Wu

and Mooney,2019;Jiang et al.,2018;Lu et al.,

2019;Nguyen et al.,2021). There is a recent trend

towards outside knowledge visual question answer-

ing (OK-VQA) (Marino et al.,2019), where open

domain external knowledge outside the image is

necessary. Most OK-VQA models (Marino et al.,

2019;Gardères et al.,2020;Zhu et al.,2020;Li

et al.,2020;Narasimhan et al.,2018;Marino et al.,

2021;Wu et al.,2022;Gui et al.,2021) incorpo-

rate a retriever-reader framework that ﬁrst retrieves

textual knowledge relevant to the question and im-

age and then “reads” this text to predicts the an-

swer. As an online free encyclopedia, Wikipedia

is often used as the knowledge source for OK-

VQA. While most previous works focused more

on the answer prediction stage, the performance

is still lacking because of the imperfect quality

of the retrieved knowledge. This work focuses on

knowledge retrieval and aims at retrieving question-

relevant knowledge that focuses explicitly on the

critical entities for the visual question.

2.2 Passage Retrieval

Sparse Retrieval:

Before the recent proliferation

of transformer-based dense passage retrieval mod-

els (Karpukhin et al.,2020), previous work mainly

explored sparse retrievers, such as TF-IDF and

BM25 (Robertson and Zaragoza,2009), that mea-

sure the similarity between the search query and

candidate passage using weighted term matching.

These sparse retrievers require no training signals

on the relevancy of the passage and show solid base-

line performances. However, exact term matching

prevents them from capturing synonyms and para-

phrases and understanding the semantic meanings

of the query and the passages.

Dense Retrieval:

To better represent semantics,

dense retrievers (Karpukhin et al.,2020;Chen et al.,

2021;Lewis et al.,2022;Lee et al.,2021) extract

deep representations for the query and the candi-

date passages using large pretrained transformer

models. Most dense retrievers are trained using a

contrastive objective that encourages the represen-

tation of the query to be more similar to the relevant

passages than other irrelevant passages. During

training, the passage with a high sparse retrieval

score containing the answer is often regarded as

a positive sample for the question-answering task.

However, these positive passages may not ﬁt the

question’s context and only serve as very weak su-

pervision. Moreover, the query and passages are

often encoded as single vectors. Therefore most

dense retrievers fail to explicitly discover and uti-

lize critical entities for the question (Chen et al.,

2021). This often leads to overly general knowl-

edge without a speciﬁc focus.

2.3 Dense Passage Retrieval for VQA

Motivated by the trend toward dense retrievers, pre-

vious work has also applied them to OK-VQA.

Qu et al. (2021) utilize Wikipedia as a knowledge

source. Luo et al. (2021) crawl Google search re-

sults on the training set as a knowledge source.

However, the weak training signals for passage re-

trieval become more problematic for VQA as the

visual context of the question makes it more com-

plex. Therefore, a “positive passage” becomes less

likely to ﬁt the visual context and actually provide

suitable supervision. In order to better incorporate

visual content, Gui et al. (2021) adopt an image-

based knowledge retriever that employs the CLIP

model (Radford et al.,2021) pretrained on large-

scale multi-modal pairs as the backbone. How-

ever, question relevancy is not considered, so the

retriever has to retrieve knowledge on every aspect

of the image for different possible questions.

This work proposes an

tity-

cused

trieval

(EnFoRe) model that recognizes key entities for

the visual question and retrieves question-relevant

knowledge speciﬁcally focused on them. Our ap-

proach also beneﬁts from stronger passage-retrieval

supervision with the help of those key entities.

2.4 Phrase-Based Dense Passage Retrieval

The most relevant work to ours is phrase-based

dense passage retrieval. Chen et al. (2021) em-

ploy a separate lexical model that is trained to

mimic the performance of a sparse retriever that is

better at matching phrases. Lee et al. (2021) pro-

pose DensePhrase model that extracts each possible

phrase feature in the passage and only uses the most

relevant phrase to measure the similarity between

the query and passage. However, the training sig-

nals still come from exactly matching ground truth

answers, and the phrases are parsed from the can-

didate passage, limiting the scope of the search. In

contrast, our approach collects entities from many

aspects of the question and image, including object

recognition, attribute detection, OCR, brands, cap-

tioning, etc., building a rich uniﬁed intermediate

representation.

3 Entity Set Construction

Our EnFoRe model is empowered by a compre-

hensive set of extracted entities. Entities are not

limited to phrases from the question and passages

as in (Lee et al.,2021). We collect entities from

the sources below. Most entity extraction steps

are independent and can execute in parallel, except

for answering sub-questions, which ﬁrst requires

parsing the questions. Parallelizing these steps can

signiﬁcantly reduce run time.

3.1 Question-Based Entities

Entities from Questions:

First, the noun phrases

in questions usually reveal critical entities. Follow-

ing Wu et al. (2022), we parse the question using

a constituency parser (Gardner et al.,2018) and

extract noun phrases at the leaves of the parse tree.

Then, we link each phrase to the image and extract

the referred object with its attributes. We use a

pretrained ViLBERT model (Lu et al.,2020) as the

object linker.

Entities from Sub-Questions:

OK-VQA often re-

quires systems to solve visual reference problems

as well as comprehend relevant outside knowledge.

Therefore, we employ a general VQA model to

ﬁnd answers to the visual aspects of the question.

In particular, we collect a set of sub-questions by

appending each noun phrase in the parse tree to the

common question phrases “What is...” and “How

is...” When the conﬁdence for an answer from a pre-

trained VilBERT model (Lu et al.,2020) exceeds

0.5, it is added to the entity set. For the example

in Fig. 2, the noun phrases “plush toy” and “presi-

dent” generate the sub-questions: “What is plush

toy?”, “How is plush toy?”, “What is president?”,

“How is president?”. The answer conﬁdence for

“teddy bear” exceeds 0.5 for the ﬁrst question, so

we include it in the entity set.

Entities from Answer Candidates:

Standard

state-of-the-art VQA models are surprisingly effec-

tive at generating a small set of promising answer

candidates for OK-VQA (Wu et al.,2022,2020).

Therefore, we ﬁnetune a ViLBERT model (Lu et al.,

2019) on the OK-VQA data set and extract the top

5answer candidates and add them to entity set.

3.2 Image-Based Entities

Question-based entities are high precision and nar-

row down the search space for knowledge retriev-

ers. To complement this, we also collect image-

based entities to help achieve higher recall.

Entities from Azure tagging:

Following Yang

et al. (2022), we use Azure OCR and brand tag-

ging to annotate the detected objects in the images

using a Mask R-CNN detector (He et al.,2017).

Entities from Wikidata:

As suggested by Gui

et al. (2021), common image and object tags can

be generic with a limited vocabulary, leading to

noise or irrelevant knowledge. Therefore, we also

leverage recent advanced visual-semantic match-

ing approaches, i.e. CLIP (Radford et al.,2021),

to extract image-relevant entities from Wikidata.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Entity-FocusedDensePassageRetrievalforOutside-KnowledgeVisualQuestionAnsweringJialinWuDepartmentofComputerScienceTheUniversityofTexasatAustinjialinwu@utexas.eduRaymondJ.MooneyDepartmentofComputerScienceTheUniversityofTexasatAustinmooney@cs.utexas.eduAbstractMostOutside-KnowledgeVisualQuestionAn-swer...

展开>> 收起<<

Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering Jialin Wu.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering Jialin Wu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: