Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering Jialin Wu

2025-05-06 0 0 7.81MB 12 页 10玖币
侵权投诉
Entity-Focused Dense Passage Retrieval for
Outside-Knowledge Visual Question Answering
Jialin Wu
Department of Computer Science
The University of Texas at Austin
jialinwu@utexas.edu
Raymond J. Mooney
Department of Computer Science
The University of Texas at Austin
mooney@cs.utexas.edu
Abstract
Most Outside-Knowledge Visual Question An-
swering (OK-VQA) systems employ a two-
stage framework that first retrieves external
knowledge given the visual question and then
predicts the answer based on the retrieved con-
tent. However, the retrieved knowledge is of-
ten inadequate. Retrievals are frequently too
general and fail to cover specific knowledge
needed to answer the question. Also, the nat-
urally available supervision (whether the pas-
sage contains the correct answer) is weak and
does not guarantee question relevancy. To
address these issues, we propose an Entity-
Focused Retrieval (EnFoRe) model that pro-
vides stronger supervision during training and
recognizes question-relevant entities to help
retrieve more specific knowledge. Experi-
ments show that our EnFoRe model achieves
superior retrieval performance on OK-VQA,
the currently largest outside-knowledge VQA
dataset. We also combine the retrieved knowl-
edge with state-of-the-art VQA models, and
achieve a new state-of-the-art performance on
OK-VQA.
1 Introduction
Passage retrieval under a multi-modal setting is a
critical prerequisite for applications such as outside-
knowledge visual question answering (OK-VQA)
(Marino et al.,2019), which requires effectively uti-
lizing knowledge external to the image. Recently,
dense passage retrievers with deep semantic rep-
resentations powered by large transformer models
have shown superior performance to traditional
sparse retrievers such as BM25 (Robertson and
Zaragoza,2009) and TF-IDF under both textual
(Karpukhin et al.,2020;Chen et al.,2021;Lewis
et al.,2022) and multi-modal settings (Luo et al.,
2021;Qu et al.,2021;Gui et al.,2021).
In this work, we investigate two main drawbacks
of recent dense retrievers (Karpukhin et al.,2020;
Chen et al.,2021;Lewis et al.,2022;Luo et al.,
Being omnivores they enjoy eating
live crickets and other insects and
small amounts of chopped fruits and
vegetables such as …
Bell pepper The bell pepper is … in
different colours, including red,
yellow, orange, …they are commonly
used as a vegetable ingredient …
Q: What holiday is this?
A: Thanksgiving.
Q: This plush toy was named
after what US president?
A: Theodore Teddy Roosevelt.
critical entity: teddy bear critical entity: turkey
Q: Is the large yellow object a fruit or a vegetable?
A: vegetable. critical entity: bell pepper
Figure 1: Top: Examples of critical entities upon which
retrieval models should focus; Bottom: Example of im-
proved passage retrieval using critical entities.
2021;Qu et al.,2021;Gui et al.,2021), which are
typically trained to produce similar representations
for input queries and passages containing ground-
truth answers.
First, as most retrieval models encode the query
and passages as a whole, they fail to explicitly
discover entities critical to answering the ques-
tion (Chen et al.,2021). This frequently leads
to retrieving overly-general knowledge lacking a
specific focus. Ideally, a retrieval model should
identify the critical entities for the query and then
retrieve question-relevant knowledge specifically
about them. For example, as shown in the top half
of Figure 1, retrieval models should realize that the
entities “turkey” and “teddy bear” are critical.
Second, on the supervision side, the positive
signals are often passages containing the right an-
swers with top sparse-retrieval scores such as BM
25 (Robertson and Zaragoza,2009) and TF-IDF.
However, this criterion is inadequate to guarantee
question relevancy, since good positive passages
should reveal facts that actually support the correct
answer using the critical entities depicted in the im-
arXiv:2210.10176v2 [cs.CL] 20 Oct 2022
age. For example, as shown in the bottom of Figure
1, both passages mention the correct answer “veg-
etable” but only the second one which focuses on
the critical entity “bell pepper” is question-relevant.
In order to address these shortcomings, we pro-
pose an
En
tity-
Fo
cused
Re
trieval (EnFoRe) model
that improves the quality of the positive passages
for stronger supervision. EnFoRe automatically
identifies critical entities for the question and then
retrieves knowledge focused on them. We focus
on entities that improve a sparse retriever’s perfor-
mance if emphasized during retrieval as critical
entities. We use the top passages containing both
critical entities and the correct answer as positive
supervision. Then, our EnFoRe model learns two
scores to indicate (1) the importance of each entity
given the question and the image and (2) a score
that measured how well each entity fits the context
of each candidate passage.
We evaluate EnFoRe on OK-VQA (Marino et al.,
2019), currently the largest knowledge-based VQA
dataset. Our approach achieves state-of-the-art
(SOTA) knowledge retrieval results, indicating the
effectiveness of explicitly recognizing key enti-
ties during retrieval. We also combine this re-
treived knowledge with SOTA OK-VQA mod-
els and achieve a new SOTA OK-VQA perfor-
mance. Our code is available at https://github.com/
jialinwu17/EnFoRe.git.
2 Background and Related Work
2.1 OK-VQA
Visual Question Answering (VQA) has witnessed
remarkable progress over the past few years, in
terms of both the scope of the questions (Antol
et al.,2015;Hudson and Manning,2019;Wang
et al.,2018;Gurari et al.,2018;Singh et al.,2019),
and the sophistication of the model design (An-
tol et al.,2015;Lu et al.,2016;Anderson et al.,
2018;Kim et al.,2018,2020;Wu et al.,2019;Wu
and Mooney,2019;Jiang et al.,2018;Lu et al.,
2019;Nguyen et al.,2021). There is a recent trend
towards outside knowledge visual question answer-
ing (OK-VQA) (Marino et al.,2019), where open
domain external knowledge outside the image is
necessary. Most OK-VQA models (Marino et al.,
2019;Gardères et al.,2020;Zhu et al.,2020;Li
et al.,2020;Narasimhan et al.,2018;Marino et al.,
2021;Wu et al.,2022;Gui et al.,2021) incorpo-
rate a retriever-reader framework that first retrieves
textual knowledge relevant to the question and im-
age and then “reads” this text to predicts the an-
swer. As an online free encyclopedia, Wikipedia
is often used as the knowledge source for OK-
VQA. While most previous works focused more
on the answer prediction stage, the performance
is still lacking because of the imperfect quality
of the retrieved knowledge. This work focuses on
knowledge retrieval and aims at retrieving question-
relevant knowledge that focuses explicitly on the
critical entities for the visual question.
2.2 Passage Retrieval
Sparse Retrieval:
Before the recent proliferation
of transformer-based dense passage retrieval mod-
els (Karpukhin et al.,2020), previous work mainly
explored sparse retrievers, such as TF-IDF and
BM25 (Robertson and Zaragoza,2009), that mea-
sure the similarity between the search query and
candidate passage using weighted term matching.
These sparse retrievers require no training signals
on the relevancy of the passage and show solid base-
line performances. However, exact term matching
prevents them from capturing synonyms and para-
phrases and understanding the semantic meanings
of the query and the passages.
Dense Retrieval:
To better represent semantics,
dense retrievers (Karpukhin et al.,2020;Chen et al.,
2021;Lewis et al.,2022;Lee et al.,2021) extract
deep representations for the query and the candi-
date passages using large pretrained transformer
models. Most dense retrievers are trained using a
contrastive objective that encourages the represen-
tation of the query to be more similar to the relevant
passages than other irrelevant passages. During
training, the passage with a high sparse retrieval
score containing the answer is often regarded as
a positive sample for the question-answering task.
However, these positive passages may not fit the
question’s context and only serve as very weak su-
pervision. Moreover, the query and passages are
often encoded as single vectors. Therefore most
dense retrievers fail to explicitly discover and uti-
lize critical entities for the question (Chen et al.,
2021). This often leads to overly general knowl-
edge without a specific focus.
2.3 Dense Passage Retrieval for VQA
Motivated by the trend toward dense retrievers, pre-
vious work has also applied them to OK-VQA.
Qu et al. (2021) utilize Wikipedia as a knowledge
source. Luo et al. (2021) crawl Google search re-
sults on the training set as a knowledge source.
However, the weak training signals for passage re-
trieval become more problematic for VQA as the
visual context of the question makes it more com-
plex. Therefore, a “positive passage” becomes less
likely to fit the visual context and actually provide
suitable supervision. In order to better incorporate
visual content, Gui et al. (2021) adopt an image-
based knowledge retriever that employs the CLIP
model (Radford et al.,2021) pretrained on large-
scale multi-modal pairs as the backbone. How-
ever, question relevancy is not considered, so the
retriever has to retrieve knowledge on every aspect
of the image for different possible questions.
This work proposes an
En
tity-
Fo
cused
Re
trieval
(EnFoRe) model that recognizes key entities for
the visual question and retrieves question-relevant
knowledge specifically focused on them. Our ap-
proach also benefits from stronger passage-retrieval
supervision with the help of those key entities.
2.4 Phrase-Based Dense Passage Retrieval
The most relevant work to ours is phrase-based
dense passage retrieval. Chen et al. (2021) em-
ploy a separate lexical model that is trained to
mimic the performance of a sparse retriever that is
better at matching phrases. Lee et al. (2021) pro-
pose DensePhrase model that extracts each possible
phrase feature in the passage and only uses the most
relevant phrase to measure the similarity between
the query and passage. However, the training sig-
nals still come from exactly matching ground truth
answers, and the phrases are parsed from the can-
didate passage, limiting the scope of the search. In
contrast, our approach collects entities from many
aspects of the question and image, including object
recognition, attribute detection, OCR, brands, cap-
tioning, etc., building a rich unified intermediate
representation.
3 Entity Set Construction
Our EnFoRe model is empowered by a compre-
hensive set of extracted entities. Entities are not
limited to phrases from the question and passages
as in (Lee et al.,2021). We collect entities from
the sources below. Most entity extraction steps
are independent and can execute in parallel, except
for answering sub-questions, which first requires
parsing the questions. Parallelizing these steps can
significantly reduce run time.
3.1 Question-Based Entities
Entities from Questions:
First, the noun phrases
in questions usually reveal critical entities. Follow-
ing Wu et al. (2022), we parse the question using
a constituency parser (Gardner et al.,2018) and
extract noun phrases at the leaves of the parse tree.
Then, we link each phrase to the image and extract
the referred object with its attributes. We use a
pretrained ViLBERT model (Lu et al.,2020) as the
object linker.
Entities from Sub-Questions:
OK-VQA often re-
quires systems to solve visual reference problems
as well as comprehend relevant outside knowledge.
Therefore, we employ a general VQA model to
find answers to the visual aspects of the question.
In particular, we collect a set of sub-questions by
appending each noun phrase in the parse tree to the
common question phrases “What is... and “How
is... When the confidence for an answer from a pre-
trained VilBERT model (Lu et al.,2020) exceeds
0.5, it is added to the entity set. For the example
in Fig. 2, the noun phrases “plush toy” and “presi-
dent” generate the sub-questions: “What is plush
toy?”, “How is plush toy?”, “What is president?”,
“How is president?”. The answer confidence for
“teddy bear” exceeds 0.5 for the first question, so
we include it in the entity set.
Entities from Answer Candidates:
Standard
state-of-the-art VQA models are surprisingly effec-
tive at generating a small set of promising answer
candidates for OK-VQA (Wu et al.,2022,2020).
Therefore, we finetune a ViLBERT model (Lu et al.,
2019) on the OK-VQA data set and extract the top
5answer candidates and add them to entity set.
3.2 Image-Based Entities
Question-based entities are high precision and nar-
row down the search space for knowledge retriev-
ers. To complement this, we also collect image-
based entities to help achieve higher recall.
Entities from Azure tagging:
Following Yang
et al. (2022), we use Azure OCR and brand tag-
ging to annotate the detected objects in the images
using a Mask R-CNN detector (He et al.,2017).
Entities from Wikidata:
As suggested by Gui
et al. (2021), common image and object tags can
be generic with a limited vocabulary, leading to
noise or irrelevant knowledge. Therefore, we also
leverage recent advanced visual-semantic match-
ing approaches, i.e. CLIP (Radford et al.,2021),
to extract image-relevant entities from Wikidata.
摘要:

Entity-FocusedDensePassageRetrievalforOutside-KnowledgeVisualQuestionAnsweringJialinWuDepartmentofComputerScienceTheUniversityofTexasatAustinjialinwu@utexas.eduRaymondJ.MooneyDepartmentofComputerScienceTheUniversityofTexasatAustinmooney@cs.utexas.eduAbstractMostOutside-KnowledgeVisualQuestionAn-swer...

展开>> 收起<<
Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering Jialin Wu.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:7.81MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注