age. For example, as shown in the bottom of Figure
1, both passages mention the correct answer “veg-
etable” but only the second one which focuses on
the critical entity “bell pepper” is question-relevant.
In order to address these shortcomings, we pro-
pose an
En
tity-
Fo
cused
Re
trieval (EnFoRe) model
that improves the quality of the positive passages
for stronger supervision. EnFoRe automatically
identifies critical entities for the question and then
retrieves knowledge focused on them. We focus
on entities that improve a sparse retriever’s perfor-
mance if emphasized during retrieval as critical
entities. We use the top passages containing both
critical entities and the correct answer as positive
supervision. Then, our EnFoRe model learns two
scores to indicate (1) the importance of each entity
given the question and the image and (2) a score
that measured how well each entity fits the context
of each candidate passage.
We evaluate EnFoRe on OK-VQA (Marino et al.,
2019), currently the largest knowledge-based VQA
dataset. Our approach achieves state-of-the-art
(SOTA) knowledge retrieval results, indicating the
effectiveness of explicitly recognizing key enti-
ties during retrieval. We also combine this re-
treived knowledge with SOTA OK-VQA mod-
els and achieve a new SOTA OK-VQA perfor-
mance. Our code is available at https://github.com/
jialinwu17/EnFoRe.git.
2 Background and Related Work
2.1 OK-VQA
Visual Question Answering (VQA) has witnessed
remarkable progress over the past few years, in
terms of both the scope of the questions (Antol
et al.,2015;Hudson and Manning,2019;Wang
et al.,2018;Gurari et al.,2018;Singh et al.,2019),
and the sophistication of the model design (An-
tol et al.,2015;Lu et al.,2016;Anderson et al.,
2018;Kim et al.,2018,2020;Wu et al.,2019;Wu
and Mooney,2019;Jiang et al.,2018;Lu et al.,
2019;Nguyen et al.,2021). There is a recent trend
towards outside knowledge visual question answer-
ing (OK-VQA) (Marino et al.,2019), where open
domain external knowledge outside the image is
necessary. Most OK-VQA models (Marino et al.,
2019;Gardères et al.,2020;Zhu et al.,2020;Li
et al.,2020;Narasimhan et al.,2018;Marino et al.,
2021;Wu et al.,2022;Gui et al.,2021) incorpo-
rate a retriever-reader framework that first retrieves
textual knowledge relevant to the question and im-
age and then “reads” this text to predicts the an-
swer. As an online free encyclopedia, Wikipedia
is often used as the knowledge source for OK-
VQA. While most previous works focused more
on the answer prediction stage, the performance
is still lacking because of the imperfect quality
of the retrieved knowledge. This work focuses on
knowledge retrieval and aims at retrieving question-
relevant knowledge that focuses explicitly on the
critical entities for the visual question.
2.2 Passage Retrieval
Sparse Retrieval:
Before the recent proliferation
of transformer-based dense passage retrieval mod-
els (Karpukhin et al.,2020), previous work mainly
explored sparse retrievers, such as TF-IDF and
BM25 (Robertson and Zaragoza,2009), that mea-
sure the similarity between the search query and
candidate passage using weighted term matching.
These sparse retrievers require no training signals
on the relevancy of the passage and show solid base-
line performances. However, exact term matching
prevents them from capturing synonyms and para-
phrases and understanding the semantic meanings
of the query and the passages.
Dense Retrieval:
To better represent semantics,
dense retrievers (Karpukhin et al.,2020;Chen et al.,
2021;Lewis et al.,2022;Lee et al.,2021) extract
deep representations for the query and the candi-
date passages using large pretrained transformer
models. Most dense retrievers are trained using a
contrastive objective that encourages the represen-
tation of the query to be more similar to the relevant
passages than other irrelevant passages. During
training, the passage with a high sparse retrieval
score containing the answer is often regarded as
a positive sample for the question-answering task.
However, these positive passages may not fit the
question’s context and only serve as very weak su-
pervision. Moreover, the query and passages are
often encoded as single vectors. Therefore most
dense retrievers fail to explicitly discover and uti-
lize critical entities for the question (Chen et al.,
2021). This often leads to overly general knowl-
edge without a specific focus.
2.3 Dense Passage Retrieval for VQA
Motivated by the trend toward dense retrievers, pre-
vious work has also applied them to OK-VQA.
Qu et al. (2021) utilize Wikipedia as a knowledge
source. Luo et al. (2021) crawl Google search re-
sults on the training set as a knowledge source.