trained on commonsense knowledge graphs. We augment
sentence transformers [31] to rank, filter and embed the
commonsense inferences. We incorporate the filtered in-
ferences into VLC-BERT using an attention-driven fusion
mechanism that learns to focus on the most important infer-
ences for each question. Commonsense knowledge may not
be necessary for answering every question, as some ques-
tions are either purely visual, factual, or straight-forward.
To eliminate injecting noisy knowledge in such cases, we
employ weak supervision to help us discriminate between
situations when commonsense knowledge may or may not
be valuable.
Our evaluations on the challenging OK-VQA [29] and
A-OKVQA [36] datasets confirm that leveraging common-
sense is consistently useful for knowledge-intensive visual
question answering tasks. We analyze the successful pre-
dictions and show how the commonsense inferences help
answering difficult questions.
2. Related Work
2.1. Vision-Language Transformer Models
Pre-trained Vision-Language models based on BERT [8]
have shown impressive performances on downstream mul-
timodal tasks such as Visual Question Answering. ViL-
BERT [25] and LXMERT [42] use a two-stream architec-
ture to first encode language and vision modalities indepen-
dently, and then apply a cross-modality encoder to align
textual and visual tokens. VL-BERT [41], OSCAR [22]
and OSCAR+ [50] use a single-stream architecture to di-
rectly learn inter-modality interactions. Large-scale pre-
training is commonly done using the Conceptual Captions
[38] dataset, with objectives that are designed to encourage
interaction between modalities, such as predicting masked
tokens or image regions [22, 25, 41, 42], and using con-
trastive loss between modalities [22]. As a result, such
models inherently capture some commonsense knowledge
through their pre-training regime. While these models per-
form impressively on downstream tasks such as VQA [1],
they typically perform worse on questions requiring rea-
soning about knowledge beyond the image content or in-
volving multiple reasoning hops. In our work, we introduce
VLC-BERT, a multimodal transformer model based on VL-
BERT that explicitly incorporates external knowledge to al-
leviate this issue.
2.2. Knowledge-based Visual Question Answering
In recent years, several VQA datasets were designed
specifically to require reasoning about external knowledge
beyond the image, whether using factual and web infor-
mation (FVQA [45], WebQA [5], a provided text pas-
sage (VLQA [34]), commonsense-driven reasoning (VCR
[49]), or external commonsense knowledge (OK-VQA [29],
A-OKVQA [36]). This motivated a line of work on
knowledge-enhanced VL transformer models. External
knowledge is typically retrieved from a structured knowl-
edge base like ConceptNet [40], in the form of a subgraph,
and integrated into the VL transformer as an additional in-
put [9,20,28,47]. Alternative sources of knowledge include
image captions [33], Google Search results [26], and tex-
tual and visual knowledge from Wikipedia, and Google Im-
ages [47]. In contrast to most of the preceding work, PICa
[48] and Knowledge Augmented Transformer (KAT) [13]
attempt to use GPT-3 [3] in a few-shot setting on the VQA
task, by building prompts containing the caption and ob-
ject tags generated using the image, followed by the ques-
tion statement, asking the model to produce an answer.
In our proposed model, we focus on a specific subset of
the knowledge-intensive datasets that require commonsense
knowledge. Our approach, that uses COMET [15], for in-
corporating commonsense knowledge is distinctly different,
far simpler and more cost-effective.
2.3. Knowledge incorporation in NLP
Structured large-scale knowledge bases (KBs) like Con-
ceptNet [40] and ATOMIC [35] are widely used in NLP
tasks to provide additional commonsense knowledge to
models. ConceptNet contains 3.4M assertions focusing on
concept and entity relations (such as RelatedTo, Synonym,
IsA, MadeOf). ATOMIC contains 1.33M triplets focusing
on event-centric social commonsense about causes, effects,
mental states of the event participants. Several approaches
were proposed for incorporating symbolic knowledge from
these KBs into downstream NLP tasks such as encoding
subgraphs of relevant knowledge [9, 23] and pre-training
on commonsense knowledge bases or tasks [51]. Despite
the performance improvements, incorporating knowledge
directly from KBs suffers from two limitations: lack of
coverage and lack of consideration for context. Com-
monsense Transformer, COMET [15], attempts to allevi-
ate these issues by fine-tuning pre-trained language models
on KBs. COMET can generate inferences for the various
KB relations dynamically for new inputs. It has been suc-
cessfully used for generating knowledge in language tasks
[4, 27, 39, 43]. Inspired by the success of these models, we
chose to use COMET [15] to generate relevant contextual
expansions rather than directly retrieving knowledge from
KBs. To the best of our knowledge, we are the first to in-
corporate commonsense knowledge using COMET in VQA
tasks. Newer COMET variants [30, 46] are less applicable
to OK-VQA and A-OKVQA as they focus more on event
commonsense than entities.
3. Method
We briefly outline the overall architecture of our model
and then delve deeper into its individual components. Fig-