VLC-BERT Visual Question Answering with Contextualized Commonsense Knowledge Sahithya Ravi12Aditya Chinchure12Leonid Sigal12Renjie Liao1Vered Shwartz12

2025-05-06 1 0 4.1MB 11 页 10玖币
侵权投诉
VLC-BERT: Visual Question Answering
with Contextualized Commonsense Knowledge
Sahithya Ravi1,2*Aditya Chinchure1,2Leonid Sigal1,2Renjie Liao1Vered Shwartz1,2
1University of British Columbia 2Vector Institute for AI
{sahiravi, aditya10, lsigal, vshwartz}@cs.ubc.ca, rjliao@ece.ubc.ca
Abstract
There has been a growing interest in solving Visual
Question Answering (VQA) tasks that require the model
to reason beyond the content present in the image. In
this work, we focus on questions that require common-
sense reasoning. In contrast to previous methods which
inject knowledge from static knowledge bases, we in-
vestigate the incorporation of contextualized knowledge
using Commonsense Transformer (COMET), an existing
knowledge model trained on human-curated knowledge
bases. We propose a method to generate, select, and en-
code external commonsense knowledge alongside visual
and textual cues in a new pre-trained Vision-Language-
Commonsense transformer model, VLC-BERT. Through
our evaluation on the knowledge-intensive OK-VQA and A-
OKVQA datasets, we show that VLC-BERT is capable of
outperforming existing models that utilize static knowledge
bases. Furthermore, through a detailed analysis, we ex-
plain which questions benefit, and which don’t, from con-
textualized commonsense knowledge from COMET. Code:
https://github.com/aditya10/VLC-BERT
1. Introduction
Recent progress in multimodal vision-language learning
has been fueled by large-scale annotated datasets for Visual
Question Answering (VQA) [1,6,12,37,49], in which mod-
els are presented with questions about an image. To answer
questions correctly, models are required to perform scene
understanding and learn meaningful connections between
the two modalities. In recent years, transformer-based vi-
sion and language (VL) models [8, 21, 44], pre-trained on
large-scale multimodal corpora, have reached impressive
accuracies on standard VQA datasets.
VQA often necessitates not only visual comprehension
of the scene depicted by the image (e.g., “A plate with meat,
potatoes and bread”) but also making inferences about plau-
*Denotes equal contribution
Figure 1: OK-VQA [29]: Where might one buy this?
sible stories behind the image (e.g., “The plate is likely
found at a restaurant”). Humans make such inferences
based on prior experience and commonsense knowledge
(e.g., “This is likely a lunch or dinner at a restaurant, peo-
ple may be enjoying themselves...”). Most existing meth-
ods rely on world knowledge implicitly encoded by lan-
guage models, which often lacks in both accuracy and cov-
erage [32]. This is primarily due to the fact that com-
monsense knowledge is extremely broad, and frequently as-
sumed. Commonsense knowledge learned from text suffers
from reporting bias [11]: over-representation of exceptional
facts (e.g., “people die in accidents”) in text corpora, at the
expense of rarely discussed trivial facts known to everyone
(e.g., “people eat”).
Several visual question answering benchmarks were pro-
posed, in which the questions require either factual [29, 45]
or commonsense knowledge [36, 49] beyond the visual
scene comprehension. This prompted the development of
neurosymbolic methods combining transformer-based rep-
resentations with knowledge bases (KBs) [9, 28, 47]. How-
ever, retrieving relevant facts directly from a KB is chal-
lenging due to lack of coverage, and because KB facts are
only appropriate in certain contexts.
In this work, we propose VLC-BERT (Vision-Language-
Commonsense BERT), a model designed to incorporate
contextualized commonsense knowledge into a Vision-
Language transformer built on VL-BERT [41]. As an al-
ternative to the retrieval paradigm often used in knowledge-
based VQA, our model generates contextualized common-
sense inferences on the question phrase combined with im-
age object tags using COMET [2, 15], a language model
arXiv:2210.13626v1 [cs.CV] 24 Oct 2022
trained on commonsense knowledge graphs. We augment
sentence transformers [31] to rank, filter and embed the
commonsense inferences. We incorporate the filtered in-
ferences into VLC-BERT using an attention-driven fusion
mechanism that learns to focus on the most important infer-
ences for each question. Commonsense knowledge may not
be necessary for answering every question, as some ques-
tions are either purely visual, factual, or straight-forward.
To eliminate injecting noisy knowledge in such cases, we
employ weak supervision to help us discriminate between
situations when commonsense knowledge may or may not
be valuable.
Our evaluations on the challenging OK-VQA [29] and
A-OKVQA [36] datasets confirm that leveraging common-
sense is consistently useful for knowledge-intensive visual
question answering tasks. We analyze the successful pre-
dictions and show how the commonsense inferences help
answering difficult questions.
2. Related Work
2.1. Vision-Language Transformer Models
Pre-trained Vision-Language models based on BERT [8]
have shown impressive performances on downstream mul-
timodal tasks such as Visual Question Answering. ViL-
BERT [25] and LXMERT [42] use a two-stream architec-
ture to first encode language and vision modalities indepen-
dently, and then apply a cross-modality encoder to align
textual and visual tokens. VL-BERT [41], OSCAR [22]
and OSCAR+ [50] use a single-stream architecture to di-
rectly learn inter-modality interactions. Large-scale pre-
training is commonly done using the Conceptual Captions
[38] dataset, with objectives that are designed to encourage
interaction between modalities, such as predicting masked
tokens or image regions [22, 25, 41, 42], and using con-
trastive loss between modalities [22]. As a result, such
models inherently capture some commonsense knowledge
through their pre-training regime. While these models per-
form impressively on downstream tasks such as VQA [1],
they typically perform worse on questions requiring rea-
soning about knowledge beyond the image content or in-
volving multiple reasoning hops. In our work, we introduce
VLC-BERT, a multimodal transformer model based on VL-
BERT that explicitly incorporates external knowledge to al-
leviate this issue.
2.2. Knowledge-based Visual Question Answering
In recent years, several VQA datasets were designed
specifically to require reasoning about external knowledge
beyond the image, whether using factual and web infor-
mation (FVQA [45], WebQA [5], a provided text pas-
sage (VLQA [34]), commonsense-driven reasoning (VCR
[49]), or external commonsense knowledge (OK-VQA [29],
A-OKVQA [36]). This motivated a line of work on
knowledge-enhanced VL transformer models. External
knowledge is typically retrieved from a structured knowl-
edge base like ConceptNet [40], in the form of a subgraph,
and integrated into the VL transformer as an additional in-
put [9,20,28,47]. Alternative sources of knowledge include
image captions [33], Google Search results [26], and tex-
tual and visual knowledge from Wikipedia, and Google Im-
ages [47]. In contrast to most of the preceding work, PICa
[48] and Knowledge Augmented Transformer (KAT) [13]
attempt to use GPT-3 [3] in a few-shot setting on the VQA
task, by building prompts containing the caption and ob-
ject tags generated using the image, followed by the ques-
tion statement, asking the model to produce an answer.
In our proposed model, we focus on a specific subset of
the knowledge-intensive datasets that require commonsense
knowledge. Our approach, that uses COMET [15], for in-
corporating commonsense knowledge is distinctly different,
far simpler and more cost-effective.
2.3. Knowledge incorporation in NLP
Structured large-scale knowledge bases (KBs) like Con-
ceptNet [40] and ATOMIC [35] are widely used in NLP
tasks to provide additional commonsense knowledge to
models. ConceptNet contains 3.4M assertions focusing on
concept and entity relations (such as RelatedTo, Synonym,
IsA, MadeOf). ATOMIC contains 1.33M triplets focusing
on event-centric social commonsense about causes, effects,
mental states of the event participants. Several approaches
were proposed for incorporating symbolic knowledge from
these KBs into downstream NLP tasks such as encoding
subgraphs of relevant knowledge [9, 23] and pre-training
on commonsense knowledge bases or tasks [51]. Despite
the performance improvements, incorporating knowledge
directly from KBs suffers from two limitations: lack of
coverage and lack of consideration for context. Com-
monsense Transformer, COMET [15], attempts to allevi-
ate these issues by fine-tuning pre-trained language models
on KBs. COMET can generate inferences for the various
KB relations dynamically for new inputs. It has been suc-
cessfully used for generating knowledge in language tasks
[4, 27, 39, 43]. Inspired by the success of these models, we
chose to use COMET [15] to generate relevant contextual
expansions rather than directly retrieving knowledge from
KBs. To the best of our knowledge, we are the first to in-
corporate commonsense knowledge using COMET in VQA
tasks. Newer COMET variants [30, 46] are less applicable
to OK-VQA and A-OKVQA as they focus more on event
commonsense than entities.
3. Method
We briefly outline the overall architecture of our model
and then delve deeper into its individual components. Fig-
VLC-BERT Transformer
Question
Question (Q)
Why do they have
umbrellas?
Answer
Commonsense Inferences (C)
Knowledge
Generation &
Selection
Fast(er) R-CNN
Image
Image Regions (I)
YOLOv5
Object tags
dog, chair...
(a) Overall architecture
Relations
CapableOf
HasProperty
MadeOf
AtLocation
Causes
xWant
...
umbrella protects
from sun,
umbrella protects
from rain
COMET
Knowledge Generation & Selection
Semantic
search
(SBERT)
Q
What is the
purpose of the
umbrella?
Question to
declarative
AtLocation
umbrella
umbrella stand
store
garage
park
UsedFor
protect from rain
protect from sun
protect themselves
keep dog dry
use as weapon
MadeOf
umbrella handle
umbrella head
umbrella
umbrella blade
umbrella cap
...
...
...
...
C1...Cn
C1...CK
Sentence
construction
O
dog, chair The purpose of the umbrellas
with dog and chair
You are likely to find umbrella at store
Umbrellas is made of umbrella head
...
(b) Knowledge generation and selection
Figure 2: Architecture of VLC-BERT: Given an image, VLC-BERT generates commonsense inferences for the question-
object phrase using COMET. These inferences are relevance ranked, and top ones are selected and fed along with image
regions into a VL-Transformer in order to produce an answer. We utilize semantic similarity between Qand Cto select the
final Kinferences that go into VLC-BERT.
ure 2a illustrates the VLC-BERT pipeline. Given an im-
age with corresponding image regions Iprecomputed using
Fast RCNN [10] and a question Qrelated to the image, we
generate commonsense inferences Con the events and enti-
ties in the question phrase and two object tags O, and select
the set of commonsense inferences which is the most use-
ful for answering the question, C={C1, C2, ..., Ck}(§3.1).
Finally, we embed Q,Iand C, as input to VLC-BERT and
train it to predict an answer Ato Q(§3.2).
3.1. Structured knowledge generation and selection
3.1.1 Knowledge Generation
To generate commonsense knowledge, we employ the most
recent version of COMET [15] initialized using BART [19]
in a zero-shot setting. COMET is trained to complete 50
relation types from both ConceptNet [40] (such as AtLoca-
tion, Madeof) and ATOMIC [35] (such as xNeed, xWants),
thus capturing concept as well as event oriented knowledge.
We generate inferences based on 30 relation types most rel-
evant to our work and supported by COMET.1Consider the
example shown in Figure 2b. For the given question, “What
is the purpose of the umbrella?” we first process each ques-
tion using AllenNLP’s constituency parser [17] and convert
it into a declarative sentence, since COMET was mainly
trained on declarative sentences. In the example shown,
“What is the purpose of the umbrella?” is rephrased as
“The purpose of the umbrellas is”. We then adopt a state-
of-the-art object detection model, YOLOv5 [16], to trans-
late the corresponding image into object tags that COMET
can understand. We select the top two most confident ob-
ject tags and combine it with the question phrase to obtain a
question-object(QO) phrase, “The purpose of the umbrella
1We include the full list of relation types in the supplementary material.
is, with dog and chair”. We restrict the number of the object
tags used in COMET’s input to two because the addition of
multiple tags make the inferences more conflated and noisy.
In this manner, we can obtain inferences that can provide
additional knowledge about both the visual and language
inputs to VLC-BERT.
We use beam search to decode the top 5 inferences for
each relation type, ranked according to the model’s con-
fidence. Overall, we get 30 ×5 = 150 inferences for
each input phrase. Finally, we convert each inference to
a sentence in natural language using relation-specific tem-
plates as defined in [7]. In the shown example, the assertion
< umbrella, Located At, store > is expressed as “You are
likely to find umbrella at store”. In order to remove re-
dundant sentences of the same relation type, we measure
the lexical overlap by measuring the percentage of common
words between two given sentences. We exclude the sen-
tences which have more than 70% overlap with previously
constructed sentences of the same relation.
3.1.2 Knowledge Selection
Due to the high cost of computation, and the noise asso-
ciated with feeding such a large number of text tokens,
feeding up to 150 COMET inferences into the VL Trans-
former model is impractical. In order to rank and select the
inferences, we employ semantic search based on sentence
transformers (SBERT) [31], which are pre-trained on tasks
that retrieve candidate answers to a search query. In this
method, the question and the inferences are embedded
into the same vector space using SBERT [31] and cosine
similarity between the question and the inference embed-
dings is used to rank the inferences. We prune the set
of inference sentences Cby picking K= 5 inferences
摘要:

VLC-BERT:VisualQuestionAnsweringwithContextualizedCommonsenseKnowledgeSahithyaRavi1,2*AdityaChinchure1,2∗LeonidSigal1,2RenjieLiao1VeredShwartz1,21UniversityofBritishColumbia2VectorInstituteforAI{sahiravi,aditya10,lsigal,vshwartz}@cs.ubc.ca,rjliao@ece.ubc.caAbstractTherehasbeenagrowinginterestinsolvi...

展开>> 收起<<
VLC-BERT Visual Question Answering with Contextualized Commonsense Knowledge Sahithya Ravi12Aditya Chinchure12Leonid Sigal12Renjie Liao1Vered Shwartz12.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:11 页 大小:4.1MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注