VLC-BERT Visual Question Answering with Contextualized Commonsense Knowledge Sahithya Ravi12Aditya Chinchure12Leonid Sigal12Renjie Liao1Vered Shwartz12

2025-05-06 1 0 4.1MB 11 页 10玖币

侵权投诉

VLC-BERT: Visual Question Answering

with Contextualized Commonsense Knowledge

Sahithya Ravi1,2*Aditya Chinchure1,2∗Leonid Sigal1,2Renjie Liao1Vered Shwartz1,2

1University of British Columbia 2Vector Institute for AI

{sahiravi, aditya10, lsigal, vshwartz}@cs.ubc.ca, rjliao@ece.ubc.ca

Abstract

There has been a growing interest in solving Visual

Question Answering (VQA) tasks that require the model

to reason beyond the content present in the image. In

this work, we focus on questions that require common-

sense reasoning. In contrast to previous methods which

inject knowledge from static knowledge bases, we in-

vestigate the incorporation of contextualized knowledge

using Commonsense Transformer (COMET), an existing

knowledge model trained on human-curated knowledge

bases. We propose a method to generate, select, and en-

code external commonsense knowledge alongside visual

and textual cues in a new pre-trained Vision-Language-

Commonsense transformer model, VLC-BERT. Through

our evaluation on the knowledge-intensive OK-VQA and A-

OKVQA datasets, we show that VLC-BERT is capable of

outperforming existing models that utilize static knowledge

bases. Furthermore, through a detailed analysis, we ex-

plain which questions beneﬁt, and which don’t, from con-

textualized commonsense knowledge from COMET. Code:

https://github.com/aditya10/VLC-BERT

1. Introduction

Recent progress in multimodal vision-language learning

has been fueled by large-scale annotated datasets for Visual

Question Answering (VQA) [1,6,12,37,49], in which mod-

els are presented with questions about an image. To answer

questions correctly, models are required to perform scene

understanding and learn meaningful connections between

the two modalities. In recent years, transformer-based vi-

sion and language (VL) models [8, 21, 44], pre-trained on

large-scale multimodal corpora, have reached impressive

accuracies on standard VQA datasets.

VQA often necessitates not only visual comprehension

of the scene depicted by the image (e.g., “A plate with meat,

potatoes and bread”) but also making inferences about plau-

*Denotes equal contribution

Figure 1: OK-VQA [29]: Where might one buy this?

sible stories behind the image (e.g., “The plate is likely

found at a restaurant”). Humans make such inferences

based on prior experience and commonsense knowledge

(e.g., “This is likely a lunch or dinner at a restaurant, peo-

ple may be enjoying themselves...”). Most existing meth-

ods rely on world knowledge implicitly encoded by lan-

guage models, which often lacks in both accuracy and cov-

erage [32]. This is primarily due to the fact that com-

monsense knowledge is extremely broad, and frequently as-

sumed. Commonsense knowledge learned from text suffers

from reporting bias [11]: over-representation of exceptional

facts (e.g., “people die in accidents”) in text corpora, at the

expense of rarely discussed trivial facts known to everyone

(e.g., “people eat”).

Several visual question answering benchmarks were pro-

posed, in which the questions require either factual [29, 45]

or commonsense knowledge [36, 49] beyond the visual

scene comprehension. This prompted the development of

neurosymbolic methods combining transformer-based rep-

resentations with knowledge bases (KBs) [9, 28, 47]. How-

ever, retrieving relevant facts directly from a KB is chal-

lenging due to lack of coverage, and because KB facts are

only appropriate in certain contexts.

In this work, we propose VLC-BERT (Vision-Language-

Commonsense BERT), a model designed to incorporate

contextualized commonsense knowledge into a Vision-

Language transformer built on VL-BERT [41]. As an al-

ternative to the retrieval paradigm often used in knowledge-

based VQA, our model generates contextualized common-

sense inferences on the question phrase combined with im-

age object tags using COMET [2, 15], a language model

arXiv:2210.13626v1 [cs.CV] 24 Oct 2022

trained on commonsense knowledge graphs. We augment

sentence transformers [31] to rank, ﬁlter and embed the

commonsense inferences. We incorporate the ﬁltered in-

ferences into VLC-BERT using an attention-driven fusion

mechanism that learns to focus on the most important infer-

ences for each question. Commonsense knowledge may not

be necessary for answering every question, as some ques-

tions are either purely visual, factual, or straight-forward.

To eliminate injecting noisy knowledge in such cases, we

employ weak supervision to help us discriminate between

situations when commonsense knowledge may or may not

be valuable.

Our evaluations on the challenging OK-VQA [29] and

A-OKVQA [36] datasets conﬁrm that leveraging common-

sense is consistently useful for knowledge-intensive visual

question answering tasks. We analyze the successful pre-

dictions and show how the commonsense inferences help

answering difﬁcult questions.

2. Related Work

2.1. Vision-Language Transformer Models

Pre-trained Vision-Language models based on BERT [8]

have shown impressive performances on downstream mul-

timodal tasks such as Visual Question Answering. ViL-

BERT [25] and LXMERT [42] use a two-stream architec-

ture to ﬁrst encode language and vision modalities indepen-

dently, and then apply a cross-modality encoder to align

textual and visual tokens. VL-BERT [41], OSCAR [22]

and OSCAR+ [50] use a single-stream architecture to di-

rectly learn inter-modality interactions. Large-scale pre-

training is commonly done using the Conceptual Captions

[38] dataset, with objectives that are designed to encourage

interaction between modalities, such as predicting masked

tokens or image regions [22, 25, 41, 42], and using con-

trastive loss between modalities [22]. As a result, such

models inherently capture some commonsense knowledge

through their pre-training regime. While these models per-

form impressively on downstream tasks such as VQA [1],

they typically perform worse on questions requiring rea-

soning about knowledge beyond the image content or in-

volving multiple reasoning hops. In our work, we introduce

VLC-BERT, a multimodal transformer model based on VL-

BERT that explicitly incorporates external knowledge to al-

leviate this issue.

2.2. Knowledge-based Visual Question Answering

In recent years, several VQA datasets were designed

speciﬁcally to require reasoning about external knowledge

beyond the image, whether using factual and web infor-

mation (FVQA [45], WebQA [5], a provided text pas-

sage (VLQA [34]), commonsense-driven reasoning (VCR

[49]), or external commonsense knowledge (OK-VQA [29],

A-OKVQA [36]). This motivated a line of work on

knowledge-enhanced VL transformer models. External

knowledge is typically retrieved from a structured knowl-

edge base like ConceptNet [40], in the form of a subgraph,

and integrated into the VL transformer as an additional in-

put [9,20,28,47]. Alternative sources of knowledge include

image captions [33], Google Search results [26], and tex-

tual and visual knowledge from Wikipedia, and Google Im-

ages [47]. In contrast to most of the preceding work, PICa

[48] and Knowledge Augmented Transformer (KAT) [13]

attempt to use GPT-3 [3] in a few-shot setting on the VQA

task, by building prompts containing the caption and ob-

ject tags generated using the image, followed by the ques-

tion statement, asking the model to produce an answer.

In our proposed model, we focus on a speciﬁc subset of

the knowledge-intensive datasets that require commonsense

knowledge. Our approach, that uses COMET [15], for in-

corporating commonsense knowledge is distinctly different,

far simpler and more cost-effective.

2.3. Knowledge incorporation in NLP

Structured large-scale knowledge bases (KBs) like Con-

ceptNet [40] and ATOMIC [35] are widely used in NLP

tasks to provide additional commonsense knowledge to

models. ConceptNet contains 3.4M assertions focusing on

concept and entity relations (such as RelatedTo, Synonym,

IsA, MadeOf). ATOMIC contains 1.33M triplets focusing

on event-centric social commonsense about causes, effects,

mental states of the event participants. Several approaches

were proposed for incorporating symbolic knowledge from

these KBs into downstream NLP tasks such as encoding

subgraphs of relevant knowledge [9, 23] and pre-training

on commonsense knowledge bases or tasks [51]. Despite

the performance improvements, incorporating knowledge

directly from KBs suffers from two limitations: lack of

coverage and lack of consideration for context. Com-

monsense Transformer, COMET [15], attempts to allevi-

ate these issues by ﬁne-tuning pre-trained language models

on KBs. COMET can generate inferences for the various

KB relations dynamically for new inputs. It has been suc-

cessfully used for generating knowledge in language tasks

[4, 27, 39, 43]. Inspired by the success of these models, we

chose to use COMET [15] to generate relevant contextual

expansions rather than directly retrieving knowledge from

KBs. To the best of our knowledge, we are the ﬁrst to in-

corporate commonsense knowledge using COMET in VQA

tasks. Newer COMET variants [30, 46] are less applicable

to OK-VQA and A-OKVQA as they focus more on event

commonsense than entities.

3. Method

We brieﬂy outline the overall architecture of our model

and then delve deeper into its individual components. Fig-

VLC-BERT Transformer

Question

Question (Q)

Why do they have

umbrellas?

Answer

Commonsense Inferences (C)

Knowledge

Generation &

Selection

Fast(er) R-CNN

Image

Image Regions (I)

YOLOv5

Object tags

dog, chair...

(a) Overall architecture

Relations

CapableOf

HasProperty

MadeOf

AtLocation

Causes

xWant

...

umbrella protects

from sun,

umbrella protects

from rain

COMET

Knowledge Generation & Selection

Semantic

(SBERT)

What is the

purpose of the

umbrella?

Question to

declarative

AtLocation

umbrella

umbrella stand

store

garage

park

UsedFor

protect from rain

protect from sun

protect themselves

keep dog dry

use as weapon

MadeOf

umbrella handle

umbrella head

umbrella

umbrella blade

umbrella cap

...

C1...Cn

C1...CK

Sentence

construction

dog, chair The purpose of the umbrellas

with dog and chair

You are likely to find umbrella at store

Umbrellas is made of umbrella head

...

(b) Knowledge generation and selection

Figure 2: Architecture of VLC-BERT: Given an image, VLC-BERT generates commonsense inferences for the question-

object phrase using COMET. These inferences are relevance ranked, and top ones are selected and fed along with image

regions into a VL-Transformer in order to produce an answer. We utilize semantic similarity between Qand Cto select the

ﬁnal Kinferences that go into VLC-BERT.

ure 2a illustrates the VLC-BERT pipeline. Given an im-

age with corresponding image regions Iprecomputed using

Fast RCNN [10] and a question Qrelated to the image, we

generate commonsense inferences Con the events and enti-

ties in the question phrase and two object tags O, and select

the set of commonsense inferences which is the most use-

ful for answering the question, C={C1, C2, ..., Ck}(§3.1).

Finally, we embed Q,Iand C, as input to VLC-BERT and

train it to predict an answer Ato Q(§3.2).

3.1. Structured knowledge generation and selection

3.1.1 Knowledge Generation

To generate commonsense knowledge, we employ the most

recent version of COMET [15] initialized using BART [19]

in a zero-shot setting. COMET is trained to complete 50

relation types from both ConceptNet [40] (such as AtLoca-

tion, Madeof) and ATOMIC [35] (such as xNeed, xWants),

thus capturing concept as well as event oriented knowledge.

We generate inferences based on 30 relation types most rel-

evant to our work and supported by COMET.1Consider the

example shown in Figure 2b. For the given question, “What

is the purpose of the umbrella?” we ﬁrst process each ques-

tion using AllenNLP’s constituency parser [17] and convert

it into a declarative sentence, since COMET was mainly

trained on declarative sentences. In the example shown,

“What is the purpose of the umbrella?” is rephrased as

“The purpose of the umbrellas is”. We then adopt a state-

of-the-art object detection model, YOLOv5 [16], to trans-

late the corresponding image into object tags that COMET

can understand. We select the top two most conﬁdent ob-

ject tags and combine it with the question phrase to obtain a

question-object(QO) phrase, “The purpose of the umbrella

1We include the full list of relation types in the supplementary material.

is, with dog and chair”. We restrict the number of the object

tags used in COMET’s input to two because the addition of

multiple tags make the inferences more conﬂated and noisy.

In this manner, we can obtain inferences that can provide

additional knowledge about both the visual and language

inputs to VLC-BERT.

We use beam search to decode the top 5 inferences for

each relation type, ranked according to the model’s con-

ﬁdence. Overall, we get 30 ×5 = 150 inferences for

each input phrase. Finally, we convert each inference to

a sentence in natural language using relation-speciﬁc tem-

plates as deﬁned in [7]. In the shown example, the assertion

< umbrella, Located At, store > is expressed as “You are

likely to ﬁnd umbrella at store”. In order to remove re-

dundant sentences of the same relation type, we measure

the lexical overlap by measuring the percentage of common

words between two given sentences. We exclude the sen-

tences which have more than 70% overlap with previously

constructed sentences of the same relation.

3.1.2 Knowledge Selection

Due to the high cost of computation, and the noise asso-

ciated with feeding such a large number of text tokens,

feeding up to 150 COMET inferences into the VL Trans-

former model is impractical. In order to rank and select the

inferences, we employ semantic search based on sentence

transformers (SBERT) [31], which are pre-trained on tasks

that retrieve candidate answers to a search query. In this

method, the question and the inferences are embedded

into the same vector space using SBERT [31] and cosine

similarity between the question and the inference embed-

dings is used to rank the inferences. We prune the set

of inference sentences Cby picking K= 5 inferences

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VLC-BERT:VisualQuestionAnsweringwithContextualizedCommonsenseKnowledgeSahithyaRavi1,2*AdityaChinchure1,2∗LeonidSigal1,2RenjieLiao1VeredShwartz1,21UniversityofBritishColumbia2VectorInstituteforAI{sahiravi,aditya10,lsigal,vshwartz}@cs.ubc.ca,rjliao@ece.ubc.caAbstractTherehasbeenagrowinginterestinsolvi...

展开>> 收起<<

VLC-BERT Visual Question Answering with Contextualized Commonsense Knowledge Sahithya Ravi12Aditya Chinchure12Leonid Sigal12Renjie Liao1Vered Shwartz12.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

VLC-BERT Visual Question Answering with Contextualized Commonsense Knowledge Sahithya Ravi12Aditya Chinchure12Leonid Sigal12Renjie Liao1Vered Shwartz12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: