Towards Robust Visual Question Answering Making the Most of Biased Samples via Contrastive Learning Qingyi Si12 Yuanxin Liu14 Fandong Meng3 Zheng Lin12

2025-05-06 0 0 7.73MB 13 页 10玖币

侵权投诉

Towards Robust Visual Question Answering: Making the Most of Biased

Samples via Contrastive Learning

Qingyi Si1,2, Yuanxin Liu1,4, Fandong Meng3, Zheng Lin1,2∗

Peng Fu1, Yanan Cao1,2, Weiping Wang1, Jie Zhou3

1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

3Parttern Recognition Center, WeChat AI, Tencent Inc, China 4Peking University

{siqingyi,linzheng,fupeng,caoyanan,wangweiping}@iie.ac.cn,

liuyuanxin@stu.pku.edu.cn,{fandongmeng,withtomzhou}@tencent.com

Abstract

Models for Visual Question Answering (VQA)

often rely on the spurious correlations, i.e.,

the language priors, that appear in the biased

samples of training set, which make them brit-

tle against the out-of-distribution (OOD) test

data. Recent methods have achieved promis-

ing progress in overcoming this problem by re-

ducing the impact of biased samples on model

training. However, these models reveal a

trade-off that the improvements on OOD data

severely sacriﬁce the performance on the in-

distribution (ID) data (which is dominated by

the biased samples). Therefore, we propose a

novel contrastive learning approach, MMBS1,

for building robust VQA models by Making

the Most of Biased Samples. Speciﬁcally,

we construct positive samples for contrastive

learning by eliminating the information related

to spurious correlation from the original train-

ing samples and explore several strategies to

use the constructed positive samples for train-

ing. Instead of undermining the importance

of biased samples in model training, our ap-

proach precisely exploits the biased samples

for unbiased information that contributes to

reasoning. The proposed method is compat-

ible with various VQA backbones. We vali-

date our contributions by achieving competi-

tive performance on the OOD dataset VQA-

CP v2 while preserving robust performance on

the ID dataset VQA v2.

1 Introduction

Visual Question Answering (VQA), aiming to an-

swer a question about the given image, is a multi-

modal task that involves the intersection between

vision and language. Despite the remarkable per-

formance on many VQA datasets such as VQA

v2 (Goyal et al.,2017), recent studies (Antol et al.,

∗Corresponding author: Zheng Lin.

Joint work with Pattern Recognition Center, WeChat

AI, Tencent Inc, China. The code is available at

https:

//github.com/PhoebusSi/MMBS.

Figure 1: Qualitative comparison of our method

LMH+MMBS against the plain method UpDn and

the debiasing method LMH. In VQA-CP v2 (upper),

the question types (‘Does the’ and ‘How many’) bias

UpDn to the most common answers (see Fig. 5for the

answer distribution). LMH alleviates the language pri-

ors for yesno questions (upper left), while it fails on the

more difﬁcult non-yesno questions (upper right). Be-

sides, LMH damages the ID performance, giving an un-

common answer to the common sample from VQA v2

(lower left). MMBS improves the OOD performance

while maintains the ID performance (lower right).

2015;Kaﬂe and Kanan,2017;Agrawal et al.,2016)

ﬁnd that the VQA systems rely heavily on the lan-

guage priors. They are caused by the strong spuri-

ous correlation between certain question category

and answers, e.g., the frequent co-occurrence of

the question category ‘what sport’ and the answer

‘tennis’ (Selvaraju et al.,2019). As a result, the

VQA models, which are over-reliant on the lan-

guage priors of training set, fail to generalize to the

OOD dataset, VQA-CP v2 (Agrawal et al.,2018).

Recently, several methods achieved remarkable

progress in overcoming this language prior prob-

lem. They assign less importance to the biased sam-

ples that can be correctly classiﬁed with the spu-

rious correlation. However, most of them achieve

gains on VQA-CP v2 at the cost of degrading the

model’s ID performance on the VQA v2 dataset

(see Tab. 2). This trade-off suggests that the suc-

cess of these methods merely comes from biasing

arXiv:2210.04563v1 [cs.CV] 10 Oct 2022

the models to other directions, rather than endow-

ing them with the reasoning capability and robust-

ness to language priors. Ideally, a robust VQA

system should maintain its performance on the ID

dataset while overcoming the language priors, as

shown in Fig. 1.

We think the essence of both language-prior and

trade-off problems is about the learning of biased

samples. The former is caused by over-reliance on

biased information from biased samples, while the

latter is caused by undermining the importance of

biased samples. Therefore, if a model can precisely

exploit the biased samples for intrinsic information

of the given task, both problems can be alleviated

simultaneously.

Motivated by this, we propose a self-supervised

contrastive learning method (MMBS) for building

robust VQA systems by

ake the

ost of

iased

amples. Firstly, in view of the characteristics of

the spurious correlations, we construct two kinds

of positive samples for the questions of training

samples to exploit the unbiased information, and

then design four strategies to use the constructed

positive samples. Next, we propose a novel algo-

rithm to distinguish between biased and unbiased

samples, so as to treat them differently. On this

basis, we introduce an auxiliary contrastive train-

ing objective, which helps the model learn a more

general representation with ameliorated language

priors by narrowing the distance between original

samples and positive samples in the cross-modality

joint embedding space.

To summarize, our contributions are as follow:

) We propose a novel contrastive learning method,

which effectively addresses the language prior prob-

lem and the ID-OOD performance trade-off in

VQA, by making the most of biased samples.

)

We propose an algorithm to distinguish between

biased and unbiased samples and treat them dif-

ferently in contrastive learning.

iii

) Experimental

results demonstrate that our method is compatible

with various VQA backbones and achieve compet-

itive performance on the language-bias sensitive

VQA-CP v2 dataset while preserving the original

accuracy on the in-distribution VQA v2 dataset.

2 Related Work

Overcoming Language Priors in VQA.

Re-

cently, the language biases in VQA datasets raised

the attention of many researchers (Goyal et al.,

2017;Antol et al.,2015;Agrawal et al.,2016;Ker-

vadec et al.,2021). In response to this problem,

numerous methods are proposed to debias the VQA

models. The most effective ones of them can be

roughly divided into two categories:

Ensemble-

based methods

(Grand and Belinkov,2019;Be-

linkov et al.,2019;Cadene et al.,2019;Clark et al.,

2019;Mahabadi and Henderson,2019;Niu et al.,

2021) introduce a biased model, which is designed

to focus on the spurious features, to assist the train-

ing of the main model. For example, the recent

method LPF (Liang et al.,2021) leverages the out-

put distribution of the bias model to down-weight

the biased sample when computing the VQA loss.

However, these methods neglect the useful informa-

tion that helps reasoning in biased samples.

Data-

balancing methods

(Zhu et al.,2020;Liang et al.,

2020) balance the training priors. For example,

CSS and Mutant (Chen et al.,2020;Gokhale et al.,

2020) generate samples by masking the critical

object in images and word in questions and by se-

mantic image mutations respectively. These meth-

ods usually outperform other debiasing methods

with a large margin on VQA-CP v2, because they

bypass the challenge of the imbalanced settings

(Liang et al.,2021;Niu et al.,2021) by explicitly

balancing the answers’ distribution at the training

stage. Though our method constructs the positive

questions, it does not change the training answers’

distribution. We also extend our method to the

data-balancing method SAR (Si et al.,2021).

Contrastive Learning in VQA.

Recently, the

contrastive learning is well-developed in unsuper-

vised learning (Oord et al.,2018;He et al.,2020)

while its application in VQA is still in initial stage.

CL (Liang et al.,2020) is the ﬁrst work to employ

contrastive learning to improve VQA model’s ro-

bustness. Its motivation is to learn a better relation-

ship among the input sample and the factual and

counterfactual sample which are generated by CSS.

However, CL brings weak OOD performance gain

and ID performance drop based on CSS. In con-

trast, our method attributes the key point of solving

language bias to the positive-sample designs for

excluding the spurious correlations. It is model-

agnostic and can boost models’ OOD performance

signiﬁcantly while retain the ID performance.

3 Method

Fig. 2shows MMBS’s overview, which includes:

1) A backbone VQA model; 2) A positive sample

construction module; 3) An unbiased sample selec-

Figure 2: Overview of our method. The question cate-

gory words are highlighted in yellow. The orange circle

and blue triangle denote the cross-modality representa-

tions of the original sample and positive sample. The

other samples in the same batch are the negative sam-

ples, which are denoted by the gray circles.

tion module; 4) A contrastive learning objective.

3.1 Backbone VQA Model

The backbone VQA model is a free choice in

MMBS. The widely-used backbone models (Ander-

son et al.,2018;Mahabadi and Henderson,2019)

treat VQA as a multi-class multi-label classiﬁca-

tion task. Concretely, given a VQA dataset

{Ii, Qi, Ai}N

i=1

with

samples, where

Ii∈I

Qi∈Q

are the image and question of the

ith

sam-

ple.

Ai∈A

is the ground-truth answer which is

usually in multi-label form, and

tgti

is the corre-

sponding target score of each label. Most exist-

ing VQA models consist of four parts: the ques-

tion encoder

eq(·)

, the image encoder

ev(·)

, the

fusion function

F(·)

and the classiﬁer

clf (·)

. For

example, LXMERT (Tan and Bansal,2019) en-

codes image and caption text separately to extract

visual features

Vi=ev(Ii)

, and textual features

Ti=eq(Qi)

, in two streams. Next, the higher co-

attentional transformer layers fuse the two features

and project them into the cross-modality joint em-

bedding space, i.e.,

F(Vi, Ti)

. Finally, the classiﬁer

outputs the answer prediction:

P(A|Ii, Qi) = clf (F(Vi, Ti)) (1)

The training objective minimizes the multi-label

soft loss,

Lvqa

, which can be formalized as follow:

Lvqa =−1

NXN

i=1[tgti·log(δ(F(Vi, Ti)))

+ (1 −tgti)·log(1 −δ(F(Vi, Ti)))]

(2)

where δdenotes the sigmoid function.

3.2 Positive Sample Construction

To make the most of the unbiased information con-

tained in the biased sample, we ﬁrst construct the

positive samples which exclude the biased informa-

tion. According to the construction of VQA-CP v2,

there is a shift between the training and test set in

terms of answer distribution under the same ques-

tion category (Teney et al.,2020;Agrawal et al.,

2018). As a result, the frequency co-occurrence of

certain answer and question category in the train-

ing set produces a major source of bias. Therefore,

we construct two kinds of positive questions (

Q+

)

by corrupting the question category information of

each input question (Qi):

Shufﬂing

: We randomly shufﬂe the words in

the question sentence so that the question category

words are mixed with the other words. This in-

creases the difﬁculty of building the correlations

between question category and answer.

Removal

: We remove the question category

words from the question sentence. It eliminates

the co-occurrence of answer and question category

words completely.

We notice that the construction process could

induce some unexpected noise in the positive sam-

ples. To tackle this concern, we present more pos-

itive samples in

Appendix

A.1 and discuss their

quality and potential impact on our method.

We also propose four strategies for using the

constructed positive questions during training:

S:Use the Shufﬂing positive questions.

R:Use the Removal positive questions.

B:Use both positive questions.

:Use the

Shufﬂing

positive questions for non-

yesno (i.e., ‘Num’ and ‘Other’) questions and use

the

Removal

ones for yesno (i.e., ‘Y/N’) questions.

The

strategy deals with yesno and non-yesno

questions in different ways based on their char-

acteristics. Intuitively, the question categories of

the yesno questions usually contain little informa-

tion, as they are mostly comprised of ‘is’, ‘do’, etc.

By contrast, the question categories of non-yesno

questions tend to contain more information which

is important for answering correctly. Therefore,

Removal is not applied to non-yesno questions.

Adopting any strategy above, we can ob-

tain the positive samples

{Ii, Q+

i}B

i=1

for in-

put samples

{Ii, Qi}B

i=1

. The negative samples

{Ib, Qb}B

b=1

, where

b6=i

, are the other samples in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsRobustVisualQuestionAnswering:MakingtheMostofBiasedSamplesviaContrastiveLearningQingyiSi1;2,YuanxinLiu1;4,FandongMeng3,ZhengLin1;2PengFu1,YananCao1;2,WeipingWang1,JieZhou31InstituteofInformationEngineering,ChineseAcademyofSciences,Beijing,China2SchoolofCyberSecurity,UniversityofChineseAcadem...

展开>> 收起<<

Towards Robust Visual Question Answering Making the Most of Biased Samples via Contrastive Learning Qingyi Si12 Yuanxin Liu14 Fandong Meng3 Zheng Lin12.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Robust Visual Question Answering Making the Most of Biased Samples via Contrastive Learning Qingyi Si12 Yuanxin Liu14 Fandong Meng3 Zheng Lin12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: