Towards Robust Visual Question Answering Making the Most of Biased Samples via Contrastive Learning Qingyi Si12 Yuanxin Liu14 Fandong Meng3 Zheng Lin12

2025-05-06 0 0 7.73MB 13 页 10玖币
侵权投诉
Towards Robust Visual Question Answering: Making the Most of Biased
Samples via Contrastive Learning
Qingyi Si1,2, Yuanxin Liu1,4, Fandong Meng3, Zheng Lin1,2
Peng Fu1, Yanan Cao1,2, Weiping Wang1, Jie Zhou3
1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
3Parttern Recognition Center, WeChat AI, Tencent Inc, China 4Peking University
{siqingyi,linzheng,fupeng,caoyanan,wangweiping}@iie.ac.cn,
liuyuanxin@stu.pku.edu.cn,{fandongmeng,withtomzhou}@tencent.com
Abstract
Models for Visual Question Answering (VQA)
often rely on the spurious correlations, i.e.,
the language priors, that appear in the biased
samples of training set, which make them brit-
tle against the out-of-distribution (OOD) test
data. Recent methods have achieved promis-
ing progress in overcoming this problem by re-
ducing the impact of biased samples on model
training. However, these models reveal a
trade-off that the improvements on OOD data
severely sacrifice the performance on the in-
distribution (ID) data (which is dominated by
the biased samples). Therefore, we propose a
novel contrastive learning approach, MMBS1,
for building robust VQA models by Making
the Most of Biased Samples. Specifically,
we construct positive samples for contrastive
learning by eliminating the information related
to spurious correlation from the original train-
ing samples and explore several strategies to
use the constructed positive samples for train-
ing. Instead of undermining the importance
of biased samples in model training, our ap-
proach precisely exploits the biased samples
for unbiased information that contributes to
reasoning. The proposed method is compat-
ible with various VQA backbones. We vali-
date our contributions by achieving competi-
tive performance on the OOD dataset VQA-
CP v2 while preserving robust performance on
the ID dataset VQA v2.
1 Introduction
Visual Question Answering (VQA), aiming to an-
swer a question about the given image, is a multi-
modal task that involves the intersection between
vision and language. Despite the remarkable per-
formance on many VQA datasets such as VQA
v2 (Goyal et al.,2017), recent studies (Antol et al.,
Corresponding author: Zheng Lin.
1
Joint work with Pattern Recognition Center, WeChat
AI, Tencent Inc, China. The code is available at
https:
//github.com/PhoebusSi/MMBS.
Figure 1: Qualitative comparison of our method
LMH+MMBS against the plain method UpDn and
the debiasing method LMH. In VQA-CP v2 (upper),
the question types (‘Does the’ and ‘How many’) bias
UpDn to the most common answers (see Fig. 5for the
answer distribution). LMH alleviates the language pri-
ors for yesno questions (upper left), while it fails on the
more difficult non-yesno questions (upper right). Be-
sides, LMH damages the ID performance, giving an un-
common answer to the common sample from VQA v2
(lower left). MMBS improves the OOD performance
while maintains the ID performance (lower right).
2015;Kafle and Kanan,2017;Agrawal et al.,2016)
find that the VQA systems rely heavily on the lan-
guage priors. They are caused by the strong spuri-
ous correlation between certain question category
and answers, e.g., the frequent co-occurrence of
the question category ‘what sport’ and the answer
‘tennis’ (Selvaraju et al.,2019). As a result, the
VQA models, which are over-reliant on the lan-
guage priors of training set, fail to generalize to the
OOD dataset, VQA-CP v2 (Agrawal et al.,2018).
Recently, several methods achieved remarkable
progress in overcoming this language prior prob-
lem. They assign less importance to the biased sam-
ples that can be correctly classified with the spu-
rious correlation. However, most of them achieve
gains on VQA-CP v2 at the cost of degrading the
model’s ID performance on the VQA v2 dataset
(see Tab. 2). This trade-off suggests that the suc-
cess of these methods merely comes from biasing
arXiv:2210.04563v1 [cs.CV] 10 Oct 2022
the models to other directions, rather than endow-
ing them with the reasoning capability and robust-
ness to language priors. Ideally, a robust VQA
system should maintain its performance on the ID
dataset while overcoming the language priors, as
shown in Fig. 1.
We think the essence of both language-prior and
trade-off problems is about the learning of biased
samples. The former is caused by over-reliance on
biased information from biased samples, while the
latter is caused by undermining the importance of
biased samples. Therefore, if a model can precisely
exploit the biased samples for intrinsic information
of the given task, both problems can be alleviated
simultaneously.
Motivated by this, we propose a self-supervised
contrastive learning method (MMBS) for building
robust VQA systems by
M
ake the
M
ost of
B
iased
S
amples. Firstly, in view of the characteristics of
the spurious correlations, we construct two kinds
of positive samples for the questions of training
samples to exploit the unbiased information, and
then design four strategies to use the constructed
positive samples. Next, we propose a novel algo-
rithm to distinguish between biased and unbiased
samples, so as to treat them differently. On this
basis, we introduce an auxiliary contrastive train-
ing objective, which helps the model learn a more
general representation with ameliorated language
priors by narrowing the distance between original
samples and positive samples in the cross-modality
joint embedding space.
To summarize, our contributions are as follow:
i
) We propose a novel contrastive learning method,
which effectively addresses the language prior prob-
lem and the ID-OOD performance trade-off in
VQA, by making the most of biased samples.
ii
)
We propose an algorithm to distinguish between
biased and unbiased samples and treat them dif-
ferently in contrastive learning.
iii
) Experimental
results demonstrate that our method is compatible
with various VQA backbones and achieve compet-
itive performance on the language-bias sensitive
VQA-CP v2 dataset while preserving the original
accuracy on the in-distribution VQA v2 dataset.
2 Related Work
Overcoming Language Priors in VQA.
Re-
cently, the language biases in VQA datasets raised
the attention of many researchers (Goyal et al.,
2017;Antol et al.,2015;Agrawal et al.,2016;Ker-
vadec et al.,2021). In response to this problem,
numerous methods are proposed to debias the VQA
models. The most effective ones of them can be
roughly divided into two categories:
Ensemble-
based methods
(Grand and Belinkov,2019;Be-
linkov et al.,2019;Cadene et al.,2019;Clark et al.,
2019;Mahabadi and Henderson,2019;Niu et al.,
2021) introduce a biased model, which is designed
to focus on the spurious features, to assist the train-
ing of the main model. For example, the recent
method LPF (Liang et al.,2021) leverages the out-
put distribution of the bias model to down-weight
the biased sample when computing the VQA loss.
However, these methods neglect the useful informa-
tion that helps reasoning in biased samples.
Data-
balancing methods
(Zhu et al.,2020;Liang et al.,
2020) balance the training priors. For example,
CSS and Mutant (Chen et al.,2020;Gokhale et al.,
2020) generate samples by masking the critical
object in images and word in questions and by se-
mantic image mutations respectively. These meth-
ods usually outperform other debiasing methods
with a large margin on VQA-CP v2, because they
bypass the challenge of the imbalanced settings
(Liang et al.,2021;Niu et al.,2021) by explicitly
balancing the answers’ distribution at the training
stage. Though our method constructs the positive
questions, it does not change the training answers’
distribution. We also extend our method to the
data-balancing method SAR (Si et al.,2021).
Contrastive Learning in VQA.
Recently, the
contrastive learning is well-developed in unsuper-
vised learning (Oord et al.,2018;He et al.,2020)
while its application in VQA is still in initial stage.
CL (Liang et al.,2020) is the first work to employ
contrastive learning to improve VQA model’s ro-
bustness. Its motivation is to learn a better relation-
ship among the input sample and the factual and
counterfactual sample which are generated by CSS.
However, CL brings weak OOD performance gain
and ID performance drop based on CSS. In con-
trast, our method attributes the key point of solving
language bias to the positive-sample designs for
excluding the spurious correlations. It is model-
agnostic and can boost models’ OOD performance
significantly while retain the ID performance.
3 Method
Fig. 2shows MMBS’s overview, which includes:
1) A backbone VQA model; 2) A positive sample
construction module; 3) An unbiased sample selec-
Figure 2: Overview of our method. The question cate-
gory words are highlighted in yellow. The orange circle
and blue triangle denote the cross-modality representa-
tions of the original sample and positive sample. The
other samples in the same batch are the negative sam-
ples, which are denoted by the gray circles.
tion module; 4) A contrastive learning objective.
3.1 Backbone VQA Model
The backbone VQA model is a free choice in
MMBS. The widely-used backbone models (Ander-
son et al.,2018;Mahabadi and Henderson,2019)
treat VQA as a multi-class multi-label classifica-
tion task. Concretely, given a VQA dataset
D=
{Ii, Qi, Ai}N
i=1
with
N
samples, where
IiI
,
QiQ
are the image and question of the
ith
sam-
ple.
AiA
is the ground-truth answer which is
usually in multi-label form, and
tgti
is the corre-
sponding target score of each label. Most exist-
ing VQA models consist of four parts: the ques-
tion encoder
eq(·)
, the image encoder
ev(·)
, the
fusion function
F(·)
and the classifier
clf (·)
. For
example, LXMERT (Tan and Bansal,2019) en-
codes image and caption text separately to extract
visual features
Vi=ev(Ii)
, and textual features
Ti=eq(Qi)
, in two streams. Next, the higher co-
attentional transformer layers fuse the two features
and project them into the cross-modality joint em-
bedding space, i.e.,
F(Vi, Ti)
. Finally, the classifier
outputs the answer prediction:
P(A|Ii, Qi) = clf (F(Vi, Ti)) (1)
The training objective minimizes the multi-label
soft loss,
Lvqa
, which can be formalized as follow:
Lvqa =1
NXN
i=1[tgti·log(δ(F(Vi, Ti)))
+ (1 tgti)·log(1 δ(F(Vi, Ti)))]
(2)
where δdenotes the sigmoid function.
3.2 Positive Sample Construction
To make the most of the unbiased information con-
tained in the biased sample, we first construct the
positive samples which exclude the biased informa-
tion. According to the construction of VQA-CP v2,
there is a shift between the training and test set in
terms of answer distribution under the same ques-
tion category (Teney et al.,2020;Agrawal et al.,
2018). As a result, the frequency co-occurrence of
certain answer and question category in the train-
ing set produces a major source of bias. Therefore,
we construct two kinds of positive questions (
Q+
i
)
by corrupting the question category information of
each input question (Qi):
Shuffling
: We randomly shuffle the words in
the question sentence so that the question category
words are mixed with the other words. This in-
creases the difficulty of building the correlations
between question category and answer.
Removal
: We remove the question category
words from the question sentence. It eliminates
the co-occurrence of answer and question category
words completely.
We notice that the construction process could
induce some unexpected noise in the positive sam-
ples. To tackle this concern, we present more pos-
itive samples in
Appendix
A.1 and discuss their
quality and potential impact on our method.
We also propose four strategies for using the
constructed positive questions during training:
S:Use the Shuffling positive questions.
R:Use the Removal positive questions.
B:Use both positive questions.
SR
:Use the
Shuffling
positive questions for non-
yesno (i.e., ‘Num’ and ‘Other’) questions and use
the
Removal
ones for yesno (i.e., ‘Y/N’) questions.
The
SR
strategy deals with yesno and non-yesno
questions in different ways based on their char-
acteristics. Intuitively, the question categories of
the yesno questions usually contain little informa-
tion, as they are mostly comprised of ‘is’, ‘do’, etc.
By contrast, the question categories of non-yesno
questions tend to contain more information which
is important for answering correctly. Therefore,
Removal is not applied to non-yesno questions.
Adopting any strategy above, we can ob-
tain the positive samples
{Ii, Q+
i}B
i=1
for in-
put samples
{Ii, Qi}B
i=1
. The negative samples
{Ib, Qb}B
b=1
, where
b6=i
, are the other samples in
摘要:

TowardsRobustVisualQuestionAnswering:MakingtheMostofBiasedSamplesviaContrastiveLearningQingyiSi1;2,YuanxinLiu1;4,FandongMeng3,ZhengLin1;2PengFu1,YananCao1;2,WeipingWang1,JieZhou31InstituteofInformationEngineering,ChineseAcademyofSciences,Beijing,China2SchoolofCyberSecurity,UniversityofChineseAcadem...

展开>> 收起<<
Towards Robust Visual Question Answering Making the Most of Biased Samples via Contrastive Learning Qingyi Si12 Yuanxin Liu14 Fandong Meng3 Zheng Lin12.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:7.73MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注