Language Prior Is Not the Only Shortcut A Benchmark for Shortcut Learning in VQA Qingyi Si12 Fandong Meng3 Mingyu Zheng12 Zheng Lin12

2025-05-03 0 0 7.84MB 15 页 10玖币
侵权投诉
Language Prior Is Not the Only Shortcut:
A Benchmark for Shortcut Learning in VQA
Qingyi Si1,2, Fandong Meng3, Mingyu Zheng1,2, Zheng Lin1,2
Yuanxin Liu1,4, Peng Fu1, Yanan Cao1,2, Weiping Wang1, Jie Zhou3
1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
3Parttern Recognition Center, WeChat AI, Tencent Inc, China 4Peking University
{siqingyi,zhengmingyu,linzheng,fupeng,caoyanan,wangweiping}@iie.ac.cn,
liuyuanxin@stu.pku.edu.cn,{fandongmeng,withtomzhou}@tencent.com
Abstract
Visual Question Answering (VQA) models are
prone to learn the shortcut solution formed by
dataset biases rather than the intended solution.
To evaluate the VQA models’ reasoning abil-
ity beyond shortcut learning, the VQA-CP v2
dataset introduces an answer distribution shift
between the training and test set given a ques-
tion type. In this way, the model cannot use the
training set shortcut to perform well on the test
set. However, VQA-CP v2 only considers one
type of shortcut (from question type to answer)
and thus still cannot guarantee that the model
relies on the intended solution rather than a so-
lution specific to this shortcut. To overcome
this limitation, we propose a new dataset that
considers varying types of shortcuts by con-
structing different distribution shifts in multi-
ple OOD test sets. In addition, we overcome
three troubling practices in the use of VQA-
CP v2, e.g., selecting models using OOD test
sets, and further standardize OOD evaluation
procedure. Our benchmark provides a more
rigorous and comprehensive testbed for short-
cut learning in VQA. We benchmark recent
methods and find that methods specifically de-
signed for particular shortcuts fail to simulta-
neously generalize to our varying OOD test
sets. We also systematically study the vary-
ing shortcuts and provide several valuable find-
ings, which may promote the exploration of
shortcut learning in VQA.1
1 Introduction
Visual Question Answering (VQA) (Antol et al.,
2015) is a multi-modal task, involving the com-
prehension and reasoning on vision and language.
Despite the remarkable performance on many VQA
datasets such as VQA v2 (Goyal et al.,2017), VQA
models have been criticized for their tendency to
depend on the biases in training set. That is, they
Corresponding author: Zheng Lin.
1
Joint work with Pattern Recognition Center, WeChat AI,
Tencent Inc, China. The code and data are available at
https:
//github.com/PhoebusSi/VQA-VS.
Figure 1: (a) The acc improvement of LMH over its
backbone model UpDn on nine OOD test sets. (The
acronyms, like QT, are defined in Sec. 3.2 ) (b) Solu-
tions possibly learnd by models.
tend to directly output the answer according to
question type (shortcut solutions) without actual
reasoning (Kervadec et al.,2021) (intended solu-
tion) (Agrawal et al.,2018,2016;Manjunatha et al.,
2019). This widely studied language priors prob-
lem is a typical symptom of
Shortcut Learning
(Geirhos et al.,2020) (
App. A.1
). In spite of such
defect, the models can still reach artificially excel-
lent performance on VQA v2 whose test distribu-
tion is the same as the training distribution, i.e.,
under the in-distribution (IID) setting.
To address this problem, VQA-CP v2 (Agrawal
et al.,2018) was constructed by re-organizing VQA
v2 dataset such that answer distributions of the
same question type are different between the train-
ing set and the test set, i.e., under the OOD setting.
VQA-CP v2 can make the shortcut (from the ques-
tion type to the answer) in the training set invalid in
the test set and it has become a widely-used OOD
benchmark in VQA community. With models’ per-
formance on VQA-CP v2 continually improving, it
seems that existing methods (Clark et al.,2019;Si
et al.,2021;Gokhale et al.,2020;Liang et al.,2021)
have been able to overcome the shortcut learning
problem. However, through analyzing VQA-CP v2
and existing methods, we point out that there are
two aspects needed to be improved:
First, VQA-CP v2 introduces only one specific
type of controlled distribution shift, and thus its
arXiv:2210.04692v1 [cs.CV] 10 Oct 2022
OOD setting can only evaluate the model’s rea-
soning ability beyond one specific shortcut rather
than the intended solution. As shown in Fig. 1(a),
despite performing well on VQA-CP v2, the debi-
asing method LMH (Clark et al.,2019), can only
boost its backbone model UpDn on few certain
OOD test sets while fails to generalize to other
OOD sets. This shows VQA-CP v2 cannot identify
whether the models rely on other types of short-
cuts (e.g., correlations between visual objects and
answers). Therefore, as shown in Fig. 1(b), more
OOD test sets are needed to measure the reliance
of the model on different types of shortcuts. As the
performance on more OOD test sets is improved
simultaneously, the more confidently can the model
be deemed to have learned the intended solution.
Moreover, some studies (Kervadec et al.,2021;
Dancette et al.,2021) demonstrate that abundant
shortcuts in data are derived from the real-world
stereotypes, e.g., the sky is blue. Thus, to establish
a reliable diagnostic for VQA models, construct-
ing a new benchmark with multiple OOD test sets
corresponding to varying types of shortcuts is an
urgent need.
Second, three troubling practices (Teney et al.,
2020b) exist in the experimental setup and the de-
sign of existing methods: 1) The training and the
test distribution of VQA-CP v2 are almost inverse
against each other, and this known construction of
the OOD setting can be easily exploited to achieve
good performance; 2) using the OOD test set for
model selection; 3) IID performance is evaluated
after retraining. These practices do not conform
with the real-world OOD scenarios, and thus mak-
ing the evaluated OOD performance unreliable.
To alleviate the single-shortcut limitation and
overcome three above-mentioned issues, we con-
struct and publicly release a new
VQA
benchmark
considering
V
arying
S
hortcuts (Sec. 3.2), named
VQA-VS
, and further standardize the OOD test-
ing procedure (Sec. 2.2). In particular, we se-
lect varying shortcuts including language-based,
visual-based and multi-modality ones, which aims
at covering different types of superficial correla-
tions. For each selected shortcut, we propose a
method based on mutual information to select the
shortcut-specific concepts. Then we utilize the con-
cepts to group samples, and further introduce nine
distribution shifts, based on a Shannon entropy
method, to construct nine OOD test sets according
to varying shortcuts. Besides, our benchmark also
presents an IID validation set and an IID test set
for evaluating the IID performance.
We benchmark a series of state-of-the-art models
on VQA-VS, and find that it may provide more re-
liable evaluation of the reasoning ability compared
to existing benchmarks. Moreover, adequate exper-
iments are conducted to present the first systematic
study on multiple shortcuts, which may promote
the development of shortcut learning in VQA.
2 Motivation
2.1 A Causal Perspective
Schölkopf et al. (2021) describe the notion of OOD
generalization as the empirical risk minimization
for different interventions on one or several causal
variables. Inspired by them, we formalize the OOD
testing task for VQA and explain the motivation of
the proposed benchmark from a causal perspective.
We can access to a training data from a distri-
bution
P((V, Q), A)
and VQA task is to minimize
the empirical risk:
ˆ
RP((V,Q),A)(f) = EP((V,Q),A)[loss(f(V, Q), A)]
(1)
where
f
is the predictor, i.e., the model predicting
answers
A
from the given image-question pairs,
(V, Q)
, and
loss
is the loss function for model
training.
EP((V,Q),A)
denotes the empirical mean
obtained from the samples drawn from the training
distribution
P((V, Q), A)
. We aim at finding the
optimal predictor fin a hypothesis space H:
f=arg min
fH
ˆ
RP((V,Q),A)(f)(2)
The existing OOD benchmarks
2
for VQA, e.g.,
VQA-CP v2, evaluate the robustness of models
by the small expected risk for a single different
distribution P0((V, Q), A):
ROOD
P0((V,Q),A)(f) = EP0((V,Q),A)[loss(f(V, Q), A)]
(3)
How different the test distribution
P0
is from the
training distribution
P
determines the gap between
ˆ
RP((V,Q),A)(f)
and
ROOD
P0((V,Q),A)(f)
. Under the
IID settings, the test distribution is the same as
training distribution, i.e., P=P0.
The novel test distribution
P0
can be restricted
to the result of a collection of distribution shifts,
which are introduced by the interventions on one
or several causal variables in the causal graph
g
of
2Related works are discussed in App. A.2.
VQA. We denote by
Pg
all the possible interven-
tional distributions over the whole causal graph
g
,
including the unknown and observed causal vari-
ables. To stay robust against distribution shifts on
possible causal variables, we focus on the overall
OOD risk (instead of the worst case OOD risk):
ROOD
Pg(f) = X
P0Pg
EP0((V,Q),A)[loss(f(V, Q), A)]
(4)
In practice, to achieve better estimation of the
true OOD risk, we should specify
an available (ob-
served) subset of interventional distributions
εPg
, where
ε
should coincide with
Pg
(Ar-
jovsky et al.,2019;David et al.,2010), for a robust
predictor by solving:
f=arg min
fHX
P0ε
EP0((V,Q),A)[loss(f(V, Q), A)]
(5)
2.2 Overcoming Current OOD Testing Issues
VQA-VS aims to further correct three troubling
issues (Teney et al.,2020b) in the use of VQA-CP
v2
3
and the design of debiasing methods (see
App.
A.3
for details), and standardize the OOD testing
paradigm.
Issue 1
: In VQA-CP v2, the answer distributions
under the same question type are almost inverse
between training and testing. This known construc-
tion of the OOD splits in VQA-CP v2 is easily
exploited by existing debiasing methods. For ex-
ample, they answer mostly "yes" when the frequent
training answer is "no". To achieve this, the de-
biasing SoTAs (Liang et al.,2021;Cadene et al.,
2019) prevent models from learning the frequent
training samples. Some ones (Clark et al.,2019;Si
et al.,2021) even purposely use the annotation of
question type. These dataset-specific solutions are
unlikely to generalize to other datasets which do
not have this character. Unlike the handcrafted in-
verse training/test distributions in VQA-CP v2, we
follow Kervadec et al. and select the rare VQA sam-
ples as OOD samples. This construction procedure
does not artificially change the training distribution
in order to remain its natural tendencies, and thus
it is hard to be exploited by the debiasing methods.
Issue 2
: Nearly all methods directly use the test
set for model selection due to the lack of val sets,
which does not concur with the best practice of
3
We build up a collection of the debiasing methods de-
signed for VQA-CP v2 and their main issues. (see Tab. 5)
Figure 2: The splitting of our dataset.
machine learning. Kervadec et al. noticed this
issue and presented a dataset GQA-OOD with an
OOD val set. However, in real-world applications,
the information about the OOD distribution should
also be unavailable until we evaluate the model.
Therefore, this paper argues that using IID val set
for model selection is a demand in the standard
OOD testing procedure (discussed in Sec. 5.3).
Issue 3
: Existing works usually retrain their
models on the VQA v2 dataset to evaluate the IID
performance (refer to
App. A.4
for reasons). This
leads to two problems: (i) Training a model for
each distribution separately is not in line with the
realistic scenarios. (ii) Ideally, a robust VQA sys-
tem which learns the intended solution will exhibit
minor difference between the performance on the
IID and OOD test sets. Therefore, the difference
between IID and OOD accuracy is a suitable metric
(Chen et al.,2020;Gokhale et al.,2020;Si et al.,
2021). However, directly comparing the OOD per-
formance evaluated on the VQA-CP v2 and the IID
performance evaluated on the VQA v2 is not fairly
because the training sets are different. This hinders
the reliability of this metric. To alleviate this issue,
VQA-VS includes both an IID test set and OOD
test sets, which makes it possible to directly com-
pare the same model’s IID and OOD performance
based on the identical training set.
3 Dataset Construction
3.1 Merging and Splitting Data
Fig. 2shows how we split our dataset. First of all,
we merge the samples from the train and val sets
of VQA v2 dataset together (the largest square).
Formally, the whole VQA dataset can be notated as
D={V, Q, A}
, where
V
,
Q
and
A
are the ques-
tions, images and answers. Then, 70% and 5% of
the samples are randomly sampled from the merged
data
D
and constitute the training set
Dtr
and vali-
dation set
Dval
of our benchmark (
Dtr Dval =
).
摘要:

LanguagePriorIsNottheOnlyShortcut:ABenchmarkforShortcutLearninginVQAQingyiSi1;2,FandongMeng3,MingyuZheng1;2,ZhengLin1;2YuanxinLiu1;4,PengFu1,YananCao1;2,WeipingWang1,JieZhou31InstituteofInformationEngineering,ChineseAcademyofSciences,Beijing,China2SchoolofCyberSecurity,UniversityofChineseAcademyofS...

展开>> 收起<<
Language Prior Is Not the Only Shortcut A Benchmark for Shortcut Learning in VQA Qingyi Si12 Fandong Meng3 Mingyu Zheng12 Zheng Lin12.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:7.84MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注