
Language Prior Is Not the Only Shortcut:
A Benchmark for Shortcut Learning in VQA
Qingyi Si1,2, Fandong Meng3, Mingyu Zheng1,2, Zheng Lin1,2∗
Yuanxin Liu1,4, Peng Fu1, Yanan Cao1,2, Weiping Wang1, Jie Zhou3
1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
3Parttern Recognition Center, WeChat AI, Tencent Inc, China 4Peking University
{siqingyi,zhengmingyu,linzheng,fupeng,caoyanan,wangweiping}@iie.ac.cn,
liuyuanxin@stu.pku.edu.cn,{fandongmeng,withtomzhou}@tencent.com
Abstract
Visual Question Answering (VQA) models are
prone to learn the shortcut solution formed by
dataset biases rather than the intended solution.
To evaluate the VQA models’ reasoning abil-
ity beyond shortcut learning, the VQA-CP v2
dataset introduces an answer distribution shift
between the training and test set given a ques-
tion type. In this way, the model cannot use the
training set shortcut to perform well on the test
set. However, VQA-CP v2 only considers one
type of shortcut (from question type to answer)
and thus still cannot guarantee that the model
relies on the intended solution rather than a so-
lution specific to this shortcut. To overcome
this limitation, we propose a new dataset that
considers varying types of shortcuts by con-
structing different distribution shifts in multi-
ple OOD test sets. In addition, we overcome
three troubling practices in the use of VQA-
CP v2, e.g., selecting models using OOD test
sets, and further standardize OOD evaluation
procedure. Our benchmark provides a more
rigorous and comprehensive testbed for short-
cut learning in VQA. We benchmark recent
methods and find that methods specifically de-
signed for particular shortcuts fail to simulta-
neously generalize to our varying OOD test
sets. We also systematically study the vary-
ing shortcuts and provide several valuable find-
ings, which may promote the exploration of
shortcut learning in VQA.1
1 Introduction
Visual Question Answering (VQA) (Antol et al.,
2015) is a multi-modal task, involving the com-
prehension and reasoning on vision and language.
Despite the remarkable performance on many VQA
datasets such as VQA v2 (Goyal et al.,2017), VQA
models have been criticized for their tendency to
depend on the biases in training set. That is, they
∗Corresponding author: Zheng Lin.
1
Joint work with Pattern Recognition Center, WeChat AI,
Tencent Inc, China. The code and data are available at
https:
//github.com/PhoebusSi/VQA-VS.
Figure 1: (a) The acc improvement of LMH over its
backbone model UpDn on nine OOD test sets. (The
acronyms, like QT, are defined in Sec. 3.2 ) (b) Solu-
tions possibly learnd by models.
tend to directly output the answer according to
question type (shortcut solutions) without actual
reasoning (Kervadec et al.,2021) (intended solu-
tion) (Agrawal et al.,2018,2016;Manjunatha et al.,
2019). This widely studied language priors prob-
lem is a typical symptom of
Shortcut Learning
(Geirhos et al.,2020) (
App. A.1
). In spite of such
defect, the models can still reach artificially excel-
lent performance on VQA v2 whose test distribu-
tion is the same as the training distribution, i.e.,
under the in-distribution (IID) setting.
To address this problem, VQA-CP v2 (Agrawal
et al.,2018) was constructed by re-organizing VQA
v2 dataset such that answer distributions of the
same question type are different between the train-
ing set and the test set, i.e., under the OOD setting.
VQA-CP v2 can make the shortcut (from the ques-
tion type to the answer) in the training set invalid in
the test set and it has become a widely-used OOD
benchmark in VQA community. With models’ per-
formance on VQA-CP v2 continually improving, it
seems that existing methods (Clark et al.,2019;Si
et al.,2021;Gokhale et al.,2020;Liang et al.,2021)
have been able to overcome the shortcut learning
problem. However, through analyzing VQA-CP v2
and existing methods, we point out that there are
two aspects needed to be improved:
First, VQA-CP v2 introduces only one specific
type of controlled distribution shift, and thus its
arXiv:2210.04692v1 [cs.CV] 10 Oct 2022