Language Prior Is Not the Only Shortcut A Benchmark for Shortcut Learning in VQA Qingyi Si12 Fandong Meng3 Mingyu Zheng12 Zheng Lin12

2025-05-03 0 0 7.84MB 15 页 10玖币

侵权投诉

Language Prior Is Not the Only Shortcut:

A Benchmark for Shortcut Learning in VQA

Qingyi Si1,2, Fandong Meng3, Mingyu Zheng1,2, Zheng Lin1,2∗

Yuanxin Liu1,4, Peng Fu1, Yanan Cao1,2, Weiping Wang1, Jie Zhou3

1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

3Parttern Recognition Center, WeChat AI, Tencent Inc, China 4Peking University

{siqingyi,zhengmingyu,linzheng,fupeng,caoyanan,wangweiping}@iie.ac.cn,

liuyuanxin@stu.pku.edu.cn,{fandongmeng,withtomzhou}@tencent.com

Abstract

Visual Question Answering (VQA) models are

prone to learn the shortcut solution formed by

dataset biases rather than the intended solution.

To evaluate the VQA models’ reasoning abil-

ity beyond shortcut learning, the VQA-CP v2

dataset introduces an answer distribution shift

between the training and test set given a ques-

tion type. In this way, the model cannot use the

training set shortcut to perform well on the test

set. However, VQA-CP v2 only considers one

type of shortcut (from question type to answer)

and thus still cannot guarantee that the model

relies on the intended solution rather than a so-

lution speciﬁc to this shortcut. To overcome

this limitation, we propose a new dataset that

considers varying types of shortcuts by con-

structing different distribution shifts in multi-

ple OOD test sets. In addition, we overcome

three troubling practices in the use of VQA-

CP v2, e.g., selecting models using OOD test

sets, and further standardize OOD evaluation

procedure. Our benchmark provides a more

rigorous and comprehensive testbed for short-

cut learning in VQA. We benchmark recent

methods and ﬁnd that methods speciﬁcally de-

signed for particular shortcuts fail to simulta-

neously generalize to our varying OOD test

sets. We also systematically study the vary-

ing shortcuts and provide several valuable ﬁnd-

ings, which may promote the exploration of

shortcut learning in VQA.1

1 Introduction

Visual Question Answering (VQA) (Antol et al.,

2015) is a multi-modal task, involving the com-

prehension and reasoning on vision and language.

Despite the remarkable performance on many VQA

datasets such as VQA v2 (Goyal et al.,2017), VQA

models have been criticized for their tendency to

depend on the biases in training set. That is, they

∗Corresponding author: Zheng Lin.

Joint work with Pattern Recognition Center, WeChat AI,

Tencent Inc, China. The code and data are available at

https:

//github.com/PhoebusSi/VQA-VS.

Figure 1: (a) The acc improvement of LMH over its

backbone model UpDn on nine OOD test sets. (The

acronyms, like QT, are deﬁned in Sec. 3.2 ) (b) Solu-

tions possibly learnd by models.

tend to directly output the answer according to

question type (shortcut solutions) without actual

reasoning (Kervadec et al.,2021) (intended solu-

tion) (Agrawal et al.,2018,2016;Manjunatha et al.,

2019). This widely studied language priors prob-

lem is a typical symptom of

Shortcut Learning

(Geirhos et al.,2020) (

App. A.1

). In spite of such

defect, the models can still reach artiﬁcially excel-

lent performance on VQA v2 whose test distribu-

tion is the same as the training distribution, i.e.,

under the in-distribution (IID) setting.

To address this problem, VQA-CP v2 (Agrawal

et al.,2018) was constructed by re-organizing VQA

v2 dataset such that answer distributions of the

same question type are different between the train-

ing set and the test set, i.e., under the OOD setting.

VQA-CP v2 can make the shortcut (from the ques-

tion type to the answer) in the training set invalid in

the test set and it has become a widely-used OOD

benchmark in VQA community. With models’ per-

formance on VQA-CP v2 continually improving, it

seems that existing methods (Clark et al.,2019;Si

et al.,2021;Gokhale et al.,2020;Liang et al.,2021)

have been able to overcome the shortcut learning

problem. However, through analyzing VQA-CP v2

and existing methods, we point out that there are

two aspects needed to be improved:

First, VQA-CP v2 introduces only one speciﬁc

type of controlled distribution shift, and thus its

arXiv:2210.04692v1 [cs.CV] 10 Oct 2022

OOD setting can only evaluate the model’s rea-

soning ability beyond one speciﬁc shortcut rather

than the intended solution. As shown in Fig. 1(a),

despite performing well on VQA-CP v2, the debi-

asing method LMH (Clark et al.,2019), can only

boost its backbone model UpDn on few certain

OOD test sets while fails to generalize to other

OOD sets. This shows VQA-CP v2 cannot identify

whether the models rely on other types of short-

cuts (e.g., correlations between visual objects and

answers). Therefore, as shown in Fig. 1(b), more

OOD test sets are needed to measure the reliance

of the model on different types of shortcuts. As the

performance on more OOD test sets is improved

simultaneously, the more conﬁdently can the model

be deemed to have learned the intended solution.

Moreover, some studies (Kervadec et al.,2021;

Dancette et al.,2021) demonstrate that abundant

shortcuts in data are derived from the real-world

stereotypes, e.g., the sky is blue. Thus, to establish

a reliable diagnostic for VQA models, construct-

ing a new benchmark with multiple OOD test sets

corresponding to varying types of shortcuts is an

urgent need.

Second, three troubling practices (Teney et al.,

2020b) exist in the experimental setup and the de-

sign of existing methods: 1) The training and the

test distribution of VQA-CP v2 are almost inverse

against each other, and this known construction of

the OOD setting can be easily exploited to achieve

good performance; 2) using the OOD test set for

model selection; 3) IID performance is evaluated

after retraining. These practices do not conform

with the real-world OOD scenarios, and thus mak-

ing the evaluated OOD performance unreliable.

To alleviate the single-shortcut limitation and

overcome three above-mentioned issues, we con-

struct and publicly release a new

VQA

benchmark

considering

arying

hortcuts (Sec. 3.2), named

VQA-VS

, and further standardize the OOD test-

ing procedure (Sec. 2.2). In particular, we se-

lect varying shortcuts including language-based,

visual-based and multi-modality ones, which aims

at covering different types of superﬁcial correla-

tions. For each selected shortcut, we propose a

method based on mutual information to select the

shortcut-speciﬁc concepts. Then we utilize the con-

cepts to group samples, and further introduce nine

distribution shifts, based on a Shannon entropy

method, to construct nine OOD test sets according

to varying shortcuts. Besides, our benchmark also

presents an IID validation set and an IID test set

for evaluating the IID performance.

We benchmark a series of state-of-the-art models

on VQA-VS, and ﬁnd that it may provide more re-

liable evaluation of the reasoning ability compared

to existing benchmarks. Moreover, adequate exper-

iments are conducted to present the ﬁrst systematic

study on multiple shortcuts, which may promote

the development of shortcut learning in VQA.

2 Motivation

2.1 A Causal Perspective

Schölkopf et al. (2021) describe the notion of OOD

generalization as the empirical risk minimization

for different interventions on one or several causal

variables. Inspired by them, we formalize the OOD

testing task for VQA and explain the motivation of

the proposed benchmark from a causal perspective.

We can access to a training data from a distri-

bution

P((V, Q), A)

and VQA task is to minimize

the empirical risk:

RP((V,Q),A)(f) = EP((V,Q),A)[loss(f(V, Q), A)]

(1)

where

is the predictor, i.e., the model predicting

answers

from the given image-question pairs,

(V, Q)

, and

loss

is the loss function for model

training.

EP((V,Q),A)

denotes the empirical mean

obtained from the samples drawn from the training

distribution

P((V, Q), A)

. We aim at ﬁnding the

optimal predictor f∗in a hypothesis space H:

f∗=arg min

f∈H

RP((V,Q),A)(f)(2)

The existing OOD benchmarks

for VQA, e.g.,

VQA-CP v2, evaluate the robustness of models

by the small expected risk for a single different

distribution P0((V, Q), A):

ROOD

P0((V,Q),A)(f) = EP0((V,Q),A)[loss(f(V, Q), A)]

(3)

How different the test distribution

is from the

training distribution

determines the gap between

RP((V,Q),A)(f)

and

ROOD

P0((V,Q),A)(f)

. Under the

IID settings, the test distribution is the same as

training distribution, i.e., P=P0.

The novel test distribution

can be restricted

to the result of a collection of distribution shifts,

which are introduced by the interventions on one

or several causal variables in the causal graph

2Related works are discussed in App. A.2.

VQA. We denote by

all the possible interven-

tional distributions over the whole causal graph

including the unknown and observed causal vari-

ables. To stay robust against distribution shifts on

possible causal variables, we focus on the overall

OOD risk (instead of the worst case OOD risk):

ROOD

Pg(f) = X

P0∈Pg

EP0((V,Q),A)[loss(f(V, Q), A)]

(4)

In practice, to achieve better estimation of the

true OOD risk, we should specify

an available (ob-

served) subset of interventional distributions

ε⊂Pg

, where

should coincide with

(Ar-

jovsky et al.,2019;David et al.,2010), for a robust

predictor by solving:

f∗=arg min

f∈HX

P0∈ε

EP0((V,Q),A)[loss(f(V, Q), A)]

(5)

2.2 Overcoming Current OOD Testing Issues

VQA-VS aims to further correct three troubling

issues (Teney et al.,2020b) in the use of VQA-CP

and the design of debiasing methods (see

App.

A.3

for details), and standardize the OOD testing

paradigm.

Issue 1

: In VQA-CP v2, the answer distributions

under the same question type are almost inverse

between training and testing. This known construc-

tion of the OOD splits in VQA-CP v2 is easily

exploited by existing debiasing methods. For ex-

ample, they answer mostly "yes" when the frequent

training answer is "no". To achieve this, the de-

biasing SoTAs (Liang et al.,2021;Cadene et al.,

2019) prevent models from learning the frequent

training samples. Some ones (Clark et al.,2019;Si

et al.,2021) even purposely use the annotation of

question type. These dataset-speciﬁc solutions are

unlikely to generalize to other datasets which do

not have this character. Unlike the handcrafted in-

verse training/test distributions in VQA-CP v2, we

follow Kervadec et al. and select the rare VQA sam-

ples as OOD samples. This construction procedure

does not artiﬁcially change the training distribution

in order to remain its natural tendencies, and thus

it is hard to be exploited by the debiasing methods.

Issue 2

: Nearly all methods directly use the test

set for model selection due to the lack of val sets,

which does not concur with the best practice of

We build up a collection of the debiasing methods de-

signed for VQA-CP v2 and their main issues. (see Tab. 5)

Figure 2: The splitting of our dataset.

machine learning. Kervadec et al. noticed this

issue and presented a dataset GQA-OOD with an

OOD val set. However, in real-world applications,

the information about the OOD distribution should

also be unavailable until we evaluate the model.

Therefore, this paper argues that using IID val set

for model selection is a demand in the standard

OOD testing procedure (discussed in Sec. 5.3).

Issue 3

: Existing works usually retrain their

models on the VQA v2 dataset to evaluate the IID

performance (refer to

App. A.4

for reasons). This

leads to two problems: (i) Training a model for

each distribution separately is not in line with the

realistic scenarios. (ii) Ideally, a robust VQA sys-

tem which learns the intended solution will exhibit

minor difference between the performance on the

IID and OOD test sets. Therefore, the difference

between IID and OOD accuracy is a suitable metric

(Chen et al.,2020;Gokhale et al.,2020;Si et al.,

2021). However, directly comparing the OOD per-

formance evaluated on the VQA-CP v2 and the IID

performance evaluated on the VQA v2 is not fairly

because the training sets are different. This hinders

the reliability of this metric. To alleviate this issue,

VQA-VS includes both an IID test set and OOD

test sets, which makes it possible to directly com-

pare the same model’s IID and OOD performance

based on the identical training set.

3 Dataset Construction

3.1 Merging and Splitting Data

Fig. 2shows how we split our dataset. First of all,

we merge the samples from the train and val sets

of VQA v2 dataset together (the largest square).

Formally, the whole VQA dataset can be notated as

D={V, Q, A}

, where

and

are the ques-

tions, images and answers. Then, 70% and 5% of

the samples are randomly sampled from the merged

data

and constitute the training set

Dtr

and vali-

dation set

Dval

of our benchmark (

Dtr ∩Dval =∅

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LanguagePriorIsNottheOnlyShortcut:ABenchmarkforShortcutLearninginVQAQingyiSi1;2,FandongMeng3,MingyuZheng1;2,ZhengLin1;2YuanxinLiu1;4,PengFu1,YananCao1;2,WeipingWang1,JieZhou31InstituteofInformationEngineering,ChineseAcademyofSciences,Beijing,China2SchoolofCyberSecurity,UniversityofChineseAcademyofS...

展开>> 收起<<

Language Prior Is Not the Only Shortcut A Benchmark for Shortcut Learning in VQA Qingyi Si12 Fandong Meng3 Mingyu Zheng12 Zheng Lin12.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Language Prior Is Not the Only Shortcut A Benchmark for Shortcut Learning in VQA Qingyi Si12 Fandong Meng3 Mingyu Zheng12 Zheng Lin12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: