the models to other directions, rather than endow-
ing them with the reasoning capability and robust-
ness to language priors. Ideally, a robust VQA
system should maintain its performance on the ID
dataset while overcoming the language priors, as
shown in Fig. 1.
We think the essence of both language-prior and
trade-off problems is about the learning of biased
samples. The former is caused by over-reliance on
biased information from biased samples, while the
latter is caused by undermining the importance of
biased samples. Therefore, if a model can precisely
exploit the biased samples for intrinsic information
of the given task, both problems can be alleviated
simultaneously.
Motivated by this, we propose a self-supervised
contrastive learning method (MMBS) for building
robust VQA systems by
M
ake the
M
ost of
B
iased
S
amples. Firstly, in view of the characteristics of
the spurious correlations, we construct two kinds
of positive samples for the questions of training
samples to exploit the unbiased information, and
then design four strategies to use the constructed
positive samples. Next, we propose a novel algo-
rithm to distinguish between biased and unbiased
samples, so as to treat them differently. On this
basis, we introduce an auxiliary contrastive train-
ing objective, which helps the model learn a more
general representation with ameliorated language
priors by narrowing the distance between original
samples and positive samples in the cross-modality
joint embedding space.
To summarize, our contributions are as follow:
i
) We propose a novel contrastive learning method,
which effectively addresses the language prior prob-
lem and the ID-OOD performance trade-off in
VQA, by making the most of biased samples.
ii
)
We propose an algorithm to distinguish between
biased and unbiased samples and treat them dif-
ferently in contrastive learning.
iii
) Experimental
results demonstrate that our method is compatible
with various VQA backbones and achieve compet-
itive performance on the language-bias sensitive
VQA-CP v2 dataset while preserving the original
accuracy on the in-distribution VQA v2 dataset.
2 Related Work
Overcoming Language Priors in VQA.
Re-
cently, the language biases in VQA datasets raised
the attention of many researchers (Goyal et al.,
2017;Antol et al.,2015;Agrawal et al.,2016;Ker-
vadec et al.,2021). In response to this problem,
numerous methods are proposed to debias the VQA
models. The most effective ones of them can be
roughly divided into two categories:
Ensemble-
based methods
(Grand and Belinkov,2019;Be-
linkov et al.,2019;Cadene et al.,2019;Clark et al.,
2019;Mahabadi and Henderson,2019;Niu et al.,
2021) introduce a biased model, which is designed
to focus on the spurious features, to assist the train-
ing of the main model. For example, the recent
method LPF (Liang et al.,2021) leverages the out-
put distribution of the bias model to down-weight
the biased sample when computing the VQA loss.
However, these methods neglect the useful informa-
tion that helps reasoning in biased samples.
Data-
balancing methods
(Zhu et al.,2020;Liang et al.,
2020) balance the training priors. For example,
CSS and Mutant (Chen et al.,2020;Gokhale et al.,
2020) generate samples by masking the critical
object in images and word in questions and by se-
mantic image mutations respectively. These meth-
ods usually outperform other debiasing methods
with a large margin on VQA-CP v2, because they
bypass the challenge of the imbalanced settings
(Liang et al.,2021;Niu et al.,2021) by explicitly
balancing the answers’ distribution at the training
stage. Though our method constructs the positive
questions, it does not change the training answers’
distribution. We also extend our method to the
data-balancing method SAR (Si et al.,2021).
Contrastive Learning in VQA.
Recently, the
contrastive learning is well-developed in unsuper-
vised learning (Oord et al.,2018;He et al.,2020)
while its application in VQA is still in initial stage.
CL (Liang et al.,2020) is the first work to employ
contrastive learning to improve VQA model’s ro-
bustness. Its motivation is to learn a better relation-
ship among the input sample and the factual and
counterfactual sample which are generated by CSS.
However, CL brings weak OOD performance gain
and ID performance drop based on CSS. In con-
trast, our method attributes the key point of solving
language bias to the positive-sample designs for
excluding the spurious correlations. It is model-
agnostic and can boost models’ OOD performance
significantly while retain the ID performance.
3 Method
Fig. 2shows MMBS’s overview, which includes:
1) A backbone VQA model; 2) A positive sample
construction module; 3) An unbiased sample selec-