
reasoning tasks. However, we still lack a compre-
hensive understanding of the generalization differ-
ences between these two paradigms under various
setups. Given recent work suggesting that OOD ac-
curacy often strongly correlates with in-distribution
accuracy (Miller et al.,2020,2021), we might ex-
pect VLE2E and NS systems to often have similar
generalization abilities. But do they?
In this work, we conduct the first comprehensive
comparison of generalization behavior between
VLE2E and NS systems for VL reasoning tasks.
Our study spans single-image and multi-image set-
tings with natural images and includes four dis-
tinct types of generalization tests, three of which
are shown in Figure 1. We introduce a novel
segment-combine test
for multi-image settings
that requires models to make consistent predictions
when some input images are replaced with irrele-
vant ones. We evaluate on
contrast sets
(Gardner
et al.,2020), including new contrast sets we con-
struct for COVR that test understanding of quan-
tifiers. We also measure
compositional general-
ization
as defined by compositional splits from
COVR (Bogin et al.,2021) and
cross-benchmark
transfer
between VQA and GQA. We also develop
improved NS systems for GQA by handling mis-
matches between program and scene graph object
descriptors, and for COVR by refining the original
logical language.
Overall, we find that VLE2E and NS systems
exhibit distinct and complementary generalization
patterns. The NS systems are more robust than
the VLE2E systems in the first three testing sit-
uations. The VLE2E systems exhibit overstabil-
ity to meaning-altering perturbations, suggesting
they overfit to spurious correlations in the training
data and do not learn precise reasoning skills. We
further find that the semantic parsing module of
NS systems can quickly improve on generalization
tests given a few training examples, whereas VL
models do not adapt as quickly. On the other hand,
while VLE2E systems lose more than 10% in accu-
racy on transfer between VQA and GQA, the NS
methods perform even worse. Taken together, our
findings underscore the need for a diverse suite of
generalization tests to fully compare different mod-
eling paradigms. The different behavior of these
two systems could guide the community to design
more robust VL reasoning systems. We release our
code for generating test data, and we encourage
future VL models to be evaluated on these tests.1
2 Related Work
We first survey related work on vision-language
reasoning models and OOD evaluation tests.
VL OOD Generalization.
Many efforts have
been made to evaluate the generalization ability
of VLE2E systems and task-specific methods on
compositionality (Johnson et al.,2017;Thrush
et al.,2022a), language perturbations (Ribeiro et al.,
2019) and visual perturbations (Jimenez et al.,
2022). Li et al. (2020) showed VLE2E systems
exhibit better robustness than task-specific meth-
ods. We are the first to comprehensively compare
the generalization differences between VLE2E and
NS systems across different OOD tests.
VL Pretrained Models.
Large-scale, VL pre-
trained models for question-answering can be
single-stream—encoding vision and language fea-
tures together with a single transformer—such as
VisualBERT (Li et al.,2019) and VinVL (Zhang
et al.,2021), or dual-stream—encoding vision and
language with separate transformers and apply-
ing cross-modal transformers later—such as ViL-
BERT (Lu et al.,2019) and LXMERT (Tan and
Bansal,2019). We evaluate on both single- and
dual-stream VL pretrained models.
Neuro-Symbolic Methods.
NS-VQA (Wu et al.,
2017) disentangled vision and language processing
for VL reasoning tasks on simulated images. How-
ever, it requires the datasets to include annotations
of logical forms to describe language. To reduce
the supervision signal from program annotations,
NS-CL (Mao et al.,2019) jointly learned concept
embeddings and latent programs, and extended
to natural images. NSM (Hudson and Manning,
2019b) learned graph-level reasoning and show-
cased the compositional reasoning abilities of NS
methods. To be applicable to both single- and multi-
image setups, we choose the same pipeline as in
the original NS-VQA. We use the scene graph as
the structural representation, and test on multiple
language models for semantic parsing.
Single- and Multi-Image VL Reasoning Tasks.
For VL reasoning, there are many datasets that fo-
cus on single images, such as CLEVR (Johnson
et al.,2017), VQA, and GQA, as well as many
1
We release our code and test data at
https://github.
com/Bill1235813/gendiff_vlsys