Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems Wang Zhu Jesse Thomason Robin Jia

2025-05-06 0 0 1.53MB 15 页 10玖币
侵权投诉
Generalization Differences between End-to-End and Neuro-Symbolic
Vision-Language Reasoning Systems
Wang Zhu Jesse Thomason Robin Jia
University of Southern California, Los Angeles, CA, USA
{wangzhu, jessetho, robinjia}@usc.edu
Abstract
For vision-and-language (VL) reasoning tasks,
both fully connectionist, end-to-end methods
and hybrid, neuro-symbolic methods have
achieved high in-distribution performance. In
which out-of-distribution settings does each
paradigm excel? We investigate this ques-
tion on both single-image and multi-image
visual question-answering through four types
of generalization tests: a novel segment-
combine test for multi-image queries, contrast
set, compositional generalization, and cross-
benchmark transfer. Vision-and-language end-
to-end (VLE2E) trained systems exhibit size-
able performance drops across all these tests.
Neuro-symbolic (NS) methods suffer even
more on cross-benchmark transfer from GQA
to VQA, but they show smaller accuracy drops
on the other generalization tests and their per-
formance quickly improves by few-shot train-
ing. Overall, our results demonstrate the com-
plementary benefits of these two paradigms,
and emphasize the importance of using a di-
verse suite of generalization tests to fully char-
acterize model robustness to distribution shift.
1 Introduction
Widely used multi-modal pretrained models (Chen
et al.,2020;Lu et al.,2019;Li et al.,2019)
have exhibited great performance when fine-tuned
on downstream vision-and-language tasks like
VQA (Antol et al.,2015) and GQA (Hudson and
Manning,2019a). These models often generalize
poorly to out-of-distribution (OOD) data, suggest-
ing shortcomings in the VLE2E pipeline. Neuro-
symbolic methods (Wu et al.,2017;Yi et al.,2018)
try to address this issue by disentangling grounding
and reasoning mechanisms in multi-modal systems.
NS methods generate grounded visual representa-
tions, parse the language into executable programs
for reasoning, and execute the programs on the vi-
sual representations. Previous work (Hudson and
Train A: There is at least 1 image with 2 bottles.
Train B: Is the dark bottle on the table or not?
Compositional Generalization
Contrast: There is less than 1 image
with exactly 2 dark bottles on the table.
Contrast Set
Label: No
Pred: Yes
Segment-Combine Test
Or
Aggregated Pred: No
Pred: No
Pred: No
Pred: No
Label: Yes
Evaluate: There is at least 1 image with
exactly 2dark bottles on the table.
Pred: Yes
Label: Yes
Figure 1: We build segment-combine tests,contrast
sets and compositional generalization splits for multi-
image question answering in the COVR dataset. The
above question requires counting both within and
across images. Segment-Combine Test: the multi-
image query enables considering each image in isola-
tion, pairing them with random unrelated images and
feeding to the model, doing an OR operation of per-
image answers. Contrast Set: language perturba-
tion by replacing phrases in query with synonyms or
antonyms. Compositional Generalization: the evalu-
ated query is a compositional variant of questions Train
A and Train B, involving reasoning on both counting
and relations.
Manning,2018;Mao et al.,2019) has shown the
effectiveness of neuro-symbolic methods for OOD
compositional generalization on single-image VL
arXiv:2210.15037v1 [cs.CL] 26 Oct 2022
reasoning tasks. However, we still lack a compre-
hensive understanding of the generalization differ-
ences between these two paradigms under various
setups. Given recent work suggesting that OOD ac-
curacy often strongly correlates with in-distribution
accuracy (Miller et al.,2020,2021), we might ex-
pect VLE2E and NS systems to often have similar
generalization abilities. But do they?
In this work, we conduct the first comprehensive
comparison of generalization behavior between
VLE2E and NS systems for VL reasoning tasks.
Our study spans single-image and multi-image set-
tings with natural images and includes four dis-
tinct types of generalization tests, three of which
are shown in Figure 1. We introduce a novel
segment-combine test
for multi-image settings
that requires models to make consistent predictions
when some input images are replaced with irrele-
vant ones. We evaluate on
contrast sets
(Gardner
et al.,2020), including new contrast sets we con-
struct for COVR that test understanding of quan-
tifiers. We also measure
compositional general-
ization
as defined by compositional splits from
COVR (Bogin et al.,2021) and
cross-benchmark
transfer
between VQA and GQA. We also develop
improved NS systems for GQA by handling mis-
matches between program and scene graph object
descriptors, and for COVR by refining the original
logical language.
Overall, we find that VLE2E and NS systems
exhibit distinct and complementary generalization
patterns. The NS systems are more robust than
the VLE2E systems in the first three testing sit-
uations. The VLE2E systems exhibit overstabil-
ity to meaning-altering perturbations, suggesting
they overfit to spurious correlations in the training
data and do not learn precise reasoning skills. We
further find that the semantic parsing module of
NS systems can quickly improve on generalization
tests given a few training examples, whereas VL
models do not adapt as quickly. On the other hand,
while VLE2E systems lose more than 10% in accu-
racy on transfer between VQA and GQA, the NS
methods perform even worse. Taken together, our
findings underscore the need for a diverse suite of
generalization tests to fully compare different mod-
eling paradigms. The different behavior of these
two systems could guide the community to design
more robust VL reasoning systems. We release our
code for generating test data, and we encourage
future VL models to be evaluated on these tests.1
2 Related Work
We first survey related work on vision-language
reasoning models and OOD evaluation tests.
VL OOD Generalization.
Many efforts have
been made to evaluate the generalization ability
of VLE2E systems and task-specific methods on
compositionality (Johnson et al.,2017;Thrush
et al.,2022a), language perturbations (Ribeiro et al.,
2019) and visual perturbations (Jimenez et al.,
2022). Li et al. (2020) showed VLE2E systems
exhibit better robustness than task-specific meth-
ods. We are the first to comprehensively compare
the generalization differences between VLE2E and
NS systems across different OOD tests.
VL Pretrained Models.
Large-scale, VL pre-
trained models for question-answering can be
single-stream—encoding vision and language fea-
tures together with a single transformer—such as
VisualBERT (Li et al.,2019) and VinVL (Zhang
et al.,2021), or dual-stream—encoding vision and
language with separate transformers and apply-
ing cross-modal transformers later—such as ViL-
BERT (Lu et al.,2019) and LXMERT (Tan and
Bansal,2019). We evaluate on both single- and
dual-stream VL pretrained models.
Neuro-Symbolic Methods.
NS-VQA (Wu et al.,
2017) disentangled vision and language processing
for VL reasoning tasks on simulated images. How-
ever, it requires the datasets to include annotations
of logical forms to describe language. To reduce
the supervision signal from program annotations,
NS-CL (Mao et al.,2019) jointly learned concept
embeddings and latent programs, and extended
to natural images. NSM (Hudson and Manning,
2019b) learned graph-level reasoning and show-
cased the compositional reasoning abilities of NS
methods. To be applicable to both single- and multi-
image setups, we choose the same pipeline as in
the original NS-VQA. We use the scene graph as
the structural representation, and test on multiple
language models for semantic parsing.
Single- and Multi-Image VL Reasoning Tasks.
For VL reasoning, there are many datasets that fo-
cus on single images, such as CLEVR (Johnson
et al.,2017), VQA, and GQA, as well as many
1
We release our code and test data at
https://github.
com/Bill1235813/gendiff_vlsys
other datasets that involve multi-image reasoning,
such as NLVR (Suhr et al.,2017), NLVR2 (Suhr
et al.,2019), COVR (Bogin et al.,2021), and
Winoground (Thrush et al.,2022b). We experi-
ment with two single-image datasets, VQA and
GQA, and one multi-image dataset, COVR, all of
which use natural images.
3 Models
Next, we formally define the VL reasoning tasks
and VLE2E and NS methods we study. We also
discuss a new NS system for COVR and associated
changes to the original COVR logical forms.
3.1 Vision-Language Reasoning
In a VL reasoning task, each example consists of
a triple
(q, I, y)
, where
q
is a natural language
query,
I
is a set of queried images and
y
is the
corresponding answer of the query. The number
of queried images is
|I|
,e.g.,
|I| = 1
for a single-
image query. Given query
q
and image set
I
, a VL
system
f
predicts an answer
ˆy=f(q, I)
. Models
are trained on Dtrain and evaluated on Dtest.
3.2 Modified VLE2E System
For a VLE2E system,
f
is a single neural network
that is trained end-to-end. Since current VL pre-
trained models are trained to process single images,
we modify the VLE2E pipeline for multi-image
settings following Bogin et al. (2021). Given a
multi-image query
(q, I)
and a pretrained model,
for each image
I∈ I
, we feed the pair
(q, I)
to
the pretrained model to get an image-text represen-
tation. We concatenate these
|I|
image-text repre-
sentations and prepend a [CLS] token to construct
a sequence of length
|I| + 1
. We then input this
generated sequence into a two-layer transformer,
and take the produced embedding of the
[CLS]
to-
ken as the representation of the entire multi-image
query. Finally, we feed the representation into an
MLP classifier to predict
y
. All modules includ-
ing the pretrained model are fine-tuned. We ex-
periment with 4 different VL pretrained models:
the single-stream VisualBERT and VinVL and the
dual-stream LXMERT and ViLBERT.
3.3 Modified NS System
A NS system separately processes vision and lan-
guage with two trainable modules
φ
and
ψ
. The
image set is represented as
φ(I)
, and the query
semantics is represented as a functional program
Scene Graph Generation
Generated
Scene Graphs
Images
table
bottle
attr: glass, dark
on
cup
on
Question: There is at least
1 image with exactly 2
dark bottles on the table
Question
Program
OLF: find(table), find(bottle),
filter(dark), with_relation(on), …,
geq(1)
CLF: find(table), find(bottle),
filter(attr; dark), filter(rel; on), …,
compare(geq; 1)
Semantic Parsing Symbolic
Execution
Pred: Yes
Pred
GenExec
Figure 2: The process of the multi-image query with
the modified neuro-symbolic methods. A language
model (blue) maps the question to a functional pro-
gram in our compositional logical forms (CLF) format;
differences with the original logical forms (OLF) are
shown in bold. A scene graph generator (purple) pro-
cesses each image into a separate scene graph; queried
information shown in bold. The program is executed
on all the scene graphs together to produce an answer
(red).
ψ(q)
. A pre-defined executor executes
ψ(q)
on
φ(I)
to predict the answer
ˆy
. To apply NS-VQA-
like pipelines to real-world images, we use scene
graphs as the structured representation φ(I).
We use a pre-trained scene graph generator that
can be fine-tuned on task-specific scene graph data,
depending on the dataset (see §5.1 for details). We
fine-tune large language models to map queries
q
to functional programs
ψ(q)
(i.e., semantic pars-
ing). We experiment with 3 language models:
(1) T5 (Raffel et al.,2020), (2) BART (Lewis et al.,
2020) and (3) GPT-2 (Radford et al.,2019).
Now, we describe dataset-specific work needed
to build a full NS pipeline for GQA and COVR.
Both datasets provide logical forms for each ques-
tion, but these forms require modification to be
compatible with NS systems.
摘要:

GeneralizationDifferencesbetweenEnd-to-EndandNeuro-SymbolicVision-LanguageReasoningSystemsWangZhuJesseThomasonRobinJiaUniversityofSouthernCalifornia,LosAngeles,CA,USA{wangzhu,jessetho,robinjia}@usc.eduAbstractForvision-and-language(VL)reasoningtasks,bothfullyconnectionist,end-to-endmethodsandhybrid,...

展开>> 收起<<
Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems Wang Zhu Jesse Thomason Robin Jia.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.53MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注