Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems Wang Zhu Jesse Thomason Robin Jia

2025-05-06 0 0 1.53MB 15 页 10玖币

侵权投诉

Generalization Differences between End-to-End and Neuro-Symbolic

Vision-Language Reasoning Systems

Wang Zhu Jesse Thomason Robin Jia

University of Southern California, Los Angeles, CA, USA

{wangzhu, jessetho, robinjia}@usc.edu

Abstract

For vision-and-language (VL) reasoning tasks,

both fully connectionist, end-to-end methods

and hybrid, neuro-symbolic methods have

achieved high in-distribution performance. In

which out-of-distribution settings does each

paradigm excel? We investigate this ques-

tion on both single-image and multi-image

visual question-answering through four types

of generalization tests: a novel segment-

combine test for multi-image queries, contrast

set, compositional generalization, and cross-

benchmark transfer. Vision-and-language end-

to-end (VLE2E) trained systems exhibit size-

able performance drops across all these tests.

Neuro-symbolic (NS) methods suffer even

more on cross-benchmark transfer from GQA

to VQA, but they show smaller accuracy drops

on the other generalization tests and their per-

formance quickly improves by few-shot train-

ing. Overall, our results demonstrate the com-

plementary beneﬁts of these two paradigms,

and emphasize the importance of using a di-

verse suite of generalization tests to fully char-

acterize model robustness to distribution shift.

1 Introduction

Widely used multi-modal pretrained models (Chen

et al.,2020;Lu et al.,2019;Li et al.,2019)

have exhibited great performance when ﬁne-tuned

on downstream vision-and-language tasks like

VQA (Antol et al.,2015) and GQA (Hudson and

Manning,2019a). These models often generalize

poorly to out-of-distribution (OOD) data, suggest-

ing shortcomings in the VLE2E pipeline. Neuro-

symbolic methods (Wu et al.,2017;Yi et al.,2018)

try to address this issue by disentangling grounding

and reasoning mechanisms in multi-modal systems.

NS methods generate grounded visual representa-

tions, parse the language into executable programs

for reasoning, and execute the programs on the vi-

sual representations. Previous work (Hudson and

Train A: There is at least 1 image with 2 bottles.

Train B: Is the dark bottle on the table or not?

Compositional Generalization

Contrast: There is less than 1 image

with exactly 2 dark bottles on the table.

Contrast Set

Label: No

Pred: Yes

Segment-Combine Test

Aggregated Pred: No

Pred: No

Label: Yes

Evaluate: There is at least 1 image with

exactly 2dark bottles on the table.

Pred: Yes

Label: Yes

Figure 1: We build segment-combine tests,contrast

sets and compositional generalization splits for multi-

image question answering in the COVR dataset. The

above question requires counting both within and

across images. Segment-Combine Test: the multi-

image query enables considering each image in isola-

tion, pairing them with random unrelated images and

feeding to the model, doing an OR operation of per-

image answers. Contrast Set: language perturba-

tion by replacing phrases in query with synonyms or

antonyms. Compositional Generalization: the evalu-

ated query is a compositional variant of questions Train

A and Train B, involving reasoning on both counting

and relations.

Manning,2018;Mao et al.,2019) has shown the

effectiveness of neuro-symbolic methods for OOD

compositional generalization on single-image VL

arXiv:2210.15037v1 [cs.CL] 26 Oct 2022

reasoning tasks. However, we still lack a compre-

hensive understanding of the generalization differ-

ences between these two paradigms under various

setups. Given recent work suggesting that OOD ac-

curacy often strongly correlates with in-distribution

accuracy (Miller et al.,2020,2021), we might ex-

pect VLE2E and NS systems to often have similar

generalization abilities. But do they?

In this work, we conduct the ﬁrst comprehensive

comparison of generalization behavior between

VLE2E and NS systems for VL reasoning tasks.

Our study spans single-image and multi-image set-

tings with natural images and includes four dis-

tinct types of generalization tests, three of which

are shown in Figure 1. We introduce a novel

segment-combine test

for multi-image settings

that requires models to make consistent predictions

when some input images are replaced with irrele-

vant ones. We evaluate on

contrast sets

(Gardner

et al.,2020), including new contrast sets we con-

struct for COVR that test understanding of quan-

tiﬁers. We also measure

compositional general-

ization

as deﬁned by compositional splits from

COVR (Bogin et al.,2021) and

cross-benchmark

transfer

between VQA and GQA. We also develop

improved NS systems for GQA by handling mis-

matches between program and scene graph object

descriptors, and for COVR by reﬁning the original

logical language.

Overall, we ﬁnd that VLE2E and NS systems

exhibit distinct and complementary generalization

patterns. The NS systems are more robust than

the VLE2E systems in the ﬁrst three testing sit-

uations. The VLE2E systems exhibit overstabil-

ity to meaning-altering perturbations, suggesting

they overﬁt to spurious correlations in the training

data and do not learn precise reasoning skills. We

further ﬁnd that the semantic parsing module of

NS systems can quickly improve on generalization

tests given a few training examples, whereas VL

models do not adapt as quickly. On the other hand,

while VLE2E systems lose more than 10% in accu-

racy on transfer between VQA and GQA, the NS

methods perform even worse. Taken together, our

ﬁndings underscore the need for a diverse suite of

generalization tests to fully compare different mod-

eling paradigms. The different behavior of these

two systems could guide the community to design

more robust VL reasoning systems. We release our

code for generating test data, and we encourage

future VL models to be evaluated on these tests.1

2 Related Work

We ﬁrst survey related work on vision-language

reasoning models and OOD evaluation tests.

VL OOD Generalization.

Many efforts have

been made to evaluate the generalization ability

of VLE2E systems and task-speciﬁc methods on

compositionality (Johnson et al.,2017;Thrush

et al.,2022a), language perturbations (Ribeiro et al.,

2019) and visual perturbations (Jimenez et al.,

2022). Li et al. (2020) showed VLE2E systems

exhibit better robustness than task-speciﬁc meth-

ods. We are the ﬁrst to comprehensively compare

the generalization differences between VLE2E and

NS systems across different OOD tests.

VL Pretrained Models.

Large-scale, VL pre-

trained models for question-answering can be

single-stream—encoding vision and language fea-

tures together with a single transformer—such as

VisualBERT (Li et al.,2019) and VinVL (Zhang

et al.,2021), or dual-stream—encoding vision and

language with separate transformers and apply-

ing cross-modal transformers later—such as ViL-

BERT (Lu et al.,2019) and LXMERT (Tan and

Bansal,2019). We evaluate on both single- and

dual-stream VL pretrained models.

Neuro-Symbolic Methods.

NS-VQA (Wu et al.,

2017) disentangled vision and language processing

for VL reasoning tasks on simulated images. How-

ever, it requires the datasets to include annotations

of logical forms to describe language. To reduce

the supervision signal from program annotations,

NS-CL (Mao et al.,2019) jointly learned concept

embeddings and latent programs, and extended

to natural images. NSM (Hudson and Manning,

2019b) learned graph-level reasoning and show-

cased the compositional reasoning abilities of NS

methods. To be applicable to both single- and multi-

image setups, we choose the same pipeline as in

the original NS-VQA. We use the scene graph as

the structural representation, and test on multiple

language models for semantic parsing.

Single- and Multi-Image VL Reasoning Tasks.

For VL reasoning, there are many datasets that fo-

cus on single images, such as CLEVR (Johnson

et al.,2017), VQA, and GQA, as well as many

We release our code and test data at

https://github.

com/Bill1235813/gendiff_vlsys

other datasets that involve multi-image reasoning,

such as NLVR (Suhr et al.,2017), NLVR2 (Suhr

et al.,2019), COVR (Bogin et al.,2021), and

Winoground (Thrush et al.,2022b). We experi-

ment with two single-image datasets, VQA and

GQA, and one multi-image dataset, COVR, all of

which use natural images.

3 Models

Next, we formally deﬁne the VL reasoning tasks

and VLE2E and NS methods we study. We also

discuss a new NS system for COVR and associated

changes to the original COVR logical forms.

3.1 Vision-Language Reasoning

In a VL reasoning task, each example consists of

a triple

(q, I, y)

, where

is a natural language

query,

is a set of queried images and

is the

corresponding answer of the query. The number

of queried images is

|I|

,e.g.,

|I| = 1

for a single-

image query. Given query

and image set

, a VL

system

predicts an answer

ˆy=f(q, I)

. Models

are trained on Dtrain and evaluated on Dtest.

3.2 Modiﬁed VLE2E System

For a VLE2E system,

is a single neural network

that is trained end-to-end. Since current VL pre-

trained models are trained to process single images,

we modify the VLE2E pipeline for multi-image

settings following Bogin et al. (2021). Given a

multi-image query

(q, I)

and a pretrained model,

for each image

I∈ I

, we feed the pair

(q, I)

the pretrained model to get an image-text represen-

tation. We concatenate these

|I|

image-text repre-

sentations and prepend a [CLS] token to construct

a sequence of length

|I| + 1

. We then input this

generated sequence into a two-layer transformer,

and take the produced embedding of the

[CLS]

to-

ken as the representation of the entire multi-image

query. Finally, we feed the representation into an

MLP classiﬁer to predict

. All modules includ-

ing the pretrained model are ﬁne-tuned. We ex-

periment with 4 different VL pretrained models:

the single-stream VisualBERT and VinVL and the

dual-stream LXMERT and ViLBERT.

3.3 Modiﬁed NS System

A NS system separately processes vision and lan-

guage with two trainable modules

and

. The

image set is represented as

φ(I)

, and the query

semantics is represented as a functional program

Scene Graph Generation

Generated

Scene Graphs

Images

table

bottle

attr: glass, dark

cup

…

Question: There is at least

1 image with exactly 2

dark bottles on the table

Question

Program

OLF: find(table), find(bottle),

filter(dark), with_relation(on), …,

geq(1)

CLF: find(table), find(bottle),

filter(attr; dark), filter(rel; on), …,

compare(geq; 1)

Semantic Parsing Symbolic

Execution

Pred: Yes

Pred

GenExec

Figure 2: The process of the multi-image query with

the modiﬁed neuro-symbolic methods. A language

model (blue) maps the question to a functional pro-

gram in our compositional logical forms (CLF) format;

differences with the original logical forms (OLF) are

shown in bold. A scene graph generator (purple) pro-

cesses each image into a separate scene graph; queried

information shown in bold. The program is executed

on all the scene graphs together to produce an answer

(red).

ψ(q)

. A pre-deﬁned executor executes

ψ(q)

φ(I)

to predict the answer

ˆy

. To apply NS-VQA-

like pipelines to real-world images, we use scene

graphs as the structured representation φ(I).

We use a pre-trained scene graph generator that

can be ﬁne-tuned on task-speciﬁc scene graph data,

depending on the dataset (see §5.1 for details). We

ﬁne-tune large language models to map queries

to functional programs

ψ(q)

(i.e., semantic pars-

ing). We experiment with 3 language models:

(1) T5 (Raffel et al.,2020), (2) BART (Lewis et al.,

2020) and (3) GPT-2 (Radford et al.,2019).

Now, we describe dataset-speciﬁc work needed

to build a full NS pipeline for GQA and COVR.

Both datasets provide logical forms for each ques-

tion, but these forms require modiﬁcation to be

compatible with NS systems.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeneralizationDifferencesbetweenEnd-to-EndandNeuro-SymbolicVision-LanguageReasoningSystemsWangZhuJesseThomasonRobinJiaUniversityofSouthernCalifornia,LosAngeles,CA,USA{wangzhu,jessetho,robinjia}@usc.eduAbstractForvision-and-language(VL)reasoningtasks,bothfullyconnectionist,end-to-endmethodsandhybrid,...

展开>> 收起<<

Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems Wang Zhu Jesse Thomason Robin Jia.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems Wang Zhu Jesse Thomason Robin Jia

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: