MULTI -VIEWPOINT AND MULTI -EVALUATION WITH FELICITOUS INDUCTIVE BIASBOOST MACHINE ABSTRACT REASONING ABILITY

2025-05-02 0 0 1.34MB 19 页 10玖币

侵权投诉

MULTI-VIEWPOINT AND MULTI-EVALUATION WITH

FELICITOUS INDUCTIVE BIAS BOOST MACHINE ABSTRACT

REASONING ABILITY

Qinglai Wei

State Key Laboratory for Management and Control of Complex Systems,

Institute of Automation, Chinese Academy of Sciences

School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences

Beijing, China

qinglai.wei@ia.ac.cn

Diancheng Chen

State Key Laboratory for Management and Control of Complex Systems,

Institute of Automation, Chinese Academy of Sciences

School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences

Beijing, China

chendiancheng2020@ia.ac.cn

Beiming Yuan

School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences

Beijing, China

yuanbeiming20@mails.ucas.ac.cn

March 30, 2023

1 2 ABSTRACT

Great endeavors have been made to study AI’s ability in abstract reasoning, along with which different

versions of RAVEN’s progressive matrices (RPM) are proposed as benchmarks. Previous works

give inkling that without sophisticated design or extra meta-data containing semantic information,

neural networks may still be indecisive in making decisions regarding the RPM problems, after

relentless training. Evidenced by thorough experiments, we show that, neural networks embodied

with felicitous inductive bias, intentionally design or serendipitously match, can solve the RPM

problems efﬁciently, without the augment of any extra meta-data. Our work also reveals that multi-

viewpoint with multi-evaluation is a key learning strategy for successful reasoning. Nevertheless,

we also point out the unique role of meta-data by showing that a pre-training model supervised

by the meta-data leads to a RPM solver with better performance. Source code can be found in

https://github.com/QinglaiWeiCASIA/RavenSolver.

Keywords

Abstract Reasoning, Raven’s Progressive Matrices, Inductive Bias, Convolutional Neural Network,

Transformer, Generalization

1All authors contributed equally to this work.

2Corresponding Author: Diancheng Chen (chendiancheng2020@ia.ac.cn)

arXiv:2210.14914v2 [cs.LG] 29 Mar 2023

APREPRINT - MARCH 30, 2023

1 Introduction

From expert system with elaborately designed rules to the renaissance of neural network, AI practitioners never cease

to work on machine intelligence to make it a counterpart of human intelligence. The tremendous success of machine

learning in areas like visual perception [

], natural language processing [

], or generative models [

] ,

intrigues researchers to study the reasoning ability of AI. Representative works cover, but not limit to, visual question

answering [

], ﬂexible application of language models [

], and abstract reasoning problems [

]. Here

we consider the RPM problem, originally develops for the purpose of IQ test [

], and recently serves as a benchmark

for the evaluation of AI’s abstract reasoning ability.

(a) (b)

Figure 1: Demonstrations of RPM problems. These two RPM questions are snapshots from I-RAVEN and PGM dataset,

respectively.

Fig. 1 shows two RPM problems. Without loss of generality, RPM problems are formalized within three steps. First,

sample rules which determine the changing patterns of visual attributes, from a predeﬁned rule set. Common rules

include, but not limited to, arithmetic operation, set operation, and logic operation. Second, given the sampled rules,

design proper values for all the visual attributes. Common visual attributes are type, size, and color, etc. Some visual

attributes may play the role of distracter, with their values change randomly. Finally, render images basing on all the

visual attribute values. Instantiated RPM problem is composed of a context and an answer pool: the context is a 3

image matrix, with image in the lower right corner missing. While the answer pool contains 8 images for selection, and

the test-takers are expected to select one most ﬁtted image from the answer pool to complete the matrix, so as to make it

compatible with the internal rules.

To achieve satisfying reasoning accuracy in RPM problems, it is expected that models should be able to extract visual

attributes relevant to the downstream tasks, in the meantime infer about the underlying rules. That is, traditional

perception neural networks consisting of perception modules only is incompetent to solve the RPM problems [

In this work, we solve the RPM problems in an end-to-end manner. Several key points to follow when developing

the black-box RPM solver: distinct modularization to imitate the complete perception and reasoning processes,

encapsulation of two potential RPM characteristics, namely permutation-invariance and transpose-invariance, into

the inductive bias design, and the implementation of multi-viewpoint and multi-evaluation strategy. To be speciﬁc,

distinct modularization requires both the cooperation and a clear boundary between the feature extraction module and

the reasoning module. It is expected that each module attends to its own duty properly, otherwise adding a new module

is nothing but merely extending the depth of a neural network. This issue is addressed by injecting available inductive

bias to the reasoning module to make it aware of the permutation-invariance and transpose-invariance characteristics of

the RPM problems. On the other hand, various visual attributes and rules are involved in the RPM problems, resulting

in abundant attribute-rule combinations. In light of this, we equip the feature extraction module with multi-viewpoint

strategy and the reasoning module with multi-evaluation strategy, which endows with the ability of attending to the

RPM problems in different perspectives to the model. Aforementioned details will sufﬁce to build a RPM solver with

very high reasoning accuracy. Nevertheless, we train a auxiliary model to predict the natural languages describing the

rules for the RPM problems. Adopting this auxiliary model as a pre-training model, we manage to train a RPM solver

with higher reasoning accuracy in a very fast manner.

APREPRINT - MARCH 30, 2023

The results of our work are promising and intriguing in several ways. First, it shows that, models with multi-viewpoint

and multi-evaluation strategies, either based on convolutional neural network (CNN) or vision transformer (ViT [

]),

produce competitive reasoning accuracies, without the aid of any meta-data. Second, it is shown experimentally that

rules captured by the neural network are different from the predeﬁned rules. Third, we ﬁnd out that model predicting

the rules of the RPM problem can serve well as a pre-training model for the RPM solver, which bring forth higher

reasoning accuracy and faster training speed.

2 Related Work

2.1 RPM Dataset

We study RAVEN [

], I-RAVEN [

], and PGM [

] datasets in this work. All these datasets follow the general

construction guideline described before, but they differ in subtle ways.

RAVEN consists of 7 distinct conﬁgurations with different difﬁculty levels. The easiest conﬁguration is ‘Center’, where

each panel of the problem matrix only has one entity, while harder conﬁgurations such as ‘3

3 Grid’ has at most

nine entities in each panel. Test-takers are required to observe the changing patterns row-wise, extract visual attributes,

summarize rules controlling the row-wise changes of visual attributes, then make choice to complete the problem matrix.

The most difﬁcult conﬁguration, ‘O-IG’, as shown in Fig. 1(a), requires test-takers to divide entities in each panel

into two groups, each of which follows one set of rules, then perform reasoning respectively. Some literatures show

that the answer generation process of RAVEN encourages neural network solvers to ﬁnd shortcut solutions instead of

discovering rules [

], and datasets like I-RAVEN and RAVEN-Fair with reﬁning answer generation strategies are

proposed to address this issue [31, 24].

Fig. 1(b) shows an example of the PGM dataset, where each panel of the PGM problem matrix may have entities in

the foreground and lines in the background. Test-takers need to observe the changes of visual attributes row-wise and

column-wise simultaneously, summarize the potential rules in the foreground and background respectively, and then

complete the reasoning task accordingly.

Statistically speaking, in average, RAVEN and I-RAVEN possesses more rules than PGM per question (6.29 vs. 1.37

[

]). RAVEN and I-RAVEN has two ﬁxed visual attributes as distractors, while PGM is way more ﬂexible in that

any visual attribute can be a distractor. Rules of RAVEN and I-RAVEN are encoded row-wise, while one must check

row-wise and column-wise information simultaneously for summarizing rules in PGM.

2.2 RPM solvers

literatures of RPM solver expand rapidly in recent years. Here we roughly divide them into two categories. The ﬁrst

one is end-to-end black-box solvers, accounting for the majority of previous works. The second one leverages symbolic

AI in order to obtain results beyond reasoning accuracies, such as interpretability.

The end-to-end black-box models focus on improving the reasoning accuracy on RPM problems. Early works show

that prevalent visual models fail to solve RPM problems, and adding extra labels containing information of structure

or rule improve the results to some extent [

]. In LEN[

], researchers argue that the main challenge in solving

RPM problems is the elimination of distracting information. CoPINet [

] and DCNet [

] are proposed to leverage

contrastive learning in reasoning. MRNet [

] shows that retrieving features from different CNN blocks which connect

serially helps the model to capture multiple visual attributes simultaneously, it is also the ﬁrst work to report that extra

meta-data jeopardizes network performance. In SCL [

], tensor scattering is performed to make each scattered part

attend to speciﬁc visual attributes or rules. SAVIR-T [

] extracts intra-image information and inter-image relations so

as to facilitate reasoning ability.

Symbolic AI powered methods bring forth higher reasoning accuracies and stronger model interpretability. In PrAE

[

], a neural symbolic system performs probabilistic abduction and execution to generate an answer image. ALANS

[

] manages to get rid of prior knowledge required in PrAE and outperforms monolithic end-to-end model in terms of

generalization ability. NVSA [

] uses holographic vectorized representations and ground-truth attribute values to build

a neural-symbolic model.

In one hand, our methods absorb successful experiences of previous models. Speciﬁcally, we fully utilize the inductive

bias of the RPM problem like MRNet and SAVIR-T do, and adopt the encoder architecture of MRNet in one of our

models. On the other hand, the active expressiveness of inductive bias in our models, and the unique multi-viewpoint

and multi-evaluation strategies, make our models stand out from previous models, in terms of the reasoning accuracy.

APREPRINT - MARCH 30, 2023

2.3 CLIP

CLIP is a multi-modal pre-training neural network, which jointly trains an image encoder and a natural language

encoder. By maximizing the similarity between the visual representation and natural language embedding of the

positive sample pairs and minimizing the aforementioned similarity in the negative sample pairs, CLIP learns visual

representations of high quality, which enables zero-shot transfer to downstream tasks [30].

In our study, we show that our model produces unaligned rule representations for RPM problem matrices with the

same rule. To guide the behaviour of our model, we train a CLIP model with a speciﬁc mask scheme to align the rule

representation of each RPM problem matrix with the embedding of natural language describing the corresponding rule,

then regard the visual end of the trained CLIP as a pre-trained perception module for our model. As a result, we obtain

a new model with remarkably high reasoning accuracies and fast convergence speed, compared with our original model

without pre-training.

3 Method

Here we give the deﬁnition of the RPM problem:

nXi

po8

i=1

denotes the ordered images in each

3×3

problem matrix,

with the image in the lower right corner missing.

Xi

ac 8

i=1

denotes the unordered answer candidates. Test-taker is

expected to select one image from the answer candidates to complete the problem matrix.

We ﬁrst introduce our RPM solvers in two forms, namely RS-CNN and RS-TRAN, which are composed of convolutional

neural networks and transformer blocks respectively. We show that RS-CNN can perform accurate reasoning in RAVEN

and I-RAVEN datasets, with proper inductive bias design, while the inductive bias of RS-TRAN naturally lends itself

to all the RPM problems without extra design, and that multi-viewpoint with multi-evaluation mechanism is able

to improve the reasoning ability of RS-TRAN remarkably. Then we discuss the potential problems of the original

meta-data, and introduce RS-TRAN-CLIP, which is a masked CLIP-based pre-training model for RS-TRAN.

3.1 RS-CNN

RS-CNN consists of a perception module and a reasoning module. The perception module is expected to capture

various visual attributes simultaneously. We follow the architecture of multi-scale encoder of MRNet[

], with

different convolutional blocks attending to different visual attributes, as shown in Fig. 2. For images in a problem

matrix

nXi

po8

i=1

and the corresponding answer candidate

Xi

ac 8

i=1

, the perception module of RS-CNN produces

representation triplets

{ei

p,h, ei

p,m, ei

p,l}8

i=1

and

{ei

ac,h, ei

ac,m, ei

ac,l}8

i=1

, where

h, m, l

refers to the convolutional blocks

EH, EM, ELin Fig. 2, respectively.

EHEMEL濧

濧

EH/M/L : Encoder

T : Downstream task

Figure 2: Simple illustration of multi-scale encoder developed in MRNet.

, and

, serially connected, are

residual convolutional blocks with decreasing kernel size. Not only the information processed by a former block will

ﬂow into the successor block, but also the output of each block will serve as representation for the downstream tasks

individually.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MULTI-VIEWPOINTANDMULTI-EVALUATIONWITHFELICITOUSINDUCTIVEBIASBOOSTMACHINEABSTRACTREASONINGABILITYQinglaiWeiStateKeyLaboratoryforManagementandControlofComplexSystems,InstituteofAutomation,ChineseAcademyofSciencesSchoolofArticialIntelligence,UniversityofChineseAcademyofSciencesBeijing,Chinaqinglai.we...

展开>> 收起<<

MULTI -VIEWPOINT AND MULTI -EVALUATION WITH FELICITOUS INDUCTIVE BIASBOOST MACHINE ABSTRACT REASONING ABILITY.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MULTI -VIEWPOINT AND MULTI -EVALUATION WITH FELICITOUS INDUCTIVE BIASBOOST MACHINE ABSTRACT REASONING ABILITY

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: