MULTI -VIEWPOINT AND MULTI -EVALUATION WITH FELICITOUS INDUCTIVE BIASBOOST MACHINE ABSTRACT REASONING ABILITY

2025-05-02 0 0 1.34MB 19 页 10玖币
侵权投诉
MULTI-VIEWPOINT AND MULTI-EVALUATION WITH
FELICITOUS INDUCTIVE BIAS BOOST MACHINE ABSTRACT
REASONING ABILITY
Qinglai Wei
State Key Laboratory for Management and Control of Complex Systems,
Institute of Automation, Chinese Academy of Sciences
School of Artificial Intelligence, University of Chinese Academy of Sciences
Beijing, China
qinglai.wei@ia.ac.cn
Diancheng Chen
State Key Laboratory for Management and Control of Complex Systems,
Institute of Automation, Chinese Academy of Sciences
School of Artificial Intelligence, University of Chinese Academy of Sciences
Beijing, China
chendiancheng2020@ia.ac.cn
Beiming Yuan
School of Artificial Intelligence, University of Chinese Academy of Sciences
Beijing, China
yuanbeiming20@mails.ucas.ac.cn
March 30, 2023
1 2 ABSTRACT
Great endeavors have been made to study AI’s ability in abstract reasoning, along with which different
versions of RAVEN’s progressive matrices (RPM) are proposed as benchmarks. Previous works
give inkling that without sophisticated design or extra meta-data containing semantic information,
neural networks may still be indecisive in making decisions regarding the RPM problems, after
relentless training. Evidenced by thorough experiments, we show that, neural networks embodied
with felicitous inductive bias, intentionally design or serendipitously match, can solve the RPM
problems efficiently, without the augment of any extra meta-data. Our work also reveals that multi-
viewpoint with multi-evaluation is a key learning strategy for successful reasoning. Nevertheless,
we also point out the unique role of meta-data by showing that a pre-training model supervised
by the meta-data leads to a RPM solver with better performance. Source code can be found in
https://github.com/QinglaiWeiCASIA/RavenSolver.
Keywords
Abstract Reasoning, Raven’s Progressive Matrices, Inductive Bias, Convolutional Neural Network,
Transformer, Generalization
1All authors contributed equally to this work.
2Corresponding Author: Diancheng Chen (chendiancheng2020@ia.ac.cn)
arXiv:2210.14914v2 [cs.LG] 29 Mar 2023
APREPRINT - MARCH 30, 2023
1 Introduction
From expert system with elaborately designed rules to the renaissance of neural network, AI practitioners never cease
to work on machine intelligence to make it a counterpart of human intelligence. The tremendous success of machine
learning in areas like visual perception [
1
,
2
,
3
], natural language processing [
4
,
5
,
6
], or generative models [
7
,
8
,
9
] ,
intrigues researchers to study the reasoning ability of AI. Representative works cover, but not limit to, visual question
answering [
10
,
11
], flexible application of language models [
12
,
13
,
14
], and abstract reasoning problems [
15
,
16
]. Here
we consider the RPM problem, originally develops for the purpose of IQ test [
17
], and recently serves as a benchmark
for the evaluation of AI’s abstract reasoning ability.
(a) (b)
Figure 1: Demonstrations of RPM problems. These two RPM questions are snapshots from I-RAVEN and PGM dataset,
respectively.
Fig. 1 shows two RPM problems. Without loss of generality, RPM problems are formalized within three steps. First,
sample rules which determine the changing patterns of visual attributes, from a predefined rule set. Common rules
include, but not limited to, arithmetic operation, set operation, and logic operation. Second, given the sampled rules,
design proper values for all the visual attributes. Common visual attributes are type, size, and color, etc. Some visual
attributes may play the role of distracter, with their values change randomly. Finally, render images basing on all the
visual attribute values. Instantiated RPM problem is composed of a context and an answer pool: the context is a 3
×
3
image matrix, with image in the lower right corner missing. While the answer pool contains 8 images for selection, and
the test-takers are expected to select one most fitted image from the answer pool to complete the matrix, so as to make it
compatible with the internal rules.
To achieve satisfying reasoning accuracy in RPM problems, it is expected that models should be able to extract visual
attributes relevant to the downstream tasks, in the meantime infer about the underlying rules. That is, traditional
perception neural networks consisting of perception modules only is incompetent to solve the RPM problems [
18
,
19
].
In this work, we solve the RPM problems in an end-to-end manner. Several key points to follow when developing
the black-box RPM solver: distinct modularization to imitate the complete perception and reasoning processes,
encapsulation of two potential RPM characteristics, namely permutation-invariance and transpose-invariance, into
the inductive bias design, and the implementation of multi-viewpoint and multi-evaluation strategy. To be specific,
distinct modularization requires both the cooperation and a clear boundary between the feature extraction module and
the reasoning module. It is expected that each module attends to its own duty properly, otherwise adding a new module
is nothing but merely extending the depth of a neural network. This issue is addressed by injecting available inductive
bias to the reasoning module to make it aware of the permutation-invariance and transpose-invariance characteristics of
the RPM problems. On the other hand, various visual attributes and rules are involved in the RPM problems, resulting
in abundant attribute-rule combinations. In light of this, we equip the feature extraction module with multi-viewpoint
strategy and the reasoning module with multi-evaluation strategy, which endows with the ability of attending to the
RPM problems in different perspectives to the model. Aforementioned details will suffice to build a RPM solver with
very high reasoning accuracy. Nevertheless, we train a auxiliary model to predict the natural languages describing the
rules for the RPM problems. Adopting this auxiliary model as a pre-training model, we manage to train a RPM solver
with higher reasoning accuracy in a very fast manner.
2
APREPRINT - MARCH 30, 2023
The results of our work are promising and intriguing in several ways. First, it shows that, models with multi-viewpoint
and multi-evaluation strategies, either based on convolutional neural network (CNN) or vision transformer (ViT [
20
]),
produce competitive reasoning accuracies, without the aid of any meta-data. Second, it is shown experimentally that
rules captured by the neural network are different from the predefined rules. Third, we find out that model predicting
the rules of the RPM problem can serve well as a pre-training model for the RPM solver, which bring forth higher
reasoning accuracy and faster training speed.
2 Related Work
2.1 RPM Dataset
We study RAVEN [
18
], I-RAVEN [
31
], and PGM [
19
] datasets in this work. All these datasets follow the general
construction guideline described before, but they differ in subtle ways.
RAVEN consists of 7 distinct configurations with different difficulty levels. The easiest configuration is ‘Center’, where
each panel of the problem matrix only has one entity, while harder configurations such as ‘3
×
3 Grid’ has at most
nine entities in each panel. Test-takers are required to observe the changing patterns row-wise, extract visual attributes,
summarize rules controlling the row-wise changes of visual attributes, then make choice to complete the problem matrix.
The most difficult configuration, ‘O-IG’, as shown in Fig. 1(a), requires test-takers to divide entities in each panel
into two groups, each of which follows one set of rules, then perform reasoning respectively. Some literatures show
that the answer generation process of RAVEN encourages neural network solvers to find shortcut solutions instead of
discovering rules [
31
,
24
], and datasets like I-RAVEN and RAVEN-Fair with refining answer generation strategies are
proposed to address this issue [31, 24].
Fig. 1(b) shows an example of the PGM dataset, where each panel of the PGM problem matrix may have entities in
the foreground and lines in the background. Test-takers need to observe the changes of visual attributes row-wise and
column-wise simultaneously, summarize the potential rules in the foreground and background respectively, and then
complete the reasoning task accordingly.
Statistically speaking, in average, RAVEN and I-RAVEN possesses more rules than PGM per question (6.29 vs. 1.37
[
18
]). RAVEN and I-RAVEN has two fixed visual attributes as distractors, while PGM is way more flexible in that
any visual attribute can be a distractor. Rules of RAVEN and I-RAVEN are encoded row-wise, while one must check
row-wise and column-wise information simultaneously for summarizing rules in PGM.
2.2 RPM solvers
literatures of RPM solver expand rapidly in recent years. Here we roughly divide them into two categories. The first
one is end-to-end black-box solvers, accounting for the majority of previous works. The second one leverages symbolic
AI in order to obtain results beyond reasoning accuracies, such as interpretability.
The end-to-end black-box models focus on improving the reasoning accuracy on RPM problems. Early works show
that prevalent visual models fail to solve RPM problems, and adding extra labels containing information of structure
or rule improve the results to some extent [
18
,
19
]. In LEN[
21
], researchers argue that the main challenge in solving
RPM problems is the elimination of distracting information. CoPINet [
22
] and DCNet [
23
] are proposed to leverage
contrastive learning in reasoning. MRNet [
24
] shows that retrieving features from different CNN blocks which connect
serially helps the model to capture multiple visual attributes simultaneously, it is also the first work to report that extra
meta-data jeopardizes network performance. In SCL [
25
], tensor scattering is performed to make each scattered part
attend to specific visual attributes or rules. SAVIR-T [
26
] extracts intra-image information and inter-image relations so
as to facilitate reasoning ability.
Symbolic AI powered methods bring forth higher reasoning accuracies and stronger model interpretability. In PrAE
[
27
], a neural symbolic system performs probabilistic abduction and execution to generate an answer image. ALANS
[
28
] manages to get rid of prior knowledge required in PrAE and outperforms monolithic end-to-end model in terms of
generalization ability. NVSA [
29
] uses holographic vectorized representations and ground-truth attribute values to build
a neural-symbolic model.
In one hand, our methods absorb successful experiences of previous models. Specifically, we fully utilize the inductive
bias of the RPM problem like MRNet and SAVIR-T do, and adopt the encoder architecture of MRNet in one of our
models. On the other hand, the active expressiveness of inductive bias in our models, and the unique multi-viewpoint
and multi-evaluation strategies, make our models stand out from previous models, in terms of the reasoning accuracy.
3
APREPRINT - MARCH 30, 2023
2.3 CLIP
CLIP is a multi-modal pre-training neural network, which jointly trains an image encoder and a natural language
encoder. By maximizing the similarity between the visual representation and natural language embedding of the
positive sample pairs and minimizing the aforementioned similarity in the negative sample pairs, CLIP learns visual
representations of high quality, which enables zero-shot transfer to downstream tasks [30].
In our study, we show that our model produces unaligned rule representations for RPM problem matrices with the
same rule. To guide the behaviour of our model, we train a CLIP model with a specific mask scheme to align the rule
representation of each RPM problem matrix with the embedding of natural language describing the corresponding rule,
then regard the visual end of the trained CLIP as a pre-trained perception module for our model. As a result, we obtain
a new model with remarkably high reasoning accuracies and fast convergence speed, compared with our original model
without pre-training.
3 Method
Here we give the definition of the RPM problem:
nXi
po8
i=1
denotes the ordered images in each
3×3
problem matrix,
with the image in the lower right corner missing.
Xi
ac 8
i=1
denotes the unordered answer candidates. Test-taker is
expected to select one image from the answer candidates to complete the problem matrix.
We first introduce our RPM solvers in two forms, namely RS-CNN and RS-TRAN, which are composed of convolutional
neural networks and transformer blocks respectively. We show that RS-CNN can perform accurate reasoning in RAVEN
and I-RAVEN datasets, with proper inductive bias design, while the inductive bias of RS-TRAN naturally lends itself
to all the RPM problems without extra design, and that multi-viewpoint with multi-evaluation mechanism is able
to improve the reasoning ability of RS-TRAN remarkably. Then we discuss the potential problems of the original
meta-data, and introduce RS-TRAN-CLIP, which is a masked CLIP-based pre-training model for RS-TRAN.
3.1 RS-CNN
RS-CNN consists of a perception module and a reasoning module. The perception module is expected to capture
various visual attributes simultaneously. We follow the architecture of multi-scale encoder of MRNet[
24
], with
different convolutional blocks attending to different visual attributes, as shown in Fig. 2. For images in a problem
matrix
nXi
po8
i=1
and the corresponding answer candidate
Xi
ac 8
i=1
, the perception module of RS-CNN produces
representation triplets
{ei
p,h, ei
p,m, ei
p,l}8
i=1
and
{ei
ac,h, ei
ac,m, ei
ac,l}8
i=1
, where
h, m, l
refers to the convolutional blocks
EH, EM, ELin Fig. 2, respectively.
EHEMEL
EH/M/L : Encoder
T : Downstream task
Figure 2: Simple illustration of multi-scale encoder developed in MRNet.
EH
,
EM
, and
EL
, serially connected, are
residual convolutional blocks with decreasing kernel size. Not only the information processed by a former block will
flow into the successor block, but also the output of each block will serve as representation for the downstream tasks
individually.
4
摘要:

MULTI-VIEWPOINTANDMULTI-EVALUATIONWITHFELICITOUSINDUCTIVEBIASBOOSTMACHINEABSTRACTREASONINGABILITYQinglaiWeiStateKeyLaboratoryforManagementandControlofComplexSystems,InstituteofAutomation,ChineseAcademyofSciencesSchoolofArticialIntelligence,UniversityofChineseAcademyofSciencesBeijing,Chinaqinglai.we...

展开>> 收起<<
MULTI -VIEWPOINT AND MULTI -EVALUATION WITH FELICITOUS INDUCTIVE BIASBOOST MACHINE ABSTRACT REASONING ABILITY.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.34MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注