APREPRINT - MARCH 30, 2023
The results of our work are promising and intriguing in several ways. First, it shows that, models with multi-viewpoint
and multi-evaluation strategies, either based on convolutional neural network (CNN) or vision transformer (ViT [
20
]),
produce competitive reasoning accuracies, without the aid of any meta-data. Second, it is shown experimentally that
rules captured by the neural network are different from the predefined rules. Third, we find out that model predicting
the rules of the RPM problem can serve well as a pre-training model for the RPM solver, which bring forth higher
reasoning accuracy and faster training speed.
2 Related Work
2.1 RPM Dataset
We study RAVEN [
18
], I-RAVEN [
31
], and PGM [
19
] datasets in this work. All these datasets follow the general
construction guideline described before, but they differ in subtle ways.
RAVEN consists of 7 distinct configurations with different difficulty levels. The easiest configuration is ‘Center’, where
each panel of the problem matrix only has one entity, while harder configurations such as ‘3
×
3 Grid’ has at most
nine entities in each panel. Test-takers are required to observe the changing patterns row-wise, extract visual attributes,
summarize rules controlling the row-wise changes of visual attributes, then make choice to complete the problem matrix.
The most difficult configuration, ‘O-IG’, as shown in Fig. 1(a), requires test-takers to divide entities in each panel
into two groups, each of which follows one set of rules, then perform reasoning respectively. Some literatures show
that the answer generation process of RAVEN encourages neural network solvers to find shortcut solutions instead of
discovering rules [
31
,
24
], and datasets like I-RAVEN and RAVEN-Fair with refining answer generation strategies are
proposed to address this issue [31, 24].
Fig. 1(b) shows an example of the PGM dataset, where each panel of the PGM problem matrix may have entities in
the foreground and lines in the background. Test-takers need to observe the changes of visual attributes row-wise and
column-wise simultaneously, summarize the potential rules in the foreground and background respectively, and then
complete the reasoning task accordingly.
Statistically speaking, in average, RAVEN and I-RAVEN possesses more rules than PGM per question (6.29 vs. 1.37
[
18
]). RAVEN and I-RAVEN has two fixed visual attributes as distractors, while PGM is way more flexible in that
any visual attribute can be a distractor. Rules of RAVEN and I-RAVEN are encoded row-wise, while one must check
row-wise and column-wise information simultaneously for summarizing rules in PGM.
2.2 RPM solvers
literatures of RPM solver expand rapidly in recent years. Here we roughly divide them into two categories. The first
one is end-to-end black-box solvers, accounting for the majority of previous works. The second one leverages symbolic
AI in order to obtain results beyond reasoning accuracies, such as interpretability.
The end-to-end black-box models focus on improving the reasoning accuracy on RPM problems. Early works show
that prevalent visual models fail to solve RPM problems, and adding extra labels containing information of structure
or rule improve the results to some extent [
18
,
19
]. In LEN[
21
], researchers argue that the main challenge in solving
RPM problems is the elimination of distracting information. CoPINet [
22
] and DCNet [
23
] are proposed to leverage
contrastive learning in reasoning. MRNet [
24
] shows that retrieving features from different CNN blocks which connect
serially helps the model to capture multiple visual attributes simultaneously, it is also the first work to report that extra
meta-data jeopardizes network performance. In SCL [
25
], tensor scattering is performed to make each scattered part
attend to specific visual attributes or rules. SAVIR-T [
26
] extracts intra-image information and inter-image relations so
as to facilitate reasoning ability.
Symbolic AI powered methods bring forth higher reasoning accuracies and stronger model interpretability. In PrAE
[
27
], a neural symbolic system performs probabilistic abduction and execution to generate an answer image. ALANS
[
28
] manages to get rid of prior knowledge required in PrAE and outperforms monolithic end-to-end model in terms of
generalization ability. NVSA [
29
] uses holographic vectorized representations and ground-truth attribute values to build
a neural-symbolic model.
In one hand, our methods absorb successful experiences of previous models. Specifically, we fully utilize the inductive
bias of the RPM problem like MRNet and SAVIR-T do, and adopt the encoder architecture of MRNet in one of our
models. On the other hand, the active expressiveness of inductive bias in our models, and the unique multi-viewpoint
and multi-evaluation strategies, make our models stand out from previous models, in terms of the reasoning accuracy.
3