setting an upper bound for the route of dynamics prediction. In the
third experiment
, we replace the
ground-truth dynamics with predicted dynamics from an advanced future prediction model [
62
] to see
how LfD performs in practice. The results from these two experiments reveal that precise dynamics
significantly improve model performance, but the predicted approximations fail to work as expected:
Approximation brings disturbing biases and causes the performance to degenerate into
LfI
or even
worse in certain tasks. In the
fourth experiment
, through a series of experiments on various
LfI
models, we conclude that
LfI
could be a simpler and more practical paradigm in physical reasoning.
However, making breakthroughs in physical prediction is still promising though challenging. We
hope that our discussions shed light on future studies on physical reasoning.
2 Related Work
Intuitive physics and physical reasoning
Since Battaglia et al.
[6]
, the computational aspects of
intuitive physics have attracted research attention [
43
,
75
]; intuitive physics and stability has since
been further incorporated in complex object [
77
,
64
,
49
,
17
,
50
,
72
] and scene [
74
,
65
,
73
,
76
,
47
,
30
,
9
,
63
], and task [
26
,
27
,
32
] understanding tasks. The progress enables machines to learn to judge (i)
which object is heavier after observing two objects collide [21,54,59], (ii) whether a stacked block
tower will fall [
6
,
24
,
46
], (iii) whether water in two different containers will pour at the same angle
if titled [
41
,
57
] or liquid in general [
5
], and (iv) behaviors of dynamics with various materials [
42
].
However, this line of work primarily focuses on physical tasks without long-term dynamics, either by
knowledge-based approaches [6,54,70] or learning-based approaches [24,46].
More complex physical reasoning problems [
1
,
3
,
69
], including those involving question answering
[
10
,
11
,
14
,
29
,
71
], have also been studied. In particular, Allen et al.
[1]
propose to use knowledge-
based simulation; Xu et al.
[68]
adopt a Bayesian symbolic method; Battaglia et al.
[7]
, Girdhar et al.
[23]
, and Qi et al.
[53]
recruit the graph-based interaction network. However, none of these methods
fully justify the necessity of dynamics prediction. In this work, we challenge this fundamental
assumption and point out a simpler and efficacious but overlooked solution.
PHYRE and relevant environments
Bakhtin et al.
[3]
introduce the novel physical reasoning
task of PHYRE, wherein an agent is tasked with finding action in an initial scene to reach the goal
under physical laws. Current methods for solving PHYRE include reinforcement learning (e.g.,
DQN [
3
]), forward prediction neural networks with pixel-based representation [
23
], and object-based
representation [
23
,
53
]. Notably, Girdhar et al.
[23]
adopt different kinds of forward prediction
architectures to perform PHYRE tasks but fail to obtain significant performance improvement,
whereas Qi et al.
[53]
design convolutional interaction network to learn long-term dynamics, achieving
SOTA
performance by leveraging ground-truth information about object states in the physical scenes.
In fact, the physical reasoning task of PHYRE can be regarded as an image classification task on
judging whether an initial action would lead to a successful outcome, or a video classification task by
considering the dynamics after the initial action is performed. The success of the former sat on deep
convolutional neural networks [
28
,
40
] and has now shifted to Transformer-based models [
4
,
15
,
51
].
The change from convolutional architectures to attentional models also inspires recent advances in
video classification. Models such as TimeSformer [
8
] and ViViT [
2
] in this domain also expand into
fields such as action recognition [22] and group activity recognition [20].
Unlike PHYRE which directly focuses on physical reasoning in a simplified virtual environment,
some benchmarks include physical reasoning in their environments more implicitly and take physics
as an aid to finish tasks in real life, such as the ones in autonomous driving [
16
] and embodied AI
[
39
,
67
,
19
,
55
,
48
]. Robotic controller based on physics engines [
25
,
45
,
38
,
12
,
60
], navigation
tasks on 3D physical scenes [
66
], and more broadly task and motion planner [
61
,
37
,
33
,
35
,
34
] may
also need physical understanding modules in the system.
Dynamics prediction
Predicting dynamics into the future is one of the most extensively studied top-
ics in the vision community. One modern approach is to extract image representation and incorporate
an RNN predictor [
58
,
62
] or a cycle GAN-based approach [
44
]. However, these approaches cannot
extract robust representation from the pixels and incur accumulated errors in long-term prediction.
To tackle this problem, Janner et al.
[31]
and Qi et al.
[53]
focus on object-centric representation;
these are task-specific solutions with various inductive biases (e.g., spatial information, the number
of objects), and the performance drops when dealing with multiple objects with occlusions [52].
3