On the Learning Mechanisms in Physical Reasoning Shiqian Li125 Kewen Wu35 Chi Zhang45 Yixin Zhu12 1School of Intelligence Science and Technology Peking University

2025-05-02 0 0 3.11MB 17 页 10玖币
侵权投诉
On the Learning Mechanisms in Physical Reasoning
Shiqian Li?,1,2,5, Kewen Wu?,3,5, Chi Zhang4,5, Yixin Zhu1,2
1School of Intelligence Science and Technology, Peking University
2Institute for Artificial Intelligence, Peking University
3Department of Automation, Tsinghua University
4Department of Computer Science, University of California, Los Angeles
5Beijing Institute for General Artificial Intelligence (BIGAI)
Project Website https://lishiqianhugh.github.io/LfID_Page
Abstract
Is dynamics prediction indispensable for physical reasoning? If so, what kind
of roles do the dynamics prediction modules play during the physical reasoning
process? Most studies focus on designing dynamics prediction networks and treat-
ing physical reasoning as a downstream task without investigating the questions
above, taking for granted that the designed dynamics prediction would undoubtedly
help the reasoning process. In this work, we take a closer look at this assumption,
exploring this fundamental hypothesis by comparing two learning mechanisms:
Learning from Dynamics (
LfD
) and Learning from Intuition (
LfI
). In the
first
experiment
, we directly examine and compare these two mechanisms. Results
show a surprising finding: Simple
LfI
is better than or on par with state-of-the-art
LfD
. This observation leads to the
second experiment
with Ground-truth Dynam-
ics (
GD
), the ideal case of
LfD
wherein dynamics are obtained directly from a
simulator. Results show that dynamics, if directly given instead of approximated,
would achieve much higher performance than
LfI
alone on physical reasoning; this
essentially serves as the performance upper bound. Yet practically,
LfD
mechanism
can only predict Approximate Dynamics (
AD
) using dynamics learning modules
that mimic the physical laws, making the following downstream physical reasoning
modules degenerate into the
LfI
paradigm; see the
third experiment
. We note that
this issue is hard to mitigate, as dynamics prediction errors inevitably accumulate in
the long horizon. Finally, in the
fourth experiment
, we note that
LfI
, the extremely
simpler strategy when done right, is more effective in learning to solve physical
reasoning problems. Taken together, the results on the challenging benchmark of
PHYRE [
3
] show that
LfI
is, if not better, as good as
LfD
with bells and whistles
for dynamics prediction. However, the potential improvement from
LfD
, though
challenging, remains lucrative.
1 Introduction
Humans possess a distinctive ability of understanding physical concepts and performing complex
physical reasoning. The literature on humans learning mechanisms for solving physical reasoning
problems can be categorized into two schools of thought [
43
]: (i) physical intuition at a glance
without much thinking, such as judging whether a stacked block tower will collapse [
6
], and (ii) more
?indicates equal contribution.
indicates corresponding authors.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.02075v1 [cs.LG] 5 Oct 2022
Generate dynamic sequences from predictor
Initial scene
Classifier
Obtain dynamic sequences from simulator
Simulator
Predictor
Classifier
Classifier
Simply use initial scene without any dynamic information or prediction
Solved
Unsolved
Solved
Unsolved
Solved
Unsolved
Degenerate Idealize
LFD
LFI
GD
AD
Figure 1:
Comparison of the two learning mechanisms.
Learning from Intuition (
LfI
) learns from intuition
by directly using a classifier (first row in yellow). Based on the source of dynamics, we divide Learning from
Dynamics (
LfD
) into learning with Approximate Dynamics (
AD
) (second row in green) and Ground-truth
Dynamics (
GD
) (third row in blue). Specifically, learning with
GD
leverages ground-truth dynamics from the
simulator, whereas learning with
AD
predicts how the objects’ positions and poses unfold via a dynamics
predictor. In theory, learning with
AD
should reach the same performance as learning with
GD
if dynamics
prediction were perfect; i.e., learning with
GD
can be regarded as the ideal case of learning with
AD
. However,
in practice, learning with AD usually degenerates into LfI due to very inaccurate prediction.
extensive unfolding of states under the assumed physical dynamics when facing complex physical
tasks [
7
]. Such a disparity between System-1-and-System-2-like problem-solving strategies [
36
]
motivates us to think over the learning mechanisms for physical reasoning in machines:
When performing physical reasoning, is it better for machines to learn from intu-
ition by simply analyzing the static physical structure, or to learn from dynamics
by predicting future states?
Recent physical reasoning benchmarks for machine learning are mostly physics engines focusing on
evaluating the task-solving abilities of models. For example, PHYRE [
3
] and Virtual Tools [
1
] contain
physical scenes with long-term dynamics and complex physics interactions. These environments have
various tasks with explicit goals, such as making the green object touch the blue object by placing
a new red ball into the initial scene. An artificial agent is tasked to predict the final outcome, e.g.,
whether the placed red ball will successfully solve the given task.
Existing problem-solving methods approach such physical reasoning problems by designing various
future prediction modules [
23
,
53
]. These modules are devised under the assumption that human
brains inherently possess a simplified physics engine (called intuitive physics engine) [
18
], akin to a
computer game simulator, capable of predicting objects’ future states and changes.
Although the intuitive theory claims that humans can predict physical outcomes rapidly, it does not
directly guide at the computational level more than the hypothesis that we might have a physics-
engine-like mechanism in our brain [
18
,
43
]. Critically, although humans can predict dynamics under
the intuitive theory, dynamics prediction might not be necessary for all types of physical reasoning
tasks. This hypothesis is largely left untouched, especially at the computational level. Just as noted by
Lerer et al.
[46]
, directly learning by intuition without dynamics prediction is sufficient for various
physical reasoning tasks. Of note, this hypothesis does not contradict the intuitive theory but rather
provides a new perspective at the computational level.
In this paper, we conduct a series of experiments to answer the above questions empirically. To the
best of our knowledge, ours is the first to
systematically
compare the
LfI
and
LfD
paradigms. In
the
first experiment
, we verify the simple approach of
LfI
by training a classifier [
15
] to predict
whether an action would lead to success in problem-solving. Surprisingly, in the preliminary study,
such a model already reaches the state-of-the-art (
SOTA
) performance and even outperforms existing
LfD
methods in unseen scenarios, indicating better generalization. Inspired by this counter-intuitive
result, we conduct more experiments on the two learning mechanisms; see Fig. 1for an illustration.
In the
second experiment
, we first set out to investigate whether
LfD
could work better than
LfI
in
theory by measuring the performance of a video-based classifier [
8
] using the ground-truth dynamics,
2
setting an upper bound for the route of dynamics prediction. In the
third experiment
, we replace the
ground-truth dynamics with predicted dynamics from an advanced future prediction model [
62
] to see
how LfD performs in practice. The results from these two experiments reveal that precise dynamics
significantly improve model performance, but the predicted approximations fail to work as expected:
Approximation brings disturbing biases and causes the performance to degenerate into
LfI
or even
worse in certain tasks. In the
fourth experiment
, through a series of experiments on various
LfI
models, we conclude that
LfI
could be a simpler and more practical paradigm in physical reasoning.
However, making breakthroughs in physical prediction is still promising though challenging. We
hope that our discussions shed light on future studies on physical reasoning.
2 Related Work
Intuitive physics and physical reasoning
Since Battaglia et al.
[6]
, the computational aspects of
intuitive physics have attracted research attention [
43
,
75
]; intuitive physics and stability has since
been further incorporated in complex object [
77
,
64
,
49
,
17
,
50
,
72
] and scene [
74
,
65
,
73
,
76
,
47
,
30
,
9
,
63
], and task [
26
,
27
,
32
] understanding tasks. The progress enables machines to learn to judge (i)
which object is heavier after observing two objects collide [21,54,59], (ii) whether a stacked block
tower will fall [
6
,
24
,
46
], (iii) whether water in two different containers will pour at the same angle
if titled [
41
,
57
] or liquid in general [
5
], and (iv) behaviors of dynamics with various materials [
42
].
However, this line of work primarily focuses on physical tasks without long-term dynamics, either by
knowledge-based approaches [6,54,70] or learning-based approaches [24,46].
More complex physical reasoning problems [
1
,
3
,
69
], including those involving question answering
[
10
,
11
,
14
,
29
,
71
], have also been studied. In particular, Allen et al.
[1]
propose to use knowledge-
based simulation; Xu et al.
[68]
adopt a Bayesian symbolic method; Battaglia et al.
[7]
, Girdhar et al.
[23]
, and Qi et al.
[53]
recruit the graph-based interaction network. However, none of these methods
fully justify the necessity of dynamics prediction. In this work, we challenge this fundamental
assumption and point out a simpler and efficacious but overlooked solution.
PHYRE and relevant environments
Bakhtin et al.
[3]
introduce the novel physical reasoning
task of PHYRE, wherein an agent is tasked with finding action in an initial scene to reach the goal
under physical laws. Current methods for solving PHYRE include reinforcement learning (e.g.,
DQN [
3
]), forward prediction neural networks with pixel-based representation [
23
], and object-based
representation [
23
,
53
]. Notably, Girdhar et al.
[23]
adopt different kinds of forward prediction
architectures to perform PHYRE tasks but fail to obtain significant performance improvement,
whereas Qi et al.
[53]
design convolutional interaction network to learn long-term dynamics, achieving
SOTA
performance by leveraging ground-truth information about object states in the physical scenes.
In fact, the physical reasoning task of PHYRE can be regarded as an image classification task on
judging whether an initial action would lead to a successful outcome, or a video classification task by
considering the dynamics after the initial action is performed. The success of the former sat on deep
convolutional neural networks [
28
,
40
] and has now shifted to Transformer-based models [
4
,
15
,
51
].
The change from convolutional architectures to attentional models also inspires recent advances in
video classification. Models such as TimeSformer [
8
] and ViViT [
2
] in this domain also expand into
fields such as action recognition [22] and group activity recognition [20].
Unlike PHYRE which directly focuses on physical reasoning in a simplified virtual environment,
some benchmarks include physical reasoning in their environments more implicitly and take physics
as an aid to finish tasks in real life, such as the ones in autonomous driving [
16
] and embodied AI
[
39
,
67
,
19
,
55
,
48
]. Robotic controller based on physics engines [
25
,
45
,
38
,
12
,
60
], navigation
tasks on 3D physical scenes [
66
], and more broadly task and motion planner [
61
,
37
,
33
,
35
,
34
] may
also need physical understanding modules in the system.
Dynamics prediction
Predicting dynamics into the future is one of the most extensively studied top-
ics in the vision community. One modern approach is to extract image representation and incorporate
an RNN predictor [
58
,
62
] or a cycle GAN-based approach [
44
]. However, these approaches cannot
extract robust representation from the pixels and incur accumulated errors in long-term prediction.
To tackle this problem, Janner et al.
[31]
and Qi et al.
[53]
focus on object-centric representation;
these are task-specific solutions with various inductive biases (e.g., spatial information, the number
of objects), and the performance drops when dealing with multiple objects with occlusions [52].
3
3 The Two Learning Mechanisms
In this section, we define the two learning mechanisms for solving physical reasoning problems.
Henceforth, we denote all objects’ states at time
t
as
Xt
. Given an initial background image
I
of a
physical setup and a random distribution of actions
A
, the model needs to learn a distribution of the
final outcome P(y|X0), where X0={A, I}, and ydenotes the possible outcome.
Mechanism 1. Learning from Intuition (LfI)
In
LfI
, the outcome
y
is learned directly from the
initial images and actions using a task-solution model f:
P(y|X0) = f(X0;θ),(1)
where
θ
denotes the parameters of the task-solution model
f(·)
. We call this mechanism
LfI
because
f(·)can be viewed as an intuitive map from the initial conditions to the outcome.
Mechanism 2. Learning from Dynamics (LfD)
The nature of physics is inherently dynamic. As
such, in
LfD
,
y
is no longer directly learned from the initial scenes; instead, this approach first learns
the underlying dynamics
D={Xt|t= 0,1, . . . , T }
within a time window
T
using a dynamics
prediction module
g(·)
, and then predicts the outcome from the predicted dynamics. Formally, the
forward process is described as below:
P(y|X0) = f(D;θ),where D=g(X0;φ),(2)
where
φ
represents the parameters of the dynamics prediction model
g(·)
. Usually,
g(·)
is implemented
as either an auto-regressive module based on pixel presentation or a graph-based interaction network
from object-based representation. In this work, we consider two optimization schedules for
g(·)
:
parallel optimization during joint learning of
f(·)
or serial optimization by learning and fixing
g(·)
first; please refer to Algs. 1and 2for more details.
Algorithm 1: Parallel optimization of LfD
Variables:
Iis the initial background image. Ais the action. f(·)and g(·)are the task-solution model and the
dynamics prediction model, respectively. Dand ydenote predicted dynamics and outcome, and Dgt and
ygt the ground-truth ones. The dynamics loss and cross-entropy loss are denoted as Ld(D, Dgt )and
Le(y, ygt), respectively. αand βare hyperparameters to balance the two losses.
1: repeat
2: Predict the dynamics Dfrom Iand Ausing g(·);
3: Predict the outcome yfrom Dusing f(·);
4: Compute the total loss Ltotal(D, y, Dgt, ygt) = αLd(D, Dgt) + βLe(y, ygt);
5: Optimize f(·)and g(·)simultaneously using the gradient of the total loss Ltotal(D, y, Dgt, ygt).
6: until max iteration
Algorithm 2: Serial optimization of LfD
Variables:
Iis the initial background image. Ais the action. f(·)and g(·)are the task-solution model and the
dynamics prediction model, respectively. Dand ydenote predicted dynamics and outcome, and Dgt and
ygt the ground-truth ones.
1: repeat
2: Predict the dynamics Dfrom Iand Ausing g(·);
3: Compute the dynamics loss Ld(D, Dgt);
4: Optimize g(·)using the gradient of dynamics loss Ld(D, Dgt);
5: until max iteration
6: Freeze g(·);
7: repeat
8: Predict the dynamics Dfrom Iand Ausing the pre-trained g(·);
9: Predict the final outcome yfrom Dusing f(·);
10: Compute the cross-entropy loss Le(y, ygt);
11: Optimize f(·)using the gradient of the cross entropy loss Le(y, ygt);
12: until max iteration
While the dynamics prediction step and the outcome prediction step can be integrated together, it is
worth noting that in
LfD
, additional architecture changes and supervisory signals are necessary to
learn the underlying dynamics, without which the paradigm degenerates into
LfI
. Ideally, a physics
engine or a simulator plays the role of future prediction. However, inversely learning the physical
laws [56,13] is very demanding due to the intrinsic challenges in long-term prediction.
4
摘要:

OntheLearningMechanismsinPhysicalReasoningShiqianLi?;1;2;5,KewenWu?;3;5,ChiZhang4;5,YixinZhu1;21SchoolofIntelligenceScienceandTechnology,PekingUniversity2InstituteforArticialIntelligence,PekingUniversity3DepartmentofAutomation,TsinghuaUniversity4DepartmentofComputerScience,UniversityofCalifornia,Lo...

展开>> 收起<<
On the Learning Mechanisms in Physical Reasoning Shiqian Li125 Kewen Wu35 Chi Zhang45 Yixin Zhu12 1School of Intelligence Science and Technology Peking University.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:3.11MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注