On the Learning Mechanisms in Physical Reasoning Shiqian Li125 Kewen Wu35 Chi Zhang45 Yixin Zhu12 1School of Intelligence Science and Technology Peking University

2025-05-02 1 0 3.11MB 17 页 10玖币

侵权投诉

On the Learning Mechanisms in Physical Reasoning

Shiqian Li?,1,2,5, Kewen Wu?,3,5, Chi Zhang4,5, Yixin Zhu1,2

1School of Intelligence Science and Technology, Peking University

2Institute for Artiﬁcial Intelligence, Peking University

3Department of Automation, Tsinghua University

4Department of Computer Science, University of California, Los Angeles

5Beijing Institute for General Artiﬁcial Intelligence (BIGAI)

Project Website https://lishiqianhugh.github.io/LfID_Page

Abstract

Is dynamics prediction indispensable for physical reasoning? If so, what kind

of roles do the dynamics prediction modules play during the physical reasoning

process? Most studies focus on designing dynamics prediction networks and treat-

ing physical reasoning as a downstream task without investigating the questions

above, taking for granted that the designed dynamics prediction would undoubtedly

help the reasoning process. In this work, we take a closer look at this assumption,

exploring this fundamental hypothesis by comparing two learning mechanisms:

Learning from Dynamics (

LfD

) and Learning from Intuition (

LfI

). In the

ﬁrst

experiment

, we directly examine and compare these two mechanisms. Results

show a surprising ﬁnding: Simple

LfI

is better than or on par with state-of-the-art

LfD

. This observation leads to the

second experiment

with Ground-truth Dynam-

ics (

), the ideal case of

LfD

wherein dynamics are obtained directly from a

simulator. Results show that dynamics, if directly given instead of approximated,

would achieve much higher performance than

LfI

alone on physical reasoning; this

essentially serves as the performance upper bound. Yet practically,

LfD

mechanism

can only predict Approximate Dynamics (

) using dynamics learning modules

that mimic the physical laws, making the following downstream physical reasoning

modules degenerate into the

LfI

paradigm; see the

third experiment

. We note that

this issue is hard to mitigate, as dynamics prediction errors inevitably accumulate in

the long horizon. Finally, in the

fourth experiment

, we note that

LfI

, the extremely

simpler strategy when done right, is more effective in learning to solve physical

reasoning problems. Taken together, the results on the challenging benchmark of

PHYRE [

] show that

LfI

is, if not better, as good as

LfD

with bells and whistles

for dynamics prediction. However, the potential improvement from

LfD

, though

challenging, remains lucrative.

1 Introduction

Humans possess a distinctive ability of understanding physical concepts and performing complex

physical reasoning. The literature on humans learning mechanisms for solving physical reasoning

problems can be categorized into two schools of thought [

]: (i) physical intuition at a glance

without much thinking, such as judging whether a stacked block tower will collapse [

], and (ii) more

?indicates equal contribution.

indicates corresponding authors.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.02075v1 [cs.LG] 5 Oct 2022

Generate dynamic sequences from predictor

Initial scene

Classifier

Obtain dynamic sequences from simulator

Simulator

Predictor

Classifier

Simply use initial scene without any dynamic information or prediction

Solved

Unsolved

Solved

Unsolved

Solved

Unsolved

Degenerate Idealize

LFD

LFI

Figure 1:

Comparison of the two learning mechanisms.

Learning from Intuition (

LfI

) learns from intuition

by directly using a classiﬁer (ﬁrst row in yellow). Based on the source of dynamics, we divide Learning from

Dynamics (

LfD

) into learning with Approximate Dynamics (

) (second row in green) and Ground-truth

Dynamics (

) (third row in blue). Speciﬁcally, learning with

leverages ground-truth dynamics from the

simulator, whereas learning with

predicts how the objects’ positions and poses unfold via a dynamics

predictor. In theory, learning with

should reach the same performance as learning with

if dynamics

prediction were perfect; i.e., learning with

can be regarded as the ideal case of learning with

. However,

in practice, learning with AD usually degenerates into LfI due to very inaccurate prediction.

extensive unfolding of states under the assumed physical dynamics when facing complex physical

tasks [

]. Such a disparity between System-1-and-System-2-like problem-solving strategies [

]

motivates us to think over the learning mechanisms for physical reasoning in machines:

When performing physical reasoning, is it better for machines to learn from intu-

ition by simply analyzing the static physical structure, or to learn from dynamics

by predicting future states?

Recent physical reasoning benchmarks for machine learning are mostly physics engines focusing on

evaluating the task-solving abilities of models. For example, PHYRE [

] and Virtual Tools [

] contain

physical scenes with long-term dynamics and complex physics interactions. These environments have

various tasks with explicit goals, such as making the green object touch the blue object by placing

a new red ball into the initial scene. An artiﬁcial agent is tasked to predict the ﬁnal outcome, e.g.,

whether the placed red ball will successfully solve the given task.

Existing problem-solving methods approach such physical reasoning problems by designing various

future prediction modules [

]. These modules are devised under the assumption that human

brains inherently possess a simpliﬁed physics engine (called intuitive physics engine) [

], akin to a

computer game simulator, capable of predicting objects’ future states and changes.

Although the intuitive theory claims that humans can predict physical outcomes rapidly, it does not

directly guide at the computational level more than the hypothesis that we might have a physics-

engine-like mechanism in our brain [

]. Critically, although humans can predict dynamics under

the intuitive theory, dynamics prediction might not be necessary for all types of physical reasoning

tasks. This hypothesis is largely left untouched, especially at the computational level. Just as noted by

Lerer et al.

[46]

, directly learning by intuition without dynamics prediction is sufﬁcient for various

physical reasoning tasks. Of note, this hypothesis does not contradict the intuitive theory but rather

provides a new perspective at the computational level.

In this paper, we conduct a series of experiments to answer the above questions empirically. To the

best of our knowledge, ours is the ﬁrst to

systematically

compare the

LfI

and

LfD

paradigms. In

the

ﬁrst experiment

, we verify the simple approach of

LfI

by training a classiﬁer [

] to predict

whether an action would lead to success in problem-solving. Surprisingly, in the preliminary study,

such a model already reaches the state-of-the-art (

SOTA

) performance and even outperforms existing

LfD

methods in unseen scenarios, indicating better generalization. Inspired by this counter-intuitive

result, we conduct more experiments on the two learning mechanisms; see Fig. 1for an illustration.

In the

second experiment

, we ﬁrst set out to investigate whether

LfD

could work better than

LfI

theory by measuring the performance of a video-based classiﬁer [

] using the ground-truth dynamics,

setting an upper bound for the route of dynamics prediction. In the

third experiment

, we replace the

ground-truth dynamics with predicted dynamics from an advanced future prediction model [

] to see

how LfD performs in practice. The results from these two experiments reveal that precise dynamics

signiﬁcantly improve model performance, but the predicted approximations fail to work as expected:

Approximation brings disturbing biases and causes the performance to degenerate into

LfI

or even

worse in certain tasks. In the

fourth experiment

, through a series of experiments on various

LfI

models, we conclude that

LfI

could be a simpler and more practical paradigm in physical reasoning.

However, making breakthroughs in physical prediction is still promising though challenging. We

hope that our discussions shed light on future studies on physical reasoning.

2 Related Work

Intuitive physics and physical reasoning

Since Battaglia et al.

[6]

, the computational aspects of

intuitive physics have attracted research attention [

]; intuitive physics and stability has since

been further incorporated in complex object [

] and scene [

], and task [

] understanding tasks. The progress enables machines to learn to judge (i)

which object is heavier after observing two objects collide [21,54,59], (ii) whether a stacked block

tower will fall [

], (iii) whether water in two different containers will pour at the same angle

if titled [

] or liquid in general [

], and (iv) behaviors of dynamics with various materials [

However, this line of work primarily focuses on physical tasks without long-term dynamics, either by

knowledge-based approaches [6,54,70] or learning-based approaches [24,46].

More complex physical reasoning problems [

], including those involving question answering

[

], have also been studied. In particular, Allen et al.

[1]

propose to use knowledge-

based simulation; Xu et al.

[68]

adopt a Bayesian symbolic method; Battaglia et al.

[7]

, Girdhar et al.

[23]

, and Qi et al.

[53]

recruit the graph-based interaction network. However, none of these methods

fully justify the necessity of dynamics prediction. In this work, we challenge this fundamental

assumption and point out a simpler and efﬁcacious but overlooked solution.

PHYRE and relevant environments

Bakhtin et al.

[3]

introduce the novel physical reasoning

task of PHYRE, wherein an agent is tasked with ﬁnding action in an initial scene to reach the goal

under physical laws. Current methods for solving PHYRE include reinforcement learning (e.g.,

DQN [

]), forward prediction neural networks with pixel-based representation [

], and object-based

representation [

]. Notably, Girdhar et al.

[23]

adopt different kinds of forward prediction

architectures to perform PHYRE tasks but fail to obtain signiﬁcant performance improvement,

whereas Qi et al.

[53]

design convolutional interaction network to learn long-term dynamics, achieving

SOTA

performance by leveraging ground-truth information about object states in the physical scenes.

In fact, the physical reasoning task of PHYRE can be regarded as an image classiﬁcation task on

judging whether an initial action would lead to a successful outcome, or a video classiﬁcation task by

considering the dynamics after the initial action is performed. The success of the former sat on deep

convolutional neural networks [

] and has now shifted to Transformer-based models [

The change from convolutional architectures to attentional models also inspires recent advances in

video classiﬁcation. Models such as TimeSformer [

] and ViViT [

] in this domain also expand into

ﬁelds such as action recognition [22] and group activity recognition [20].

Unlike PHYRE which directly focuses on physical reasoning in a simpliﬁed virtual environment,

some benchmarks include physical reasoning in their environments more implicitly and take physics

as an aid to ﬁnish tasks in real life, such as the ones in autonomous driving [

] and embodied AI

[

]. Robotic controller based on physics engines [

], navigation

tasks on 3D physical scenes [

], and more broadly task and motion planner [

] may

also need physical understanding modules in the system.

Dynamics prediction

Predicting dynamics into the future is one of the most extensively studied top-

ics in the vision community. One modern approach is to extract image representation and incorporate

an RNN predictor [

] or a cycle GAN-based approach [

]. However, these approaches cannot

extract robust representation from the pixels and incur accumulated errors in long-term prediction.

To tackle this problem, Janner et al.

[31]

and Qi et al.

[53]

focus on object-centric representation;

these are task-speciﬁc solutions with various inductive biases (e.g., spatial information, the number

of objects), and the performance drops when dealing with multiple objects with occlusions [52].

3 The Two Learning Mechanisms

In this section, we deﬁne the two learning mechanisms for solving physical reasoning problems.

Henceforth, we denote all objects’ states at time

. Given an initial background image

of a

physical setup and a random distribution of actions

, the model needs to learn a distribution of the

ﬁnal outcome P(y|X0), where X0={A, I}, and ydenotes the possible outcome.

Mechanism 1. Learning from Intuition (LfI)

LfI

, the outcome

is learned directly from the

initial images and actions using a task-solution model f:

P(y|X0) = f(X0;θ),(1)

where

denotes the parameters of the task-solution model

f(·)

. We call this mechanism

LfI

because

f(·)can be viewed as an intuitive map from the initial conditions to the outcome.

Mechanism 2. Learning from Dynamics (LfD)

The nature of physics is inherently dynamic. As

such, in

LfD

is no longer directly learned from the initial scenes; instead, this approach ﬁrst learns

the underlying dynamics

D={Xt|t= 0,1, . . . , T }

within a time window

using a dynamics

prediction module

g(·)

, and then predicts the outcome from the predicted dynamics. Formally, the

forward process is described as below:

P(y|X0) = f(D;θ),where D=g(X0;φ),(2)

where

represents the parameters of the dynamics prediction model

g(·)

. Usually,

g(·)

is implemented

as either an auto-regressive module based on pixel presentation or a graph-based interaction network

from object-based representation. In this work, we consider two optimization schedules for

g(·)

parallel optimization during joint learning of

f(·)

or serial optimization by learning and ﬁxing

g(·)

ﬁrst; please refer to Algs. 1and 2for more details.

Algorithm 1: Parallel optimization of LfD

Variables:

Iis the initial background image. Ais the action. f(·)and g(·)are the task-solution model and the

dynamics prediction model, respectively. Dand ydenote predicted dynamics and outcome, and Dgt and

ygt the ground-truth ones. The dynamics loss and cross-entropy loss are denoted as Ld(D, Dgt )and

Le(y, ygt), respectively. αand βare hyperparameters to balance the two losses.

1: repeat

2: Predict the dynamics Dfrom Iand Ausing g(·);

3: Predict the outcome yfrom Dusing f(·);

4: Compute the total loss Ltotal(D, y, Dgt, ygt) = αLd(D, Dgt) + βLe(y, ygt);

5: Optimize f(·)and g(·)simultaneously using the gradient of the total loss ∇Ltotal(D, y, Dgt, ygt).

6: until max iteration

Algorithm 2: Serial optimization of LfD

Variables:

Iis the initial background image. Ais the action. f(·)and g(·)are the task-solution model and the

dynamics prediction model, respectively. Dand ydenote predicted dynamics and outcome, and Dgt and

ygt the ground-truth ones.

1: repeat

2: Predict the dynamics Dfrom Iand Ausing g(·);

3: Compute the dynamics loss Ld(D, Dgt);

4: Optimize g(·)using the gradient of dynamics loss ∇Ld(D, Dgt);

5: until max iteration

6: Freeze g(·);

7: repeat

8: Predict the dynamics Dfrom Iand Ausing the pre-trained g(·);

9: Predict the ﬁnal outcome yfrom Dusing f(·);

10: Compute the cross-entropy loss Le(y, ygt);

11: Optimize f(·)using the gradient of the cross entropy loss ∇Le(y, ygt);

12: until max iteration

While the dynamics prediction step and the outcome prediction step can be integrated together, it is

worth noting that in

LfD

, additional architecture changes and supervisory signals are necessary to

learn the underlying dynamics, without which the paradigm degenerates into

LfI

. Ideally, a physics

engine or a simulator plays the role of future prediction. However, inversely learning the physical

laws [56,13] is very demanding due to the intrinsic challenges in long-term prediction.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OntheLearningMechanismsinPhysicalReasoningShiqianLi?;1;2;5,KewenWu?;3;5,ChiZhang4;5,YixinZhu1;21SchoolofIntelligenceScienceandTechnology,PekingUniversity2InstituteforArticialIntelligence,PekingUniversity3DepartmentofAutomation,TsinghuaUniversity4DepartmentofComputerScience,UniversityofCalifornia,Lo...

展开>> 收起<<

On the Learning Mechanisms in Physical Reasoning Shiqian Li125 Kewen Wu35 Chi Zhang45 Yixin Zhu12 1School of Intelligence Science and Technology Peking University.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Learning Mechanisms in Physical Reasoning Shiqian Li125 Kewen Wu35 Chi Zhang45 Yixin Zhu12 1School of Intelligence Science and Technology Peking University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: