October 10 2022 MINDSEYE G ROUNDED LANGUAGE MODEL REA- SONING THROUGH SIMULATION

2025-04-26 0 0 1.39MB 18 页 10玖币

侵权投诉

October 10, 2022

MIND’SEYE: GROUNDED LANGUAGE MODEL REA-

SONING THROUGH SIMULATION

Ruibo Liu1,2, Jason Wei1, Shixiang Shane Gu1, Te-Yen Wu2, Soroush Vosoughi2

Claire Cui1, Denny Zhou1, Andrew M. Dai1

1Google Research, Brain Team, 2Dartmouth College

ABSTRACT

Successful and effective communication between humans and AI relies on a shared

experience of the world. By training solely on written text, current language models

(LMs) miss the grounded experience of humans in the real-world—their failure

to relate language to the physical world causes knowledge to be misrepresented

and obvious mistakes in their reasoning. We present

Mind’s Eye

, a paradigm

to ground language model reasoning in the physical world. Given a physical

reasoning question, we use a computational physics engine (DeepMind’s MuJoCo)

to simulate the possible outcomes, and then use the simulation results as part of

the input, which enables language models to perform reasoning. Experiments

on 39 tasks in a physics alignment benchmark demonstrate that

Mind’s Eye

can

improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot

absolute accuracy improvement on average). Smaller language models armed with

Mind’s Eye

can obtain similar performance to models that are 100

larger. Finally,

we conﬁrm the robustness of Mind’s Eye through ablation studies.

1 INTRODUCTION

“In questions of science, the authority of a thousand

is not worth the humble reasoning of a single individual.”

——Galileo Galilei, 1632

“Do objects fall proportionately to their weight?” This famous question was once controversial until

Galileo’s Leaning Tower of Pisa experiment

—Galileo dropped two balls of different masses from

the same height (i.e., experiment) and concluded that their time of descent was independent of their

mass (i.e., inductive reasoning). Such an experiment-reasoning paradigm has been used by humans

for centuries to ground reasoning on complicated problems (Newell,1980) and transfer learned

knowledge to unfamiliar domains (Novak & Gowin,1984).

Current language models (LMs) follow a different path—by training on natural language, they

attempt to reverse engineer the physical world, so that they are able to reason about it. Large-

scale pre-trained LMs have achieved revolutionary performance on many tasks, such as solving

math word problems (Roy & Roth,2015;Ling et al.,2017;Cobbe et al.,2021) and commonsense

reasoning (Talmor et al.,2022;Geva et al.,2021). However, these models do not experience ﬁrsthand

the situations that are described by the language (McClelland et al.,2020), and lack the ability to

ﬁnd the correct answers by performing experiments like humans. As a consequence, when asked

the same free fall question, one of the most widely-used LMs, GPT-3

(Brown et al.,2020)—though

achieving superhuman performance in many reasoning tasks—will generate the wrong answer: “The

heavier object will fall faster.” (as shown in Figure 1). Due to the lack of grounded reasoning, current

LMs also have issues in truthfulness (Lin et al.,2021) and factuality (Petroni et al.,2020).

In Physics, Aristotle (384–322 BC) claims that the speed at which two identically shaped objects fall is

directly proportional to their weights, which was later challenged by Aristotelian commentator John Philoponus.

Speciﬁcally, we use text-davinci-002, which is the “most capable GPT-3 model” at the time of writing from

OpenAI: https://beta.openai.com/docs/models/overview.

arXiv:2210.05359v1 [cs.CL] 11 Oct 2022

Mind’s Eye: Grounded Language Model Reasoning Through Simulation

Question:

Two baseballs X and Y are released from rest

at the same height.

X is heavier than Y.

Which baseball will fall to the ground faster?

Answer from Vanilla LM:

Answer:

The heavier baseball will fall to the

ground faster because it has more

mass and therefore more gravity.

Zero-shot Reasoning

Question:

Two baseballs X and Y are released from rest

at the same height.

X is heavier than Y.

Which baseball will fall to the ground faster?

Text-to-Code LM Physics Engine

Mind's Eye

Augmented LM

Rendering Code

Simulator Augmented Zero/Few-shot Reasoning

Mind's Eye

Few-shot Reasoning

Simulation based Prompts Injection

Simulation Results

MuJoCo Simulation

Question:

Two baseballs X and Y are released from rest

at the same height.

X is heavier than Y.

Which baseball will fall to the ground faster?

+ Chain-of-Thought on Similar Questions Answer from CoT + LM:

Answer:

Since the acceleration of an object

can be computed as a = F/m, the

heavier one will have more gravity,

so a = mg/m > g.The heavier one

will fall to the ground faster.

Answer from Mind's Eye + LM:

Answer:

Hints:

X and Y have the same acceleration.

So the answer is: they will fall at the

same rate. Both baseballs will fall to

the ground at the same time.

Chain-of-Thought

Vanilla

Figure 1: Current language models are still challenged by simple questions that require a good

understanding of the physical world. The answer elicited by Chain-of-Thought can still be wrong if

the required knowledge is missing or misrepresented in LMs.

Mind’s Eye

, instead, enables grounded

LM reasoning by directly simulating the scene in the given question. Then the LM can reason over

the injected ground-truth rationale to generate the correct answers.

To tackle these problems, existing remedies include using improved prompting techniques, such as

inserting hand-written decomposed reasoning steps in few-shot demonstrations (Wei et al.,2022;

Zhou et al.,2022). These methods are inherently limited as their reasoning ability completely relies

on the knowledge perpetuated in the LM—their performance could suffer if the knowledge learnt by

the LM is incorrect (Petroni et al.,2019) or outdated (Dhingra et al.,2022). To incorporate external

knowledge, retrieval-augmented LMs such as REALM (Guu et al.,2020), RAG (Lewis et al.,2020)

and RETRO (Borgeaud et al.,2022), retrieve relevant documents as additional evidence for given

questions, and may also ﬁne-tune the LM on the question-document-answer triplets. However, the

knowledge presented in written language is known to have reporting bias (Bisk et al.,2020), whereby

some everyday unspoken facts or rarely seen (but practically possible) compositions are commonly

missing in text (Paik et al.,2021).

Correct and complete understanding of properties and interactions in the physical world is not only

essential to achieve human-level reasoning (Lake et al.,2017), but also fundamental to build a

general-purpose embodied intelligence (Huang et al.,2022). In this work, we investigate to what

extent current LMs understand the basic rules and principles of the physical world, and describe how

to ground their reasoning with the aid of simulation. Our contributions are three-fold:

•

We propose a new multi-task physics alignment dataset,

UTOPIA

, whose aim is to benchmark

how well current LMs can understand and reason over some basic laws of physics (

2).

The dataset contains 39 sub-tasks covering six common scenes that involve understanding

basic principles of physics (e.g., conservation of momentum in elastic collisions), and all

the ground-truth answers are automatically generated by a physics engine. We ﬁnd that

current large-scale LMs are still quite limited on many basic physics-related questions (24%

accuracy of GPT-3 175B in zero-shot, and 38.2% in few-shot).

•

We explore a paradigm that adds physics simulation to the LM reasoning pipeline (

3) to

make the reasoning grounded within the physical world. Speciﬁcally, we ﬁrst use a model to

transform the given text-form question into rendering code, and then run the corresponding

simulation on a physics engine (i.e., MuJoCo (Todorov et al.,2012)). Finally we append the

Mind’s Eye: Grounded Language Model Reasoning Through Simulation

simulation results to the input prompts of LMs during inference. Our method can serve as a

plug-and-play framework that works with any LM and requires neither handcrafted prompts

nor costly ﬁne-tuning.

•

We systematically evaluate the performance of popular LMs in different sizes on

UTOPIA

before and after augmentation by

Mind’s Eye

, and compare the augmented performance

with many existing approaches (

4.2). We ﬁnd

Mind’s Eye

outperforms other methods by

a large margin in both zero-shot and few-shot settings. More importantly,

Mind’s Eye

also effective for small LMs, and the performance with small LMs can be on par or even

outperform that of 100×larger vanilla LMs.

2 UTOPIA BENCHMARKING

Humans are able to understand their physical environment and intuit rules about the world from

embodied experience. The rules and principles behind the real world have been discovered as

scientiﬁc laws—we humans have ingrained them as knowledge or intuition (Kaiser et al.,1986) to

make reliable predictions on how observed events will unfold in day-to-day life (Kubricht et al.,

2017). For example, when driving, we can anticipate when to brake when approaching a stop sign,

using intuition or knowledge from Newton’s second law of motion. We also know it would be a

disaster to collide with a heavy truck, not only in terms of our knowledge on the conservation of

momentum (i.e., the lighter object will have a greater velocity after collision), but also from our

embodied experience of collision in everyday life.

We are thus inspired to design a physics alignment dataset that covers this knowledge, aiming to

benchmark to what extent current LMs understand basic physical concepts and rules. As shown in

Table 1, we choose six representative scenes, mainly from textbooks (e.g., high-school Physics). The

sub-tasks are deﬁned based on the composition of observed and queried concepts. For example, one

task in a motion scene could be: given the observed acceleration of the two objects with the same

mass, please answer what is the relationship of forces applied on them. In total we have 39 sub-tasks

across different scenes, and each sub-task contains various hand-designed questions whose language

style is similar to that of textbooks as well.

Table 1: We propose

UTOPIA

, a multi-task physics alignment dataset, investigating the grounded

reasoning ability of LMs on 39 sub-tasks. Unlike many other datasets,

UTOPIA

deliberately describes

the questions in relative relations (e.g., greater than) instead of absolute numbers (e.g., 3.5 m/s), to

approximate human’s perceptional sensing ability in real world. The ground-truth answers to the

questions are generated by the physics engine, which makes it easy to scale UTOPIA to larger sizes.

Scenes (Simpliﬁed) Sample Questions Concepts # Tasks

Motion

Amy pulls two sleds X and Y with the same force.

X has a greater mass than Y. Friction can be ignored.

Which one has a greater acceleration after the same period of time?

mass

force

velocity

Friction

Two boxes X and Y move at the same velocity.

We only consider kinetic frictions, and X undergoes a smaller friction than Y.

Which one has a greater velocity after the same period of time (before stop)?

mass

velocity

friction

Free fall

Two balls are dropped from the same height.

Y has a greater mass than X. We ignore the air resistance.

Which one will hit the ground earlier?

mass

height

energy

Projection

Jason throws two baseballs X and Y at the same height horizontally.

They have the same mass, but X has a greater initial horizontal velocity.

Which one will hit the ground earlier?

velocity

mass

energy

Collision

Two marbles X and Y of the same mass move towards each other.

X and Y have the same magnitude of velocity, and the collision is elastic.

Which one will have a greater velocity after collision?

velocity

mass

momentum

Incline

Two blocks of metal X and Y are released from a certain height on a slick slope.

Y has a greater mass than X, and the friction can be ignored.

Which one will have a greater velocity after the same period of time?

mass

height

friction

Table 1exempliﬁes some samples in

UTOPIA

. We deliberately choose to use relative comparison

(e.g., “greater than”,“smaller than”,“the same as”; text in purple) rather than actual numbers to

Mind’s Eye: Grounded Language Model Reasoning Through Simulation

describe the physical properties, since we are thus able to disentangle the effects from numeracy

(i.e., the gain on reasoning is not attributed to better memorization on numbers, which has been

reported as “shortcuts” used by LMs (Patel et al.,2021)). This setting is also different from those in

mathematical reasoning tasks (e.g., GSM8k (Cobbe et al.,2021)), where the decomposed reasoning

path is typically the procedure of plugging different values into equations—the LM might be able to

solve these problems by symbolic manipulation (Razeghi et al.,2022) rather than actual reasoning.

Most existing physics alignment datasets use vision as the primary modality, such as images (Zellers

et al.,2019), animations (Wu et al.,2017), or videos (Piloto et al.,2022), which loses the ﬂexibility

to run on LMs which only takes text input. PIQA (Bisk et al.,2020) and MMLU-Physics (Hendrycks

et al.,2021) are popular physics reasoning datasets used for LM benchmarking; however, their

sizes are naturally limited because of required human annotations (e.g., only 206 samples are on

physics in MMLU, with college and high school level questions combined).

UTOPIA

differs from all

these datasets as it leverages a physics engine to generate data—in theory we can obtain unlimited

samples—and each sample has reliable ground-truth supported by actual simulation. Although in the

present work we only take the text-form data for LM benchmarking, the corresponding simulation

videos during data generation have been recorded as data for future multi-modality research.

3 MIND’SEYE

As shown in Figure 1,

Mind’s Eye

comprises three main components, a text-to-code LM as the front-

end, a physics simulation engine (i.e., MuJoCo) as the back-end, and a foundation model (Bommasani

et al.,2021) for general reasoning. We detail the implementation of Mind’s Eye as below:

Text-to-Code Converter.

The objects and dynamics of the simulation is manifested by the rendering

code fed into MuJoCo. The rendering code is written in a type of XML ﬁle named MCJF

, where

the physics properties can be easily controlled by changing some key-value pairs. For example,

to change the mass of an object to 10, the line of rendering code needed is

geom.set(’mass’,

’10’)

. We use actual values to express the relative relationships in

UTOPIA

(e.g., “greater” will be

translated to 10 and 1 for the values of the properties to be set). We create rendering templates for

each sub-task of

UTOPIA

, and use programs to generate a dataset with 200,000 text-code pairs. In

each pair, the question in text is appended to the top of the XML code as comments. We then train

decoder-only LMs from scratch to learn how to generate the rendering code given the question in

comments auto-regressively. We leverage the BPE vocabulary set from GPT-2 (Radford et al.,2019)

and extend it by several special tokens to represent repeating tabs or spaces. Besides ﬁne-tuning on

the dataset with text-code pairs, we also pre-train the model on the C4 dataset (Raffel et al.,2019a) to

enhance the model’s understanding on natural language. All the training is on TPU-v3 Pods and the

resulting models have 0.3B and 1.5B parameters (used as default). See §4.1 for training details.

Simulation Augmented Prompting.

Once receiving the rendering code, the physics engine will

run the corresponding simulation to get the ground-truth outcome. The program that triggers the

simulation will also parse the outcome into text-form prompt injections (e.g., “Hints: Two baseballs

take the same time to hit the ground.”, as shown in Figure 1). The injection combined with the

question will be fed to the foundation model, with which LMs can ground their reasoning with the

physical world rendered by the physics engine. We present more details of this procedure in §A.1.

The intuition behind

Mind’s Eye

is to imitate the experiment-reasoning paradigm; however, we

leverage quick and cheap physics simulation as an alternative to actual experiments in physical world.

The cognitive analog for

Mind’s Eye

might be the mental visualization process, also known as “the

mind’s eye” (Battaglia et al.,2013;Hegarty,2004), which often relates to motor processes (Wexler

et al.,1998) during embodied reasoning (Nathan et al.,2021).

Discussion: Why does Mind’s Eye work?

Table 2shows the comparison between

Mind’s Eye

and

two other methods in the formulation of the grounding process during LM inference. Assuming

knowledge of the physical world aligns with the distribution

pWorld

, the Zero-shot Reasoner (Kojima

et al.,2022) which uses “Let’s think step by step.” in prompts can be extended to any number of new

tasks. However, its reasoning ability will be compromised if the knowledge in LMs is incorrect or

outdated. Similarly, incorporating handcrafted reasoning steps rather than a generic phrase, Chain-

3Docs for MJCF: https://mujoco.readthedocs.io/en/latest/XMLreference.html

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

October10,2022MIND'SEYE:GROUNDEDLANGUAGEMODELREA-SONINGTHROUGHSIMULATIONRuiboLiu1;2,JasonWei1,ShixiangShaneGu1,Te-YenWu2,SoroushVosoughi2ClaireCui1,DennyZhou1,AndrewM.Dai11GoogleResearch,BrainTeam,2DartmouthCollegeABSTRACTSuccessfulandeffectivecommunicationbetweenhumansandAIreliesonasharedexperience...

展开>> 收起<<

October 10 2022 MINDSEYE G ROUNDED LANGUAGE MODEL REA- SONING THROUGH SIMULATION.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

October 10 2022 MINDSEYE G ROUNDED LANGUAGE MODEL REA- SONING THROUGH SIMULATION

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: