October 10 2022 MINDSEYE G ROUNDED LANGUAGE MODEL REA- SONING THROUGH SIMULATION

2025-04-26 0 0 1.39MB 18 页 10玖币
侵权投诉
October 10, 2022
MINDSEYE: GROUNDED LANGUAGE MODEL REA-
SONING THROUGH SIMULATION
Ruibo Liu1,2, Jason Wei1, Shixiang Shane Gu1, Te-Yen Wu2, Soroush Vosoughi2
Claire Cui1, Denny Zhou1, Andrew M. Dai1
1Google Research, Brain Team, 2Dartmouth College
ABSTRACT
Successful and effective communication between humans and AI relies on a shared
experience of the world. By training solely on written text, current language models
(LMs) miss the grounded experience of humans in the real-world—their failure
to relate language to the physical world causes knowledge to be misrepresented
and obvious mistakes in their reasoning. We present
Mind’s Eye
, a paradigm
to ground language model reasoning in the physical world. Given a physical
reasoning question, we use a computational physics engine (DeepMind’s MuJoCo)
to simulate the possible outcomes, and then use the simulation results as part of
the input, which enables language models to perform reasoning. Experiments
on 39 tasks in a physics alignment benchmark demonstrate that
Mind’s Eye
can
improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot
absolute accuracy improvement on average). Smaller language models armed with
Mind’s Eye
can obtain similar performance to models that are 100
×
larger. Finally,
we confirm the robustness of Mind’s Eye through ablation studies.
1 INTRODUCTION
In questions of science, the authority of a thousand
is not worth the humble reasoning of a single individual.
——Galileo Galilei, 1632
“Do objects fall proportionately to their weight?” This famous question was once controversial until
Galileo’s Leaning Tower of Pisa experiment
1
—Galileo dropped two balls of different masses from
the same height (i.e., experiment) and concluded that their time of descent was independent of their
mass (i.e., inductive reasoning). Such an experiment-reasoning paradigm has been used by humans
for centuries to ground reasoning on complicated problems (Newell,1980) and transfer learned
knowledge to unfamiliar domains (Novak & Gowin,1984).
Current language models (LMs) follow a different path—by training on natural language, they
attempt to reverse engineer the physical world, so that they are able to reason about it. Large-
scale pre-trained LMs have achieved revolutionary performance on many tasks, such as solving
math word problems (Roy & Roth,2015;Ling et al.,2017;Cobbe et al.,2021) and commonsense
reasoning (Talmor et al.,2022;Geva et al.,2021). However, these models do not experience firsthand
the situations that are described by the language (McClelland et al.,2020), and lack the ability to
find the correct answers by performing experiments like humans. As a consequence, when asked
the same free fall question, one of the most widely-used LMs, GPT-3
2
(Brown et al.,2020)—though
achieving superhuman performance in many reasoning tasks—will generate the wrong answer: “The
heavier object will fall faster. (as shown in Figure 1). Due to the lack of grounded reasoning, current
LMs also have issues in truthfulness (Lin et al.,2021) and factuality (Petroni et al.,2020).
1
In Physics, Aristotle (384–322 BC) claims that the speed at which two identically shaped objects fall is
directly proportional to their weights, which was later challenged by Aristotelian commentator John Philoponus.
2
Specifically, we use text-davinci-002, which is the “most capable GPT-3 model” at the time of writing from
OpenAI: https://beta.openai.com/docs/models/overview.
1
arXiv:2210.05359v1 [cs.CL] 11 Oct 2022
Mind’s Eye: Grounded Language Model Reasoning Through Simulation
Question:
Two baseballs X and Y are released from rest
at the same height.
X is heavier than Y.
Which baseball will fall to the ground faster?
Answer from Vanilla LM:
Answer:
The heavier baseball will fall to the
ground faster because it has more
mass and therefore more gravity.
Zero-shot Reasoning
Question:
Two baseballs X and Y are released from rest
at the same height.
X is heavier than Y.
Which baseball will fall to the ground faster?
Text-to-Code LM Physics Engine
Mind's Eye
Augmented LM
Rendering Code
Simulator Augmented Zero/Few-shot Reasoning
Mind's Eye
Few-shot Reasoning
Simulation based Prompts Injection
Simulation Results
MuJoCo Simulation
Question:
Two baseballs X and Y are released from rest
at the same height.
X is heavier than Y.
Which baseball will fall to the ground faster?
+ Chain-of-Thought on Similar Questions Answer from CoT + LM:
Answer:
Since the acceleration of an object
can be computed as a = F/m, the
heavier one will have more gravity,
so a = mg/m > g.The heavier one
will fall to the ground faster.
Answer from Mind's Eye + LM:
Answer:
Hints:
X and Y have the same acceleration.
So the answer is: they will fall at the
same rate. Both baseballs will fall to
the ground at the same time.
Chain-of-Thought
Vanilla
Figure 1: Current language models are still challenged by simple questions that require a good
understanding of the physical world. The answer elicited by Chain-of-Thought can still be wrong if
the required knowledge is missing or misrepresented in LMs.
Mind’s Eye
, instead, enables grounded
LM reasoning by directly simulating the scene in the given question. Then the LM can reason over
the injected ground-truth rationale to generate the correct answers.
To tackle these problems, existing remedies include using improved prompting techniques, such as
inserting hand-written decomposed reasoning steps in few-shot demonstrations (Wei et al.,2022;
Zhou et al.,2022). These methods are inherently limited as their reasoning ability completely relies
on the knowledge perpetuated in the LM—their performance could suffer if the knowledge learnt by
the LM is incorrect (Petroni et al.,2019) or outdated (Dhingra et al.,2022). To incorporate external
knowledge, retrieval-augmented LMs such as REALM (Guu et al.,2020), RAG (Lewis et al.,2020)
and RETRO (Borgeaud et al.,2022), retrieve relevant documents as additional evidence for given
questions, and may also fine-tune the LM on the question-document-answer triplets. However, the
knowledge presented in written language is known to have reporting bias (Bisk et al.,2020), whereby
some everyday unspoken facts or rarely seen (but practically possible) compositions are commonly
missing in text (Paik et al.,2021).
Correct and complete understanding of properties and interactions in the physical world is not only
essential to achieve human-level reasoning (Lake et al.,2017), but also fundamental to build a
general-purpose embodied intelligence (Huang et al.,2022). In this work, we investigate to what
extent current LMs understand the basic rules and principles of the physical world, and describe how
to ground their reasoning with the aid of simulation. Our contributions are three-fold:
We propose a new multi-task physics alignment dataset,
UTOPIA
, whose aim is to benchmark
how well current LMs can understand and reason over some basic laws of physics (
§
2).
The dataset contains 39 sub-tasks covering six common scenes that involve understanding
basic principles of physics (e.g., conservation of momentum in elastic collisions), and all
the ground-truth answers are automatically generated by a physics engine. We find that
current large-scale LMs are still quite limited on many basic physics-related questions (24%
accuracy of GPT-3 175B in zero-shot, and 38.2% in few-shot).
We explore a paradigm that adds physics simulation to the LM reasoning pipeline (
§
3) to
make the reasoning grounded within the physical world. Specifically, we first use a model to
transform the given text-form question into rendering code, and then run the corresponding
simulation on a physics engine (i.e., MuJoCo (Todorov et al.,2012)). Finally we append the
2
Mind’s Eye: Grounded Language Model Reasoning Through Simulation
simulation results to the input prompts of LMs during inference. Our method can serve as a
plug-and-play framework that works with any LM and requires neither handcrafted prompts
nor costly fine-tuning.
We systematically evaluate the performance of popular LMs in different sizes on
UTOPIA
before and after augmentation by
Mind’s Eye
, and compare the augmented performance
with many existing approaches (
§
4.2). We find
Mind’s Eye
outperforms other methods by
a large margin in both zero-shot and few-shot settings. More importantly,
Mind’s Eye
is
also effective for small LMs, and the performance with small LMs can be on par or even
outperform that of 100×larger vanilla LMs.
2 UTOPIA BENCHMARKING
Humans are able to understand their physical environment and intuit rules about the world from
embodied experience. The rules and principles behind the real world have been discovered as
scientific laws—we humans have ingrained them as knowledge or intuition (Kaiser et al.,1986) to
make reliable predictions on how observed events will unfold in day-to-day life (Kubricht et al.,
2017). For example, when driving, we can anticipate when to brake when approaching a stop sign,
using intuition or knowledge from Newton’s second law of motion. We also know it would be a
disaster to collide with a heavy truck, not only in terms of our knowledge on the conservation of
momentum (i.e., the lighter object will have a greater velocity after collision), but also from our
embodied experience of collision in everyday life.
We are thus inspired to design a physics alignment dataset that covers this knowledge, aiming to
benchmark to what extent current LMs understand basic physical concepts and rules. As shown in
Table 1, we choose six representative scenes, mainly from textbooks (e.g., high-school Physics). The
sub-tasks are defined based on the composition of observed and queried concepts. For example, one
task in a motion scene could be: given the observed acceleration of the two objects with the same
mass, please answer what is the relationship of forces applied on them. In total we have 39 sub-tasks
across different scenes, and each sub-task contains various hand-designed questions whose language
style is similar to that of textbooks as well.
Table 1: We propose
UTOPIA
, a multi-task physics alignment dataset, investigating the grounded
reasoning ability of LMs on 39 sub-tasks. Unlike many other datasets,
UTOPIA
deliberately describes
the questions in relative relations (e.g., greater than) instead of absolute numbers (e.g., 3.5 m/s), to
approximate human’s perceptional sensing ability in real world. The ground-truth answers to the
questions are generated by the physics engine, which makes it easy to scale UTOPIA to larger sizes.
Scenes (Simplified) Sample Questions Concepts # Tasks
Motion
Amy pulls two sleds X and Y with the same force.
X has a greater mass than Y. Friction can be ignored.
Which one has a greater acceleration after the same period of time?
mass
force
velocity
6
Friction
Two boxes X and Y move at the same velocity.
We only consider kinetic frictions, and X undergoes a smaller friction than Y.
Which one has a greater velocity after the same period of time (before stop)?
mass
velocity
friction
6
Free fall
Two balls are dropped from the same height.
Y has a greater mass than X. We ignore the air resistance.
Which one will hit the ground earlier?
mass
height
energy
6
Projection
Jason throws two baseballs X and Y at the same height horizontally.
They have the same mass, but X has a greater initial horizontal velocity.
Which one will hit the ground earlier?
velocity
mass
energy
6
Collision
Two marbles X and Y of the same mass move towards each other.
X and Y have the same magnitude of velocity, and the collision is elastic.
Which one will have a greater velocity after collision?
velocity
mass
momentum
6
Incline
Two blocks of metal X and Y are released from a certain height on a slick slope.
Y has a greater mass than X, and the friction can be ignored.
Which one will have a greater velocity after the same period of time?
mass
height
friction
9
Table 1exemplifies some samples in
UTOPIA
. We deliberately choose to use relative comparison
(e.g., “greater than”,“smaller than”,“the same as”; text in purple) rather than actual numbers to
3
Mind’s Eye: Grounded Language Model Reasoning Through Simulation
describe the physical properties, since we are thus able to disentangle the effects from numeracy
(i.e., the gain on reasoning is not attributed to better memorization on numbers, which has been
reported as “shortcuts” used by LMs (Patel et al.,2021)). This setting is also different from those in
mathematical reasoning tasks (e.g., GSM8k (Cobbe et al.,2021)), where the decomposed reasoning
path is typically the procedure of plugging different values into equations—the LM might be able to
solve these problems by symbolic manipulation (Razeghi et al.,2022) rather than actual reasoning.
Most existing physics alignment datasets use vision as the primary modality, such as images (Zellers
et al.,2019), animations (Wu et al.,2017), or videos (Piloto et al.,2022), which loses the flexibility
to run on LMs which only takes text input. PIQA (Bisk et al.,2020) and MMLU-Physics (Hendrycks
et al.,2021) are popular physics reasoning datasets used for LM benchmarking; however, their
sizes are naturally limited because of required human annotations (e.g., only 206 samples are on
physics in MMLU, with college and high school level questions combined).
UTOPIA
differs from all
these datasets as it leverages a physics engine to generate data—in theory we can obtain unlimited
samples—and each sample has reliable ground-truth supported by actual simulation. Although in the
present work we only take the text-form data for LM benchmarking, the corresponding simulation
videos during data generation have been recorded as data for future multi-modality research.
3 MINDSEYE
As shown in Figure 1,
Mind’s Eye
comprises three main components, a text-to-code LM as the front-
end, a physics simulation engine (i.e., MuJoCo) as the back-end, and a foundation model (Bommasani
et al.,2021) for general reasoning. We detail the implementation of Mind’s Eye as below:
Text-to-Code Converter.
The objects and dynamics of the simulation is manifested by the rendering
code fed into MuJoCo. The rendering code is written in a type of XML file named MCJF
3
, where
the physics properties can be easily controlled by changing some key-value pairs. For example,
to change the mass of an object to 10, the line of rendering code needed is
geom.set(’mass’,
’10’)
. We use actual values to express the relative relationships in
UTOPIA
(e.g., “greater” will be
translated to 10 and 1 for the values of the properties to be set). We create rendering templates for
each sub-task of
UTOPIA
, and use programs to generate a dataset with 200,000 text-code pairs. In
each pair, the question in text is appended to the top of the XML code as comments. We then train
decoder-only LMs from scratch to learn how to generate the rendering code given the question in
comments auto-regressively. We leverage the BPE vocabulary set from GPT-2 (Radford et al.,2019)
and extend it by several special tokens to represent repeating tabs or spaces. Besides fine-tuning on
the dataset with text-code pairs, we also pre-train the model on the C4 dataset (Raffel et al.,2019a) to
enhance the model’s understanding on natural language. All the training is on TPU-v3 Pods and the
resulting models have 0.3B and 1.5B parameters (used as default). See §4.1 for training details.
Simulation Augmented Prompting.
Once receiving the rendering code, the physics engine will
run the corresponding simulation to get the ground-truth outcome. The program that triggers the
simulation will also parse the outcome into text-form prompt injections (e.g., “Hints: Two baseballs
take the same time to hit the ground., as shown in Figure 1). The injection combined with the
question will be fed to the foundation model, with which LMs can ground their reasoning with the
physical world rendered by the physics engine. We present more details of this procedure in §A.1.
The intuition behind
Mind’s Eye
is to imitate the experiment-reasoning paradigm; however, we
leverage quick and cheap physics simulation as an alternative to actual experiments in physical world.
The cognitive analog for
Mind’s Eye
might be the mental visualization process, also known as “the
mind’s eye” (Battaglia et al.,2013;Hegarty,2004), which often relates to motor processes (Wexler
et al.,1998) during embodied reasoning (Nathan et al.,2021).
Discussion: Why does Mind’s Eye work?
Table 2shows the comparison between
Mind’s Eye
and
two other methods in the formulation of the grounding process during LM inference. Assuming
knowledge of the physical world aligns with the distribution
pWorld
, the Zero-shot Reasoner (Kojima
et al.,2022) which uses “Let’s think step by step. in prompts can be extended to any number of new
tasks. However, its reasoning ability will be compromised if the knowledge in LMs is incorrect or
outdated. Similarly, incorporating handcrafted reasoning steps rather than a generic phrase, Chain-
3Docs for MJCF: https://mujoco.readthedocs.io/en/latest/XMLreference.html
4
摘要:

October10,2022MIND'SEYE:GROUNDEDLANGUAGEMODELREA-SONINGTHROUGHSIMULATIONRuiboLiu1;2,JasonWei1,ShixiangShaneGu1,Te-YenWu2,SoroushVosoughi2ClaireCui1,DennyZhou1,AndrewM.Dai11GoogleResearch,BrainTeam,2DartmouthCollegeABSTRACTSuccessfulandeffectivecommunicationbetweenhumansandAIreliesonasharedexperience...

展开>> 收起<<
October 10 2022 MINDSEYE G ROUNDED LANGUAGE MODEL REA- SONING THROUGH SIMULATION.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.39MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注