
Mind’s Eye: Grounded Language Model Reasoning Through Simulation
describe the physical properties, since we are thus able to disentangle the effects from numeracy
(i.e., the gain on reasoning is not attributed to better memorization on numbers, which has been
reported as “shortcuts” used by LMs (Patel et al.,2021)). This setting is also different from those in
mathematical reasoning tasks (e.g., GSM8k (Cobbe et al.,2021)), where the decomposed reasoning
path is typically the procedure of plugging different values into equations—the LM might be able to
solve these problems by symbolic manipulation (Razeghi et al.,2022) rather than actual reasoning.
Most existing physics alignment datasets use vision as the primary modality, such as images (Zellers
et al.,2019), animations (Wu et al.,2017), or videos (Piloto et al.,2022), which loses the flexibility
to run on LMs which only takes text input. PIQA (Bisk et al.,2020) and MMLU-Physics (Hendrycks
et al.,2021) are popular physics reasoning datasets used for LM benchmarking; however, their
sizes are naturally limited because of required human annotations (e.g., only 206 samples are on
physics in MMLU, with college and high school level questions combined).
UTOPIA
differs from all
these datasets as it leverages a physics engine to generate data—in theory we can obtain unlimited
samples—and each sample has reliable ground-truth supported by actual simulation. Although in the
present work we only take the text-form data for LM benchmarking, the corresponding simulation
videos during data generation have been recorded as data for future multi-modality research.
3 MIND’SEYE
As shown in Figure 1,
Mind’s Eye
comprises three main components, a text-to-code LM as the front-
end, a physics simulation engine (i.e., MuJoCo) as the back-end, and a foundation model (Bommasani
et al.,2021) for general reasoning. We detail the implementation of Mind’s Eye as below:
Text-to-Code Converter.
The objects and dynamics of the simulation is manifested by the rendering
code fed into MuJoCo. The rendering code is written in a type of XML file named MCJF
3
, where
the physics properties can be easily controlled by changing some key-value pairs. For example,
to change the mass of an object to 10, the line of rendering code needed is
geom.set(’mass’,
’10’)
. We use actual values to express the relative relationships in
UTOPIA
(e.g., “greater” will be
translated to 10 and 1 for the values of the properties to be set). We create rendering templates for
each sub-task of
UTOPIA
, and use programs to generate a dataset with 200,000 text-code pairs. In
each pair, the question in text is appended to the top of the XML code as comments. We then train
decoder-only LMs from scratch to learn how to generate the rendering code given the question in
comments auto-regressively. We leverage the BPE vocabulary set from GPT-2 (Radford et al.,2019)
and extend it by several special tokens to represent repeating tabs or spaces. Besides fine-tuning on
the dataset with text-code pairs, we also pre-train the model on the C4 dataset (Raffel et al.,2019a) to
enhance the model’s understanding on natural language. All the training is on TPU-v3 Pods and the
resulting models have 0.3B and 1.5B parameters (used as default). See §4.1 for training details.
Simulation Augmented Prompting.
Once receiving the rendering code, the physics engine will
run the corresponding simulation to get the ground-truth outcome. The program that triggers the
simulation will also parse the outcome into text-form prompt injections (e.g., “Hints: Two baseballs
take the same time to hit the ground.”, as shown in Figure 1). The injection combined with the
question will be fed to the foundation model, with which LMs can ground their reasoning with the
physical world rendered by the physics engine. We present more details of this procedure in §A.1.
The intuition behind
Mind’s Eye
is to imitate the experiment-reasoning paradigm; however, we
leverage quick and cheap physics simulation as an alternative to actual experiments in physical world.
The cognitive analog for
Mind’s Eye
might be the mental visualization process, also known as “the
mind’s eye” (Battaglia et al.,2013;Hegarty,2004), which often relates to motor processes (Wexler
et al.,1998) during embodied reasoning (Nathan et al.,2021).
Discussion: Why does Mind’s Eye work?
Table 2shows the comparison between
Mind’s Eye
and
two other methods in the formulation of the grounding process during LM inference. Assuming
knowledge of the physical world aligns with the distribution
pWorld
, the Zero-shot Reasoner (Kojima
et al.,2022) which uses “Let’s think step by step.” in prompts can be extended to any number of new
tasks. However, its reasoning ability will be compromised if the knowledge in LMs is incorrect or
outdated. Similarly, incorporating handcrafted reasoning steps rather than a generic phrase, Chain-
3Docs for MJCF: https://mujoco.readthedocs.io/en/latest/XMLreference.html
4