in Vaswani et al. [2017], Saxton et al. [2019] found that transformers-based models outperformed
other architectures when trained to generate the answer directly from the problem statement. Many
researchers have explored enhancing model performance by fine-tuning to produce intermediate
equations or programs [Shi et al., 2015, Upadhyay and Chang, 2015, Amini et al., 2019, Miao et al.,
2020, Drori et al., 2021]. Recent advances rely on large transformer-based language models [Brown
et al., 2020, Thoppilan et al., 2022, Chowdhery et al., 2022, Lewkowycz et al., 2022] and/or datasets
involving full step-by-step solutions in natural language [Ling et al., 2017, Hendrycks et al., 2021,
Welleck et al., 2021, Cobbe et al., 2021, Drori et al., 2021].
Interestingly, prompting large language models such as GPT-3 to generate chains of thought with
just a few examples at test time can enhance performance considerably [Wei et al., 2022], indicating
that the models may already have the ability to engage in a step by step reasoning process, in part
because such a process is exemplified in their training. Many recent works use multiple samples
from a model, either using a verifier trained on model-generated responses to re-rank candidate
sequences Cobbe et al. [2021] or relying on a majority voting scheme [Wang et al., 2022]. The
strongest results overall to date [Lewkowycz et al., 2022] use a very large transformer based language
model, fine-tuned on scientific and mathematical text, provided with a chain of thought prompt, and
assessed using majority voting. However, these models still only achieve modest scores on harder
problems, consistent with the view Hendrycks et al. [2021] that simply scaling up the model size is
an intractable strategy for solving mathematics problems of higher difficulty, even with the added
benefit of chain-of-thought prompting, verifiers, or majority voting.
Common across these existing works is the use of human-generated solution sequences. In our
work, we introduce our GSM8K-R dataset to explicitly contrast performance on different types of
solution sequences and explore how explicit focus on generating a structured abstract relational
plan can improve learning, an analysis that would not be possible with existing datasets. We
also introduce the unit conversion (UC) task, a completely synthetic task domain to complement
our exploration of solving problems expressed in natural language. This parallels the approach of
Gontier et al. [2020], with a crucial difference. These authors investigated logical reasoning over
a fixed data-base of specific relational facts, training models to produce an inferable relation to a
probe question, and found only small advantages of a plan sequence compared to generating the
answer directly. In contrast, our UC task affords separating the abstract relational plan from the
specific numerical computations. This allows us to demonstrate a striking advantage from learning
to produce the abstract relational sequence rather than just the necessary numerical expressions.
4 Experiments
We use two tasks to explore the possible benefits or relational abstractions: a set of natural language
math problems from the Grade School Math 8K (GSM8K) dataset [Cobbe et al., 2021], and an
abstract unit conversion task (UC) in which the model must determine how the number of units of
one type corresponds to a specified number of units of another type. Both tasks contain quantities
and relations that can be represented by a graph, and involve formulating and solving a series of
numerical equations. However, the two tasks pose different challenges, allow different approaches to
model training, and afford different comparison conditions and analyses.
The GSM8K dataset consists of realistic word problems requiring a broad understanding of
mathematical concepts and their application to grade school math problems. The dataset includes
human-generated mixed expressions that usually step through the problems in a linear order corre-
sponding to the problem statement in a fairly small number of solution steps. Because these are word
problems, they challenge the model’s natural language understanding and general world knowledge
(such as the fact that a dozen consists of 12 items, or that the number of eggs increases when it is
laid by a chicken but decreases when it is used in baking cookies). We present our GSM8K-R dataset
by building on the GSM8K dataset, adding human annotations that extract the core components
of the reasoning process, namely the entities, quantities, and the arithmetic operations that define
the entities’ relations. In this setting we fine-tune pre-trained language models and compare our
proposed conditions to the natural language based comparison conditions provided with the data
set.
The unit conversion task avoids the natural language understanding and world knowledge issues
4