Learning to Reason With Relational Abstractions Andrew J. Nam1 Mengye Ren2 Chelsea Finn1 James L. McClelland1 1Stanford University2NYU

2025-04-29 0 0 756.08KB 27 页 10玖币
侵权投诉
Learning to Reason With Relational Abstractions
Andrew J. Nam1, Mengye Ren2, Chelsea Finn1, James L. McClelland1
1Stanford University, 2NYU
December 7, 2022
Abstract
Large language models have recently shown promising progress in mathematical reasoning
when fine-tuned with human-generated sequences walking through a sequence of solution steps.
However, the solution sequences are not formally structured and the resulting model-generated
sequences may not reflect the kind of systematic reasoning we might expect an expert human to
produce. In this paper, we study how to build stronger reasoning capability in language models
using the idea of relational abstractions. We introduce new types of sequences that more explic-
itly provide an abstract characterization of the transitions through intermediate solution steps
to the goal state. We find that models that are supplied with such sequences as prompts can
solve tasks with a significantly higher accuracy, and models that are trained to produce such se-
quences solve problems better than those that are trained with previously used human-generated
sequences and other baselines. Our work thus takes several steps toward elucidating and im-
proving how language models perform on tasks requiring multi-step mathematical reasoning.
1 Introduction
Deep learning has had tremendous success in a wide range of domains, such as vision [He et al.,
2016], language [Brown et al., 2020], and playing games at superhuman levels [Mnih et al., 2015,
Silver et al., 2016, Vinyals et al., 2019]. Yet despite these accomplishments, these systems remain
limited in their formal and mathematical reasoning abilities [Saxton et al., 2019, Cobbe et al., 2021,
Hendrycks et al., 2021]. Although there have be recent impressive gains Lewkowycz et al. [2022],
the models remain challenged to succeed at harder problems.
Recent work suggest that neural networks, like humans, benefit from relying on a chain of reason-
ing steps rather than attempting to produce the final output as a direct mapping from the problem
prompt [Recchia, 2021, Nye et al., 2021, Hendrycks et al., 2021, Cobbe et al., 2021, Lewkowycz et al.,
2022]. These works rely entirely on naturalistic data and manipulations, in the sense that problems
and their step-wise solutions are taken as they are found in existing sources, or human annotators
are asked to produce a sequence of solution steps using numbers interspersed with natural language.
However, while naturalistic sentences are certainly how we often communicate our solutions to each
other informally, we argue that formal and mathematical reasoning depends on identifying and ex-
ploiting the set of abstract relationships that underlies the details of the problem at hand. Even
in settings where the focus is on the step-wise manipulation of quantities to obtain valid practical
results, a set of abstract relationships underlies the sequence of operations.
We build on this intuition by exploring the possibility that, if a problem-solver can formulate
the problem under consideration at an abstract level, this will be conducive to finding the correct
sequence of more specific arithmetic operations. However, to our knowledge, no math dataset
currently exists that utilizes natural language and also isolates key reasoning components such as
entities and their relations, i.e. there is no way to train the model to convert natural language
inputs into these core elements. We address this gap by proposing a new dataset, GSM8K-R, by
expanding on the GSM8K dataset [Cobbe et al., 2021], a dataset containing grade-school level math
Equal Contribution.
1
arXiv:2210.02615v2 [cs.LG] 5 Dec 2022
Num 1
A. Numeric only B. Relational-First C. Interleaved D. Multitask
Num 2 Num 3
Num 1 Num 2 Num 3
Rel 1 Rel 2 Rel 3 Num 1
Num 2 Num 3
Rel 1 Rel 2
Rel 3 Num 1 Num 2 Num 3
Rel 1 Rel 2 Rel 3
<relation> eggs laid per day -eggs
for breakfast -eggs for baking =
remaining eggs;
remaining eggs * price per egg =
amount earned daily from eggs
OR
<equation> 16-3-4=9; 9*2=18
<relation> eggs laid per day -eggs
for breakfast -eggs for baking =
remaining eggs;
remaining eggs * price per egg =
amount earned daily from eggs
<equation> 16-3-4=9; 9*2=18
eggs laid per day -eggs for
breakfast -eggs for baking =
remaining eggs; 16-3-4=9;
remaining eggs * price per egg =
amount earned daily from eggs;
9*2=18
16-3-4=9; 9*2=18
Math Question: Janet's ducks lay 16 eggs per day. She eats 3 for breakfast every morning and bakes muffins for her friends every day with 4. She sells the
remainder at the farmers' market daily for $2 per fresh duck egg. How much does she make every day?
Unit Conversion Task: H = 2A; F = 3D; B = 3A; I = 3F; E = 3B; J = 2I; B = 3C; F = 4E; G = 3C; I = 4H; D = 2C; G = 1B;
Convert J to G (mod 5)
1 * 2 = 2; 2 * 3 = 1; 1 * 3 = 3; 3 * 2 =
1; 1 / 3 = 2;
<relation> J -> I; I -> F; F->D; D ->
C; C->G; <equation> 1 * 2 = 2; 2 * 3
= 1; 1 * 3 = 3; 3 * 2 = 1; 1 / 3 = 2;
J -> I; 1 * 2 = 2; I -> F; 2 * 3 = 1;
F->D; 1 * 3 = 3; D -> C; 3 * 2 = 1;
C->G; 1 / 3 = 2;
<relation> J -> I; I -> F; F->D; D ->
C; C->G;
OR
<equation> 1 * 2 = 2; 2 * 3 = 1; 1 * 3
= 3; 3 * 2 = 1; 1 / 3 = 2;
Figure 1: We explore abstract relational reasoning by partitioning the reasoning process into the abstract relational
and the numeric part, and compare four different possibilities: Numeric only (NN): Only numeric steps are
provided without any relational tokens; Relational-first: (RRNN) The abstract relational parts are stated before
the numeric; Interleaved: (RNRN): relational then numeric parts occur in alternating sequence; and Multitask:
(RR|NN): The network learns to produce either the abstract relational or the numeric sequence to a task prompt,
then prompted for the numeric sequence at test time.
word problems, with human annotations that highlight the relational abstractions that are central
to mathematical reasoning. We also introduce a new synthetic task, called the unit conversion
(UC) task, in which the abstract relational problem is reduced to its essence that enables controlled
analyses without the complications that arise from naturalistic datasets.
At their core, both tasks involve reasoning about how different quantities relate to each other,
and formulating appropriate arithmetic equations to perform the corresponding numerical computa-
tions. We can decompose each step of the solution into abstract relational reasoning and arithmetic
expressions, which can then be used to recompose the solution sequence in different forms.
We summarize our main contributions as follows:
We decompose the problem solving process into identifying the relevant abstract relationships and
performing the corresponding arithmetic manipulations.
We present a new dataset called GSM8K-R that adds relational abstraction annotations to the
original GSM8K dataset [Cobbe et al., 2021] (to be released with the paper).
We introduce the new synthetic task Unit Conversion task that brings out the importance of
engaging with the relational abstractions, even in smaller transformer models.
We find that teaching models to identify the relevant abstract relationships on trained problems
can lead to substantial performance gains at test, and identify several factors affecting this out-
come.
We find that identifying the crucial abstract relationships remains a challenge, and that providing
the relational abstraction at test time can produce drastic gains.
Taken together, we believe these findings highlight the importance of identifying the relevant ab-
stract relations to enable correct formal and mathematical reasoning. In the discussion, we consider
next steps that may allow the development of artificial systems that capture this ability.
2 Incorporating Relational Abstraction
In this section, we describe our framework of incorporating relational abstractions into mathematical
reasoning. We begin with the notion that mathematical problem solving involves determining the
values of unknown quantities from known quantities, where a quantity is a numerical attribute of an
2
item or set, such as the price of an item or the number of items in the set. Quantities can be derived
from other quantities relying on rules that apply to quantities of relevant types. For example, as in
the problem shown in Table 1, the amount earned from selling some number of items (in this case,
eggs) is equal to the product of the number of items sold times the price per item.
In general, mathematical problem solving requires several operations on given quantities to obtain
a final answer – a specified target or goal quantity. In the problem in Table 1, we are given the
number of eggs Janet’s ducks lay each day, eggs eaten for breakfast, eggs used in baking, and we
are told that she sells the remainder for a specified price per egg. To solve for how much money
she makes, we must first determine the remainder by subtracting the number of eggs eaten and the
number of eggs used in baking from the number laid, and then determine the amount earned by
multiplying the remaining number of eggs times the price per egg.
This exemplifies what we call the abstract relational plan: a plan outlining the reasoning process
without invoking any numbers. Here, “eggs laid”, “eggs eaten”, “eggs used in baking”, “remaining
eggs” and “price per egg” are quantities needed to reach the target quantity. The abstract relational
plan specifies the steps that must be applied to the given quantities to reach the relevant intermediate
quantities, and then applied to these quantities to reach the final answer. What makes a plan abstract
is that it omits specific information – that is, the specific quantities involved – and connects items
through how they relate to each other at a more general or abstract level. What makes it relational
is that it specifies which entities are relevant to each other in the problem. An abstract relational
plan formulates the problem as a graph of interconnected abstract entities, whose specific values
could be replaced by others without changing the set of relationships.
The problems found in the GSM8K dataset can all be seen as solvable by extracting the correct
abstract relational plan from the verbal statement of the problem and then applying the plan to
obtain the numeric value of the target quantities given the values of the given quantities. The chal-
lenge here is that GSM8K, and other math datasets like it, consists entirely of natural language data
that makes it difficult to systematically extract the relevant entities and their relations. We address
this issue through our human-annotated dataset GSM8K-R that provides the ground truth labels
to train the model with, and we explore several instructional forms that utilize these annotations.
Figure 1 enumerates a few possibilities for how we can incorporate abstract relational reasoning
into the training and testing of a decoder-only transformer of the kind used in the GPT model series.
We first decompose a solution sequence into an an abstract relational plan, consisting of a sequence
of abstract relational expressions as described above and a sequence of arithmetic expressions in-
volving only numbers and basic arithmetic operations. We can then train and test the models using
conditions of the following four types: numeric-only (NN) uses only the narithmetic sequences, and
serves as our baseline. In relational-then-numeric, (RRNN) the relational expressions are presented
before numeric ones. This represents the strategy of generating a high-level relational plan first, and
then implementing the plan by performing the relevant arithmetic operations. The interleaved for-
mat (RNRN) alternates between the abstract relational expressions and the arithmetic expressions,
so that each arithmetic expression is accompanied by the relevant abstract relational expression.
Lastly, in the multitask approach (RR|NN), the model is prompted to output the sequence of ei-
ther the relational or the numeric expressions, but not both. This may allow the model to learn to
represent the problem at the abstract level and exploit such representations even when it is only pro-
ducing the numerical expressions. This approach tests the claim that additional auxiliary language
tokens effectively function as regularizers or learning tools that can be discarded at test time and
may even suppress performance if included [Mu et al., 2020, Lampinen et al., 2022, Hendrycks et al.,
2021]. Moreover, learning and generating the two sequences separately has the added advantage of
generating shorter sequences at test time, just like numeric-only. In this paper, we examine which
type of relational abstraction brings the best reasoning capability in each of our two task settings.
3 Related Work
Although computational models of mathematical reasoning have been proposed for over half a cen-
tury [Bobrow, 1964], application of neural network models began much more recently using recurrent
networks for sequence-to-sequence prediction [Wang et al., 2017]. Shortly after their introduction
3
in Vaswani et al. [2017], Saxton et al. [2019] found that transformers-based models outperformed
other architectures when trained to generate the answer directly from the problem statement. Many
researchers have explored enhancing model performance by fine-tuning to produce intermediate
equations or programs [Shi et al., 2015, Upadhyay and Chang, 2015, Amini et al., 2019, Miao et al.,
2020, Drori et al., 2021]. Recent advances rely on large transformer-based language models [Brown
et al., 2020, Thoppilan et al., 2022, Chowdhery et al., 2022, Lewkowycz et al., 2022] and/or datasets
involving full step-by-step solutions in natural language [Ling et al., 2017, Hendrycks et al., 2021,
Welleck et al., 2021, Cobbe et al., 2021, Drori et al., 2021].
Interestingly, prompting large language models such as GPT-3 to generate chains of thought with
just a few examples at test time can enhance performance considerably [Wei et al., 2022], indicating
that the models may already have the ability to engage in a step by step reasoning process, in part
because such a process is exemplified in their training. Many recent works use multiple samples
from a model, either using a verifier trained on model-generated responses to re-rank candidate
sequences Cobbe et al. [2021] or relying on a majority voting scheme [Wang et al., 2022]. The
strongest results overall to date [Lewkowycz et al., 2022] use a very large transformer based language
model, fine-tuned on scientific and mathematical text, provided with a chain of thought prompt, and
assessed using majority voting. However, these models still only achieve modest scores on harder
problems, consistent with the view Hendrycks et al. [2021] that simply scaling up the model size is
an intractable strategy for solving mathematics problems of higher difficulty, even with the added
benefit of chain-of-thought prompting, verifiers, or majority voting.
Common across these existing works is the use of human-generated solution sequences. In our
work, we introduce our GSM8K-R dataset to explicitly contrast performance on different types of
solution sequences and explore how explicit focus on generating a structured abstract relational
plan can improve learning, an analysis that would not be possible with existing datasets. We
also introduce the unit conversion (UC) task, a completely synthetic task domain to complement
our exploration of solving problems expressed in natural language. This parallels the approach of
Gontier et al. [2020], with a crucial difference. These authors investigated logical reasoning over
a fixed data-base of specific relational facts, training models to produce an inferable relation to a
probe question, and found only small advantages of a plan sequence compared to generating the
answer directly. In contrast, our UC task affords separating the abstract relational plan from the
specific numerical computations. This allows us to demonstrate a striking advantage from learning
to produce the abstract relational sequence rather than just the necessary numerical expressions.
4 Experiments
We use two tasks to explore the possible benefits or relational abstractions: a set of natural language
math problems from the Grade School Math 8K (GSM8K) dataset [Cobbe et al., 2021], and an
abstract unit conversion task (UC) in which the model must determine how the number of units of
one type corresponds to a specified number of units of another type. Both tasks contain quantities
and relations that can be represented by a graph, and involve formulating and solving a series of
numerical equations. However, the two tasks pose different challenges, allow different approaches to
model training, and afford different comparison conditions and analyses.
The GSM8K dataset consists of realistic word problems requiring a broad understanding of
mathematical concepts and their application to grade school math problems. The dataset includes
human-generated mixed expressions that usually step through the problems in a linear order corre-
sponding to the problem statement in a fairly small number of solution steps. Because these are word
problems, they challenge the model’s natural language understanding and general world knowledge
(such as the fact that a dozen consists of 12 items, or that the number of eggs increases when it is
laid by a chicken but decreases when it is used in baking cookies). We present our GSM8K-R dataset
by building on the GSM8K dataset, adding human annotations that extract the core components
of the reasoning process, namely the entities, quantities, and the arithmetic operations that define
the entities’ relations. In this setting we fine-tune pre-trained language models and compare our
proposed conditions to the natural language based comparison conditions provided with the data
set.
The unit conversion task avoids the natural language understanding and world knowledge issues
4
Model, training regime, and dataset
GPT2-XL, fine-tuned on GSM8K Simple transformer trained from
scratch on unit conversion dataset
Type of steps
in training
Problem
prompt only
Problem & relational
plan prompt
Problem
prompt only
Problem & relational
plan prompt
Answer only
baseline 4.93 - 24.7 -
Numeric only
(NN) 22.97 - 25.9 -
Multitask
(RR|NN) 28.05 - 29.8 -
Relational First
(RRNN) 19.48 64.59
71.1
W/ correct plan: 96.8
W/ incorrect plan: 21.0
Plan accuracy: 66.2
96.7
Interleaved
(RNRN) 22.97 66.26
85.8
W/ correct plan: 99.9
W/ incorrect plan: 20.3
Plan accuracy: 82.3
99.9
Table 1: Key results demonstrating the key findings from the parallel conditions of our two experiments. Fuller
definition of the conditions are given in the caption for Figure 1.
by presenting conversion rules in a simple symbolic form. This allows us to present problems requiring
the use of a larger number of specified relationships that are presented to the model in a random
order and requiring longer sequences of solution steps. In this setting we use smaller scale models
that we are able to train end-to-end, allowing us consider several additional variations of the training
regime and to analyze the model’s step-by-step performance more straightforwardly. Together our
two tasks offer both a rich, naturalistic environment with empirical results for broader applicability
and a systematic, synthetic environment that reduces mathematical reasoning to its most abstract
form, bringing out the advantage of relational abstractions more clearly.
Table 1 presents key results from the four conditions illustrated in Figure 1. In both the GSM8K-
R and UC tasks, the models perform very poorly after fine tuning to generate the answer directly
from a problem statement (25% correct is the chance level on the UC task), and training on nu-
meric sequences produces some improvement for GSM8K-R but only a hint of a gain over chance
level for UC. The multitask condition produces slight gains for but models, but the real big gains
are observed when the models have been trained to produce relational sequences either before or
alternating with the numerical sequences. For GSM8K-R, the benefit only appears when the rela-
tional plan is included in the prompt at test time. In the UC setting, we also see big gains when
the model produces the relational sequence for itself, and we also see that this advantage comes
only on trials where the model produces the relational sequence correctly. Indeed, either when the
model produces the relational sequence correctly itself or when prompted with the correct relational
sequence, performance is at near-ceiling levels. In the next sections we describe the two data sets
and experiments in more detail, along with further many findings from many additional comparison
conditions.
4.1 Task 1: Solving Grade School Math Problems
We first evaluate our framework on more realistic problems posed using natural language in the
GSM8K-R dataset, which contains around 7.7K training question and 1.3K test questions from the
original GSM8K dataset with additional human annotated solutions, all in the form of the English
language. An example of the problem and its solution can be found in the first two rows of Table 2.
The original dataset contains the following possible solution formats:
The original solution format was used in the original paper. It provides solution steps in natural
language annotated with executable equations. It is similar to our interleaved approach in that
the target unit of each step often appears at the end of the sentence (e.g. Janet sells 16-3-4 eggs
5
摘要:

LearningtoReasonWithRelationalAbstractionsAndrewJ.Nam*1,MengyeRen2,ChelseaFinn1,JamesL.McClelland11StanfordUniversity,2NYUDecember7,2022AbstractLargelanguagemodelshaverecentlyshownpromisingprogressinmathematicalreasoningwhen ne-tunedwithhuman-generatedsequenceswalkingthroughasequenceofsolutionsteps...

展开>> 收起<<
Learning to Reason With Relational Abstractions Andrew J. Nam1 Mengye Ren2 Chelsea Finn1 James L. McClelland1 1Stanford University2NYU.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:756.08KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注