A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models Alessandro Stolfo

2025-04-30 0 0 2.03MB 15 页 10玖币
侵权投诉
A Causal Framework to Quantify the Robustness of
Mathematical Reasoning with Language Models
Alessandro Stolfo
ETH Zürich
stolfoa@ethz.ch
Zhijing Jin
MPI & ETH Zürich
jinzhi@ethz.ch
Kumar Shridhar
ETH Zürich
shkumar@ethz.ch
Bernhard Schölkopf
MPI & ETH Zürich
bs@tue.mpg.de
Mrinmaya Sachan
ETH Zürich
msachan@ethz.ch
Abstract
We have recently witnessed a number of im-
pressive results on hard mathematical reasoning
problems with language models. At the same
time, the robustness of these models has also
been called into question; recent works have
shown that models can rely on shallow patterns
in the problem description when generating a
solution. Building on the idea of behavioral
testing, we propose a novel framework, which
pins down the causal effect of various factors in
the input, e.g., the surface form of the problem
text, the operands, and math operators on the
output solution. By grounding the behavioral
analysis in a causal graph describing an intu-
itive reasoning process, we study the behavior
of language models in terms of robustness and
sensitivity to direct interventions in the input
space. We apply our framework on a test bed
of math word problems. Our analysis shows
that robustness does not appear to continuously
improve as a function of size, but the GPT-3
Davinci models (175B) achieve a dramatic im-
provement in both robustness and sensitivity
compared to all other GPT variants.1
1 Introduction
Many natural language understanding situations,
such as understanding the financial news, require
reasoning with text that includes numbers. How-
ever, such mathematical reasoning is challenging
for NLP models (Cobbe et al.,2021;Mishra et al.,
2022b). Mathematical reasoning for text has been
an active area of research for a while (Seo et al.,
2015;Sachan and Xing,2017;Sachan et al.,2017,
2018,inter alia), and has also emerged as a key
task to track the capabilities of large language mod-
els (LLMs) in recent years (Brown et al.,2020;
Ouyang et al.,2022;Wei et al.,2022a,inter alia).
However, despite the impressive performance of
LLMs on various math reasoning benchmarks (e.g.,
Equal contribution.
1
Our code and data are available at
https://github.
com/alestolfo/causal-math.
Kyle could fit
n1
=26 drawings on each page. If he has
n2
=11
pages, the number of drawings he can make is ___.
Kyle could fit
n1
=2 drawings on each page. If he has
n2
=143
pages, the number of drawings he can make is ___.
Prediction
LLMs
Original text:
Example
do
-intervention
by our framework:
Keep ground-truth
g
,
but change
n1
,
n2
After
do
-intervention
Pred = 286 = g
P(286)=0.085
Pred = 143 (incorrect)
P(286)=0.001
Original prediction
Distribution of the Predicted Numerical Answer
P(R)
Figure 1: Through our framework, we conduct
do
-
interventions on the input and evaluate the change in the
distribution
P(R)
of the prediction
R
by LLMs, in this
figure, GPT-J. This allows us to measure the causal ef-
fect of each factor in the input on the model’s response.
Ouyang et al.,2022;Chowdhery et al.,2022), it
remains unclear whether these models have learned
mere artifacts in the data or have truly mastered
the mathematical concepts needed to consistently
solve all variations of the same problem (Patel et al.,
2021;Razeghi et al.,2022;Welleck et al.,2022).
In sharp contrast with a large number of papers on
improving the performance of LLMs on various
types of math-based problems, there has been little
effort on behavioral analysis of LLMs for these
tasks. Existing methods for understanding the ro-
bustness of these models (Patel et al.,2021) rely on
manually constructing variations of math problems,
and we do not yet have a principled, comprehensive
framework for quantifying such robustness.
Thus, in this work, we propose a formal frame-
work based on causal inference, to quantify the ro-
bustness of NLP models’ math reasoning abilities.
Specifically, we describe a causal graph formula-
tion of math reasoning, where the graph allows us
to measure the difference in the structural causal
arXiv:2210.12023v3 [cs.CL] 7 Jun 2023
Math Word
Problem!
Q
Operands!
N= (N1,N2,…)
Non-Operand Parts!
T
Operations!
O
Irrelevant Surface Form!
S
Correct Calculation!
Model’s Prediction!
R
DCE( )
NR
DCE( )
SR
TCE( )
N"on"R
TCE( )
T"on"R
: Causal Graph of Human/
Ground-truth Reasoning
Gh
DCE( )
OR
Intervention
🔧
Intervention
🔧
Red Arrows: Potential Spurious Correlations
Blue Arrow: Desired Eect
DCE( )
GR
Figure 2: Causal graph of model predictions on math questions. We highlight the difference between a cognitively-
inspired correct reasoning path (
Gh
) and the undesired effects that some factors might have on the model’s prediction
(red arrows). By performing controlled interventions of the numerical values (
N
) and on the textual framing of the
problem (T,S), we are able to quantify the causal effects of each factor.
models of human reasoning and model judgment.
We consider various causal factors such as the tex-
tual framing of the question, numerical operands,
and operation types. Then, we identify a set of
interventions in the context of math word prob-
lems (an example of which is illustrated in Figure
1), and provide a causal inference framework to
obtain causal effects of each factor via direct
do
-
interventions (Pearl,1995) and causal mediation
analysis (Pearl,2001). While our approach is remi-
niscent of recent studies using causal analysis for
LLMs (Finlayson et al.,2021;Vig et al.,2020;
Meng et al.,2022), in this work, we provide a new
theoretical analysis framework specifically suitable
for math reasoning. Using our framework, we dis-
entangle factors affecting the model’s predictions
and measure their influences. This way, we are able
to provide insights into the model’s reasoning in
terms of robustness and sensitivity with respect to
changes in these factors.
We apply our framework to study a set of thirteen
GPT models with various sizes and training proce-
dures (i.e., instruction-tuned and non-instruction-
tuned). We observe that, among non-instruction-
tuned language models, the larger ones tend to be
more sensitive to changes in the ground-truth result
of a math word problem, but not necessarily more
robust. However, we observe a different behavior
in the instruction-tuned GPT-3 models (Ouyang
et al.,2022), which show a remarkable improve-
ment in both sensitivity and robustness, although
the robustness reduces when problems get more
complicated. We additionally investigate the role
of size and instruction tuning on the model’s per-
formance with three models of the LLaMA family
(Touvron et al.,2023) and Stanford Alpaca (Taori
et al.,2023).
2 Problem Setup
We consider a dataset
D
of math word problems
(MWPs), where each MWP is denoted as a ques-
tion
Q
.
Q
is a list
(T,N)
consisting of a ques-
tion template
T
and an ordered list of operands
N= (N1, N2, . . . , Nm)
. Each question template
T:= (O, S)
further contains two types of informa-
tion: a set of arithmetic operations
O
implicitly ex-
pressed in the question, and the text surface form
S
irrelevant to the arithmetic operations.
O
incorpo-
rates the information relative to the operations as a
collection of tuples
{(O1, i1, j1),(O2, i2, j2), . . . }
,
where
Ok∈ {+,,×,÷}
(
kN
) and
ik, jkN
represent the indices of the operands to which op-
erator
Ok
should be applied to.
2
The ground-truth
result
G=fO(N)
is calculated by computing the
function
fO
, which represents the application of all
the operators in
O
to the respective operands. We il-
lustrate the factors in
Q
and their inter-dependency
in the causal graph in Figure 2. A two-operand in-
stance
q
of
Q
in this form from Patel et al. (2021)
is:
Template
t
: Mark has
n1
trees in his
backyard. If he plants
n2
more, how
many trees will he have?
Operands n:(n1= 12, n2= 13)
Operations o: {(“+”, 1, 2)}
Result:g=fo(n) = n1+n2= 25
2
The intermediate result of operation
Ol
is indicated by
ik=m+l.
Our goal is to quantify the robustness of a model
M
on the set of problems
q∈ D
. Ideally,
D
should be a dataset not seen by the model during
training. We assume that a model takes
q
as input
and predicts a probability distribution of the result
R
:
P(R|t,n)
. Our formulation below will be
easier to understand using this finite discrete set
and can be generalized to any kind of data pairing
a natural language template with a function that
maps a set of operands to a result (e.g., a Python
program; Mishra et al. 2022a).
3 A Causal Framework
In this section, we describe our framework in three
steps. First, we define the idea of model robust-
ness on MWPs. Then, we identify possible
do
-
interventions (Pearl,1995) that we can perform.
Finally, we describe the causal effects that we mea-
sure to quantify the robustness of various models.
3.1 Step 1. Question Reformulation
We address the research question “Is a model rea-
soning robustly on MWPs? by comparing the
causal mechanisms of the model’s decisions to a
hypothesized human reasoning mechanism. Note
that we do not claim to know how humans reason
about these problems. We simply propose a reason-
able and intuitive way to judge model robustness
given a reasonable and intuitive human reasoning
mechanism inspired by findings regarding the inde-
pendence of language and mathematical reasoning
in humans (Brannon,2005;Monti et al.,2012).
Human Reasoning Mechanisms. The causal
mechanisms of how humans might solve
q
include
o=fabstract(q),(1)
g=fo(n),(2)
where they first abstract the arithmetic operations
o
from the problem
q
by some cognitive pro-
cess
fabstract
, and then apply the operation to the
operands to obtain the result
g
. We show these
mechanisms in the green subgraph Ghof Figure 2.
Model Reasoning Mechanisms. In contrast, the
causal mechanisms of how a model might solve
q
are as follows:
r=fblackBox(t,n),(3)
where we are unsure about (1) what part(s) of
t
the
model takes into account, and (2) how it operates
over the relevant variables.
Thus, we draw all possible causal mechanisms
that might take place in the black-box model
fblackBox
in the complete causal graph in Figure 2.
Some possible fine-grained causal mechanisms are
1.
The model might attend over the question
template
t
in two ways: paying attention to
the text surface form
s
via the causal path
TSR
, or text relevant to the math op-
erations ovia the causal path TOR.
2.
The model might also attend to the operands
n:= (n1, n2, . . . )via a causal path NR.
3.
If the model learns the correct causal mech-
anisms as in the human cognitive process,
it should capture how the operator and the
operands matter to the ground-truth result
g
(via
OG
and
NG
) and then the model
prediction should be sensitive to any changes
in the ground truth, namely
GR
. No spuri-
ous correlations can directly affect
R
without
going through the mediator G.
Hence, to answer the question “How robust is
the mathematical reasoning of a model on MWPs?”
we can answer the following subquestions:
1.
How does
R
change in response to
G
? By
quantifying this, we assess the sensitivity (cor-
rect responsiveness) of the model to changes
in the problem. In other words, does the model
correctly adjust its prediction in response to a
change in the correct solution of the problem?
2.
What is the (unwanted) direct causal effect
size of
SR
, and
NR
? We see the
quantities as a measure of the brittleness (i.e.,
wrong responsiveness) of the model to result-
preserving changes in the input. The lower
the direct causal effect of
S
and
N
, the more
robust the model is.
3.2 Step 2. Causal Intervention List
After formulating the cognitively-inspired sub-
graph
Gh
and defining the undesired causal paths
in Figure 2, we list all feasible limited actions that
allow us to perform our causal analysis. In the con-
text of MWPs, we use the following interventions:
1.
Direct intervention on all possible
n1, n2, . . .
;
2.
Partially controllable interventions on
T
. We
can replace the template Tin two ways:
摘要:

ACausalFrameworktoQuantifytheRobustnessofMathematicalReasoningwithLanguageModelsAlessandroStolfo∗ETHZürichstolfoa@ethz.chZhijingJin∗MPIÐZürichjinzhi@ethz.chKumarShridharETHZürichshkumar@ethz.chBernhardSchölkopfMPIÐZürichbs@tue.mpg.deMrinmayaSachanETHZürichmsachan@ethz.chAbstractWehaverecentlyw...

展开>> 收起<<
A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models Alessandro Stolfo.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.03MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注