A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models Alessandro Stolfo

2025-04-30 0 0 2.03MB 15 页 10玖币

侵权投诉

A Causal Framework to Quantify the Robustness of

Mathematical Reasoning with Language Models

Alessandro Stolfo∗

ETH Zürich

stolfoa@ethz.ch

Zhijing Jin∗

MPI & ETH Zürich

jinzhi@ethz.ch

Kumar Shridhar

ETH Zürich

shkumar@ethz.ch

Bernhard Schölkopf

MPI & ETH Zürich

bs@tue.mpg.de

Mrinmaya Sachan

ETH Zürich

msachan@ethz.ch

Abstract

We have recently witnessed a number of im-

pressive results on hard mathematical reasoning

problems with language models. At the same

time, the robustness of these models has also

been called into question; recent works have

shown that models can rely on shallow patterns

in the problem description when generating a

solution. Building on the idea of behavioral

testing, we propose a novel framework, which

pins down the causal effect of various factors in

the input, e.g., the surface form of the problem

text, the operands, and math operators on the

output solution. By grounding the behavioral

analysis in a causal graph describing an intu-

itive reasoning process, we study the behavior

of language models in terms of robustness and

sensitivity to direct interventions in the input

space. We apply our framework on a test bed

of math word problems. Our analysis shows

that robustness does not appear to continuously

improve as a function of size, but the GPT-3

Davinci models (175B) achieve a dramatic im-

provement in both robustness and sensitivity

compared to all other GPT variants.1

1 Introduction

Many natural language understanding situations,

such as understanding the ﬁnancial news, require

reasoning with text that includes numbers. How-

ever, such mathematical reasoning is challenging

for NLP models (Cobbe et al.,2021;Mishra et al.,

2022b). Mathematical reasoning for text has been

an active area of research for a while (Seo et al.,

2015;Sachan and Xing,2017;Sachan et al.,2017,

2018,inter alia), and has also emerged as a key

task to track the capabilities of large language mod-

els (LLMs) in recent years (Brown et al.,2020;

Ouyang et al.,2022;Wei et al.,2022a,inter alia).

However, despite the impressive performance of

LLMs on various math reasoning benchmarks (e.g.,

∗Equal contribution.

Our code and data are available at

https://github.

com/alestolfo/causal-math.

Kyle could fit

=26 drawings on each page. If he has

=11

pages, the number of drawings he can make is ___.

Kyle could fit

=2 drawings on each page. If he has

=143

pages, the number of drawings he can make is ___.

Prediction

LLMs

Original text:

Example

-intervention

by our framework:

Keep ground-truth

but change

After

-intervention

Pred = 286 = g

P(286)=0.085

Pred = 143 (incorrect)

P(286)=0.001

Original prediction

Distribution of the Predicted Numerical Answer

P(R)

Figure 1: Through our framework, we conduct

interventions on the input and evaluate the change in the

distribution

P(R)

of the prediction

by LLMs, in this

ﬁgure, GPT-J. This allows us to measure the causal ef-

fect of each factor in the input on the model’s response.

Ouyang et al.,2022;Chowdhery et al.,2022), it

remains unclear whether these models have learned

mere artifacts in the data or have truly mastered

the mathematical concepts needed to consistently

solve all variations of the same problem (Patel et al.,

2021;Razeghi et al.,2022;Welleck et al.,2022).

In sharp contrast with a large number of papers on

improving the performance of LLMs on various

types of math-based problems, there has been little

effort on behavioral analysis of LLMs for these

tasks. Existing methods for understanding the ro-

bustness of these models (Patel et al.,2021) rely on

manually constructing variations of math problems,

and we do not yet have a principled, comprehensive

framework for quantifying such robustness.

Thus, in this work, we propose a formal frame-

work based on causal inference, to quantify the ro-

bustness of NLP models’ math reasoning abilities.

Speciﬁcally, we describe a causal graph formula-

tion of math reasoning, where the graph allows us

to measure the difference in the structural causal

arXiv:2210.12023v3 [cs.CL] 7 Jun 2023

Math Word

Problem!

Operands!

N= (N1,N2,…)

Non-Operand Parts!

Operations!

Irrelevant Surface Form!

Correct Calculation!

G:= fO(N)

Model’s Prediction!

DCE( )

N→R

DCE( )

S→R

TCE( )

N"on"R

TCE( )

T"on"R

: Causal Graph of Human/

Ground-truth Reasoning

DCE( )

O→R

Intervention

🔧

Intervention

🔧

Red Arrows: Potential Spurious Correlations

Blue Arrow: Desired Eﬀect

DCE( )

G→R

Figure 2: Causal graph of model predictions on math questions. We highlight the difference between a cognitively-

inspired correct reasoning path (

) and the undesired effects that some factors might have on the model’s prediction

(red arrows). By performing controlled interventions of the numerical values (

) and on the textual framing of the

problem (T,S), we are able to quantify the causal effects of each factor.

models of human reasoning and model judgment.

We consider various causal factors such as the tex-

tual framing of the question, numerical operands,

and operation types. Then, we identify a set of

interventions in the context of math word prob-

lems (an example of which is illustrated in Figure

1), and provide a causal inference framework to

obtain causal effects of each factor via direct

interventions (Pearl,1995) and causal mediation

analysis (Pearl,2001). While our approach is remi-

niscent of recent studies using causal analysis for

LLMs (Finlayson et al.,2021;Vig et al.,2020;

Meng et al.,2022), in this work, we provide a new

theoretical analysis framework speciﬁcally suitable

for math reasoning. Using our framework, we dis-

entangle factors affecting the model’s predictions

and measure their inﬂuences. This way, we are able

to provide insights into the model’s reasoning in

terms of robustness and sensitivity with respect to

changes in these factors.

We apply our framework to study a set of thirteen

GPT models with various sizes and training proce-

dures (i.e., instruction-tuned and non-instruction-

tuned). We observe that, among non-instruction-

tuned language models, the larger ones tend to be

more sensitive to changes in the ground-truth result

of a math word problem, but not necessarily more

robust. However, we observe a different behavior

in the instruction-tuned GPT-3 models (Ouyang

et al.,2022), which show a remarkable improve-

ment in both sensitivity and robustness, although

the robustness reduces when problems get more

complicated. We additionally investigate the role

of size and instruction tuning on the model’s per-

formance with three models of the LLaMA family

(Touvron et al.,2023) and Stanford Alpaca (Taori

et al.,2023).

2 Problem Setup

We consider a dataset

of math word problems

(MWPs), where each MWP is denoted as a ques-

tion

is a list

(T,N)

consisting of a ques-

tion template

and an ordered list of operands

N= (N1, N2, . . . , Nm)

. Each question template

T:= (O, S)

further contains two types of informa-

tion: a set of arithmetic operations

implicitly ex-

pressed in the question, and the text surface form

irrelevant to the arithmetic operations.

incorpo-

rates the information relative to the operations as a

collection of tuples

{(O1, i1, j1),(O2, i2, j2), . . . }

where

Ok∈ {+,−,×,÷}

(

k∈N

) and

ik, jk∈N

represent the indices of the operands to which op-

erator

should be applied to.

The ground-truth

result

G=fO(N)

is calculated by computing the

function

, which represents the application of all

the operators in

to the respective operands. We il-

lustrate the factors in

and their inter-dependency

in the causal graph in Figure 2. A two-operand in-

stance

in this form from Patel et al. (2021)

is:

Template

: Mark has

trees in his

backyard. If he plants

more, how

many trees will he have?

Operands n:(n1= 12, n2= 13)

Operations o: {(“+”, 1, 2)}

Result:g=fo(n) = n1+n2= 25

The intermediate result of operation

is indicated by

ik=m+l.

Our goal is to quantify the robustness of a model

on the set of problems

q∈ D

. Ideally,

should be a dataset not seen by the model during

training. We assume that a model takes

as input

and predicts a probability distribution of the result

P(R|t,n)

. Our formulation below will be

easier to understand using this ﬁnite discrete set

and can be generalized to any kind of data pairing

a natural language template with a function that

maps a set of operands to a result (e.g., a Python

program; Mishra et al. 2022a).

3 A Causal Framework

In this section, we describe our framework in three

steps. First, we deﬁne the idea of model robust-

ness on MWPs. Then, we identify possible

interventions (Pearl,1995) that we can perform.

Finally, we describe the causal effects that we mea-

sure to quantify the robustness of various models.

3.1 Step 1. Question Reformulation

We address the research question “Is a model rea-

soning robustly on MWPs?” by comparing the

causal mechanisms of the model’s decisions to a

hypothesized human reasoning mechanism. Note

that we do not claim to know how humans reason

about these problems. We simply propose a reason-

able and intuitive way to judge model robustness

given a reasonable and intuitive human reasoning

mechanism inspired by ﬁndings regarding the inde-

pendence of language and mathematical reasoning

in humans (Brannon,2005;Monti et al.,2012).

Human Reasoning Mechanisms. The causal

mechanisms of how humans might solve

include

o=fabstract(q),(1)

g=fo(n),(2)

where they ﬁrst abstract the arithmetic operations

from the problem

by some cognitive pro-

cess

fabstract

, and then apply the operation to the

operands to obtain the result

. We show these

mechanisms in the green subgraph Ghof Figure 2.

Model Reasoning Mechanisms. In contrast, the

causal mechanisms of how a model might solve

are as follows:

r=fblackBox(t,n),(3)

where we are unsure about (1) what part(s) of

the

model takes into account, and (2) how it operates

over the relevant variables.

Thus, we draw all possible causal mechanisms

that might take place in the black-box model

fblackBox

in the complete causal graph in Figure 2.

Some possible ﬁne-grained causal mechanisms are

The model might attend over the question

template

in two ways: paying attention to

the text surface form

via the causal path

T→S→R

, or text relevant to the math op-

erations ovia the causal path T→O→R.

The model might also attend to the operands

n:= (n1, n2, . . . )via a causal path N→R.

If the model learns the correct causal mech-

anisms as in the human cognitive process,

it should capture how the operator and the

operands matter to the ground-truth result

(via

O→G

and

N→G

) and then the model

prediction should be sensitive to any changes

in the ground truth, namely

G→R

. No spuri-

ous correlations can directly affect

without

going through the mediator G.

Hence, to answer the question “How robust is

the mathematical reasoning of a model on MWPs?”

we can answer the following subquestions:

How does

change in response to

? By

quantifying this, we assess the sensitivity (cor-

rect responsiveness) of the model to changes

in the problem. In other words, does the model

correctly adjust its prediction in response to a

change in the correct solution of the problem?

What is the (unwanted) direct causal effect

size of

S→R

, and

N→R

? We see the

quantities as a measure of the brittleness (i.e.,

wrong responsiveness) of the model to result-

preserving changes in the input. The lower

the direct causal effect of

and

, the more

robust the model is.

3.2 Step 2. Causal Intervention List

After formulating the cognitively-inspired sub-

graph

and deﬁning the undesired causal paths

in Figure 2, we list all feasible limited actions that

allow us to perform our causal analysis. In the con-

text of MWPs, we use the following interventions:

Direct intervention on all possible

n1, n2, . . .

;

Partially controllable interventions on

. We

can replace the template Tin two ways:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ACausalFrameworktoQuantifytheRobustnessofMathematicalReasoningwithLanguageModelsAlessandroStolfo∗ETHZürichstolfoa@ethz.chZhijingJin∗MPIÐZürichjinzhi@ethz.chKumarShridharETHZürichshkumar@ethz.chBernhardSchölkopfMPIÐZürichbs@tue.mpg.deMrinmayaSachanETHZürichmsachan@ethz.chAbstractWehaverecentlyw...

展开>> 收起<<

A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models Alessandro Stolfo.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models Alessandro Stolfo

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: