
A Causal Framework to Quantify the Robustness of
Mathematical Reasoning with Language Models
Alessandro Stolfo∗
ETH Zürich
stolfoa@ethz.ch
Zhijing Jin∗
MPI & ETH Zürich
jinzhi@ethz.ch
Kumar Shridhar
ETH Zürich
shkumar@ethz.ch
Bernhard Schölkopf
MPI & ETH Zürich
bs@tue.mpg.de
Mrinmaya Sachan
ETH Zürich
msachan@ethz.ch
Abstract
We have recently witnessed a number of im-
pressive results on hard mathematical reasoning
problems with language models. At the same
time, the robustness of these models has also
been called into question; recent works have
shown that models can rely on shallow patterns
in the problem description when generating a
solution. Building on the idea of behavioral
testing, we propose a novel framework, which
pins down the causal effect of various factors in
the input, e.g., the surface form of the problem
text, the operands, and math operators on the
output solution. By grounding the behavioral
analysis in a causal graph describing an intu-
itive reasoning process, we study the behavior
of language models in terms of robustness and
sensitivity to direct interventions in the input
space. We apply our framework on a test bed
of math word problems. Our analysis shows
that robustness does not appear to continuously
improve as a function of size, but the GPT-3
Davinci models (175B) achieve a dramatic im-
provement in both robustness and sensitivity
compared to all other GPT variants.1
1 Introduction
Many natural language understanding situations,
such as understanding the financial news, require
reasoning with text that includes numbers. How-
ever, such mathematical reasoning is challenging
for NLP models (Cobbe et al.,2021;Mishra et al.,
2022b). Mathematical reasoning for text has been
an active area of research for a while (Seo et al.,
2015;Sachan and Xing,2017;Sachan et al.,2017,
2018,inter alia), and has also emerged as a key
task to track the capabilities of large language mod-
els (LLMs) in recent years (Brown et al.,2020;
Ouyang et al.,2022;Wei et al.,2022a,inter alia).
However, despite the impressive performance of
LLMs on various math reasoning benchmarks (e.g.,
∗Equal contribution.
1
Our code and data are available at
https://github.
com/alestolfo/causal-math.
Kyle could fit
n1
=26 drawings on each page. If he has
n2
=11
pages, the number of drawings he can make is ___.
Kyle could fit
n1
=2 drawings on each page. If he has
n2
=143
pages, the number of drawings he can make is ___.
Prediction
LLMs
Original text:
Example
do
-intervention
by our framework:
Keep ground-truth
g
,
but change
n1
,
n2
After
do
-intervention
Pred = 286 = g
P(286)=0.085
Pred = 143 (incorrect)
P(286)=0.001
Original prediction
Distribution of the Predicted Numerical Answer
P(R)
Figure 1: Through our framework, we conduct
do
-
interventions on the input and evaluate the change in the
distribution
P(R)
of the prediction
R
by LLMs, in this
figure, GPT-J. This allows us to measure the causal ef-
fect of each factor in the input on the model’s response.
Ouyang et al.,2022;Chowdhery et al.,2022), it
remains unclear whether these models have learned
mere artifacts in the data or have truly mastered
the mathematical concepts needed to consistently
solve all variations of the same problem (Patel et al.,
2021;Razeghi et al.,2022;Welleck et al.,2022).
In sharp contrast with a large number of papers on
improving the performance of LLMs on various
types of math-based problems, there has been little
effort on behavioral analysis of LLMs for these
tasks. Existing methods for understanding the ro-
bustness of these models (Patel et al.,2021) rely on
manually constructing variations of math problems,
and we do not yet have a principled, comprehensive
framework for quantifying such robustness.
Thus, in this work, we propose a formal frame-
work based on causal inference, to quantify the ro-
bustness of NLP models’ math reasoning abilities.
Specifically, we describe a causal graph formula-
tion of math reasoning, where the graph allows us
to measure the difference in the structural causal
arXiv:2210.12023v3 [cs.CL] 7 Jun 2023