SELF-CONSISTENT REASONING FOR SOLVING MATH WORD PROBLEMS Jing Xiong1 Zhongwei Wan23 Xiping Hu1 Min Yang23 Chengming Li1 1Sun Yat-Sen University China

2025-05-03 0 0 420.53KB 6 页 10玖币
侵权投诉
SELF-CONSISTENT REASONING FOR SOLVING MATH WORD PROBLEMS
Jing Xiong1, Zhongwei Wan2,3, Xiping Hu1, Min Yang2,3, Chengming Li1
1Sun Yat-Sen University, China
2University of Chinese Academy of Sciences, China
3SIAT, Chinese Academy of Sciences, China
xiongj69@mail2.sysu.edu.cn, {huxiping, lichengming}@mail.sysu.edu.cn, {zw.wan1, min.yang}@siat.ac.cn
ABSTRACT
Math word problems (MWPs) is a task that automatically de-
rives solution expression from a giving math problems in text. The
previous studies suffer from spurious correlations between input
text and output expression. To mitigate this issue, we propose a
self-consistent reasoning framework called SCR, which attempts to
adopt a pruning strategy to correct the output distribution shift so
as to implicitly fix those spurious correlative samples. Specifically,
we firstly obtain a sub-network by pruning a roberta2tree model, for
the sake to use the gap on output distribution between the original
roberta2tree model and the pruned sub-network to expose spuri-
ous correlative samples. Then, we calibrate the output distribution
shift by applying symmetric Kullback-Leibler divergence to alle-
viate spurious correlations. In addition, SCR generates equivalent
expressions, thereby, capturing the original text’s logic rather than
relying on hints from original text. Extensive experiments on two
large-scale benchmarks demonstrate that our model substantially
outperforms the strong baseline methods.
Index TermsMath Word Problems, Spurious correlative sam-
ples, Pruning, Self-consistency
1. INTRODUCTION
Math word problems (MWPs) [1] is a challenging symbolic logi-
cal reasoning task based on natural language description and draws
much attention from researchers about the reasoning power of large
language models [2,3,4,5,6,7] recently. MWPs aims to auto-
matically solve mathematical questions given in a natural language,
which requires the model not only to understand the natural language
but also to have the ability to reason logically. Table 1shows several
examples of MWPs.
At present, there are mainly three paradigms of models that have
achieved excellent performance, namely seq2seq [8,9,10,11,12,
13], seq2tree [14,15], and complex relation extraction [16]. But all
three paradigm model suffers from spurious correlations[17,18,16].
Take an example of Table 1, some of the previous works may ob-
tain the same mathematical formula as ”a÷b×c” for Problem
1 and Problem 2, due to the similar semantic context information,
e.g., Calculate the money situation. However, if the models ignore
the spurious correlations, they intend to generate the wrong solution
expression for Problem 3, which has very analogous semantic infor-
mation like the important words ”money”, ”bank”, and ”account,
which exist in problems 1 and 2. To be specific, the models that
learn spurious information among problems 1 to 3 are more likely
to generate the wrong expression ”12500÷5%×15%” instead of
Problem 1: Tom takes the money from his bank ac-
count and has taken 240 dollars of his account for 3
days. If he takes the same amount of money every day,
how much money will Tom take for next 2 days?
Solution Expression: 240÷3×2Solution: 160
Problem 2: Sherry has deposited 6000 dollars to the
bank for the last 5 months. If she saves the same
monthly money, how much will she add to the account
in the next 3 months?
Solution Expression: 6000÷5×3Solution: 3600
Problem 3: Uncle Jack spends 5%of his bank account
to invest for the trust funds of States and 15%of the
account for the shares of Apple Inc. The money he has
spent on financial management is 12500 dollars. How
much money is in Uncle Jack’s account?
Solution Expression: 12500÷(5%+ 15%)Solu-
tion: 62500
Wrong Solution Expression: 12500÷5%×15%
Table 1. A typical math word examples of the spurious correlation.
”12500÷( 5%+ 15%)” for problem 3 Calculate the money of the
account.
Some recent models address this problem by using variational
information bottlenecks[15]. Our article considers this problem from
the perspective of memorization. Some recent articles have revealed
that pruning can make the model forget some hard-to-memorize
samples[19]. In addition, [20] have revealed that long-tailed sam-
ples are easily forgettable. Usually, these long-tailed samples easily
confuse the model, and the model will generate the final result based
on some shallow hints. A natural hypothesis is that some spurious
correlative samples are tougher for the model to learn well due to
shortcuts, and these samples can be adaptively exposed by pruning
[20]. A key question in the task of MWPs is how to implicitly
emphasize their shortcuts between expressions and original texts
when exposing spurious correlative samples through pruning. Some
work about the reasoning ability of large models has also revealed
that encouraging the model to produce self-consistent outputs can
effectively improve reasoning performance when the model pro-
duces multiple inferences [2,3,4]. However, their work uses voting
to encourage self-consistency, which cannot adaptively correct the
shortcuts of expressions and original texts online through the loss
function.
In this paper, we propose a self-consistent reasoning framework
(called SCR) to solve MWPs tasks. We obtain a sub-network by
arXiv:2210.15373v2 [cs.CL] 17 Feb 2024
pruning the roberta2tree model denoted as the source network. Our
SCR model adaptively finds spurious correlative samples through
pruning. Specifically, SCR encourages the models’ prediction
consistency through mutual learning [21,22], which can empha-
size samples with inconsistent prediction distributions between the
source network and sub-network.
We summarize our main contributions: (1) We propose a novel
self-consistent reasoning framework for MWPs to expose spurious
correlative samples and correct them adaptively. (2) We conduct ex-
tensive experiments on two benchmark datasets (i.e., Math23k and
Ape210k). The results demonstrate that our model performs signifi-
cantly better than the strong baselines.
2. METHODOLOGY
A math word problems (MWPs) can be denoted by a projection F:
W7→ Y, where W={w1, w2,...,wm}is the problem sequence
with mwords and Y={y1, y2,...,yn}is the solution expression
of the problem with nwords. MWPs aim to establish a model F
which generates a correct solution expression Yand calculates the
correct answer for the problem W.
As illustrated in Figure 1, the proposed SCR is composed of a
source network (denoted as S) and a sub-network (denoted as C). The
sub-network is obtained by pruning the source network. The two net-
works are optimized collaboratively and teach each other throughout
the training process. We use the encoder-decoder framework as the
backbone of both source and sub-networks.
2.1. The Encoder-Decoder Architecture
To efficiently obtain a high-quality representation of the problem,
we utilize the RoBERTa model [23] as our encoder. We pass the
problem sequence Winto the RoBERTa model and obtain problem
representation HRmd, where dis the embedding size of the
encoder. In order to model the relationship between the quantities
in the pre-training model, we set up a learnable quantity embedding
matrix TE={t1, t2,...,tn}, similar to the learnable position em-
bedding in BERT [24]. Before passing the sequence Winto the en-
coder, we first replace each quantity in the sequence Wwith a token
tiTt.
Inspired by GTS model [25], our decoder uses the recursive op-
eration of the decoder to construct Yby order of pre-order traversal.
First, the decoder generate the root node troot (middle operator part)
. Then, the decoder generates the left child node tl. Furthermore, the
right child node tris generated. This process has been iterated un-
til the leaf nodes are generated. Specifically, we apply the attention
mechanism to learn the global context vector Gi, which is utilized
to generate the current node token ˆyi. Here we denote the digital
embedding after being encoded by the encoder as T. The formula of
the attention mechanism is shown below:
Gi=
Attention (H, troot,tl),tl/∈ ∅.
Attention (H, troot,tsl ),tsl /∈ ∅.
Attention (H, troot),tl,qsl ∈ ∅.
(1)
ˆyi= Predict(Gi, T ).(2)
where Predict(·)is the final prediction layer for producing the tree
node.
The tree decoder will generate the left and right child nodes and
push them into the stack using the top-down approach if the current
node is an operator. Moreover, if it is a number, the merge opera-
tion will be carried out until the stack’s leaf nodes emerge, at which
Uncle Jack spend [NUM1] of his bank account to invest for the trust funds
of States, and [NUM2] of the account...financial management is [NUM3]...
Uncle Jack spend 5% of his bank account to invest for the trust funds of
States, and 15% of the account...financial management is 12500 dollars.
MWP Text W
Replaced Quantity Tokens
RoBERTa Encoder
H
Pruning
Sub-network
×
×
+
Source Network KL
KL
P1P2
P1
P2
Tree Decoder
Fig. 1. Overview of the proposed framework SCR
point the result will be pushed into the stack of left child nodes for
the attention operation. Then, the merge operation will pop the re-
quired node top and tsubtree from an embedding stack. The formula
of recursive construction is as follows:
tl= Left(Gi,ˆyi, troot ).(3)
tr= Right(Gi,ˆyi, troot ).(4)
tm= Merge(top, tsubtree, tm1).(5)
2.2. Self-consistent Reasoning
As shown in Figure 1, the proposed SCR comprises a source net-
work and a sub-network. The sub-network is obtained by pruning the
source network. In each iteration, the source network will correct the
distribution shift of the output p2from the sub-network to implicitly
emphasizes the spurious correlative samples. When we finish train-
ing the sub-network in this iteration, the sub-network also provides
the supervision signals to correct the distribution shift from the out-
put p1from the source network. Specifically, we preferentially fix
the output distribution of the sub-network when neither network is
trained by samples to expose spurious collaborative samples better.
At the same time, these two networks are also trained by ground-
truth supervision signals.
Formally, the training objective of the source network is to min-
imize negative log-likelihood (NLL) loss for each instance (W, Y )
from training data:
LS(θS) =
n
X
i=1
yilog p(ˆ
yi|W;θS).(6)
where yiis ground-truth of step i. θSdenotes the parameters of the
source network.
We prune the model parameters θSof the source model and ob-
tain the parameters θCof the sub-network. The training objective of
the sub-network Ccan be defined as:
LC(θC) =
n
X
i=1
yilog p(ˆ
yi|W;θC).(7)
摘要:

SELF-CONSISTENTREASONINGFORSOLVINGMATHWORDPROBLEMSJingXiong1,ZhongweiWan2,3,XipingHu1,MinYang2,3,ChengmingLi11SunYat-SenUniversity,China2UniversityofChineseAcademyofSciences,China3SIAT,ChineseAcademyofSciences,Chinaxiongj69@mail2.sysu.edu.cn,{huxiping,lichengming}@mail.sysu.edu.cn,{zw.wan1,min.yang}...

展开>> 收起<<
SELF-CONSISTENT REASONING FOR SOLVING MATH WORD PROBLEMS Jing Xiong1 Zhongwei Wan23 Xiping Hu1 Min Yang23 Chengming Li1 1Sun Yat-Sen University China.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:420.53KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注