SELF-CONSISTENT REASONING FOR SOLVING MATH WORD PROBLEMS Jing Xiong1 Zhongwei Wan23 Xiping Hu1 Min Yang23 Chengming Li1 1Sun Yat-Sen University China

2025-05-03 0 0 420.53KB 6 页 10玖币

侵权投诉

SELF-CONSISTENT REASONING FOR SOLVING MATH WORD PROBLEMS

Jing Xiong1, Zhongwei Wan2,3, Xiping Hu1, Min Yang2,3, Chengming Li1

1Sun Yat-Sen University, China

2University of Chinese Academy of Sciences, China

3SIAT, Chinese Academy of Sciences, China

xiongj69@mail2.sysu.edu.cn, {huxiping, lichengming}@mail.sysu.edu.cn, {zw.wan1, min.yang}@siat.ac.cn

ABSTRACT

Math word problems (MWPs) is a task that automatically de-

rives solution expression from a giving math problems in text. The

previous studies suffer from spurious correlations between input

text and output expression. To mitigate this issue, we propose a

self-consistent reasoning framework called SCR, which attempts to

adopt a pruning strategy to correct the output distribution shift so

as to implicitly ﬁx those spurious correlative samples. Speciﬁcally,

we ﬁrstly obtain a sub-network by pruning a roberta2tree model, for

the sake to use the gap on output distribution between the original

roberta2tree model and the pruned sub-network to expose spuri-

ous correlative samples. Then, we calibrate the output distribution

shift by applying symmetric Kullback-Leibler divergence to alle-

viate spurious correlations. In addition, SCR generates equivalent

expressions, thereby, capturing the original text’s logic rather than

relying on hints from original text. Extensive experiments on two

large-scale benchmarks demonstrate that our model substantially

outperforms the strong baseline methods.

Index Terms—Math Word Problems, Spurious correlative sam-

ples, Pruning, Self-consistency

1. INTRODUCTION

Math word problems (MWPs) [1] is a challenging symbolic logi-

cal reasoning task based on natural language description and draws

much attention from researchers about the reasoning power of large

language models [2,3,4,5,6,7] recently. MWPs aims to auto-

matically solve mathematical questions given in a natural language,

which requires the model not only to understand the natural language

but also to have the ability to reason logically. Table 1shows several

examples of MWPs.

At present, there are mainly three paradigms of models that have

achieved excellent performance, namely seq2seq [8,9,10,11,12,

13], seq2tree [14,15], and complex relation extraction [16]. But all

three paradigm model suffers from spurious correlations[17,18,16].

Take an example of Table 1, some of the previous works may ob-

tain the same mathematical formula as ”a÷b×c” for Problem

1 and Problem 2, due to the similar semantic context information,

e.g., Calculate the money situation. However, if the models ignore

the spurious correlations, they intend to generate the wrong solution

expression for Problem 3, which has very analogous semantic infor-

mation like the important words ”money”, ”bank”, and ”account,”

which exist in problems 1 and 2. To be speciﬁc, the models that

learn spurious information among problems 1 to 3 are more likely

to generate the wrong expression ”12500÷5%×15%” instead of

Problem 1: Tom takes the money from his bank ac-

count and has taken 240 dollars of his account for 3

days. If he takes the same amount of money every day,

how much money will Tom take for next 2 days?

Solution Expression: 240÷3×2Solution: 160

Problem 2: Sherry has deposited 6000 dollars to the

bank for the last 5 months. If she saves the same

monthly money, how much will she add to the account

in the next 3 months?

Solution Expression: 6000÷5×3Solution: 3600

Problem 3: Uncle Jack spends 5%of his bank account

to invest for the trust funds of States and 15%of the

account for the shares of Apple Inc. The money he has

spent on ﬁnancial management is 12500 dollars. How

much money is in Uncle Jack’s account?

Solution Expression: 12500÷(5%+ 15%)Solu-

tion: 62500

Wrong Solution Expression: 12500÷5%×15%

Table 1. A typical math word examples of the spurious correlation.

”12500÷( 5%+ 15%)” for problem 3 Calculate the money of the

account.

Some recent models address this problem by using variational

information bottlenecks[15]. Our article considers this problem from

the perspective of memorization. Some recent articles have revealed

that pruning can make the model forget some hard-to-memorize

samples[19]. In addition, [20] have revealed that long-tailed sam-

ples are easily forgettable. Usually, these long-tailed samples easily

confuse the model, and the model will generate the ﬁnal result based

on some shallow hints. A natural hypothesis is that some spurious

correlative samples are tougher for the model to learn well due to

shortcuts, and these samples can be adaptively exposed by pruning

[20]. A key question in the task of MWPs is how to implicitly

emphasize their shortcuts between expressions and original texts

when exposing spurious correlative samples through pruning. Some

work about the reasoning ability of large models has also revealed

that encouraging the model to produce self-consistent outputs can

effectively improve reasoning performance when the model pro-

duces multiple inferences [2,3,4]. However, their work uses voting

to encourage self-consistency, which cannot adaptively correct the

shortcuts of expressions and original texts online through the loss

function.

In this paper, we propose a self-consistent reasoning framework

(called SCR) to solve MWPs tasks. We obtain a sub-network by

arXiv:2210.15373v2 [cs.CL] 17 Feb 2024

pruning the roberta2tree model denoted as the source network. Our

SCR model adaptively ﬁnds spurious correlative samples through

pruning. Speciﬁcally, SCR encourages the models’ prediction

consistency through mutual learning [21,22], which can empha-

size samples with inconsistent prediction distributions between the

source network and sub-network.

We summarize our main contributions: (1) We propose a novel

self-consistent reasoning framework for MWPs to expose spurious

correlative samples and correct them adaptively. (2) We conduct ex-

tensive experiments on two benchmark datasets (i.e., Math23k and

Ape210k). The results demonstrate that our model performs signiﬁ-

cantly better than the strong baselines.

2. METHODOLOGY

A math word problems (MWPs) can be denoted by a projection F:

W7→ Y, where W={w1, w2,...,wm}is the problem sequence

with mwords and Y={y1, y2,...,yn}is the solution expression

of the problem with nwords. MWPs aim to establish a model F

which generates a correct solution expression Yand calculates the

correct answer for the problem W.

As illustrated in Figure 1, the proposed SCR is composed of a

source network (denoted as S) and a sub-network (denoted as C). The

sub-network is obtained by pruning the source network. The two net-

works are optimized collaboratively and teach each other throughout

the training process. We use the encoder-decoder framework as the

backbone of both source and sub-networks.

2.1. The Encoder-Decoder Architecture

To efﬁciently obtain a high-quality representation of the problem,

we utilize the RoBERTa model [23] as our encoder. We pass the

problem sequence Winto the RoBERTa model and obtain problem

representation H∈Rm∗d, where dis the embedding size of the

encoder. In order to model the relationship between the quantities

in the pre-training model, we set up a learnable quantity embedding

matrix TE={t1, t2,...,tn}, similar to the learnable position em-

bedding in BERT [24]. Before passing the sequence Winto the en-

coder, we ﬁrst replace each quantity in the sequence Wwith a token

ti∈Tt.

Inspired by GTS model [25], our decoder uses the recursive op-

eration of the decoder to construct Yby order of pre-order traversal.

First, the decoder generate the root node troot (middle operator part)

. Then, the decoder generates the left child node tl. Furthermore, the

right child node tris generated. This process has been iterated un-

til the leaf nodes are generated. Speciﬁcally, we apply the attention

mechanism to learn the global context vector Gi, which is utilized

to generate the current node token ˆyi. Here we denote the digital

embedding after being encoded by the encoder as T. The formula of

the attention mechanism is shown below:

Gi=





Attention (H, troot,tl),tl/∈ ∅.

Attention (H, troot,tsl ),tsl /∈ ∅.

Attention (H, troot),tl,qsl ∈ ∅.

(1)

ˆyi= Predict(Gi, T ).(2)

where Predict(·)is the ﬁnal prediction layer for producing the tree

node.

The tree decoder will generate the left and right child nodes and

push them into the stack using the top-down approach if the current

node is an operator. Moreover, if it is a number, the merge opera-

tion will be carried out until the stack’s leaf nodes emerge, at which

Uncle Jack spend [NUM1] of his bank account to invest for the trust funds

of States, and [NUM2] of the account...financial management is [NUM3]...

Uncle Jack spend 5% of his bank account to invest for the trust funds of

States, and 15% of the account...financial management is 12500 dollars.

MWP Text W

Replaced Quantity Tokens

RoBERTa Encoder

Pruning

Sub-network

Source Network KL

P1P2

Tree Decoder

Fig. 1. Overview of the proposed framework SCR

point the result will be pushed into the stack of left child nodes for

the attention operation. Then, the merge operation will pop the re-

quired node top and tsubtree from an embedding stack. The formula

of recursive construction is as follows:

tl= Left(Gi,ˆyi, troot ).(3)

tr= Right(Gi,ˆyi, troot ).(4)

tm= Merge(top, tsubtree, tm−1).(5)

2.2. Self-consistent Reasoning

As shown in Figure 1, the proposed SCR comprises a source net-

work and a sub-network. The sub-network is obtained by pruning the

source network. In each iteration, the source network will correct the

distribution shift of the output p2from the sub-network to implicitly

emphasizes the spurious correlative samples. When we ﬁnish train-

ing the sub-network in this iteration, the sub-network also provides

the supervision signals to correct the distribution shift from the out-

put p1from the source network. Speciﬁcally, we preferentially ﬁx

the output distribution of the sub-network when neither network is

trained by samples to expose spurious collaborative samples better.

At the same time, these two networks are also trained by ground-

truth supervision signals.

Formally, the training objective of the source network is to min-

imize negative log-likelihood (NLL) loss for each instance (W, Y )

from training data:

LS(θS) = −

i=1

yilog p(ˆ

yi|W;θS).(6)

where yiis ground-truth of step i. θSdenotes the parameters of the

source network.

We prune the model parameters θSof the source model and ob-

tain the parameters θCof the sub-network. The training objective of

the sub-network Ccan be deﬁned as:

LC(θC) = −

i=1

yilog p(ˆ

yi|W;θC).(7)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SELF-CONSISTENTREASONINGFORSOLVINGMATHWORDPROBLEMSJingXiong1,ZhongweiWan2,3,XipingHu1,MinYang2,3,ChengmingLi11SunYat-SenUniversity,China2UniversityofChineseAcademyofSciences,China3SIAT,ChineseAcademyofSciences,Chinaxiongj69@mail2.sysu.edu.cn,{huxiping,lichengming}@mail.sysu.edu.cn,{zw.wan1,min.yang}...

展开>> 收起<<

SELF-CONSISTENT REASONING FOR SOLVING MATH WORD PROBLEMS Jing Xiong1 Zhongwei Wan23 Xiping Hu1 Min Yang23 Chengming Li1 1Sun Yat-Sen University China.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SELF-CONSISTENT REASONING FOR SOLVING MATH WORD PROBLEMS Jing Xiong1 Zhongwei Wan23 Xiping Hu1 Min Yang23 Chengming Li1 1Sun Yat-Sen University China

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: