CONV FINQA Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering Zhiyu Chen1 Shiyang Li1 Charese Smiley2 Zhiqiang Ma2

2025-05-06 0 0 1.13MB 14 页 10玖币

侵权投诉

CONVFINQA: Exploring the Chain of Numerical Reasoning in

Conversational Finance Question Answering

Zhiyu Chen1, Shiyang Li1, Charese Smiley2, Zhiqiang Ma2,

Sameena Shah2and William Yang Wang1

1University of California, Santa Barbara

2J.P. Morgan

{zhiyuchen,shiyangli,william}@cs.ucsb.edu,

{charese.h.smiley,zhiqiang.ma,sameena.shah}@jpmchase.com

Abstract

With the recent advance in large pre-trained

language models, researchers have achieved

record performances in NLP tasks that mostly

focus on language pattern matching. The com-

munity is experiencing the shift of the chal-

lenge from how to model language to the imita-

tion of complex reasoning abilities like human

beings. In this work, we investigate the ap-

plication domain of ﬁnance that involves real-

world, complex numerical reasoning. We pro-

pose a new large-scale dataset, CONVFINQA,

aiming to study the chain of numerical rea-

soning in conversational question answering.

Our dataset poses great challenge in model-

ing long-range, complex numerical reasoning

paths in real-world conversations. We con-

duct comprehensive experiments and analyses

with both the neural symbolic methods and

the prompting-based methods, to provide in-

sights into the reasoning mechanisms of these

two divisions. We believe our new dataset

should serve as a valuable resource to push for-

ward the exploration of real-world, complex

reasoning tasks as the next research focus. Our

dataset and code is publicly available1.

1 Introduction

The rapid advancement in developing large pre-

trained language models (LM) has brought the nat-

ural language processing research into a new era.

Based on the well-known transformer (Vaswani

et al.,2017) architecture, such large pre-trained

LMs (Devlin et al.,2019;Radford et al.,2019;

Raffel et al.,2020;Sanh et al.,2021;Wang et al.,

2022) have set up new state-of-the-art results for

many NLP tasks, with some of them approaching

or even surpassing human performances, like on

the SQuAD (Rajpurkar et al.,2016) dataset. We

observe that the tasks with the essence of modeling

language patterns can be well addressed by large

pre-trained LMs. However, for the other kind of

1https://github.com/czyssrs/ConvFinQA

2010 2009 2008

share-based

compensation cost $18.10 $14.60 $13.80

income tax benefit -$6.30 -$5.20 -$4.90

Financial report:

… the total income tax benefit recognized for

share-based compensation in the accompanying

statements of income is also presented.

Conversational QA:

Q1: In the year of 2010, what was the share-based

compensation cost?

A1: 18.1

Q2: and what was the income tax benefit?

A2: -6.3

Q3: what was, then, the sum of both?

A3: add(18.1, -6.3) = 11.8

Q4: and what was that sum in 2009?

A4: add(14.6, -5.2) = 9.4

Q5: what, then, was the change in the sum of those

amounts from 2009 to 2010?

A5: add(18.1, -6.3), add(14.6, -5.2), subtract(#0, #1) = 2.4

Figure 1: An example from

CONVFINQA

: each question

may depend on previous questions to answer.

tasks requiring complex reasoning abilities, current

researches are still away from satisfactory perfor-

mances (Wei et al.,2022).

Traditional methods on reasoning tasks typically

take neural symbolic models to encode the con-

text, generate the reasoning program and do the

execution (Liang et al.,2017;Chen et al.,2020).

Most recently, it is shown that sufﬁciently large

pre-trained LMs can excel at reasoning tasks given

proper prompts (Wei et al.,2022). But their tasks

being experimented with are relatively general and

toy, such as simple math word problems. The form

of the solutions and the reasoning explanations

probably have been witnessed by the model during

pre-training. This raises an interesting question:

Which of the two directions is the fundamental way

arXiv:2210.03849v1 [cs.CL] 7 Oct 2022

to solve complex reasoning problems?

In this work, we go beyond the simple reasoning

tasks and dive into the real application domain of

ﬁnance to investigate the complex numerical rea-

soning ability of current modeling paradigms. The

ﬁnance domain bears the natural requirements of

realistic, complex numerical reasoning from hu-

man labor, such as quantitative analysis of ﬁnan-

cial reports. We seek to study the real-world sce-

nario of

conversational question answering over

ﬁnancial reports

– investors or analysts would typ-

ically ask sequential questions to get insights into

the numerical in the reports. The questions require

extensive calculations and meanwhile often demon-

strate cross dependency, forming the chains of nu-

merical reasoning throughout the conversation.

To this end, we propose a new dataset,

CONVFINQA

(

Con

versational

Fin

ance

uestion

nswering), with 3,892 conversations consisting

14,115 questions. To construct the dataset, we de-

sign a framework to simulate the conversation ﬂow

by decomposition and concatenation of the multi-

hop questions from the FinQA (Chen et al.,2021)

dataset. We then ask expert annotators to compose

the question for each conversation turn based on the

simulated conversing ﬂow. Figure 1shows one ex-

ample conversation from our dataset. We conduct

comprehensive experiments and analyses on our

dataset using both the neural symbolic models and

the prompting-based methods, and summarize the

following insights: (1) Both kinds of approaches

(with the execution accuracy less than 70.0%) fall

far behind human performance (89.4%). The rea-

soning chains throughout the conversation pose

great challenges for the models to learn when to re-

fer to or discard the conversation history and how to

assemble the reasoning path. (2) Though excelling

at simple general reasoning tasks, prompting-based

methods perform a lot worse for our task (less than

50.0% using GPT-3 175B). They either superﬁ-

cially mimic the given prompts or recall their own

knowledge for simple general numerical reasoning.

They tend to fail to understand new complex task

paradigms for new domains. We believe our new

dataset should serve as a challenging and valuable

resource for the exploration of real-world, complex

reasoning tasks as the next research focus.

2 Related Work

Conversational Question Answering

Conver-

sational question answering (ConvQA) (Zaib et al.,

Dataset Size Mode Challenge Domain

SQA 6k ConvQA table navigation general

CSQA 200k ConvQA KG reasoning general

CoQA 8k ConvQA co-reference general

QuAC 14k ConvQA open-ended general

DROP 96k QA numerical reasoning general

MathQA 37k QA numerical reasoning math

FinQA 8k QA numerical reasoning ﬁnance

TAT-QA 17k QA numerical reasoning ﬁnance

CONVFINQA 4k ConvQA numerical reasoning ﬁnance

Table 1: Comparison of

CONVFINQA

with existing datasets.

2021) has been gaining attentions in recent years.

In ConvQA, the users can append multiple ques-

tions in addition to the ﬁrst one to get more infor-

mation. This also mitigates the need to ask a single

complex multi-hop question at one time, making

the information-seeking procedure more natural.

For previous datasets, SQA (Iyyer et al.,2017) are

built by decomposing multi-hop questions based

on Wikitables. CSQA (Saha et al.,2018) questions

require simple logical operations over knowledge

graphs (KGs). CoQA (Reddy et al.,2019) focuses

on co-references among the conversation turns to

be more human-like. QuAC (Choi et al.,2018)

focuses on open-ended, exploratory questions. In

contrast, our dataset

CONVFINQA

targets com-

plex numerical reasoning chains among the sequen-

tial questions in ﬁnance conversations.

Numerical Reasoning

The numerical reasoning

ability is often investigated in the form of question

answering. The DROP dataset (Dua et al.,2019)

explores simple calculations over texts in the gen-

eral domain. MaWPS (Koncel-Kedziorski et al.,

2016) and MathQA (Amini et al.,2019) focus on

generating solutions for math word problems. Re-

cently, Wei et al. (2022) demonstrate that large

pre-trained LMs can excel at reasoning tasks given

proper prompts with natural language explanations.

However, their reasoning tasks are mostly simple

and general. In this work, we explore complex nu-

merical reasoning in a highly specialized domain.

Financial NLP

Previous work in ﬁnancial NLP

mostly centers on sentiment analysis (Day and Lee,

2016;Akhtar et al.,2017), fraud detection (Han

et al.,2018;Wang et al.,2019;Nourbakhsh and

Bang,2019), opinionated QA (Liu et al.,2020),

such as the FiQA

dataset built based on social me-

dia. Most recently, Chen et al. (2021) propose the

FinQA dataset with multi-hop numerical reasoning

questions based on ﬁnancial report. TAT-QA (Zhu

2https://sites.google.com/view/ﬁqa/home

et al.,2021) is another QA dataset with a simi-

lar focus. In

CONVFINQA

, we seek to construct

question sequences in the conversational setting

aiming at more natural experiences for real-world

usages. Table 1presents the comparison of our

dataset with existing ones.

3 Task Formulation

Given a ﬁnancial report containing both the textual

content

and structured table

, the user asks a se-

quence of questions

{Qi}n

i=0

where later questions

may depend on previous questions to answer. The

target is to generate the reasoning program

to be

executed to get the answer Ato the last question:

P(A|T, B, Qn) = XP(Gi|T, B, Q0, Q1, ...Qn−1)

(1)

Where

{Gi}

is all the possible programs to evaluate

to the correct answer. We follow the same domain

speciﬁc language (DSL) in FinQA (Chen et al.,

2021) to construct the reasoning programs as a

sequence of operation-argument clauses (Appendix

A for all operations):

op1[args1],op2[args2]..., opn[argsn](2)

We follow the same evaluation metric as in FinQA,

the execution accuracy to evaluate the ﬁnal execu-

tion result and program accuracy to evaluate pro-

gram equivalence.

4 The CONVFINQA Dataset

4.1 Dataset Construction

The Overview

The core challenge of building

such a dataset is the construction of a natural, real-

istic conversational ﬂow – what kinds of questions

the queriers may ask and how these questions logi-

cally appear in a conversation. We consult ﬁnancial

experts to summarize the following key factors in-

tegrating a conversation when querying ﬁnancial

reports: (i) The questioner directly queries the sur-

face content. (ii) The questioner asks something

requiring calculations from the numbers in the re-

port to answer. (iii) The questioner asks the above

two kinds of questions sequentially to form the con-

versation, to cumulatively query more information

or switch to other aspects.

Directly composing the conversations from

scratch involving all the above factors is very heavy

and costly. To tackle this challenge, we propose

The reasoning program of the original multi-step question:

op1( arg1, arg2 ), op2( #0, arg3 )

Conversation skeleton:

Turn 1: op1( arg1, arg2 )

Turn 2: op2( #0, arg3 )

Conversation skeleton:

Turn 1: query number arg1

Turn 2: query number arg2

Turn 3: op1( arg1, arg2 )

Turn 4: op2( #0, arg3 )

Decomposition

Insert span selection turns

The reasoning programs of the two original

multi-step questions:

op1( arg1, arg2 ), op2( #0, arg3 )

op3( arg3, arg4 ), op4( #0, arg4 )

Conversation skeleton

of question 1:

Turn 1: op1( arg1, arg2 )

Turn 2: op2( #0, arg3 )

Conversation skeleton

of question 1:

Turn 1: query number arg1

Turn 2: query number arg2

Turn 3: op1( arg1, arg2 )

Turn 4: op2( #0, arg3 )

Conversation skeleton

of question 2:

Turn 1: op3( arg3, arg4 )

Turn 2: op4( #0, arg4 )

Conversation skeleton

of question 2:

Turn 1: op3( arg3, arg4 )

Turn 2: op4( #0, arg4 )

Concatenation of the two decompositions:

Turn 1: query number arg1

Turn 2: query number arg2

Turn 3: op1( arg1, arg2 ) = #0

Turn 4: op2( #0, arg3 )

Turn 5: op3( arg3, arg4 ) = #1

Turn 6: op4( #1, arg4 )

Decomposition

Insert span selection turns

Integrating two decompositions

Type I simple conversation

Type II hybrid conversation

Figure 2: The simulation process of conversation skeletons.

a two-step construction framework:

(I): Conver-

sational QA ﬂow simulation

to produce the con-

versation skeleton with each turn ﬁlled with the

reasoning semantics, and

(II): Question composi-

tion

to realize the reasoning semantics into textual

questions.

Conversational QA Flow Simulation

We build

the conversation ﬂow based on the decomposition

and concatenation of the multi-step reasoning pro-

grams (the solutions of the multi-hop questions) in

the existing FinQA (Chen et al.,2021) dataset. In

FinQA, the authors construct two multi-hop ques-

tions for most of its reports. The two FinQA ques-

tions for the same report naturally query different

but sometimes correlated aspects of the report, in-

spiring us to integrate them into a natural and real-

istic conversation. We simulate two types of con-

versations:

Type I: Simple conversation

from the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CONVFINQA:ExploringtheChainofNumericalReasoninginConversationalFinanceQuestionAnsweringZhiyuChen1,ShiyangLi1,ChareseSmiley2,ZhiqiangMa2,SameenaShah2andWilliamYangWang11UniversityofCalifornia,SantaBarbara2J.P.Morgan{zhiyuchen,shiyangli,william}@cs.ucsb.edu,{charese.h.smiley,zhiqiang.ma,sameena.shah}@...

展开>> 收起<<

CONV FINQA Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering Zhiyu Chen1 Shiyang Li1 Charese Smiley2 Zhiqiang Ma2.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CONV FINQA Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering Zhiyu Chen1 Shiyang Li1 Charese Smiley2 Zhiqiang Ma2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: