CONV FINQA Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering Zhiyu Chen1 Shiyang Li1 Charese Smiley2 Zhiqiang Ma2

2025-05-06 0 0 1.13MB 14 页 10玖币
侵权投诉
CONVFINQA: Exploring the Chain of Numerical Reasoning in
Conversational Finance Question Answering
Zhiyu Chen1, Shiyang Li1, Charese Smiley2, Zhiqiang Ma2,
Sameena Shah2and William Yang Wang1
1University of California, Santa Barbara
2J.P. Morgan
{zhiyuchen,shiyangli,william}@cs.ucsb.edu,
{charese.h.smiley,zhiqiang.ma,sameena.shah}@jpmchase.com
Abstract
With the recent advance in large pre-trained
language models, researchers have achieved
record performances in NLP tasks that mostly
focus on language pattern matching. The com-
munity is experiencing the shift of the chal-
lenge from how to model language to the imita-
tion of complex reasoning abilities like human
beings. In this work, we investigate the ap-
plication domain of finance that involves real-
world, complex numerical reasoning. We pro-
pose a new large-scale dataset, CONVFINQA,
aiming to study the chain of numerical rea-
soning in conversational question answering.
Our dataset poses great challenge in model-
ing long-range, complex numerical reasoning
paths in real-world conversations. We con-
duct comprehensive experiments and analyses
with both the neural symbolic methods and
the prompting-based methods, to provide in-
sights into the reasoning mechanisms of these
two divisions. We believe our new dataset
should serve as a valuable resource to push for-
ward the exploration of real-world, complex
reasoning tasks as the next research focus. Our
dataset and code is publicly available1.
1 Introduction
The rapid advancement in developing large pre-
trained language models (LM) has brought the nat-
ural language processing research into a new era.
Based on the well-known transformer (Vaswani
et al.,2017) architecture, such large pre-trained
LMs (Devlin et al.,2019;Radford et al.,2019;
Raffel et al.,2020;Sanh et al.,2021;Wang et al.,
2022) have set up new state-of-the-art results for
many NLP tasks, with some of them approaching
or even surpassing human performances, like on
the SQuAD (Rajpurkar et al.,2016) dataset. We
observe that the tasks with the essence of modeling
language patterns can be well addressed by large
pre-trained LMs. However, for the other kind of
1https://github.com/czyssrs/ConvFinQA
2010 2009 2008
share-based
compensation cost $18.10 $14.60 $13.80
income tax benefit -$6.30 -$5.20 -$4.90
Financial report:
… the total income tax benefit recognized for
share-based compensation in the accompanying
statements of income is also presented.
Conversational QA:
Q1: In the year of 2010, what was the share-based
compensation cost?
A1: 18.1
Q2: and what was the income tax benefit?
A2: -6.3
Q3: what was, then, the sum of both?
A3: add(18.1, -6.3) = 11.8
Q4: and what was that sum in 2009?
A4: add(14.6, -5.2) = 9.4
Q5: what, then, was the change in the sum of those
amounts from 2009 to 2010?
A5: add(18.1, -6.3), add(14.6, -5.2), subtract(#0, #1) = 2.4
Figure 1: An example from
CONVFINQA
: each question
may depend on previous questions to answer.
tasks requiring complex reasoning abilities, current
researches are still away from satisfactory perfor-
mances (Wei et al.,2022).
Traditional methods on reasoning tasks typically
take neural symbolic models to encode the con-
text, generate the reasoning program and do the
execution (Liang et al.,2017;Chen et al.,2020).
Most recently, it is shown that sufficiently large
pre-trained LMs can excel at reasoning tasks given
proper prompts (Wei et al.,2022). But their tasks
being experimented with are relatively general and
toy, such as simple math word problems. The form
of the solutions and the reasoning explanations
probably have been witnessed by the model during
pre-training. This raises an interesting question:
Which of the two directions is the fundamental way
arXiv:2210.03849v1 [cs.CL] 7 Oct 2022
to solve complex reasoning problems?
In this work, we go beyond the simple reasoning
tasks and dive into the real application domain of
finance to investigate the complex numerical rea-
soning ability of current modeling paradigms. The
finance domain bears the natural requirements of
realistic, complex numerical reasoning from hu-
man labor, such as quantitative analysis of finan-
cial reports. We seek to study the real-world sce-
nario of
conversational question answering over
financial reports
– investors or analysts would typ-
ically ask sequential questions to get insights into
the numerical in the reports. The questions require
extensive calculations and meanwhile often demon-
strate cross dependency, forming the chains of nu-
merical reasoning throughout the conversation.
To this end, we propose a new dataset,
CONVFINQA
(
Con
versational
Fin
ance
Q
uestion
A
nswering), with 3,892 conversations consisting
14,115 questions. To construct the dataset, we de-
sign a framework to simulate the conversation flow
by decomposition and concatenation of the multi-
hop questions from the FinQA (Chen et al.,2021)
dataset. We then ask expert annotators to compose
the question for each conversation turn based on the
simulated conversing flow. Figure 1shows one ex-
ample conversation from our dataset. We conduct
comprehensive experiments and analyses on our
dataset using both the neural symbolic models and
the prompting-based methods, and summarize the
following insights: (1) Both kinds of approaches
(with the execution accuracy less than 70.0%) fall
far behind human performance (89.4%). The rea-
soning chains throughout the conversation pose
great challenges for the models to learn when to re-
fer to or discard the conversation history and how to
assemble the reasoning path. (2) Though excelling
at simple general reasoning tasks, prompting-based
methods perform a lot worse for our task (less than
50.0% using GPT-3 175B). They either superfi-
cially mimic the given prompts or recall their own
knowledge for simple general numerical reasoning.
They tend to fail to understand new complex task
paradigms for new domains. We believe our new
dataset should serve as a challenging and valuable
resource for the exploration of real-world, complex
reasoning tasks as the next research focus.
2 Related Work
Conversational Question Answering
Conver-
sational question answering (ConvQA) (Zaib et al.,
Dataset Size Mode Challenge Domain
SQA 6k ConvQA table navigation general
CSQA 200k ConvQA KG reasoning general
CoQA 8k ConvQA co-reference general
QuAC 14k ConvQA open-ended general
DROP 96k QA numerical reasoning general
MathQA 37k QA numerical reasoning math
FinQA 8k QA numerical reasoning finance
TAT-QA 17k QA numerical reasoning finance
CONVFINQA 4k ConvQA numerical reasoning finance
Table 1: Comparison of
CONVFINQA
with existing datasets.
2021) has been gaining attentions in recent years.
In ConvQA, the users can append multiple ques-
tions in addition to the first one to get more infor-
mation. This also mitigates the need to ask a single
complex multi-hop question at one time, making
the information-seeking procedure more natural.
For previous datasets, SQA (Iyyer et al.,2017) are
built by decomposing multi-hop questions based
on Wikitables. CSQA (Saha et al.,2018) questions
require simple logical operations over knowledge
graphs (KGs). CoQA (Reddy et al.,2019) focuses
on co-references among the conversation turns to
be more human-like. QuAC (Choi et al.,2018)
focuses on open-ended, exploratory questions. In
contrast, our dataset
CONVFINQA
targets com-
plex numerical reasoning chains among the sequen-
tial questions in finance conversations.
Numerical Reasoning
The numerical reasoning
ability is often investigated in the form of question
answering. The DROP dataset (Dua et al.,2019)
explores simple calculations over texts in the gen-
eral domain. MaWPS (Koncel-Kedziorski et al.,
2016) and MathQA (Amini et al.,2019) focus on
generating solutions for math word problems. Re-
cently, Wei et al. (2022) demonstrate that large
pre-trained LMs can excel at reasoning tasks given
proper prompts with natural language explanations.
However, their reasoning tasks are mostly simple
and general. In this work, we explore complex nu-
merical reasoning in a highly specialized domain.
Financial NLP
Previous work in financial NLP
mostly centers on sentiment analysis (Day and Lee,
2016;Akhtar et al.,2017), fraud detection (Han
et al.,2018;Wang et al.,2019;Nourbakhsh and
Bang,2019), opinionated QA (Liu et al.,2020),
such as the FiQA
2
dataset built based on social me-
dia. Most recently, Chen et al. (2021) propose the
FinQA dataset with multi-hop numerical reasoning
questions based on financial report. TAT-QA (Zhu
2https://sites.google.com/view/fiqa/home
et al.,2021) is another QA dataset with a simi-
lar focus. In
CONVFINQA
, we seek to construct
question sequences in the conversational setting
aiming at more natural experiences for real-world
usages. Table 1presents the comparison of our
dataset with existing ones.
3 Task Formulation
Given a financial report containing both the textual
content
T
and structured table
B
, the user asks a se-
quence of questions
{Qi}n
i=0
where later questions
may depend on previous questions to answer. The
target is to generate the reasoning program
G
to be
executed to get the answer Ato the last question:
P(A|T, B, Qn) = XP(Gi|T, B, Q0, Q1, ...Qn1)
(1)
Where
{Gi}
is all the possible programs to evaluate
to the correct answer. We follow the same domain
specific language (DSL) in FinQA (Chen et al.,
2021) to construct the reasoning programs as a
sequence of operation-argument clauses (Appendix
A for all operations):
op1[args1],op2[args2]..., opn[argsn](2)
We follow the same evaluation metric as in FinQA,
the execution accuracy to evaluate the final execu-
tion result and program accuracy to evaluate pro-
gram equivalence.
4 The CONVFINQA Dataset
4.1 Dataset Construction
The Overview
The core challenge of building
such a dataset is the construction of a natural, real-
istic conversational flow – what kinds of questions
the queriers may ask and how these questions logi-
cally appear in a conversation. We consult financial
experts to summarize the following key factors in-
tegrating a conversation when querying financial
reports: (i) The questioner directly queries the sur-
face content. (ii) The questioner asks something
requiring calculations from the numbers in the re-
port to answer. (iii) The questioner asks the above
two kinds of questions sequentially to form the con-
versation, to cumulatively query more information
or switch to other aspects.
Directly composing the conversations from
scratch involving all the above factors is very heavy
and costly. To tackle this challenge, we propose
The reasoning program of the original multi-step question:
op1( arg1, arg2 ), op2( #0, arg3 )
Conversation skeleton:
Turn 1: op1( arg1, arg2 )
Turn 2: op2( #0, arg3 )
Conversation skeleton:
Turn 1: query number arg1
Turn 2: query number arg2
Turn 3: op1( arg1, arg2 )
Turn 4: op2( #0, arg3 )
Decomposition
Insert span selection turns
The reasoning programs of the two original
multi-step questions:
op1( arg1, arg2 ), op2( #0, arg3 )
op3( arg3, arg4 ), op4( #0, arg4 )
Conversation skeleton
of question 1:
Turn 1: op1( arg1, arg2 )
Turn 2: op2( #0, arg3 )
Conversation skeleton
of question 1:
Turn 1: query number arg1
Turn 2: query number arg2
Turn 3: op1( arg1, arg2 )
Turn 4: op2( #0, arg3 )
Conversation skeleton
of question 2:
Turn 1: op3( arg3, arg4 )
Turn 2: op4( #0, arg4 )
Conversation skeleton
of question 2:
Turn 1: op3( arg3, arg4 )
Turn 2: op4( #0, arg4 )
Concatenation of the two decompositions:
Turn 1: query number arg1
Turn 2: query number arg2
Turn 3: op1( arg1, arg2 ) = #0
Turn 4: op2( #0, arg3 )
Turn 5: op3( arg3, arg4 ) = #1
Turn 6: op4( #1, arg4 )
Decomposition
Insert span selection turns
Integrating two decompositions
Type I simple conversation
Type II hybrid conversation
Figure 2: The simulation process of conversation skeletons.
a two-step construction framework:
(I): Conver-
sational QA flow simulation
to produce the con-
versation skeleton with each turn filled with the
reasoning semantics, and
(II): Question composi-
tion
to realize the reasoning semantics into textual
questions.
Conversational QA Flow Simulation
We build
the conversation flow based on the decomposition
and concatenation of the multi-step reasoning pro-
grams (the solutions of the multi-hop questions) in
the existing FinQA (Chen et al.,2021) dataset. In
FinQA, the authors construct two multi-hop ques-
tions for most of its reports. The two FinQA ques-
tions for the same report naturally query different
but sometimes correlated aspects of the report, in-
spiring us to integrate them into a natural and real-
istic conversation. We simulate two types of con-
versations:
Type I: Simple conversation
from the
摘要:

CONVFINQA:ExploringtheChainofNumericalReasoninginConversationalFinanceQuestionAnsweringZhiyuChen1,ShiyangLi1,ChareseSmiley2,ZhiqiangMa2,SameenaShah2andWilliamYangWang11UniversityofCalifornia,SantaBarbara2J.P.Morgan{zhiyuchen,shiyangli,william}@cs.ucsb.edu,{charese.h.smiley,zhiqiang.ma,sameena.shah}@...

展开>> 收起<<
CONV FINQA Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering Zhiyu Chen1 Shiyang Li1 Charese Smiley2 Zhiqiang Ma2.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.13MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注