
to solve complex reasoning problems?
In this work, we go beyond the simple reasoning
tasks and dive into the real application domain of
finance to investigate the complex numerical rea-
soning ability of current modeling paradigms. The
finance domain bears the natural requirements of
realistic, complex numerical reasoning from hu-
man labor, such as quantitative analysis of finan-
cial reports. We seek to study the real-world sce-
nario of
conversational question answering over
financial reports
– investors or analysts would typ-
ically ask sequential questions to get insights into
the numerical in the reports. The questions require
extensive calculations and meanwhile often demon-
strate cross dependency, forming the chains of nu-
merical reasoning throughout the conversation.
To this end, we propose a new dataset,
CONVFINQA
(
Con
versational
Fin
ance
Q
uestion
A
nswering), with 3,892 conversations consisting
14,115 questions. To construct the dataset, we de-
sign a framework to simulate the conversation flow
by decomposition and concatenation of the multi-
hop questions from the FinQA (Chen et al.,2021)
dataset. We then ask expert annotators to compose
the question for each conversation turn based on the
simulated conversing flow. Figure 1shows one ex-
ample conversation from our dataset. We conduct
comprehensive experiments and analyses on our
dataset using both the neural symbolic models and
the prompting-based methods, and summarize the
following insights: (1) Both kinds of approaches
(with the execution accuracy less than 70.0%) fall
far behind human performance (89.4%). The rea-
soning chains throughout the conversation pose
great challenges for the models to learn when to re-
fer to or discard the conversation history and how to
assemble the reasoning path. (2) Though excelling
at simple general reasoning tasks, prompting-based
methods perform a lot worse for our task (less than
50.0% using GPT-3 175B). They either superfi-
cially mimic the given prompts or recall their own
knowledge for simple general numerical reasoning.
They tend to fail to understand new complex task
paradigms for new domains. We believe our new
dataset should serve as a challenging and valuable
resource for the exploration of real-world, complex
reasoning tasks as the next research focus.
2 Related Work
Conversational Question Answering
Conver-
sational question answering (ConvQA) (Zaib et al.,
Dataset Size Mode Challenge Domain
SQA 6k ConvQA table navigation general
CSQA 200k ConvQA KG reasoning general
CoQA 8k ConvQA co-reference general
QuAC 14k ConvQA open-ended general
DROP 96k QA numerical reasoning general
MathQA 37k QA numerical reasoning math
FinQA 8k QA numerical reasoning finance
TAT-QA 17k QA numerical reasoning finance
CONVFINQA 4k ConvQA numerical reasoning finance
Table 1: Comparison of
CONVFINQA
with existing datasets.
2021) has been gaining attentions in recent years.
In ConvQA, the users can append multiple ques-
tions in addition to the first one to get more infor-
mation. This also mitigates the need to ask a single
complex multi-hop question at one time, making
the information-seeking procedure more natural.
For previous datasets, SQA (Iyyer et al.,2017) are
built by decomposing multi-hop questions based
on Wikitables. CSQA (Saha et al.,2018) questions
require simple logical operations over knowledge
graphs (KGs). CoQA (Reddy et al.,2019) focuses
on co-references among the conversation turns to
be more human-like. QuAC (Choi et al.,2018)
focuses on open-ended, exploratory questions. In
contrast, our dataset
CONVFINQA
targets com-
plex numerical reasoning chains among the sequen-
tial questions in finance conversations.
Numerical Reasoning
The numerical reasoning
ability is often investigated in the form of question
answering. The DROP dataset (Dua et al.,2019)
explores simple calculations over texts in the gen-
eral domain. MaWPS (Koncel-Kedziorski et al.,
2016) and MathQA (Amini et al.,2019) focus on
generating solutions for math word problems. Re-
cently, Wei et al. (2022) demonstrate that large
pre-trained LMs can excel at reasoning tasks given
proper prompts with natural language explanations.
However, their reasoning tasks are mostly simple
and general. In this work, we explore complex nu-
merical reasoning in a highly specialized domain.
Financial NLP
Previous work in financial NLP
mostly centers on sentiment analysis (Day and Lee,
2016;Akhtar et al.,2017), fraud detection (Han
et al.,2018;Wang et al.,2019;Nourbakhsh and
Bang,2019), opinionated QA (Liu et al.,2020),
such as the FiQA
2
dataset built based on social me-
dia. Most recently, Chen et al. (2021) propose the
FinQA dataset with multi-hop numerical reasoning
questions based on financial report. TAT-QA (Zhu
2https://sites.google.com/view/fiqa/home