OpenCQA Open-ended Question Answering with Charts Shankar Kantharaj1 Xuan Long Do2 Rixie Tiffany Ko Leong2 Jia Qing Tan2 Enamul Hoque1 Shafiq Joty23

2025-04-27 0 0 2.43MB 21 页 10玖币
侵权投诉
OpenCQA: Open-ended Question Answering with Charts
Shankar Kantharaj1, Xuan Long Do2, Rixie Tiffany Ko Leong2,
Jia Qing Tan2, Enamul Hoque1, Shafiq Joty2,3
1York University, Canada, 2Nanyang Technological University, Singapore,
3Salesforce AI Research
{shankark, enamulh}@yorku.ca
{xuanlong001@e, C190022@e, srjoty@}.ntu.edu.sg
Abstract
Charts are very popular to analyze data and
convey important insights. People often ana-
lyze visualizations to answer open-ended ques-
tions that require explanatory answers. An-
swering such questions are often difficult and
time-consuming as it requires a lot of cogni-
tive and perceptual efforts. To address this
challenge, we introduce a new task called
OpenCQA, where the goal is to answer an
open-ended question about a chart with de-
scriptive texts. We present the annotation pro-
cess and an in-depth analysis of our dataset.
We implement and evaluate a set of baselines
under three practical settings. In the first set-
ting, a chart and the accompanying article is
provided as input to the model. The second set-
ting provides only the relevant paragraph(s) to
the chart instead of the entire article, whereas
the third setting requires the model to generate
an answer solely based on the chart. Our anal-
ysis of the results show that the top performing
models generally produce fluent and coherent
text while they struggle to perform complex
logical and arithmetic reasoning.
1 Introduction
Using data visualizations such as bar charts and
line charts to discover critical insights, and explain
them to others is at the heart of many decision mak-
ing tasks (Munzner,2014). Often, people explore
such visualizations to answer high-level questions
that involve reasoning and explanations. For exam-
ple, Figure 1shows an open-ended question which
cannot be answered by a single word or phrase,
rather it requires an explanatory answer. Answer-
ing such questions can be time consuming and men-
tally taxing as they require significant amounts of
perceptual and cognitive efforts. For the particular
question in Figure 1, the user needs to find relevant
marks (bars) in the given charts, compare their val-
ues and perform reasoning over them to generate
an explanatory answer. Thus, the research question
Question
: Compare the Democrats and Republicans
views about providing health care to the population?
Answer
: While 83% of Democrats say providing high-
quality, affordable health care for all should be a top prior-
ity, a much smaller share of Republicans (48%) agree.
Figure 1: A question-answer pair from our dataset.
we address in this paper is: can we build systems to
automatically answer such open-ended questions
about charts with descriptive texts?
Chart Question Answering (CQA) is a task
where the goal is to take a chart and a natural
language question as input and generate the de-
sired answer as output (Hoque et al.,2022). While
CQA has received growing attentions in the last
few years, existing datasets only focus on close-
ended (factoid) questions where the answer is a
word or phrase (Kahou et al.,2017;Kafle et al.,
2018;Chaudhry et al.,2020;Singh and Shekhar,
2020). These datasets typically use predefined tem-
plates to generate synthetic questions and answers
to these questions come from a closed vocabulary
(e.g., ‘yes’, ‘no’, ‘x-axis-label’). PlotQA (Methani
et al.,2020) introduces some open-vocabulary ques-
tions that require aggregation operations on the un-
derlying chart data; the answer is still however a
number or a word/phrase obtained from the chart.
Kim et al. (2020) attempt to automatically explain
how the model computes the answer but only for
close-ended ones. To our knowledge, there are no
datasets on CQA with open-ended questions.
In this work, we introduce a novel task named,
OpenCQA in which the system takes a chart and a
question as input and is expected to produce a de-
arXiv:2210.06628v1 [cs.LG] 12 Oct 2022
scriptive answer as output like the one in Figure 1.
This task is multidisciplinary and challenging in
nature as it involves natural language generation
(NLG), information visualization and computer
vision. It differs from the data-to-text and read-
ing comprehension, because unlike text or tables,
charts serve a different communicative goal by cre-
ating visual representation of data. Readers can
quickly notice important patterns, trends, and out-
liers from such visual representation which cannot
be easily observed from a table of raw data (Mun-
zner,2014). By looking at a line chart, one can
quickly discern an important trend whereas scatter-
plots may visually depict correlations and outliers.
Existing NLG approaches for tables do not con-
sider such chart features in the generation.
We have developed a benchmark dataset for
OpenCQA consisting of 7,724 human-written open-
ended questions about a variety of real-world charts
and the associated descriptive answers. We formu-
late three practical task settings. In the first setting,
a chart and the article containing the chart is pro-
vided as input, and the model generates an answer
to an open-ended question. This setting poses extra
challenge as articles often contain paragraphs that
are irrelevant to the questions. To make the task
more focused, the second setting provides only the
relevant paragraph(s) to the chart. Hence we can
measure the models’ ability to answer a question
without the effect of extra noise from irrelevant text.
The third setting is more challenging as the related
text is not provided and the model needs to gen-
erate an answer solely based on the chart. This is
more relevant to real world scenarios where charts
are not associated with any explanatory text.
Since the proposed task is completely new, we
adapt a variety of state-of-the-art models that uti-
lize multimodal, data2text and extractive sum-
marization methods to serve as strong baselines.
We conducted automatic and qualitative evalua-
tions and observe that the top performing models
are quite fluent and coherent in generating sum-
maries but lack in complex logical reasoning and
inference. Our codebase is publicly available at
https://github.com/vis-nlp/OpenCQA.
2 Related Work
Our work is related to three lines of prior work.
(i) Chart Summarization
Mittal et al. (1998)
and Ferres et al. (2013) adopt a planning-based
architecture and used templates to describe charts
with texts. These methods can only describe how to
read a chart without summarizing any insights from
the chart. Demir et al. (2012) compute statistics to
generate bar chart summaries and simultaneously
construct sentence- and discourse-level structures.
Chen et al. (2019) use a ResNet (He et al.,2016)
to encode a chart and a LSTM decoder to generate
a caption. All these studies generate summaries
using predefined templates, which may lack nat-
uralness and variations in terms of grammatical
structure and lexical choices. Obeid and Hoque
(2020) and Kantharaj et al. (2022a) use transformer-
based models while Spreafico and Carenini (2020)
use an LSTM based encoder-decoder model to gen-
erate chart summaries in a data-driven fashion. But
their models only focus on generating a summary
to describe the chart rather than focusing on a spe-
cific relevant portion of a chart to answer a question
which is the main focus of our work.
(ii) Visual Question Answering (VQA)
VQA
involves answering a question regarding an input
image (Antol et al.,2015). To relate the ques-
tion and the image effectively, researchers focus
on fusing textual and visual information together
(Lu et al.,2019;Talmor et al.,2021). Cho et al.
(2021) introduce VL-T5 and VL-BART as pre-
trained vision-language models which achieved
competitive results on VQA tasks. Unlike images
with real-world objects and scenes, charts encode
data using marks (bars, lines) and have inherent
structure which makes the chart QA task quite dif-
ferent from VQA (Masry et al.,2022).
(iii) Data2text Generation
Data2text models
generate a descriptive summary from a data table.
Previous work has focused on specific domains
such as sports (Barzilay and Lapata,2005;Wise-
man et al.,2017), weather-forecast (Reiter et al.,
2005), recipe (Yang et al.,2017) and biography
(Lebret et al.,2016). Others (Parikh et al.,2020;
Chen et al.,2020a) have focused on open-domain
tasks. Many of these methods use an LSTM-based
encoder-decoder architecture (Mei et al.,2016;Le-
bret et al.,2016;Wiseman et al.,2017), while Gong
et al. (2019) find that transformers yield more flu-
ent and coherent outputs. Few approaches focus
on generating textual facts with logical inference
rather than stating simple facts that can be easily
retrieved from the data table (Chen et al.,2020a,b).
Unlike the task with data tables, our task involves
understanding visual features of the charts and the
natural language questions to perform reasoning in
order to generate (or extract) texts as answers.
3 Dataset Construction
3.1 Data Collection & Annotation
Building a dataset with open-ended questions and
human-written descriptive answer is challenging
because there are not many publicly available real-
world sources with charts and related textual de-
scriptions. After exhaustive search, we decided to
use charts from Pew Research (pewresearch.org).
Pew serves as a suitable source because the arti-
cles are written by professional writers covering
opinions, market surveys, demographic trends and
social issues. The articles are often accompanied by
a variety of real-world charts and their summaries.
We collected 9,285 chart-summary-article triples
scraped from nearly 4,000 articles. However, not
all of the charts are suitable for creating open-ended
questions. For example, some charts maybe too un-
conventional or too complex while a few others
have poor resolution. Similarly, the text accompa-
nying the chart may not discuss data values in the
chart and instead refer to other external background
facts. Hence, we manually went over all the charts
to retain 7,724 samples that we deemed suitable
for our study. In particular, we filtered out 1,019
samples as too complex and 542 as samples we
cannot make an open-ended question.
We perform an annotation study on the collected
chart data to create question-answer pairs following
the four steps below (see Table 8for an illustrative
example). More details of the data collection and
annotation process are provided in Appendix A.1.
(1) Question-answer Creation
We asked each
crowdworker from Amazon Mechanical Turk to
answer three existing questions (created by another
crowdworker) for three separate charts respectively,
and create three new question-answer pairs for
three new charts. They were provided with the
chart and the summary, and were asked to select
portions of the text as an answer to the question.
The selected segments can be noncontiguous. In
this way, we collected two answers from different
workers for each question, to verify answers and to
remove any potential bias in answer selection.
(2) Question Validation and Editing
After col-
lecting the question-answer (QA) pairs, this and
the next two steps are performed by five internal
annotators who are native speakers of English and
have research background in summarization. Each
QA pair is first examined by an annotator who first
checks if the question is open-ended in nature, and
edits the questions when the question is vague or in-
complete, or not answerable from the charts. Then,
the remaining annotators analyze the questions in
terms of grammatical correctness and edit them as
needed. Overall, the question was edited in 53%
question-answer pairs. In this 22.7% cases were
minor changes (less than 30% tokens changed),
15.5% cases were moderate changes (between 30%
and 60% tokens changed) and 14.8% were major
changes (over 60% tokens changed).
(3) Disagreement Resolution
As mentioned, we
obtain two answers from the crowdworkers for each
chart-question pair. To resolve any potential bias
from one and/or disagreement between the two an-
swers, we build an annotation interface where an
annotator can either choose one of the two answers,
or select a new answer from the given summary.
The annotator checks whether the answer contains
irrelevant information to the question or any text
that are not derivable from the chart (e.g., back-
ground information). For 18.4% cases, the two an-
swers matched exactly. For 68.2% samples, the two
answers still had high overlaps (over 90% token
matches); for another 10.1% the overlaps between
the answers were moderate (between 30% and 90%
token matches) and for the remaining 3.3%, the to-
ken matches between answers were less than 30%.
While resolving the disagreements between crowd-
workers, in 96% cases the annotators chose one of
the two answers while for other 4% they selected a
new answer from the summary.
(4) Decontextualization
In some cases, crowd-
workers may have left out important information
from the summary that is relevant to the question
while in other cases, they may have included infor-
mation that is not derivable from the chart. Thus,
after selecting the most appropriate answer, annota-
tors edit it further by adding tokens from the sum-
mary or removing tokens as necessary, which is
taken as the extractive answer for the dataset. Also,
if needed, they replace the occurrence of a pronoun
with its antecedent (a proper noun) in cases where
the entity is unknown, to put the answer in context,
which is the abstractive answer.
3.2 Dataset Analysis
Figure 2a represents some basic statistics about the
dataset. The questions and titles are generally short
with both under 21 tokens on average. The percent-
age of tokens overlapping between the extractive
answer and the article is 7% on average. Other
Type Example %
Identify
What are the current thoughts on direct democ-
racy?
37%
Summarize
Explain the distribution of people who know a
transgender person?
37%
Compare
Compare Americans and Germans views about
the world economic leader?
20%
Discover
How do Americans’ see the coronavirus statistics? 6%
Table 1:
Example and distribution of question types among
100 randomly selected questions. The corresponding charts of
these examples are shown in Figure 6.
characteristics of our dataset are as follows.
Chart Types and Topics
Our dataset contains
a variety of chart types (Figure 2a). The most com-
mon is bar charts (71.7%), for both simple as well
as stacked and group bar charts. The next most
common type is line charts (24.6%). Other types
include area charts, scatter plots and pie charts.
The dataset also covers a diverse range of topics
including politics, technology, society and media
(Figure 2b); about half of the charts cover U.S. Pol-
itics & Policy due to the nature of the dataset.
Question Types
We further analyze the ques-
tion types using 100 randomly sampled question-
answer pairs from our dataset. Table 1shows the
distribution of questions across four main types.
Our categorization of questions are based on the
specific analytical tasks with visualizations one
would have to perform to answer the question
(Munzner,2014). The four categories are: (i)Iden-
tify: questions that require identifying a specific
target (e.g., a data item or a data attribute) from
the chart and describing the characteristics of that
target; (ii)Compare: questions that require com-
parisons between specified targets from the chart;
(iii)Summarize: questions that require summariz-
ing the chart based on specified statistical analysis
tasks (e.g., describing data distribution, outliers(s),
and trend(s)); and (iv)Discover: questions that
require analyzing the whole chart to derive key in-
sights through inference and reasoning. Unlike the
summarize task there is no explicit analytical task
that is specified in a discover type question.
From Table 1, we notice that people often ask
to locate one or more portions of a chart (identify
and compare) and then characterize them to answer
the question. They may also ask for a descriptive
answer of some trends or data distributions. In con-
trast, questions that require the user to focus on the
whole chart (e.g., discovery) are fewer. This sug-
gests that unlike the chart summarization problem
which focuses on the whole chart, OpenCQA prob-
lem requires the model to identify relevant portions
of the chart to answer the question.
4 OpenCQA Models
Problem Definition
For our OpenCQA problem,
we consider three task settings. In the first setup,
the model takes a chart and the article contain-
ing the chart as input and extracts an answer to
an open-ended question. The data for this task
can be represented as a tuple of 6 elements
D=
{hC, T, M, Q, D, Ain}N
n=1
, where
C, T, M, Q, D
and
A
represent the chart image, title, metadata,
question, document (article) text and answer text,
respectively. The metadata
M=hClabel, Cbboxi
consists of the chart labels that are text segments
extracted from the chart through OCR (e.g., axis
labels, data labels) and their respective bounding
boxes. In the second setup, the chart summary is
provided as input instead of the whole article. The
dataset in this setup can be represented as
D=
{hC, T, M, Q, S, Ain}N
n=1
, where
S
represents the
chart summary. In the third setup, the chart sum-
mary is not accessible, and the model must rely
only on the chart. This is a more difficult and inter-
esting problem setting since real world charts often
do not come with explanatory summaries. In this
setting, for an input
I=hC, T, M, Qi
, the model
has to learn to generate an explanatory answer A.
For training the models, we use state-of-the-art
extractive and generative QA models. The super-
vision of extractive models are extractive answers
whereas for the generative models abstractive or
edited (as described in Section 3.1) answers are
used. Note that the third task setting applies to
only generative models, whereas the first and sec-
ond settings apply to both extractive and generative
models. We describe the models below (implemen-
tation details are in Appendix A.2).
4.1 Extractive Models
We adopt two extractive models for the two prob-
lem setups where the models extract the answer to
the question from the input summary or article.
BERTQA
(Chadha and Sood,2019) is an ex-
tractive QA model that uses directed coattention
layers (Xiong et al.,2016) to improve the perfor-
mance of the original BERT model (Devlin et al.,
2019). In this approach, we first pass the question
and the text (article or summary) through a BERT
model to get the (self-attention based) representa-
tions. It then calculates the cross attention from the
question to text and text to question and concate-
Statistics (on average)
Tokens in article 1268.91
Tokens in summary 123.35
Tokens in title 17.94
Tokens in question 11.88
Tokens in abstractive answer 56.41
Tokens in extractive answer 56.21
Percentage of tokens extracted from the summary 52%
Percentage of tokens extracted from the article 7%
Type Simple Complex
Bar 712 4,823
Line 234 1,667
Area 7 4
Scatter 0 42
Pie 235 0
Total 1,188 6,536
(a) Dataset Statistics and Chart Types (b) Distribution of topics.
Figure 2: (left) Dataset Statistics and Chart Types, (right) Topic Distribution
nates the resulting vectors to predict the start and
end points of the answer span.
ELECTRA
(Clark et al.,2020) proposes a self-
supervised representation learning method that em-
phasizes computational efficiency. In contrast to
Masked Language Modeling (MLM), it uses Re-
placed Token Detection or RTD as the pretraining
task. The training process is inspired by the train-
ing of Generative Adversarial Networks or GANs
(Goodfellow et al.,2020). The training examples
are first passed to a generator (typically a small
MLM-based model) to replace some of the tokens
in the input with other probable but incorrect to-
kens. ELECTRA (the discriminator) is then trained
to distinguish between the “original” vs. “replaced”
tokens in the input. This binary classification task is
applied to every token, hence it requires fewer train-
ing examples compared to MLM training. ELEC-
TRA achieves state-of-the-art results on SQuAD
2.0 (Rajpurkar et al.,2018). We use the same ex-
perimental setup as with the BERTQA model.
4.2 Generative Models
GPT-2
(Radford et al.,2019) trains a transformer
decoder on the unlabelled BooksCorpus dataset
(Zhu et al.,2015) using a conditional language
modelling objective. Their pretrained model can
be fine-tuned on downstream tasks such as textual
entailment, similarity and question answering. We
fine-tune GPT-2 under the three tasks where all the
input elements in each task setting are concatenated
as conditioning input to predict the answer.
BART
(Lewis et al.,2020) uses a standard
encoder-decoder transformer architecture. Its pre-
training task involves denoising, where text spans
in the input text are replaced with a single mask
token, and the decoder is tasked to predict the orig-
inal input sequence. BART has been shown to
achieve state-of-the-art performance on text gener-
ation tasks such as summarization. In each of our
three task settings, we concatenate the correspond-
ing inputs and feed into the model for fine-tuning.
T5
(Raffel et al.,2020) is a unified encoder-
decoder transformer model that converts language
processing tasks into a text2text generation format.
It is first pretrained with a ‘fill-in-the-blank’ denois-
ing objective, where 15% of the input tokens are
randomly dropped out. The spans of consecutive
dropped-out tokens and dropped-out tokens that
stand alone are then replaced by special sentinel
tokens. Each sentinel token is assigned a token ID
that is unique in the input sequence. The decoder
then learns to predict those dropped-out tokens, de-
limited by the same input sentinel token plus a final
sentinel token. We fine-tuned T5 on our tasks using
the same input format as with BART.
VLT5
(Cho et al.,2021) is a T5-based framework
that unifies the Vision-Language (VL) tasks as text
generation conditioned on multimodal inputs. The
input consists of both textual tokens and visual
features of objects from the image extracted by
Faster R-CNN (Ren et al.,2015). The model is pre-
trained on multiple multimodal tasks: language
modeling, visual QA, visual grounding, image-text
matching and grounded captioning. We fine-tuned
VL-T5 on our OpenCQA generative task in the
following manner. For the textual input, we use the
same input format as T5. For the visual input, we
extract the visual features of different marks in the
chart image (e.g., bars, lines) using Mask R-CNN
(He et al.,2017) with Resnet-101 as its backbone.
CODR
(Prabhumoye et al.,2021) proposes a doc-
ument grounded generation task, where the model
uses the information provided in a document to
enhance text generation. In their setup, the context
and source documents are concatenated and passed
to a BART encoder to get a contextualized repre-
sentation of the document. Then the same encoder
is applied to the context alone and both represen-
tations are finally concatenated and passed to the
摘要:

OpenCQA:Open-endedQuestionAnsweringwithChartsShankarKantharaj1,XuanLongDo2,RixieTiffanyKoLeong2,JiaQingTan2,EnamulHoque1,ShaqJoty2;31YorkUniversity,Canada,2NanyangTechnologicalUniversity,Singapore,3SalesforceAIResearch{shankark,enamulh}@yorku.ca{xuanlong001@e,C190022@e,srjoty@}.ntu.edu.sgAbstractCh...

展开>> 收起<<
OpenCQA Open-ended Question Answering with Charts Shankar Kantharaj1 Xuan Long Do2 Rixie Tiffany Ko Leong2 Jia Qing Tan2 Enamul Hoque1 Shafiq Joty23.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:2.43MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注