OpenCQA Open-ended Question Answering with Charts Shankar Kantharaj1 Xuan Long Do2 Rixie Tiffany Ko Leong2 Jia Qing Tan2 Enamul Hoque1 Shaﬁq Joty23

2025-04-27 0 0 2.43MB 21 页 10玖币

侵权投诉

OpenCQA: Open-ended Question Answering with Charts

Shankar Kantharaj1, Xuan Long Do2, Rixie Tiffany Ko Leong2,

Jia Qing Tan2, Enamul Hoque1, Shaﬁq Joty2,3

1York University, Canada, 2Nanyang Technological University, Singapore,

3Salesforce AI Research

{shankark, enamulh}@yorku.ca

{xuanlong001@e, C190022@e, srjoty@}.ntu.edu.sg

Abstract

Charts are very popular to analyze data and

convey important insights. People often ana-

lyze visualizations to answer open-ended ques-

tions that require explanatory answers. An-

swering such questions are often difﬁcult and

time-consuming as it requires a lot of cogni-

tive and perceptual efforts. To address this

challenge, we introduce a new task called

OpenCQA, where the goal is to answer an

open-ended question about a chart with de-

scriptive texts. We present the annotation pro-

cess and an in-depth analysis of our dataset.

We implement and evaluate a set of baselines

under three practical settings. In the ﬁrst set-

ting, a chart and the accompanying article is

provided as input to the model. The second set-

ting provides only the relevant paragraph(s) to

the chart instead of the entire article, whereas

the third setting requires the model to generate

an answer solely based on the chart. Our anal-

ysis of the results show that the top performing

models generally produce ﬂuent and coherent

text while they struggle to perform complex

logical and arithmetic reasoning.

1 Introduction

Using data visualizations such as bar charts and

line charts to discover critical insights, and explain

them to others is at the heart of many decision mak-

ing tasks (Munzner,2014). Often, people explore

such visualizations to answer high-level questions

that involve reasoning and explanations. For exam-

ple, Figure 1shows an open-ended question which

cannot be answered by a single word or phrase,

rather it requires an explanatory answer. Answer-

ing such questions can be time consuming and men-

tally taxing as they require signiﬁcant amounts of

perceptual and cognitive efforts. For the particular

question in Figure 1, the user needs to ﬁnd relevant

marks (bars) in the given charts, compare their val-

ues and perform reasoning over them to generate

an explanatory answer. Thus, the research question

Question

: Compare the Democrats and Republicans

views about providing health care to the population?

Answer

: While 83% of Democrats say providing high-

quality, affordable health care for all should be a top prior-

ity, a much smaller share of Republicans (48%) agree.

Figure 1: A question-answer pair from our dataset.

we address in this paper is: can we build systems to

automatically answer such open-ended questions

about charts with descriptive texts?

Chart Question Answering (CQA) is a task

where the goal is to take a chart and a natural

language question as input and generate the de-

sired answer as output (Hoque et al.,2022). While

CQA has received growing attentions in the last

few years, existing datasets only focus on close-

ended (factoid) questions where the answer is a

word or phrase (Kahou et al.,2017;Kaﬂe et al.,

2018;Chaudhry et al.,2020;Singh and Shekhar,

2020). These datasets typically use predeﬁned tem-

plates to generate synthetic questions and answers

to these questions come from a closed vocabulary

(e.g., ‘yes’, ‘no’, ‘x-axis-label’). PlotQA (Methani

et al.,2020) introduces some open-vocabulary ques-

tions that require aggregation operations on the un-

derlying chart data; the answer is still however a

number or a word/phrase obtained from the chart.

Kim et al. (2020) attempt to automatically explain

how the model computes the answer but only for

close-ended ones. To our knowledge, there are no

datasets on CQA with open-ended questions.

In this work, we introduce a novel task named,

OpenCQA in which the system takes a chart and a

question as input and is expected to produce a de-

arXiv:2210.06628v1 [cs.LG] 12 Oct 2022

scriptive answer as output like the one in Figure 1.

This task is multidisciplinary and challenging in

nature as it involves natural language generation

(NLG), information visualization and computer

vision. It differs from the data-to-text and read-

ing comprehension, because unlike text or tables,

charts serve a different communicative goal by cre-

ating visual representation of data. Readers can

quickly notice important patterns, trends, and out-

liers from such visual representation which cannot

be easily observed from a table of raw data (Mun-

zner,2014). By looking at a line chart, one can

quickly discern an important trend whereas scatter-

plots may visually depict correlations and outliers.

Existing NLG approaches for tables do not con-

sider such chart features in the generation.

We have developed a benchmark dataset for

OpenCQA consisting of 7,724 human-written open-

ended questions about a variety of real-world charts

and the associated descriptive answers. We formu-

late three practical task settings. In the ﬁrst setting,

a chart and the article containing the chart is pro-

vided as input, and the model generates an answer

to an open-ended question. This setting poses extra

challenge as articles often contain paragraphs that

are irrelevant to the questions. To make the task

more focused, the second setting provides only the

relevant paragraph(s) to the chart. Hence we can

measure the models’ ability to answer a question

without the effect of extra noise from irrelevant text.

The third setting is more challenging as the related

text is not provided and the model needs to gen-

erate an answer solely based on the chart. This is

more relevant to real world scenarios where charts

are not associated with any explanatory text.

Since the proposed task is completely new, we

adapt a variety of state-of-the-art models that uti-

lize multimodal, data2text and extractive sum-

marization methods to serve as strong baselines.

We conducted automatic and qualitative evalua-

tions and observe that the top performing models

are quite ﬂuent and coherent in generating sum-

maries but lack in complex logical reasoning and

inference. Our codebase is publicly available at

https://github.com/vis-nlp/OpenCQA.

2 Related Work

Our work is related to three lines of prior work.

(i) Chart Summarization

Mittal et al. (1998)

and Ferres et al. (2013) adopt a planning-based

architecture and used templates to describe charts

with texts. These methods can only describe how to

read a chart without summarizing any insights from

the chart. Demir et al. (2012) compute statistics to

generate bar chart summaries and simultaneously

construct sentence- and discourse-level structures.

Chen et al. (2019) use a ResNet (He et al.,2016)

to encode a chart and a LSTM decoder to generate

a caption. All these studies generate summaries

using predeﬁned templates, which may lack nat-

uralness and variations in terms of grammatical

structure and lexical choices. Obeid and Hoque

(2020) and Kantharaj et al. (2022a) use transformer-

based models while Spreaﬁco and Carenini (2020)

use an LSTM based encoder-decoder model to gen-

erate chart summaries in a data-driven fashion. But

their models only focus on generating a summary

to describe the chart rather than focusing on a spe-

ciﬁc relevant portion of a chart to answer a question

which is the main focus of our work.

(ii) Visual Question Answering (VQA)

VQA

involves answering a question regarding an input

image (Antol et al.,2015). To relate the ques-

tion and the image effectively, researchers focus

on fusing textual and visual information together

(Lu et al.,2019;Talmor et al.,2021). Cho et al.

(2021) introduce VL-T5 and VL-BART as pre-

trained vision-language models which achieved

competitive results on VQA tasks. Unlike images

with real-world objects and scenes, charts encode

data using marks (bars, lines) and have inherent

structure which makes the chart QA task quite dif-

ferent from VQA (Masry et al.,2022).

(iii) Data2text Generation

Data2text models

generate a descriptive summary from a data table.

Previous work has focused on speciﬁc domains

such as sports (Barzilay and Lapata,2005;Wise-

man et al.,2017), weather-forecast (Reiter et al.,

2005), recipe (Yang et al.,2017) and biography

(Lebret et al.,2016). Others (Parikh et al.,2020;

Chen et al.,2020a) have focused on open-domain

tasks. Many of these methods use an LSTM-based

encoder-decoder architecture (Mei et al.,2016;Le-

bret et al.,2016;Wiseman et al.,2017), while Gong

et al. (2019) ﬁnd that transformers yield more ﬂu-

ent and coherent outputs. Few approaches focus

on generating textual facts with logical inference

rather than stating simple facts that can be easily

retrieved from the data table (Chen et al.,2020a,b).

Unlike the task with data tables, our task involves

understanding visual features of the charts and the

natural language questions to perform reasoning in

order to generate (or extract) texts as answers.

3 Dataset Construction

3.1 Data Collection & Annotation

Building a dataset with open-ended questions and

human-written descriptive answer is challenging

because there are not many publicly available real-

world sources with charts and related textual de-

scriptions. After exhaustive search, we decided to

use charts from Pew Research (pewresearch.org).

Pew serves as a suitable source because the arti-

cles are written by professional writers covering

opinions, market surveys, demographic trends and

social issues. The articles are often accompanied by

a variety of real-world charts and their summaries.

We collected 9,285 chart-summary-article triples

scraped from nearly 4,000 articles. However, not

all of the charts are suitable for creating open-ended

questions. For example, some charts maybe too un-

conventional or too complex while a few others

have poor resolution. Similarly, the text accompa-

nying the chart may not discuss data values in the

chart and instead refer to other external background

facts. Hence, we manually went over all the charts

to retain 7,724 samples that we deemed suitable

for our study. In particular, we ﬁltered out 1,019

samples as too complex and 542 as samples we

cannot make an open-ended question.

We perform an annotation study on the collected

chart data to create question-answer pairs following

the four steps below (see Table 8for an illustrative

example). More details of the data collection and

annotation process are provided in Appendix A.1.

(1) Question-answer Creation

We asked each

crowdworker from Amazon Mechanical Turk to

answer three existing questions (created by another

crowdworker) for three separate charts respectively,

and create three new question-answer pairs for

three new charts. They were provided with the

chart and the summary, and were asked to select

portions of the text as an answer to the question.

The selected segments can be noncontiguous. In

this way, we collected two answers from different

workers for each question, to verify answers and to

remove any potential bias in answer selection.

(2) Question Validation and Editing

After col-

lecting the question-answer (QA) pairs, this and

the next two steps are performed by ﬁve internal

annotators who are native speakers of English and

have research background in summarization. Each

QA pair is ﬁrst examined by an annotator who ﬁrst

checks if the question is open-ended in nature, and

edits the questions when the question is vague or in-

complete, or not answerable from the charts. Then,

the remaining annotators analyze the questions in

terms of grammatical correctness and edit them as

needed. Overall, the question was edited in 53%

question-answer pairs. In this 22.7% cases were

minor changes (less than 30% tokens changed),

15.5% cases were moderate changes (between 30%

and 60% tokens changed) and 14.8% were major

changes (over 60% tokens changed).

(3) Disagreement Resolution

As mentioned, we

obtain two answers from the crowdworkers for each

chart-question pair. To resolve any potential bias

from one and/or disagreement between the two an-

swers, we build an annotation interface where an

annotator can either choose one of the two answers,

or select a new answer from the given summary.

The annotator checks whether the answer contains

irrelevant information to the question or any text

that are not derivable from the chart (e.g., back-

ground information). For 18.4% cases, the two an-

swers matched exactly. For 68.2% samples, the two

answers still had high overlaps (over 90% token

matches); for another 10.1% the overlaps between

the answers were moderate (between 30% and 90%

token matches) and for the remaining 3.3%, the to-

ken matches between answers were less than 30%.

While resolving the disagreements between crowd-

workers, in 96% cases the annotators chose one of

the two answers while for other 4% they selected a

new answer from the summary.

(4) Decontextualization

In some cases, crowd-

workers may have left out important information

from the summary that is relevant to the question

while in other cases, they may have included infor-

mation that is not derivable from the chart. Thus,

after selecting the most appropriate answer, annota-

tors edit it further by adding tokens from the sum-

mary or removing tokens as necessary, which is

taken as the extractive answer for the dataset. Also,

if needed, they replace the occurrence of a pronoun

with its antecedent (a proper noun) in cases where

the entity is unknown, to put the answer in context,

which is the abstractive answer.

3.2 Dataset Analysis

Figure 2a represents some basic statistics about the

dataset. The questions and titles are generally short

with both under 21 tokens on average. The percent-

age of tokens overlapping between the extractive

answer and the article is 7% on average. Other

Type Example %

Identify

What are the current thoughts on direct democ-

racy?

37%

Summarize

Explain the distribution of people who know a

transgender person?

37%

Compare

Compare Americans and Germans views about

the world economic leader?

20%

Discover

How do Americans’ see the coronavirus statistics? 6%

Table 1:

Example and distribution of question types among

100 randomly selected questions. The corresponding charts of

these examples are shown in Figure 6.

characteristics of our dataset are as follows.

•Chart Types and Topics

Our dataset contains

a variety of chart types (Figure 2a). The most com-

mon is bar charts (71.7%), for both simple as well

as stacked and group bar charts. The next most

common type is line charts (24.6%). Other types

include area charts, scatter plots and pie charts.

The dataset also covers a diverse range of topics

including politics, technology, society and media

(Figure 2b); about half of the charts cover U.S. Pol-

itics & Policy due to the nature of the dataset.

•Question Types

We further analyze the ques-

tion types using 100 randomly sampled question-

answer pairs from our dataset. Table 1shows the

distribution of questions across four main types.

Our categorization of questions are based on the

speciﬁc analytical tasks with visualizations one

would have to perform to answer the question

(Munzner,2014). The four categories are: (i)Iden-

tify: questions that require identifying a speciﬁc

target (e.g., a data item or a data attribute) from

the chart and describing the characteristics of that

target; (ii)Compare: questions that require com-

parisons between speciﬁed targets from the chart;

(iii)Summarize: questions that require summariz-

ing the chart based on speciﬁed statistical analysis

tasks (e.g., describing data distribution, outliers(s),

and trend(s)); and (iv)Discover: questions that

require analyzing the whole chart to derive key in-

sights through inference and reasoning. Unlike the

summarize task there is no explicit analytical task

that is speciﬁed in a discover type question.

From Table 1, we notice that people often ask

to locate one or more portions of a chart (identify

and compare) and then characterize them to answer

the question. They may also ask for a descriptive

answer of some trends or data distributions. In con-

trast, questions that require the user to focus on the

whole chart (e.g., discovery) are fewer. This sug-

gests that unlike the chart summarization problem

which focuses on the whole chart, OpenCQA prob-

lem requires the model to identify relevant portions

of the chart to answer the question.

4 OpenCQA Models

Problem Deﬁnition

For our OpenCQA problem,

we consider three task settings. In the ﬁrst setup,

the model takes a chart and the article contain-

ing the chart as input and extracts an answer to

an open-ended question. The data for this task

can be represented as a tuple of 6 elements

{hC, T, M, Q, D, Ain}N

n=1

, where

C, T, M, Q, D

and

represent the chart image, title, metadata,

question, document (article) text and answer text,

respectively. The metadata

M=hClabel, Cbboxi

consists of the chart labels that are text segments

extracted from the chart through OCR (e.g., axis

labels, data labels) and their respective bounding

boxes. In the second setup, the chart summary is

provided as input instead of the whole article. The

dataset in this setup can be represented as

{hC, T, M, Q, S, Ain}N

n=1

, where

represents the

chart summary. In the third setup, the chart sum-

mary is not accessible, and the model must rely

only on the chart. This is a more difﬁcult and inter-

esting problem setting since real world charts often

do not come with explanatory summaries. In this

setting, for an input

I=hC, T, M, Qi

, the model

has to learn to generate an explanatory answer A.

For training the models, we use state-of-the-art

extractive and generative QA models. The super-

vision of extractive models are extractive answers

whereas for the generative models abstractive or

edited (as described in Section 3.1) answers are

used. Note that the third task setting applies to

only generative models, whereas the ﬁrst and sec-

ond settings apply to both extractive and generative

models. We describe the models below (implemen-

tation details are in Appendix A.2).

4.1 Extractive Models

We adopt two extractive models for the two prob-

lem setups where the models extract the answer to

the question from the input summary or article.

•BERTQA

(Chadha and Sood,2019) is an ex-

tractive QA model that uses directed coattention

layers (Xiong et al.,2016) to improve the perfor-

mance of the original BERT model (Devlin et al.,

2019). In this approach, we ﬁrst pass the question

and the text (article or summary) through a BERT

model to get the (self-attention based) representa-

tions. It then calculates the cross attention from the

question to text and text to question and concate-

Statistics (on average)

Tokens in article 1268.91

Tokens in summary 123.35

Tokens in title 17.94

Tokens in question 11.88

Tokens in abstractive answer 56.41

Tokens in extractive answer 56.21

Percentage of tokens extracted from the summary 52%

Percentage of tokens extracted from the article 7%

Type Simple Complex

Bar 712 4,823

Line 234 1,667

Area 7 4

Scatter 0 42

Pie 235 0

Total 1,188 6,536

(a) Dataset Statistics and Chart Types (b) Distribution of topics.

Figure 2: (left) Dataset Statistics and Chart Types, (right) Topic Distribution

nates the resulting vectors to predict the start and

end points of the answer span.

•ELECTRA

(Clark et al.,2020) proposes a self-

supervised representation learning method that em-

phasizes computational efﬁciency. In contrast to

Masked Language Modeling (MLM), it uses Re-

placed Token Detection or RTD as the pretraining

task. The training process is inspired by the train-

ing of Generative Adversarial Networks or GANs

(Goodfellow et al.,2020). The training examples

are ﬁrst passed to a generator (typically a small

MLM-based model) to replace some of the tokens

in the input with other probable but incorrect to-

kens. ELECTRA (the discriminator) is then trained

to distinguish between the “original” vs. “replaced”

tokens in the input. This binary classiﬁcation task is

applied to every token, hence it requires fewer train-

ing examples compared to MLM training. ELEC-

TRA achieves state-of-the-art results on SQuAD

2.0 (Rajpurkar et al.,2018). We use the same ex-

perimental setup as with the BERTQA model.

4.2 Generative Models

•GPT-2

(Radford et al.,2019) trains a transformer

decoder on the unlabelled BooksCorpus dataset

(Zhu et al.,2015) using a conditional language

modelling objective. Their pretrained model can

be ﬁne-tuned on downstream tasks such as textual

entailment, similarity and question answering. We

ﬁne-tune GPT-2 under the three tasks where all the

input elements in each task setting are concatenated

as conditioning input to predict the answer.

•BART

(Lewis et al.,2020) uses a standard

encoder-decoder transformer architecture. Its pre-

training task involves denoising, where text spans

in the input text are replaced with a single mask

token, and the decoder is tasked to predict the orig-

inal input sequence. BART has been shown to

achieve state-of-the-art performance on text gener-

ation tasks such as summarization. In each of our

three task settings, we concatenate the correspond-

ing inputs and feed into the model for ﬁne-tuning.

•T5

(Raffel et al.,2020) is a uniﬁed encoder-

decoder transformer model that converts language

processing tasks into a text2text generation format.

It is ﬁrst pretrained with a ‘ﬁll-in-the-blank’ denois-

ing objective, where 15% of the input tokens are

randomly dropped out. The spans of consecutive

dropped-out tokens and dropped-out tokens that

stand alone are then replaced by special sentinel

tokens. Each sentinel token is assigned a token ID

that is unique in the input sequence. The decoder

then learns to predict those dropped-out tokens, de-

limited by the same input sentinel token plus a ﬁnal

sentinel token. We ﬁne-tuned T5 on our tasks using

the same input format as with BART.

•VLT5

(Cho et al.,2021) is a T5-based framework

that uniﬁes the Vision-Language (VL) tasks as text

generation conditioned on multimodal inputs. The

input consists of both textual tokens and visual

features of objects from the image extracted by

Faster R-CNN (Ren et al.,2015). The model is pre-

trained on multiple multimodal tasks: language

modeling, visual QA, visual grounding, image-text

matching and grounded captioning. We ﬁne-tuned

VL-T5 on our OpenCQA generative task in the

following manner. For the textual input, we use the

same input format as T5. For the visual input, we

extract the visual features of different marks in the

chart image (e.g., bars, lines) using Mask R-CNN

(He et al.,2017) with Resnet-101 as its backbone.

•CODR

(Prabhumoye et al.,2021) proposes a doc-

ument grounded generation task, where the model

uses the information provided in a document to

enhance text generation. In their setup, the context

and source documents are concatenated and passed

to a BART encoder to get a contextualized repre-

sentation of the document. Then the same encoder

is applied to the context alone and both represen-

tations are ﬁnally concatenated and passed to the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OpenCQA:Open-endedQuestionAnsweringwithChartsShankarKantharaj1,XuanLongDo2,RixieTiffanyKoLeong2,JiaQingTan2,EnamulHoque1,ShaqJoty2;31YorkUniversity,Canada,2NanyangTechnologicalUniversity,Singapore,3SalesforceAIResearch{shankark,enamulh}@yorku.ca{xuanlong001@e,C190022@e,srjoty@}.ntu.edu.sgAbstractCh...

展开>> 收起<<

OpenCQA Open-ended Question Answering with Charts Shankar Kantharaj1 Xuan Long Do2 Rixie Tiffany Ko Leong2 Jia Qing Tan2 Enamul Hoque1 Shaﬁq Joty23.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OpenCQA Open-ended Question Answering with Charts Shankar Kantharaj1 Xuan Long Do2 Rixie Tiffany Ko Leong2 Jia Qing Tan2 Enamul Hoque1 Shaﬁq Joty23

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: