scriptive answer as output like the one in Figure 1.
This task is multidisciplinary and challenging in
nature as it involves natural language generation
(NLG), information visualization and computer
vision. It differs from the data-to-text and read-
ing comprehension, because unlike text or tables,
charts serve a different communicative goal by cre-
ating visual representation of data. Readers can
quickly notice important patterns, trends, and out-
liers from such visual representation which cannot
be easily observed from a table of raw data (Mun-
zner,2014). By looking at a line chart, one can
quickly discern an important trend whereas scatter-
plots may visually depict correlations and outliers.
Existing NLG approaches for tables do not con-
sider such chart features in the generation.
We have developed a benchmark dataset for
OpenCQA consisting of 7,724 human-written open-
ended questions about a variety of real-world charts
and the associated descriptive answers. We formu-
late three practical task settings. In the first setting,
a chart and the article containing the chart is pro-
vided as input, and the model generates an answer
to an open-ended question. This setting poses extra
challenge as articles often contain paragraphs that
are irrelevant to the questions. To make the task
more focused, the second setting provides only the
relevant paragraph(s) to the chart. Hence we can
measure the models’ ability to answer a question
without the effect of extra noise from irrelevant text.
The third setting is more challenging as the related
text is not provided and the model needs to gen-
erate an answer solely based on the chart. This is
more relevant to real world scenarios where charts
are not associated with any explanatory text.
Since the proposed task is completely new, we
adapt a variety of state-of-the-art models that uti-
lize multimodal, data2text and extractive sum-
marization methods to serve as strong baselines.
We conducted automatic and qualitative evalua-
tions and observe that the top performing models
are quite fluent and coherent in generating sum-
maries but lack in complex logical reasoning and
inference. Our codebase is publicly available at
https://github.com/vis-nlp/OpenCQA.
2 Related Work
Our work is related to three lines of prior work.
(i) Chart Summarization
Mittal et al. (1998)
and Ferres et al. (2013) adopt a planning-based
architecture and used templates to describe charts
with texts. These methods can only describe how to
read a chart without summarizing any insights from
the chart. Demir et al. (2012) compute statistics to
generate bar chart summaries and simultaneously
construct sentence- and discourse-level structures.
Chen et al. (2019) use a ResNet (He et al.,2016)
to encode a chart and a LSTM decoder to generate
a caption. All these studies generate summaries
using predefined templates, which may lack nat-
uralness and variations in terms of grammatical
structure and lexical choices. Obeid and Hoque
(2020) and Kantharaj et al. (2022a) use transformer-
based models while Spreafico and Carenini (2020)
use an LSTM based encoder-decoder model to gen-
erate chart summaries in a data-driven fashion. But
their models only focus on generating a summary
to describe the chart rather than focusing on a spe-
cific relevant portion of a chart to answer a question
which is the main focus of our work.
(ii) Visual Question Answering (VQA)
VQA
involves answering a question regarding an input
image (Antol et al.,2015). To relate the ques-
tion and the image effectively, researchers focus
on fusing textual and visual information together
(Lu et al.,2019;Talmor et al.,2021). Cho et al.
(2021) introduce VL-T5 and VL-BART as pre-
trained vision-language models which achieved
competitive results on VQA tasks. Unlike images
with real-world objects and scenes, charts encode
data using marks (bars, lines) and have inherent
structure which makes the chart QA task quite dif-
ferent from VQA (Masry et al.,2022).
(iii) Data2text Generation
Data2text models
generate a descriptive summary from a data table.
Previous work has focused on specific domains
such as sports (Barzilay and Lapata,2005;Wise-
man et al.,2017), weather-forecast (Reiter et al.,
2005), recipe (Yang et al.,2017) and biography
(Lebret et al.,2016). Others (Parikh et al.,2020;
Chen et al.,2020a) have focused on open-domain
tasks. Many of these methods use an LSTM-based
encoder-decoder architecture (Mei et al.,2016;Le-
bret et al.,2016;Wiseman et al.,2017), while Gong
et al. (2019) find that transformers yield more flu-
ent and coherent outputs. Few approaches focus
on generating textual facts with logical inference
rather than stating simple facts that can be easily
retrieved from the data table (Chen et al.,2020a,b).
Unlike the task with data tables, our task involves
understanding visual features of the charts and the
natural language questions to perform reasoning in