
What’s Different between Visual Question Answering
for Machine “Understanding” Versus for Accessibility?
Yang Trista Cao∗
University of Maryland
ycao95@umd.edu
Kyle Seelman∗
University of Maryland
kseelman@umd.edu
Kyungjun Lee∗
University of Maryland
kyungjun@umd.edu
Hal Daumé III
University of Maryland
Microsoft Research
me@hal3.name
Abstract
In visual question answering (VQA), a ma-
chine must answer a question given an associ-
ated image. Recently, accessibility researchers
have explored whether VQA can be deployed
in a real-world setting where users with visual
impairments learn about their environment by
capturing their visual surroundings and ask-
ing questions. However, most of the existing
benchmarking datasets for VQA focus on ma-
chine “understanding” and it remains unclear
how progress on those datasets corresponds to
improvements in this real-world use case. We
aim to answer this question by evaluating dis-
crepancies between machine “understanding”
datasets (VQA-v2) and accessibility datasets
(VizWiz) by evaluating a variety of VQA mod-
els. Based on our findings, we discuss opportu-
nities and challenges in VQA for accessibility
and suggest directions for future work.
1 Introduction
Much research has focused on evaluating and push-
ing the boundary of machine “understanding” – can
machines achieve high scores on tasks thought to
require human-like comprehension, including im-
age tagging and captioning (e.g., Lin et al.,2014),
and various forms of reasoning (e.g., Wang et al.,
2018;Sap et al.,2020). In recent years, with the ad-
vancement of deep learning, we saw great improve-
ments in machines’ capabilities in accomplishing
these tasks, raising the possibility for deployment.
However, adapting machine systems in real-life is
non-trivial as real-life situations and users can be
significantly different from synthetic and crowd-
sourced dataset examples (Shneiderman,2020). In
this paper we use the visual question answering
(VQA) task as an example to call more attention
to shifting from development on machine “under-
standing” to building machines that can make posi-
tive impacts to the society and people.
∗?Equal contribution
Visual question answering (VQA) is a task that
requires a model to answer natural language ques-
tions based on images. This idea dates back to at
least to the 1960s in the form of answering ques-
tions about pictorial inputs (Coles,1968;Theune
et al.,2007, i.a.), and builds on “intelligence” tests
like the total Turing test (Harnad,1990). Over
the past few years, the task was re-popularized
with new modeling techniques and datasets (e.g.
Malinowski and Fritz,2014;Marino et al.,2019).
However, besides the purpose of testing a models’
multi-modal “understanding,” VQA systems could
be potentially beneficial for visually impaired peo-
ple in answering their questions about the visual
world in real-time. For simplicity, we call the for-
mer view machine understanding VQA (henceforth
omitting the scare quotes) and the latter accessi-
bility VQA. The majority of research in VQA (§2)
focuses on the machine understanding view. As
a result, it is not clear whether VQA model ar-
chitectures developed and evaluated on machine
understanding datasets can be easily adapted to the
accessibility setting, as the distribution of images,
questions, and answers might be—and, as shown
in Figure 1, are—quite different.
In this work, we aim to investigate the gap be-
tween the machine understanding VQA and the
accessibility VQA by uncovering the challenges of
adapting machine understanding VQA model archi-
tectures on an accessibility VQA dataset. Here, we
focus on English VQA systems and datasets; for
machine understanding VQA, we use the VQA-v2
dataset (Agrawal et al.,2017), while for accessibil-
ity VQA, we use the VizWiz dataset (Gurari et al.,
2018) (§3.1). Through performance assessments
of seven machine understanding VQA model archi-
tectures that span 2017–2021 (§3.3), we find that
model architecture advancements on machine un-
derstanding VQA also improve the performance on
the accessibility task, but that the gap of the model
performance between the two is still significant
arXiv:2210.14966v1 [cs.CL] 26 Oct 2022