Whats Different between Visual Question Answering for Machine Understanding Versus for Accessibility Yang Trista Cao

2025-05-06 0 0 3.52MB 10 页 10玖币
侵权投诉
What’s Different between Visual Question Answering
for Machine “Understanding” Versus for Accessibility?
Yang Trista Cao
University of Maryland
ycao95@umd.edu
Kyle Seelman
University of Maryland
kseelman@umd.edu
Kyungjun Lee
University of Maryland
kyungjun@umd.edu
Hal Daumé III
University of Maryland
Microsoft Research
me@hal3.name
Abstract
In visual question answering (VQA), a ma-
chine must answer a question given an associ-
ated image. Recently, accessibility researchers
have explored whether VQA can be deployed
in a real-world setting where users with visual
impairments learn about their environment by
capturing their visual surroundings and ask-
ing questions. However, most of the existing
benchmarking datasets for VQA focus on ma-
chine “understanding” and it remains unclear
how progress on those datasets corresponds to
improvements in this real-world use case. We
aim to answer this question by evaluating dis-
crepancies between machine “understanding”
datasets (VQA-v2) and accessibility datasets
(VizWiz) by evaluating a variety of VQA mod-
els. Based on our findings, we discuss opportu-
nities and challenges in VQA for accessibility
and suggest directions for future work.
1 Introduction
Much research has focused on evaluating and push-
ing the boundary of machine “understanding” – can
machines achieve high scores on tasks thought to
require human-like comprehension, including im-
age tagging and captioning (e.g., Lin et al.,2014),
and various forms of reasoning (e.g., Wang et al.,
2018;Sap et al.,2020). In recent years, with the ad-
vancement of deep learning, we saw great improve-
ments in machines’ capabilities in accomplishing
these tasks, raising the possibility for deployment.
However, adapting machine systems in real-life is
non-trivial as real-life situations and users can be
significantly different from synthetic and crowd-
sourced dataset examples (Shneiderman,2020). In
this paper we use the visual question answering
(VQA) task as an example to call more attention
to shifting from development on machine “under-
standing” to building machines that can make posi-
tive impacts to the society and people.
?Equal contribution
Visual question answering (VQA) is a task that
requires a model to answer natural language ques-
tions based on images. This idea dates back to at
least to the 1960s in the form of answering ques-
tions about pictorial inputs (Coles,1968;Theune
et al.,2007, i.a.), and builds on “intelligence” tests
like the total Turing test (Harnad,1990). Over
the past few years, the task was re-popularized
with new modeling techniques and datasets (e.g.
Malinowski and Fritz,2014;Marino et al.,2019).
However, besides the purpose of testing a models’
multi-modal “understanding,” VQA systems could
be potentially beneficial for visually impaired peo-
ple in answering their questions about the visual
world in real-time. For simplicity, we call the for-
mer view machine understanding VQA (henceforth
omitting the scare quotes) and the latter accessi-
bility VQA. The majority of research in VQA (§2)
focuses on the machine understanding view. As
a result, it is not clear whether VQA model ar-
chitectures developed and evaluated on machine
understanding datasets can be easily adapted to the
accessibility setting, as the distribution of images,
questions, and answers might be—and, as shown
in Figure 1, are—quite different.
In this work, we aim to investigate the gap be-
tween the machine understanding VQA and the
accessibility VQA by uncovering the challenges of
adapting machine understanding VQA model archi-
tectures on an accessibility VQA dataset. Here, we
focus on English VQA systems and datasets; for
machine understanding VQA, we use the VQA-v2
dataset (Agrawal et al.,2017), while for accessibil-
ity VQA, we use the VizWiz dataset (Gurari et al.,
2018) (§3.1). Through performance assessments
of seven machine understanding VQA model archi-
tectures that span 2017–2021 (§3.3), we find that
model architecture advancements on machine un-
derstanding VQA also improve the performance on
the accessibility task, but that the gap of the model
performance between the two is still significant
arXiv:2210.14966v1 [cs.CL] 26 Oct 2022
Figure 1: Given similar image content (left: food, right: cat), questions in the machine “understanding” dataset
VQA-v2 and the accessibility dataset VizWiz are substantially different. The VizWiz examples show questions
that are significantly more specific (with one question even explicitly stating that it’s already obvious that this is a
can of food), more verbal, and significantly less artificial (as in the cat examples) than the VQA-v2 ones.
and is increasing (§4.1). This increasing gap in ac-
curacy indicates that adapting model architectures
that were developed for machine understanding to
assist visually impaired people is challenging, and
that model development in this area may indicate
architectural overfitting.
We then further investigate what types of ques-
tions in the accessibility dataset remain hard for the
state-of-the-art (SOTA) VQA model architecture
(§4.2). We adopt the data challenge taxonomies
from Bhattacharya et al. (2019) and Zeng et al.
(2020) to perform both quantitative and qualitative
error analysis based on these challenge classes. We
find some particularly challenging classes within
the accessibility dataset for the VQA models as
a direction for future work to improve on. Addi-
tionally, we observe that many of the questions
on which state-of-the-art models perform poorly
are not due to the model not learning, but rather
due to a need for higher quality annotations and
evaluation metrics.
2 Related Work
To the best of our knowledge, this is the first work
that attempts to quantify and understand the gap
in performance VQA models have between the
VQA-v2 dataset collected by sighted people and
the VizWiz dataset that contains images and ques-
tions from people with visual impairments and an-
swers from sighted people. Brady et al. (2013)
conduct a thorough study on the types of ques-
tions people with visual impairments would like
answered, and provide a taxonomy for the types of
questions asked and the features of such questions.
This work was a significant step in understand-
ing the need in people with visual impairments for
VQA systems. In combination with our own work,
this gives a more complete picture of what kinds
of questions not only contribute to better model
performance, but actually help individuals with vi-
sual impairments. Additionally, Zeng et al. (2020)
seek to understand the task of answering questions
about images from people with visual impairments
(i.e., VizWiz) and those from sighted people (i.e.,
VQA-v2). The authors identified the common vi-
sion skills needed for both scenarios and quantified
the difficulty of these skills for both humans and
computers on both datasets.
Gurari et al. (2018), who published a very first
visual question answering (VQA) dataset, “VizWiz”
containing images and questions from people with
visual impairments, pointed out the artificial set-
ting of other VQA datasets that include questions
that are artificially created by sighted people. The
VizWiz challenge is based on real-world data and
directs researchers working on VQA problems to-
ward real-world VQA problems. This dataset was
built on data collected with a crowdsourcing app,
摘要:

What'sDifferentbetweenVisualQuestionAnsweringforMachine“Understanding”VersusforAccessibility?YangTristaCaoUniversityofMarylandycao95@umd.eduKyleSeelmanUniversityofMarylandkseelman@umd.eduKyungjunLeeUniversityofMarylandkyungjun@umd.eduHalDauméIIIUniversityofMarylandMicrosoftResearchme@hal3.nameAbs...

展开>> 收起<<
Whats Different between Visual Question Answering for Machine Understanding Versus for Accessibility Yang Trista Cao.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:10 页 大小:3.52MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注