Whats Different between Visual Question Answering for Machine Understanding Versus for Accessibility Yang Trista Cao

2025-05-06 0 0 3.52MB 10 页 10玖币

侵权投诉

What’s Different between Visual Question Answering

for Machine “Understanding” Versus for Accessibility?

Yang Trista Cao∗

University of Maryland

ycao95@umd.edu

Kyle Seelman∗

University of Maryland

kseelman@umd.edu

Kyungjun Lee∗

University of Maryland

kyungjun@umd.edu

Hal Daumé III

University of Maryland

Microsoft Research

me@hal3.name

Abstract

In visual question answering (VQA), a ma-

chine must answer a question given an associ-

ated image. Recently, accessibility researchers

have explored whether VQA can be deployed

in a real-world setting where users with visual

impairments learn about their environment by

capturing their visual surroundings and ask-

ing questions. However, most of the existing

benchmarking datasets for VQA focus on ma-

chine “understanding” and it remains unclear

how progress on those datasets corresponds to

improvements in this real-world use case. We

aim to answer this question by evaluating dis-

crepancies between machine “understanding”

datasets (VQA-v2) and accessibility datasets

(VizWiz) by evaluating a variety of VQA mod-

els. Based on our ﬁndings, we discuss opportu-

nities and challenges in VQA for accessibility

and suggest directions for future work.

1 Introduction

Much research has focused on evaluating and push-

ing the boundary of machine “understanding” – can

machines achieve high scores on tasks thought to

require human-like comprehension, including im-

age tagging and captioning (e.g., Lin et al.,2014),

and various forms of reasoning (e.g., Wang et al.,

2018;Sap et al.,2020). In recent years, with the ad-

vancement of deep learning, we saw great improve-

ments in machines’ capabilities in accomplishing

these tasks, raising the possibility for deployment.

However, adapting machine systems in real-life is

non-trivial as real-life situations and users can be

signiﬁcantly different from synthetic and crowd-

sourced dataset examples (Shneiderman,2020). In

this paper we use the visual question answering

(VQA) task as an example to call more attention

to shifting from development on machine “under-

standing” to building machines that can make posi-

tive impacts to the society and people.

∗?Equal contribution

Visual question answering (VQA) is a task that

requires a model to answer natural language ques-

tions based on images. This idea dates back to at

least to the 1960s in the form of answering ques-

tions about pictorial inputs (Coles,1968;Theune

et al.,2007, i.a.), and builds on “intelligence” tests

like the total Turing test (Harnad,1990). Over

the past few years, the task was re-popularized

with new modeling techniques and datasets (e.g.

Malinowski and Fritz,2014;Marino et al.,2019).

However, besides the purpose of testing a models’

multi-modal “understanding,” VQA systems could

be potentially beneﬁcial for visually impaired peo-

ple in answering their questions about the visual

world in real-time. For simplicity, we call the for-

mer view machine understanding VQA (henceforth

omitting the scare quotes) and the latter accessi-

bility VQA. The majority of research in VQA (§2)

focuses on the machine understanding view. As

a result, it is not clear whether VQA model ar-

chitectures developed and evaluated on machine

understanding datasets can be easily adapted to the

accessibility setting, as the distribution of images,

questions, and answers might be—and, as shown

in Figure 1, are—quite different.

In this work, we aim to investigate the gap be-

tween the machine understanding VQA and the

accessibility VQA by uncovering the challenges of

adapting machine understanding VQA model archi-

tectures on an accessibility VQA dataset. Here, we

focus on English VQA systems and datasets; for

machine understanding VQA, we use the VQA-v2

dataset (Agrawal et al.,2017), while for accessibil-

ity VQA, we use the VizWiz dataset (Gurari et al.,

2018) (§3.1). Through performance assessments

of seven machine understanding VQA model archi-

tectures that span 2017–2021 (§3.3), we ﬁnd that

model architecture advancements on machine un-

derstanding VQA also improve the performance on

the accessibility task, but that the gap of the model

performance between the two is still signiﬁcant

arXiv:2210.14966v1 [cs.CL] 26 Oct 2022

Figure 1: Given similar image content (left: food, right: cat), questions in the machine “understanding” dataset

VQA-v2 and the accessibility dataset VizWiz are substantially different. The VizWiz examples show questions

that are signiﬁcantly more speciﬁc (with one question even explicitly stating that it’s already obvious that this is a

can of food), more verbal, and signiﬁcantly less artiﬁcial (as in the cat examples) than the VQA-v2 ones.

and is increasing (§4.1). This increasing gap in ac-

curacy indicates that adapting model architectures

that were developed for machine understanding to

assist visually impaired people is challenging, and

that model development in this area may indicate

architectural overﬁtting.

We then further investigate what types of ques-

tions in the accessibility dataset remain hard for the

state-of-the-art (SOTA) VQA model architecture

(§4.2). We adopt the data challenge taxonomies

from Bhattacharya et al. (2019) and Zeng et al.

(2020) to perform both quantitative and qualitative

error analysis based on these challenge classes. We

ﬁnd some particularly challenging classes within

the accessibility dataset for the VQA models as

a direction for future work to improve on. Addi-

tionally, we observe that many of the questions

on which state-of-the-art models perform poorly

are not due to the model not learning, but rather

due to a need for higher quality annotations and

evaluation metrics.

2 Related Work

To the best of our knowledge, this is the ﬁrst work

that attempts to quantify and understand the gap

in performance VQA models have between the

VQA-v2 dataset collected by sighted people and

the VizWiz dataset that contains images and ques-

tions from people with visual impairments and an-

swers from sighted people. Brady et al. (2013)

conduct a thorough study on the types of ques-

tions people with visual impairments would like

answered, and provide a taxonomy for the types of

questions asked and the features of such questions.

This work was a signiﬁcant step in understand-

ing the need in people with visual impairments for

VQA systems. In combination with our own work,

this gives a more complete picture of what kinds

of questions not only contribute to better model

performance, but actually help individuals with vi-

sual impairments. Additionally, Zeng et al. (2020)

seek to understand the task of answering questions

about images from people with visual impairments

(i.e., VizWiz) and those from sighted people (i.e.,

VQA-v2). The authors identiﬁed the common vi-

sion skills needed for both scenarios and quantiﬁed

the difﬁculty of these skills for both humans and

computers on both datasets.

Gurari et al. (2018), who published a very ﬁrst

visual question answering (VQA) dataset, “VizWiz”

containing images and questions from people with

visual impairments, pointed out the artiﬁcial set-

ting of other VQA datasets that include questions

that are artiﬁcially created by sighted people. The

VizWiz challenge is based on real-world data and

directs researchers working on VQA problems to-

ward real-world VQA problems. This dataset was

built on data collected with a crowdsourcing app,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

What'sDifferentbetweenVisualQuestionAnsweringforMachineUnderstandingVersusforAccessibility?YangTristaCaoUniversityofMarylandycao95@umd.eduKyleSeelmanUniversityofMarylandkseelman@umd.eduKyungjunLeeUniversityofMarylandkyungjun@umd.eduHalDauméIIIUniversityofMarylandMicrosoftResearchme@hal3.nameAbs...

展开>> 收起<<

Whats Different between Visual Question Answering for Machine Understanding Versus for Accessibility Yang Trista Cao.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Whats Different between Visual Question Answering for Machine Understanding Versus for Accessibility Yang Trista Cao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: