Finding the Needle in a Haystack On the Automatic Identification of Accessibility User Reviews

2025-05-06 1 0 593.5KB 16 页 10玖币

侵权投诉

Finding the Needle in a Haystack: On the Automatic

Identification of Accessibility User Reviews

Eman Abdullah AlOmar

Rochester Institute of Technology

Rochester, New York, USA

eman.alomar@mail.rit.edu

Wajdi Aljedaani

University of North Texas

Denton, Texas, USA

wajdialjedaani@my.unt.edu

Murtaza Tamjeed

Rochester Institute of Technology

Rochester, New York, USA

mt1256@rit.edu

Mohamed Wiem Mkaouer

Rochester Institute of Technology

Rochester, New York, USA

mwmvse@rit.edu

Yasmine N. Elglaly

Western Washington University

Bellingham, Washington, USA

elglaly@wwu.edu

ABSTRACT

In recent years, mobile accessibility has become an impor-

tant trend with the goal of allowing all users the possibility

of using any app without many limitations. User reviews in-

clude insights that are useful for app evolution. However, with

the increase in the amount of received reviews, manually ana-

lyzing them is tedious and time-consuming, especially when

searching for accessibility reviews. The goal of this paper is to

support the automated identiﬁcation of accessibility in user

reviews, to help technology professionals in prioritizing their

handling, and thus, creating more inclusive apps. Particularly,

we design a model that takes as input accessibility user re-

views, learns their keyword-based features, in order to make

a binary decision, for a given review, on whether it is about ac-

cessibility or not. The model is evaluated using a total of 5,326

mobile app reviews. The ﬁndings show that (1) our model

can accurately identify accessibility reviews, outperforming

two baselines, namely keyword-based detector and a random

classiﬁer; (2) our model achieves an accuracy of 85% with rel-

atively small training dataset; however, the accuracy improves

as we increase the size of the training dataset.

CCS CONCEPTS

•Human-centered computing →Empirical studies in acces-

sibility;Ubiquitous and mobile devices.

KEYWORDS

Mobile application, user review, accessibility, machine learn-

ing.

Permission to make digital or hard copies of all or part of this work for personal

or classroom use is granted without fee provided that copies are not made

or distributed for proﬁt or commercial advantage and that copies bear this

notice and the full citation on the ﬁrst page. Copyrights for components of this

work owned by others than ACM must be honored. Abstracting with credit is

permitted. To copy otherwise, or republish, to post on servers or to redistribute

to lists, requires prior speciﬁc permission and/or a fee. Request permissions

from permissions@acm.org.

CHI ’21, May 8–13, 2021, Yokohama, Japan

ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. ..$15.00

https://doi.org/10.1145/nnnnnnn.nnnnnnn

ACM Reference Format:

Eman Abdullah AlOmar, Wajdi Aljedaani, Murtaza Tamjeed, Mo-

hamed Wiem Mkaouer, and Yasmine N. Elglaly. 2021. Finding the

Needle in a Haystack: On the Automatic Identiﬁcation of Accessibility

User Reviews. In CHI Conference on Human Factors in Computing Systems

(CHI ’21), May 8–13, 2021, Yokohama, Japan. ACM, New York, NY, USA,

16 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION

Many mobile applications (apps) have poor accessibility

which makes it diﬃcult for people with disabilities to use such

apps [

]. Researchers presented several methods,

tools, frameworks, and guidelines to support developers in

creating accessible mobile applications [

However, many software developers and designers still do

not incorporate accessibility into their software development

process due to lack of awareness or lack of resources, e.g., bud-

get and time, [

]. In this paper, we present a method

that can help software developers to quickly become aware of

speciﬁc accessibility problems with their apps that the users

encountered. Our method is based on automatically identify-

ing app reviews that users write on app stores, e.g., App Store

Google Play

and Amazon Appstore

, where these reviews

express an accessibility-related feedback.

Analyzing app reviews was used by technology profession-

als to identify issues with their mobile apps [

]. How-

ever, accessibility in user reviews is rarely studied especially

for mobile applications [

]. Identifying accessibility-related

reviews is currently done using two main methods: manual

identiﬁcation and automatic detection [

]. The manual iden-

tiﬁcation approach is time consuming especially with the vast

number of reviews that users upload to the app stores, and

so it becomes impractical. The automated detection method

employs a string-matching technique as a predeﬁned set of

keywords are searched for in the app reviews [

]. These key-

words were extracted from the British Broadcasting Corpo-

ration (BBC) recommendations for mobile accessibility [

While this method sounds more practical than the manual

one, it has its own drawbacks: the string-matching technique

1https://www.apple.com/ios/app-store/

2https://play.google.com/store

3https://www.amazon.com/mobile-apps/b?ie=UTF8&node=2350149011

arXiv:2210.09947v1 [cs.SE] 18 Oct 2022

CHI ’21, May 8–13, 2021, Yokohama, Japan AlOmar et al.

ignores that keywords derived from guidelines do not neces-

sarily match the words expressed in reviews posted by users.

This mismatch includes but not limited to situations when the

keywords are incorrectly spelled by users. More importantly,

the presence of certain keywords in a review does not neces-

sarily mean that the review is about accessibility. For example,

consider the following reviews from Eler et al. dataset [18]:

“This is the closest game to my old 2001 Kyocera 2235’s

inbuilt game ’Cavern crawler’. Everything is so simple

and easy to comprehend but that doesn’t mean that it

is easy to complete right oﬀ of the bat. Going into the

sewers almost literally blind (sight and knowledge of

goods in inventory) is a great touch too. Keep at it. I’ll

support you at least in donations.”

This review contains a set of keywords that could indicate

accessibility (e.g., “old”, “blind” and “sight”) but it is not an

accessibility review. In this review, the word “old” refers to a

device rather than a person. The words “blind” and “sight” re-

fer to knowledge of goods in the game rather than describing a

player’s vision. Therefore, the discovery of accessibility reviews

heavily relies on the context, and so, simply searching for their

existence in the review text is ineﬃcient. Due to the overhead

of the manual identiﬁcation, and the high false-positiveness of

the automated detection, these two methods remain impracti-

cal for developers to use, and so, accessibility reviews remain

hard to identify and to prioritize for correction. To address

this challenge, it is critical to design a solution with learning

capabilities, which can take a set of examples that are known

to be accessibility reviews, and another set of examples that

are not about accessibility but do contain accessibility-related

keywords, and learn how to distinguish between them. There-

fore, in this paper,

we use supervised learning to formulate the

identiﬁcation of accessibility reviews as a binary classiﬁca-

tion problem.

This model takes a set of accessibility reviews,

obtained by manual inspection, in a previous study [

] as

input, we deploy state-of-the-art, machine learning models to

learn the features, i.e., textual patterns that are representative of

accessibility reviews. In contrast to relying on words derived

from guidelines, our solution extracts features (i.e., words and

patterns) from actual user reviews and learns from them. This

is critical because there is a semantic gap between the guide-

lines, formally written on an abstract level, and technology-

speciﬁc keywords. By features, we refer to a keyword or a set

of keywords extracted from accessibility-related reviews that

are not only important for classiﬁcation algorithms, but they

can also be useful for developers to understand accessibility-

related issues and features in their apps. The patterns can be

about an app feature that supports accessibility (e.g., “font cus-

tomization”, “page zooming” or “speed control”); about assistive

technology (e.g., “word prediction”, “text to speech” or “voice

over”) as well as about disability comments (e.g., “low vision”,

“handicapped”, “deaf ” or “blind”). Particularly, we addressed

the following three research questions in our study:

RQ1:

To what extent machine learning models can accurately

distinguish accessibility reviews from non-accessibility re-

views?

To answer this research question, we rely on a manually

curated dataset of 2,663 accessibility reviews, which we

augment with another 2,663 non-accessibility reviews.

Then we perform a comparative study between state-

of-the-art binary classiﬁcation models, to identify the

best model that can properly detect accessibility reviews,

from non-accessibility reviews.

RQ2:

How eﬀective is our machine learning approach in identi-

fying accessibility reviews?

Opting for a complex solution, i.e., supervised learning,

has its own challenges, as models need to be trained,

parameter tuned, and maintained, etc. To justify our

choice of such solution, we compare the best perform-

ing model, from the previous research question, with

two baselines: the string-matching method, and the ran-

dom classiﬁer. This research question veriﬁes whether a

simpler solution can convey competitive results.

RQ3:

What is the size of the training dataset needed for the

classiﬁcation to eﬀectively identify accessibility reviews?

In this research question, we empirically extract the min-

imum number of training instances, i.e., accessibility re-

views, needed for our best performing model, to achieve

its best performance. Such information is useful for prac-

titioners, to estimate the amount of manual work needs

to be done (i.e., preparation of training data) to design

this solution.

We performed our experiments using a dataset of 5,326 user

reviews, provided by a previous study [

]. Our comparative

study has shown that the Boosted Decision Trees model (BDTs-

model) has the best performance among other 8 state-of-the-

art models. Then, we compared our BDTs-model, against two

baselines: (1) string-matching algorithm and (2) a random

classiﬁer. Our approach provided a signiﬁcant improvement

in the identiﬁcation of accessibility reviews, outperforming

the baseline-1 (keyword-based detector) by 1.574 times, and

surpassing the baseline-2 (random classiﬁer) by 39.434 times.

The contributions of this paper are:

(1)

We present an action research contribution that privi-

leges societal beneﬁt through helping developers auto-

matically detect accessibility-related reviews and ﬁlter

out irrelevant reviews. We make our model and datasets

publicly available

for researchers to replicate and ex-

tend, and for practitioners to use our web service and

ﬁlter down their user reviews.

(2)

We show that we need a relatively small dataset (i.e.,

1500 reviews) for training to achieve 85% or higher F1-

Measure, outperforming state-of-the-art string-matching

methods. However, the F1-measure score improves as

we add to the training dataset.

2 RELATED WORK

It is crucial that mobile applications be accessible to allow

all individuals with diﬀerent abilities to have fair access and

4https://smilevo.github.io/access/

Finding the Needle in a Haystack: On the Automatic Identification of Accessibility User Reviews CHI ’21, May 8–13, 2021, Yokohama, Japan

equal opportunities [

]. Prior studies investigated the accessi-

bility issues raised in Android applications [

], and others

evaluated the accessibility of various websites [

To the best of our knowledge, there is no study classiﬁes user

reviews in Android applications using machine learning.

In this section, we highlight several previous works that pro-

foundly inﬂuenced our approach. We split the related works

into three sections: user review, which brieﬂy highlights the

role of user reviews in app evolution; accessibility in user re-

view, focuses particularly on detection of accessibility in user

reviews; and classiﬁcation of text documents, where we focus

on current approaches in the classiﬁcation of text such as user

reviews by diﬀerent taxonomies.

2.1 User Reviews

Many researchers concluded that reviews and ratings posted

by users on app store platforms can play an essential role in

apps’ evolution since most developers consider users’ reviews

when working on a new release [

]. Maalej et al.

[

] proposed to consider user-input as ﬁrst means of require-

ments elicitation in software development. Similarly, Vu et al.

[

] emphasized on the role of users in software lifecycle by

developing an approach to identify useful information from

users’ review. Moreover, Seyﬀ et al. [

] suggested continu-

ous requirements elicitation from end-users’ feedback using

mobile devices.

Considering the fact that user reviews can be a powerful

driver to mobile app evolution, we are looking into whether

we can eﬀectively detect accessibility reviews from users’ feed-

back. This is important because in a highly competitive market,

identifying accessibility issues from users’ reviews can help de-

velopers improve their apps in order to attract more customers

and provide better services to users with diﬀerent abilities.

2.2 Accessibility in User Reviews

Even though user reviews can be a robust tool to mobile

apps evolution, and that even mature apps have many trivial

accessibility issues [

], only 1.24% of mobile app users

report accessibility issues to app stores [

]. In other words,

98.76% of mobile app users do not post accessibility issues in

the form of reviews on app stores. In an eﬀort to ﬁnd whether

mobile app users post accessibility-related issues to app stores,

Eler et al. [

] investigated 214,053 mobile app reviews us-

ing a string-matching approach. They depend on a set of 213

keywords derived from 54 BBC recommendations [

] pro-

posed for mobile accessibility. In their work, they inspected

214,053 user reviews to identify reviews pertaining to acces-

sibility. Their approach classiﬁed a total of 5,076 reviews as

accessibility reviews. However, through a manual inspection

later, the researchers found that only 2,663 of the reviews were

really about accessibility. We used these 2,663 identiﬁed ac-

cessibility reviews as one of the two groups in our training

set required for a supervised machine learning. We created

the second group (i.e., non-accessibility reviews) from their

total dataset (i.e., 214,053). So far, this is one of the preliminary

studies related to the accessibility in mobile app user reviews.

2.3 Classiﬁcation of Text Documents

Many studies classify app reviews using diﬀerent taxonomies

[

], for various purposes: detection of po-

tential feature requests, bug reports, complaints, and praises,

etc. Even though many of them identify reviews related to app

usability, there is no explicit mention to accessibility related

issues [18].

Unlike automatic approaches, classiﬁcation of text docu-

ments using a set of

predeﬁned keywords

has been vastly per-

formed across diﬀerent domains in software engineering. For

instance, Eler et al. [

] relied on 213 keywords to identify

accessibility-related reviews. Strogylos and Spinelles [

] iden-

tiﬁed refactoring-related commits using one keyword “refac-

tor”. Similarly, Ratzinger et al. [

] used 13 keywords to detect

refactoring in commit messages. Later, Murphy-Hill et al. [

]

replicated Ratzinger’s work in two open-source software using

the 13 keywords Ratzinger used. However, they disproved the

previous assumption that commit messages in version history

of programs are indicators of refactoring activities. The rea-

soning behind their ﬁndings is that developers do not always

report refactoring activities as they might associate refactor-

ing activities with other activities such as adding a feature.

AlOmar et al. [

] have also explored how developers docu-

ment their refactoring activities in commit messages using

a variety of 87 textual patterns (i.e., keywords and phrases).

Similarly, we believe users can express accessibility concerns

without explicitly using any accessibility keywords from the

BBC guidelines as assumed by Eler et al. [18].

In contrast to the keyword-based approaches, we used an au-

tomated machine learning approach since learning approaches

outperform the accuracy of the keyword-based approach by

at least 1.45 times [

]. On the other hand, a keyword-based

identiﬁcation approach (i.e., relying on an existing set of pre-

deﬁned keywords) could generally miss certain reviews, not

only because reviews left by users might not always use those

keywords to express an accessibility concern, but also because

a single word might not be enough to convey an accessibility

message. For example, the review “I hope someday we change size

of the fonts”; here the context provides an accessibility concern

even though the user is not explicitly using keywords such as

“disabled”, “blind” or “low vision”.

3 ACCESSIBILITY APP REVIEW

CLASSIFICATION

The main goal of this work is to automatically identify

accessibility-related reviews in a large dataset of app reviews.

Our approach takes a set of reviews as input and makes a bi-

nary decision on whether the review is accessibility pertaining

or not, i.e., classifying app reviews (for simplicity we refer to

them as accessibility reviews and non-accessibility reviews). To be

able to do so, we built a classiﬁcation model using a corpus of

reviews and current classiﬁcation techniques. We then used

the classiﬁcation model to predict types of new app reviews.

Figure 1 provides an overview of the process used in the detec-

tion of accessibility reviews. Our approach follows ﬁve main

steps:

CHI ’21, May 8–13, 2021, Yokohama, Japan AlOmar et al.

Data Collection

Eler et al.

Dataset of

Accessibility

Reviews

(2,663)

Random

Selection of

Non-Accessibility

Reviews

(2,663)

Our

Dataset

(5,326)

Data Preparation

Tokenization

Lemmatization

Stop-Word Removal

Noise Removal

Case Normalization

Feature Extraction

Feature Hashing

Filter-Based Feature

Selection

Model Selection Model Evaluation

Classifiers

Non-Accessibility

Review

Accessibility Review

Cross-Validation

Boosted Decision Tree (BDT)

Decision Forest (DF)

Logistic Regression (LR)

Neural Network (NN)

Support Vector Machine (SVM)

Averaged Perceptron (AP)

Bayes Point Machine (BPM)

Decision Jungle (DJ)

Locally Deep SVM (LD-SVM)

Step

1Step

2Step

3Step

4Step

Mutual Information

Eler et al.

Dataset of

Non-

Accessibility

Reviews

(211,390)

Figure 1: Accessibility app review classiﬁcation process.

(1) Data Collection:

We used a dataset of app reviews along

with their ground truth categories previously identiﬁed

through manual inspection [

] as input for training

purposes.

(2) Data Preparation:

We applied data cleansing and text

preprocessing on this set to improve the reviews text for

the learning algorithms. Some of the text preprocessing

procedures we used are namely, tokenizing, lemmatiz-

ing, removing stop words, and removing capitalization.

(3) Feature Extraction:

We used Feature Hashing [

] to ex-

tract features (i.e., words) from the preprocessed review

text to create a structured feature space.

(4) Model Selection and Tuning:

We examined a total of

nine classiﬁcation algorithms to evaluate the perfor-

mance of the model for prediction. These classiﬁers were

chosen because they are commonly used for classiﬁca-

tion of text such as app reviews [

]. After training

and evaluating the model, we used a testing dataset to

challenge the performance of the model. Since the model

has already learned from the N-Gram vocabulary and

their weights discussed in Section 3.3 from the train-

ing dataset, the classiﬁer output predicted-labels and

probability-scores for the testing dataset. Since an app

review is a plain text in our case, we follow the approach

provided by Kowsari et al. [

] that discusses trending

techniques and algorithms for text classiﬁcation, similar

to [3, 4].

(5) Model Evaluation:

We built a training set using the ex-

tracted features for the model to learn from.

3.1 Data Collection

The dataset, used for this study, and shown in Table 1, is a

collection of these 2,663 accessibility reviews, manually vali-

dated by Eler et al. [

]. The collected reviews are extracted

from across 701 apps, belonging to 15 diﬀerent categories,

as shown in Figure 2. This dataset excluded all apps under

the Theming and System categories, since they usually do

not have any interface associated with them. Eler et al. [

]

started with collecting 214,053 reviews, then they performed

the string-matching using 213 keywords to ﬁlter down reviews

and keep only those who potentially may contains information

related to accessibility. These keywords are derived from 54

BBC recommendations proposed for mobile accessibility. The

string-matching reduced the reviews from 214,053 to 5,076 can-

didate accessibility reviews. However, the manual inspection

of these candidate reviews found that only 2,663 were true

positives.

Table 1: Statistics of the dataset.

Number of Apps 701

App Categories 15

All Reviews 214,053

Accessibility Reviews 2,663

In order for us to verify the previous manual labeling of

the reviews, we followed the process of Levin et al. [

] and

randomly selected a 9% sample of reviews, i.e., 243 out of the

2,663 reviews. This quantity roughly equates to a sample size

with a conﬁdence level of 95% and a conﬁdence interval of

6. Then we randomly added another 243 non-accessibility re-

views, to end up with a total of 486 reviews. Afterward, one

researcher labeled them. The selected data was not exposed

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FindingtheNeedleinaHaystack:OntheAutomaticIdentificationofAccessibilityUserReviewsEmanAbdullahAlOmarRochesterInstituteofTechnologyRochester,NewYork,USAeman.alomar@mail.rit.eduWajdiAljedaaniUniversityofNorthTexasDenton,Texas,USAwajdialjedaani@my.unt.eduMurtazaTamjeedRochesterInstituteofTechnologyRoch...

展开>> 收起<<

Finding the Needle in a Haystack On the Automatic Identification of Accessibility User Reviews.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Finding the Needle in a Haystack On the Automatic Identification of Accessibility User Reviews

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: