Finding the Needle in a Haystack On the Automatic Identification of Accessibility User Reviews

2025-05-06 0 0 593.5KB 16 页 10玖币
侵权投诉
Finding the Needle in a Haystack: On the Automatic
Identification of Accessibility User Reviews
Eman Abdullah AlOmar
Rochester Institute of Technology
Rochester, New York, USA
eman.alomar@mail.rit.edu
Wajdi Aljedaani
University of North Texas
Denton, Texas, USA
wajdialjedaani@my.unt.edu
Murtaza Tamjeed
Rochester Institute of Technology
Rochester, New York, USA
mt1256@rit.edu
Mohamed Wiem Mkaouer
Rochester Institute of Technology
Rochester, New York, USA
mwmvse@rit.edu
Yasmine N. Elglaly
Western Washington University
Bellingham, Washington, USA
elglaly@wwu.edu
ABSTRACT
In recent years, mobile accessibility has become an impor-
tant trend with the goal of allowing all users the possibility
of using any app without many limitations. User reviews in-
clude insights that are useful for app evolution. However, with
the increase in the amount of received reviews, manually ana-
lyzing them is tedious and time-consuming, especially when
searching for accessibility reviews. The goal of this paper is to
support the automated identification of accessibility in user
reviews, to help technology professionals in prioritizing their
handling, and thus, creating more inclusive apps. Particularly,
we design a model that takes as input accessibility user re-
views, learns their keyword-based features, in order to make
a binary decision, for a given review, on whether it is about ac-
cessibility or not. The model is evaluated using a total of 5,326
mobile app reviews. The findings show that (1) our model
can accurately identify accessibility reviews, outperforming
two baselines, namely keyword-based detector and a random
classifier; (2) our model achieves an accuracy of 85% with rel-
atively small training dataset; however, the accuracy improves
as we increase the size of the training dataset.
CCS CONCEPTS
Human-centered computing Empirical studies in acces-
sibility;Ubiquitous and mobile devices.
KEYWORDS
Mobile application, user review, accessibility, machine learn-
ing.
Permission to make digital or hard copies of all or part of this work for personal
or classroom use is granted without fee provided that copies are not made
or distributed for profit or commercial advantage and that copies bear this
notice and the full citation on the first page. Copyrights for components of this
work owned by others than ACM must be honored. Abstracting with credit is
permitted. To copy otherwise, or republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee. Request permissions
from permissions@acm.org.
CHI ’21, May 8–13, 2021, Yokohama, Japan
©2021 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. ..$15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
ACM Reference Format:
Eman Abdullah AlOmar, Wajdi Aljedaani, Murtaza Tamjeed, Mo-
hamed Wiem Mkaouer, and Yasmine N. Elglaly. 2021. Finding the
Needle in a Haystack: On the Automatic Identification of Accessibility
User Reviews. In CHI Conference on Human Factors in Computing Systems
(CHI ’21), May 8–13, 2021, Yokohama, Japan. ACM, New York, NY, USA,
16 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Many mobile applications (apps) have poor accessibility
which makes it difficult for people with disabilities to use such
apps [
5
,
53
,
55
,
71
]. Researchers presented several methods,
tools, frameworks, and guidelines to support developers in
creating accessible mobile applications [
9
,
11
,
19
,
47
,
54
,
64
].
However, many software developers and designers still do
not incorporate accessibility into their software development
process due to lack of awareness or lack of resources, e.g., bud-
get and time, [
15
,
48
,
51
]. In this paper, we present a method
that can help software developers to quickly become aware of
specific accessibility problems with their apps that the users
encountered. Our method is based on automatically identify-
ing app reviews that users write on app stores, e.g., App Store
1
,
Google Play
2
and Amazon Appstore
3
, where these reviews
express an accessibility-related feedback.
Analyzing app reviews was used by technology profession-
als to identify issues with their mobile apps [
12
,
37
,
39
]. How-
ever, accessibility in user reviews is rarely studied especially
for mobile applications [
18
]. Identifying accessibility-related
reviews is currently done using two main methods: manual
identification and automatic detection [
18
]. The manual iden-
tification approach is time consuming especially with the vast
number of reviews that users upload to the app stores, and
so it becomes impractical. The automated detection method
employs a string-matching technique as a predefined set of
keywords are searched for in the app reviews [
18
]. These key-
words were extracted from the British Broadcasting Corpo-
ration (BBC) recommendations for mobile accessibility [
10
].
While this method sounds more practical than the manual
one, it has its own drawbacks: the string-matching technique
1https://www.apple.com/ios/app-store/
2https://play.google.com/store
3https://www.amazon.com/mobile-apps/b?ie=UTF8&node=2350149011
arXiv:2210.09947v1 [cs.SE] 18 Oct 2022
CHI ’21, May 8–13, 2021, Yokohama, Japan AlOmar et al.
ignores that keywords derived from guidelines do not neces-
sarily match the words expressed in reviews posted by users.
This mismatch includes but not limited to situations when the
keywords are incorrectly spelled by users. More importantly,
the presence of certain keywords in a review does not neces-
sarily mean that the review is about accessibility. For example,
consider the following reviews from Eler et al. dataset [18]:
“This is the closest game to my old 2001 Kyocera 2235’s
inbuilt game ’Cavern crawler’. Everything is so simple
and easy to comprehend but that doesn’t mean that it
is easy to complete right off of the bat. Going into the
sewers almost literally blind (sight and knowledge of
goods in inventory) is a great touch too. Keep at it. I’ll
support you at least in donations.”
This review contains a set of keywords that could indicate
accessibility (e.g., “old”, “blind” and “sight”) but it is not an
accessibility review. In this review, the word “old” refers to a
device rather than a person. The words “blind” and “sight” re-
fer to knowledge of goods in the game rather than describing a
player’s vision. Therefore, the discovery of accessibility reviews
heavily relies on the context, and so, simply searching for their
existence in the review text is inefficient. Due to the overhead
of the manual identification, and the high false-positiveness of
the automated detection, these two methods remain impracti-
cal for developers to use, and so, accessibility reviews remain
hard to identify and to prioritize for correction. To address
this challenge, it is critical to design a solution with learning
capabilities, which can take a set of examples that are known
to be accessibility reviews, and another set of examples that
are not about accessibility but do contain accessibility-related
keywords, and learn how to distinguish between them. There-
fore, in this paper,
we use supervised learning to formulate the
identification of accessibility reviews as a binary classifica-
tion problem.
This model takes a set of accessibility reviews,
obtained by manual inspection, in a previous study [
18
] as
input, we deploy state-of-the-art, machine learning models to
learn the features, i.e., textual patterns that are representative of
accessibility reviews. In contrast to relying on words derived
from guidelines, our solution extracts features (i.e., words and
patterns) from actual user reviews and learns from them. This
is critical because there is a semantic gap between the guide-
lines, formally written on an abstract level, and technology-
specific keywords. By features, we refer to a keyword or a set
of keywords extracted from accessibility-related reviews that
are not only important for classification algorithms, but they
can also be useful for developers to understand accessibility-
related issues and features in their apps. The patterns can be
about an app feature that supports accessibility (e.g., “font cus-
tomization”, “page zooming” or “speed control”); about assistive
technology (e.g., “word prediction”, “text to speech” or “voice
over”) as well as about disability comments (e.g., “low vision”,
handicapped”, “deaf ” or “blind”). Particularly, we addressed
the following three research questions in our study:
RQ1:
To what extent machine learning models can accurately
distinguish accessibility reviews from non-accessibility re-
views?
To answer this research question, we rely on a manually
curated dataset of 2,663 accessibility reviews, which we
augment with another 2,663 non-accessibility reviews.
Then we perform a comparative study between state-
of-the-art binary classification models, to identify the
best model that can properly detect accessibility reviews,
from non-accessibility reviews.
RQ2:
How effective is our machine learning approach in identi-
fying accessibility reviews?
Opting for a complex solution, i.e., supervised learning,
has its own challenges, as models need to be trained,
parameter tuned, and maintained, etc. To justify our
choice of such solution, we compare the best perform-
ing model, from the previous research question, with
two baselines: the string-matching method, and the ran-
dom classifier. This research question verifies whether a
simpler solution can convey competitive results.
RQ3:
What is the size of the training dataset needed for the
classification to effectively identify accessibility reviews?
In this research question, we empirically extract the min-
imum number of training instances, i.e., accessibility re-
views, needed for our best performing model, to achieve
its best performance. Such information is useful for prac-
titioners, to estimate the amount of manual work needs
to be done (i.e., preparation of training data) to design
this solution.
We performed our experiments using a dataset of 5,326 user
reviews, provided by a previous study [
18
]. Our comparative
study has shown that the Boosted Decision Trees model (BDTs-
model) has the best performance among other 8 state-of-the-
art models. Then, we compared our BDTs-model, against two
baselines: (1) string-matching algorithm and (2) a random
classifier. Our approach provided a significant improvement
in the identification of accessibility reviews, outperforming
the baseline-1 (keyword-based detector) by 1.574 times, and
surpassing the baseline-2 (random classifier) by 39.434 times.
The contributions of this paper are:
(1)
We present an action research contribution that privi-
leges societal benefit through helping developers auto-
matically detect accessibility-related reviews and filter
out irrelevant reviews. We make our model and datasets
publicly available
4
for researchers to replicate and ex-
tend, and for practitioners to use our web service and
filter down their user reviews.
(2)
We show that we need a relatively small dataset (i.e.,
1500 reviews) for training to achieve 85% or higher F1-
Measure, outperforming state-of-the-art string-matching
methods. However, the F1-measure score improves as
we add to the training dataset.
2 RELATED WORK
It is crucial that mobile applications be accessible to allow
all individuals with different abilities to have fair access and
4https://smilevo.github.io/access/
Finding the Needle in a Haystack: On the Automatic Identification of Accessibility User Reviews CHI ’21, May 8–13, 2021, Yokohama, Japan
equal opportunities [
27
]. Prior studies investigated the accessi-
bility issues raised in Android applications [
5
,
66
], and others
evaluated the accessibility of various websites [
1
,
17
,
30
,
69
].
To the best of our knowledge, there is no study classifies user
reviews in Android applications using machine learning.
In this section, we highlight several previous works that pro-
foundly influenced our approach. We split the related works
into three sections: user review, which briefly highlights the
role of user reviews in app evolution; accessibility in user re-
view, focuses particularly on detection of accessibility in user
reviews; and classification of text documents, where we focus
on current approaches in the classification of text such as user
reviews by different taxonomies.
2.1 User Reviews
Many researchers concluded that reviews and ratings posted
by users on app store platforms can play an essential role in
apps’ evolution since most developers consider users’ reviews
when working on a new release [
12
,
37
,
45
,
49
]. Maalej et al.
[
39
] proposed to consider user-input as first means of require-
ments elicitation in software development. Similarly, Vu et al.
[
67
] emphasized on the role of users in software lifecycle by
developing an approach to identify useful information from
users’ review. Moreover, Seyff et al. [
59
] suggested continu-
ous requirements elicitation from end-users’ feedback using
mobile devices.
Considering the fact that user reviews can be a powerful
driver to mobile app evolution, we are looking into whether
we can effectively detect accessibility reviews from users’ feed-
back. This is important because in a highly competitive market,
identifying accessibility issues from users’ reviews can help de-
velopers improve their apps in order to attract more customers
and provide better services to users with different abilities.
2.2 Accessibility in User Reviews
Even though user reviews can be a robust tool to mobile
apps evolution, and that even mature apps have many trivial
accessibility issues [
19
,
71
], only 1.24% of mobile app users
report accessibility issues to app stores [
18
]. In other words,
98.76% of mobile app users do not post accessibility issues in
the form of reviews on app stores. In an effort to find whether
mobile app users post accessibility-related issues to app stores,
Eler et al. [
18
] investigated 214,053 mobile app reviews us-
ing a string-matching approach. They depend on a set of 213
keywords derived from 54 BBC recommendations [
10
] pro-
posed for mobile accessibility. In their work, they inspected
214,053 user reviews to identify reviews pertaining to acces-
sibility. Their approach classified a total of 5,076 reviews as
accessibility reviews. However, through a manual inspection
later, the researchers found that only 2,663 of the reviews were
really about accessibility. We used these 2,663 identified ac-
cessibility reviews as one of the two groups in our training
set required for a supervised machine learning. We created
the second group (i.e., non-accessibility reviews) from their
total dataset (i.e., 214,053). So far, this is one of the preliminary
studies related to the accessibility in mobile app user reviews.
2.3 Classification of Text Documents
Many studies classify app reviews using different taxonomies
[
12
,
16
,
28
,
41
,
46
,
49
], for various purposes: detection of po-
tential feature requests, bug reports, complaints, and praises,
etc. Even though many of them identify reviews related to app
usability, there is no explicit mention to accessibility related
issues [18].
Unlike automatic approaches, classification of text docu-
ments using a set of
predefined keywords
has been vastly per-
formed across different domains in software engineering. For
instance, Eler et al. [
18
] relied on 213 keywords to identify
accessibility-related reviews. Strogylos and Spinelles [
62
] iden-
tified refactoring-related commits using one keyword “refac-
tor”. Similarly, Ratzinger et al. [
52
] used 13 keywords to detect
refactoring in commit messages. Later, Murphy-Hill et al. [
43
]
replicated Ratzinger’s work in two open-source software using
the 13 keywords Ratzinger used. However, they disproved the
previous assumption that commit messages in version history
of programs are indicators of refactoring activities. The rea-
soning behind their findings is that developers do not always
report refactoring activities as they might associate refactor-
ing activities with other activities such as adding a feature.
AlOmar et al. [
2
] have also explored how developers docu-
ment their refactoring activities in commit messages using
a variety of 87 textual patterns (i.e., keywords and phrases).
Similarly, we believe users can express accessibility concerns
without explicitly using any accessibility keywords from the
BBC guidelines as assumed by Eler et al. [18].
In contrast to the keyword-based approaches, we used an au-
tomated machine learning approach since learning approaches
outperform the accuracy of the keyword-based approach by
at least 1.45 times [
3
,
40
]. On the other hand, a keyword-based
identification approach (i.e., relying on an existing set of pre-
defined keywords) could generally miss certain reviews, not
only because reviews left by users might not always use those
keywords to express an accessibility concern, but also because
a single word might not be enough to convey an accessibility
message. For example, the review “I hope someday we change size
of the fonts”; here the context provides an accessibility concern
even though the user is not explicitly using keywords such as
disabled”, “blind” or “low vision”.
3 ACCESSIBILITY APP REVIEW
CLASSIFICATION
The main goal of this work is to automatically identify
accessibility-related reviews in a large dataset of app reviews.
Our approach takes a set of reviews as input and makes a bi-
nary decision on whether the review is accessibility pertaining
or not, i.e., classifying app reviews (for simplicity we refer to
them as accessibility reviews and non-accessibility reviews). To be
able to do so, we built a classification model using a corpus of
reviews and current classification techniques. We then used
the classification model to predict types of new app reviews.
Figure 1 provides an overview of the process used in the detec-
tion of accessibility reviews. Our approach follows five main
steps:
CHI ’21, May 8–13, 2021, Yokohama, Japan AlOmar et al.
Data Collection
Eler et al.
Dataset of
Accessibility
Reviews
(2,663)
Random
Selection of
Non-Accessibility
Reviews
(2,663)
Our
Dataset
(5,326)
Data Preparation
Tokenization
Lemmatization
Stop-Word Removal
Noise Removal
Case Normalization
Feature Extraction
Feature Hashing
Filter-Based Feature
Selection
Model Selection Model Evaluation
Classifiers
Non-Accessibility
Review
Accessibility Review
Cross-Validation
Boosted Decision Tree (BDT)
Decision Forest (DF)
Logistic Regression (LR)
Neural Network (NN)
Support Vector Machine (SVM)
Averaged Perceptron (AP)
Bayes Point Machine (BPM)
Decision Jungle (DJ)
Locally Deep SVM (LD-SVM)
Step
1Step
2Step
3Step
4Step
5
Mutual Information
Eler et al.
Dataset of
Non-
Accessibility
Reviews
(211,390)
Figure 1: Accessibility app review classification process.
(1) Data Collection:
We used a dataset of app reviews along
with their ground truth categories previously identified
through manual inspection [
18
] as input for training
purposes.
(2) Data Preparation:
We applied data cleansing and text
preprocessing on this set to improve the reviews text for
the learning algorithms. Some of the text preprocessing
procedures we used are namely, tokenizing, lemmatiz-
ing, removing stop words, and removing capitalization.
(3) Feature Extraction:
We used Feature Hashing [
68
] to ex-
tract features (i.e., words) from the preprocessed review
text to create a structured feature space.
(4) Model Selection and Tuning:
We examined a total of
nine classification algorithms to evaluate the perfor-
mance of the model for prediction. These classifiers were
chosen because they are commonly used for classifica-
tion of text such as app reviews [
28
,
31
]. After training
and evaluating the model, we used a testing dataset to
challenge the performance of the model. Since the model
has already learned from the N-Gram vocabulary and
their weights discussed in Section 3.3 from the train-
ing dataset, the classifier output predicted-labels and
probability-scores for the testing dataset. Since an app
review is a plain text in our case, we follow the approach
provided by Kowsari et al. [
33
] that discusses trending
techniques and algorithms for text classification, similar
to [3, 4].
(5) Model Evaluation:
We built a training set using the ex-
tracted features for the model to learn from.
3.1 Data Collection
The dataset, used for this study, and shown in Table 1, is a
collection of these 2,663 accessibility reviews, manually vali-
dated by Eler et al. [
18
]. The collected reviews are extracted
from across 701 apps, belonging to 15 different categories,
as shown in Figure 2. This dataset excluded all apps under
the Theming and System categories, since they usually do
not have any interface associated with them. Eler et al. [
18
]
started with collecting 214,053 reviews, then they performed
the string-matching using 213 keywords to filter down reviews
and keep only those who potentially may contains information
related to accessibility. These keywords are derived from 54
BBC recommendations proposed for mobile accessibility. The
string-matching reduced the reviews from 214,053 to 5,076 can-
didate accessibility reviews. However, the manual inspection
of these candidate reviews found that only 2,663 were true
positives.
Table 1: Statistics of the dataset.
Number of Apps 701
App Categories 15
All Reviews 214,053
Accessibility Reviews 2,663
In order for us to verify the previous manual labeling of
the reviews, we followed the process of Levin et al. [
36
] and
randomly selected a 9% sample of reviews, i.e., 243 out of the
2,663 reviews. This quantity roughly equates to a sample size
with a confidence level of 95% and a confidence interval of
6. Then we randomly added another 243 non-accessibility re-
views, to end up with a total of 486 reviews. Afterward, one
researcher labeled them. The selected data was not exposed
摘要:

FindingtheNeedleinaHaystack:OntheAutomaticIdentificationofAccessibilityUserReviewsEmanAbdullahAlOmarRochesterInstituteofTechnologyRochester,NewYork,USAeman.alomar@mail.rit.eduWajdiAljedaaniUniversityofNorthTexasDenton,Texas,USAwajdialjedaani@my.unt.eduMurtazaTamjeedRochesterInstituteofTechnologyRoch...

收起<<
Finding the Needle in a Haystack On the Automatic Identification of Accessibility User Reviews.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:593.5KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注