
CHI ’21, May 8–13, 2021, Yokohama, Japan AlOmar et al.
ignores that keywords derived from guidelines do not neces-
sarily match the words expressed in reviews posted by users.
This mismatch includes but not limited to situations when the
keywords are incorrectly spelled by users. More importantly,
the presence of certain keywords in a review does not neces-
sarily mean that the review is about accessibility. For example,
consider the following reviews from Eler et al. dataset [18]:
“This is the closest game to my old 2001 Kyocera 2235’s
inbuilt game ’Cavern crawler’. Everything is so simple
and easy to comprehend but that doesn’t mean that it
is easy to complete right off of the bat. Going into the
sewers almost literally blind (sight and knowledge of
goods in inventory) is a great touch too. Keep at it. I’ll
support you at least in donations.”
This review contains a set of keywords that could indicate
accessibility (e.g., “old”, “blind” and “sight”) but it is not an
accessibility review. In this review, the word “old” refers to a
device rather than a person. The words “blind” and “sight” re-
fer to knowledge of goods in the game rather than describing a
player’s vision. Therefore, the discovery of accessibility reviews
heavily relies on the context, and so, simply searching for their
existence in the review text is inefficient. Due to the overhead
of the manual identification, and the high false-positiveness of
the automated detection, these two methods remain impracti-
cal for developers to use, and so, accessibility reviews remain
hard to identify and to prioritize for correction. To address
this challenge, it is critical to design a solution with learning
capabilities, which can take a set of examples that are known
to be accessibility reviews, and another set of examples that
are not about accessibility but do contain accessibility-related
keywords, and learn how to distinguish between them. There-
fore, in this paper,
we use supervised learning to formulate the
identification of accessibility reviews as a binary classifica-
tion problem.
This model takes a set of accessibility reviews,
obtained by manual inspection, in a previous study [
18
] as
input, we deploy state-of-the-art, machine learning models to
learn the features, i.e., textual patterns that are representative of
accessibility reviews. In contrast to relying on words derived
from guidelines, our solution extracts features (i.e., words and
patterns) from actual user reviews and learns from them. This
is critical because there is a semantic gap between the guide-
lines, formally written on an abstract level, and technology-
specific keywords. By features, we refer to a keyword or a set
of keywords extracted from accessibility-related reviews that
are not only important for classification algorithms, but they
can also be useful for developers to understand accessibility-
related issues and features in their apps. The patterns can be
about an app feature that supports accessibility (e.g., “font cus-
tomization”, “page zooming” or “speed control”); about assistive
technology (e.g., “word prediction”, “text to speech” or “voice
over”) as well as about disability comments (e.g., “low vision”,
“handicapped”, “deaf ” or “blind”). Particularly, we addressed
the following three research questions in our study:
RQ1:
To what extent machine learning models can accurately
distinguish accessibility reviews from non-accessibility re-
views?
To answer this research question, we rely on a manually
curated dataset of 2,663 accessibility reviews, which we
augment with another 2,663 non-accessibility reviews.
Then we perform a comparative study between state-
of-the-art binary classification models, to identify the
best model that can properly detect accessibility reviews,
from non-accessibility reviews.
RQ2:
How effective is our machine learning approach in identi-
fying accessibility reviews?
Opting for a complex solution, i.e., supervised learning,
has its own challenges, as models need to be trained,
parameter tuned, and maintained, etc. To justify our
choice of such solution, we compare the best perform-
ing model, from the previous research question, with
two baselines: the string-matching method, and the ran-
dom classifier. This research question verifies whether a
simpler solution can convey competitive results.
RQ3:
What is the size of the training dataset needed for the
classification to effectively identify accessibility reviews?
In this research question, we empirically extract the min-
imum number of training instances, i.e., accessibility re-
views, needed for our best performing model, to achieve
its best performance. Such information is useful for prac-
titioners, to estimate the amount of manual work needs
to be done (i.e., preparation of training data) to design
this solution.
We performed our experiments using a dataset of 5,326 user
reviews, provided by a previous study [
18
]. Our comparative
study has shown that the Boosted Decision Trees model (BDTs-
model) has the best performance among other 8 state-of-the-
art models. Then, we compared our BDTs-model, against two
baselines: (1) string-matching algorithm and (2) a random
classifier. Our approach provided a significant improvement
in the identification of accessibility reviews, outperforming
the baseline-1 (keyword-based detector) by 1.574 times, and
surpassing the baseline-2 (random classifier) by 39.434 times.
The contributions of this paper are:
(1)
We present an action research contribution that privi-
leges societal benefit through helping developers auto-
matically detect accessibility-related reviews and filter
out irrelevant reviews. We make our model and datasets
publicly available
4
for researchers to replicate and ex-
tend, and for practitioners to use our web service and
filter down their user reviews.
(2)
We show that we need a relatively small dataset (i.e.,
1500 reviews) for training to achieve 85% or higher F1-
Measure, outperforming state-of-the-art string-matching
methods. However, the F1-measure score improves as
we add to the training dataset.
2 RELATED WORK
It is crucial that mobile applications be accessible to allow
all individuals with different abilities to have fair access and
4https://smilevo.github.io/access/