
(2020) where missing labels for Image Button in-
stances was found to be the primary accessibility
barrier, we focus on creating annotations useful for
identifying icons. In particular, we annotated the
most frequent 77 classes of icons based on their
appearance. We refer to this task as the Icon Shape
task. Secondly, we identified icon shapes which
can have multiple semantic meanings and anno-
tated each such icon with its semantic label. This
task is called Icon Semantics. Some examples of
such icons can be seen in Figure 1b. Finally, we
annotate UI elements, like icons, text inputs, check-
boxes etc., and associate them with their text labels.
These associations can help us identify meaningful
labels for the long tail of icons and UI elements not
present in our schema, but having a textual label
associated with them. We refer to this task as the
Label Association task. The main contributions of
this paper are as follows:
•
A large scale dataset
1
of human annotations
for 1) Icon Shape 2) Icon Semantics and 3) se-
lected general UI Elements (icons, form fields,
radio buttons, text fields) and their associated
text labels on the RICO dataset.
•
Strong benchmark models
2
based on state of
the art models (He et al.,2016;Carion et al.,
2020;Vaswani et al.,2017) using image-only
and multimodal inputs with different architec-
tures. We present an analysis of these models
evaluating the benefits of using View Hierar-
chy attributes and optical character recogni-
tion (OCR) along with the image pixels.
2 Related Work
2.1 Datasets
Large scale datasets like ImageNet (Deng et al.,
2009) played a crucial part in the development of
Deep Learning models (Krizhevsky et al.,2012;He
et al.,2016) for Image Understanding. Similarly,
the release of the RICO dataset (Deka et al.,2017)
enabled data driven modeling for understanding
user interfaces of mobile apps. RICO is, to the best
of our knowledge, the largest public repository of
mobile app data, containing 72k UI screenshots
and their View Hierarchies from 9.7k Android apps
1
The datasets are released at https://github.com/google-
research-datasets/rico-semantics.
2
Benchmark models and code are released at
https://https://github.com/google-research/google-
research/tree/master/rico-semantics
spanning 27 categories. Apart from RICO, other
datasets include ERICA (Deka et al.,2016) with
sequences of user interactions with mobile UIs and
LabelDROID (Chen et al.,2020a) which contains
13.1k mobile UI screenshots and View Hierarchies.
There have been a few efforts to provide addi-
tional annotations on RICO. SWIRE (Huang et al.,
2019) and VINS (Bunian et al.,2021) added anno-
tations for UI retrieval, Enrico (Leiva et al.,2020)
added annotations for 20 design topics. Liu et al.
(2018) automatically generated semantic annota-
tions for UI elements using a convolutional neural
network trained on a subset of the data. Recently Li
et al. (2022) released UI element labels on view hi-
erarchy boxes including identifying boxes which do
not match to any elements in the UI. Even though
some of these works are similar in spirit to the
dataset presented in this paper, there are two ma-
jor differences: 1) The icon and UI element labels
are inferred on the boxes extracted from the View
Hierarchy, whereas, in our work, we add human
annotated bounding boxes directly on the image.
Due to noise in the view hierarchies like missing
and misaligned bounding boxes for UI elements (Li
et al.,2020a,2022), we observe that human annota-
tion increases the number of icons labelled by 47%.
2) The semantic icon labels in Liu et al. (2018)
conflate appearance and functionality. For exam-
ple “close” and “delete,” “undo” and “back,” “add”
and “expand” are mapped to the same class, even
though they represent different functionalities. The
Icon Semantics annotations in our dataset specifi-
cally try to distinguish between icons with the same
appearance but differences in functionality.
2.2 Models
Pixel based methods for UI understanding have
been studied for a long time. They have been used
for a variety of applications like GUI testing (Yeh
et al.,2009;Chang et al.,2010), identifying similar
products across screens (Bell and Bala,2015), find-
ing similar designs and UIs (Behrang et al.,2018;
Bunian et al.,2021), detecting issues in UIs (Liu,
2020), generating accessibility metadata (Zhang
et al.,2021) and generating descriptions of ele-
ments (Chen et al.,2020a). A recent study by
Chen et al. (2020b) compares traditional image
processing methods and Deep Learning methods
to identify different UI elements. The image-only
baseline models studied in this paper are based on
Object Detection methods presented in Chen et al.