Towards Better Semantic Understanding of Mobile Interfaces Srinivas Sunkara1Maria Wang1Lijuan Liu1Gilles Baechler1 Yu-Chung Hsiao1Jindong JD Chen1Abhanshu Sharma1James Stout2

2025-05-06 0 0 2.23MB 15 页 10玖币
侵权投诉
Towards Better Semantic Understanding of Mobile Interfaces
Srinivas Sunkara1Maria Wang1Lijuan Liu1Gilles Baechler1
Yu-Chung Hsiao1Jindong (JD) Chen1Abhanshu Sharma1James Stout2
1Google Research, 2Google
Abstract
Improving the accessibility and automation ca-
pabilities of mobile devices can have a signifi-
cant positive impact on the daily lives of count-
less users. To stimulate research in this di-
rection, we release a human-annotated dataset
with approximately 500k unique annotations
aimed at increasing the understanding of the
functionality of UI elements. This dataset
augments images and view hierarchies from
RICO, a large dataset of mobile UIs, with
annotations for icons based on their shapes
and semantics, and associations between dif-
ferent elements and their corresponding text la-
bels, resulting in a significant increase in the
number of UI elements and the categories as-
signed to them. We also release models us-
ing image-only and multimodal inputs; we ex-
periment with various architectures and study
the benefits of using multimodal inputs on the
new dataset. Our models demonstrate strong
performance on an evaluation set of unseen
apps, indicating their generalizability to newer
screens. These models, combined with the
new dataset, can enable innovative functional-
ities like referring to UI elements by their la-
bels, improved coverage and better semantics
for icons etc., which would go a long way in
making UIs more usable for everyone.
1 Introduction
Mobile devices like phones and tablets have be-
come ubiquitous and indispensable to carry out
our daily activities. It is not an exaggeration to
say that usage of mobile devices is becoming a re-
quirement for full participation in society. Recent
reports from the WHO and others (Organization,
2021;Peter Ackland and Bourne,2017) estimate
that around 2.2 billion people across the world have
some form of vision impairment, out of which 36
million people are blind. Accessibility of mobile
*Equal contribution, correspondence: {srini-
vasksun,mariawang}@google.com
devices is necessary for these visually impaired
users to carry out their daily tasks and is an impor-
tant tool for their social integration (Ladner,2015).
Accessibility of mobile apps has improved sig-
nificantly over the past few years aided by develop-
ments on two main fronts: Firstly, the development
of screen readers, like VoiceOver (Apple,2021c)
on iOS and TalkBack (Accessibility,2021e) on
Android, enable visually impaired users to con-
trol their phone in an eyes-free manner. Secondly,
development tools and standards to enhance ac-
cessibility, such as the Accessibility guidelines
for iOS and Android (Accessibility,2021a;Apple,
2021b), Android Accessibility Scanner (Accessibil-
ity,2021b), and iOS Accessibility Inspector (Apple,
2021a), have helped developers identify and fix ac-
cessibility issues for applications. Among most of
these utilities, the main source of accessibility data
is the accessibility labels (Accessibility,2021d)
provided by app developers. These labels are speci-
fied as attributes on a structured representation such
as View Hierarchy, for the different UI elements
on the screen and are available to screen readers
(Accessibility,2021c;Apple,2018). Despite the
growth in accessibility tools, recent studies (Ross
et al.,2020,2017;Chen et al.,2020a) have found
that even the most widely used apps have large gaps
in accessibility. For instance, a study by Chen et al.
(2020a) of more than 7k apps and 279k screens re-
vealed that around 77% of the apps and 60% of the
screens had at least one element without explicit
labels. Similarly, Ross et al. (2020) found that, in a
population of 10k apps, 53% of the Image Button
elements were missing labels.
In this paper, we attempt to encourage further
research into improving mobile device accessibil-
ity and increasing device automation by releasing
an enhanced version of the RICO dataset (Deka
et al.,2017) with high-quality human annotations
aimed at semantic understanding of various UI el-
ements. Firstly, following a study by Ross et al.
arXiv:2210.02663v1 [cs.HC] 6 Oct 2022
(2020) where missing labels for Image Button in-
stances was found to be the primary accessibility
barrier, we focus on creating annotations useful for
identifying icons. In particular, we annotated the
most frequent 77 classes of icons based on their
appearance. We refer to this task as the Icon Shape
task. Secondly, we identified icon shapes which
can have multiple semantic meanings and anno-
tated each such icon with its semantic label. This
task is called Icon Semantics. Some examples of
such icons can be seen in Figure 1b. Finally, we
annotate UI elements, like icons, text inputs, check-
boxes etc., and associate them with their text labels.
These associations can help us identify meaningful
labels for the long tail of icons and UI elements not
present in our schema, but having a textual label
associated with them. We refer to this task as the
Label Association task. The main contributions of
this paper are as follows:
A large scale dataset
1
of human annotations
for 1) Icon Shape 2) Icon Semantics and 3) se-
lected general UI Elements (icons, form fields,
radio buttons, text fields) and their associated
text labels on the RICO dataset.
Strong benchmark models
2
based on state of
the art models (He et al.,2016;Carion et al.,
2020;Vaswani et al.,2017) using image-only
and multimodal inputs with different architec-
tures. We present an analysis of these models
evaluating the benefits of using View Hierar-
chy attributes and optical character recogni-
tion (OCR) along with the image pixels.
2 Related Work
2.1 Datasets
Large scale datasets like ImageNet (Deng et al.,
2009) played a crucial part in the development of
Deep Learning models (Krizhevsky et al.,2012;He
et al.,2016) for Image Understanding. Similarly,
the release of the RICO dataset (Deka et al.,2017)
enabled data driven modeling for understanding
user interfaces of mobile apps. RICO is, to the best
of our knowledge, the largest public repository of
mobile app data, containing 72k UI screenshots
and their View Hierarchies from 9.7k Android apps
1
The datasets are released at https://github.com/google-
research-datasets/rico-semantics.
2
Benchmark models and code are released at
https://https://github.com/google-research/google-
research/tree/master/rico-semantics
spanning 27 categories. Apart from RICO, other
datasets include ERICA (Deka et al.,2016) with
sequences of user interactions with mobile UIs and
LabelDROID (Chen et al.,2020a) which contains
13.1k mobile UI screenshots and View Hierarchies.
There have been a few efforts to provide addi-
tional annotations on RICO. SWIRE (Huang et al.,
2019) and VINS (Bunian et al.,2021) added anno-
tations for UI retrieval, Enrico (Leiva et al.,2020)
added annotations for 20 design topics. Liu et al.
(2018) automatically generated semantic annota-
tions for UI elements using a convolutional neural
network trained on a subset of the data. Recently Li
et al. (2022) released UI element labels on view hi-
erarchy boxes including identifying boxes which do
not match to any elements in the UI. Even though
some of these works are similar in spirit to the
dataset presented in this paper, there are two ma-
jor differences: 1) The icon and UI element labels
are inferred on the boxes extracted from the View
Hierarchy, whereas, in our work, we add human
annotated bounding boxes directly on the image.
Due to noise in the view hierarchies like missing
and misaligned bounding boxes for UI elements (Li
et al.,2020a,2022), we observe that human annota-
tion increases the number of icons labelled by 47%.
2) The semantic icon labels in Liu et al. (2018)
conflate appearance and functionality. For exam-
ple “close” and “delete,” “undo” and “back,” “add”
and “expand” are mapped to the same class, even
though they represent different functionalities. The
Icon Semantics annotations in our dataset specifi-
cally try to distinguish between icons with the same
appearance but differences in functionality.
2.2 Models
Pixel based methods for UI understanding have
been studied for a long time. They have been used
for a variety of applications like GUI testing (Yeh
et al.,2009;Chang et al.,2010), identifying similar
products across screens (Bell and Bala,2015), find-
ing similar designs and UIs (Behrang et al.,2018;
Bunian et al.,2021), detecting issues in UIs (Liu,
2020), generating accessibility metadata (Zhang
et al.,2021) and generating descriptions of ele-
ments (Chen et al.,2020a). A recent study by
Chen et al. (2020b) compares traditional image
processing methods and Deep Learning methods
to identify different UI elements. The image-only
baseline models studied in this paper are based on
Object Detection methods presented in Chen et al.
(a) Examples of 76 icon shape annotations. We include classes that reflect the social
aspects of app usage, e.g., "person" for profile and community, share via popular apps
such as Facebook and Twitter, etc.
(b) Examples of 38 icon semantics annotations, in the format of <shape>:<semantics>.
Note that a single shape may represent multiple semantics, depending on the context. E.g.,
"X" shape may mean "close", "delete text", or "multiply". We use an umbrella semantic
class "OTHER" to cover the semantics not covered in our proposed set of classes
Figure 1: In this paper, we annotated the RICO dataset with both icon shapes and their semantics, to encourage
further research on app automation and accessibility. The existing icon annotations from Liu et al. (2018) were
algorithmically generated with 10% of them verified. However, we observed 32% were missing labels compared
to our full human annotations. We release our annotations in the hope to contribute back to the community.
(2020b), Zhang et al. (2021), Chen et al. (2020a),
and Carion et al. (2020).
Extending to other modalities beyond pixels,
Banovic et al. (2012) use video tutorials to un-
derstand UIs and annotate them with additional
information. Li et al. (2021) use only the screen
information for identifying embeddings of UI ele-
ments. Hurst et al. (2010) use both the screen and
accessibility API information to identify interaction
targets in UIs and Chang et al. (2011) use similar
inputs to detect and identify certain UI elements
and Nguyen et al. (2018) use it for identifying sim-
ilar UI designs. Multimodal inputs have also been
used for understanding screen contents like generat-
ing element descriptions (Li et al.,2020b), training
UI embeddings for multiple downstream tasks (He
et al.,2021;Bai et al.,2021) and, denoising data,
predicting bounding box types (Li et al.,2022).
3 Datasets, taxonomy and annotation
To enable accessible hands-free experience for mo-
bile users, it is necessary for the system to under-
stand the functionality of the different screen UI
elements. For learning data driven models to en-
able these functionalities, we use the RICO dataset
(Deka et al.,2017). RICO spans >9K apps and
>72K UIs, each with a screenshot and information
regarding the structure of the UI in the form of a
View Hierarchy (VH). Besides the bounding boxes
of the different UI elements, the VH contains useful
摘要:

TowardsBetterSemanticUnderstandingofMobileInterfacesSrinivasSunkara1MariaWang1LijuanLiu1GillesBaechler1Yu-ChungHsiao1Jindong(JD)Chen1AbhanshuSharma1JamesStout21GoogleResearch,2GoogleAbstractImprovingtheaccessibilityandautomationca-pabilitiesofmobiledevicescanhaveasigni-cantpositiveimpactonthedail...

展开>> 收起<<
Towards Better Semantic Understanding of Mobile Interfaces Srinivas Sunkara1Maria Wang1Lijuan Liu1Gilles Baechler1 Yu-Chung Hsiao1Jindong JD Chen1Abhanshu Sharma1James Stout2.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.23MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注