Towards Better Semantic Understanding of Mobile Interfaces Srinivas Sunkara1Maria Wang1Lijuan Liu1Gilles Baechler1 Yu-Chung Hsiao1Jindong JD Chen1Abhanshu Sharma1James Stout2

2025-05-06 0 0 2.23MB 15 页 10玖币

侵权投诉

Towards Better Semantic Understanding of Mobile Interfaces

Srinivas Sunkara1∗Maria Wang1∗Lijuan Liu1Gilles Baechler1

Yu-Chung Hsiao1Jindong (JD) Chen1Abhanshu Sharma1James Stout2

1Google Research, 2Google

Abstract

Improving the accessibility and automation ca-

pabilities of mobile devices can have a signiﬁ-

cant positive impact on the daily lives of count-

less users. To stimulate research in this di-

rection, we release a human-annotated dataset

with approximately 500k unique annotations

aimed at increasing the understanding of the

functionality of UI elements. This dataset

augments images and view hierarchies from

RICO, a large dataset of mobile UIs, with

annotations for icons based on their shapes

and semantics, and associations between dif-

ferent elements and their corresponding text la-

bels, resulting in a signiﬁcant increase in the

number of UI elements and the categories as-

signed to them. We also release models us-

ing image-only and multimodal inputs; we ex-

periment with various architectures and study

the beneﬁts of using multimodal inputs on the

new dataset. Our models demonstrate strong

performance on an evaluation set of unseen

apps, indicating their generalizability to newer

screens. These models, combined with the

new dataset, can enable innovative functional-

ities like referring to UI elements by their la-

bels, improved coverage and better semantics

for icons etc., which would go a long way in

making UIs more usable for everyone.

1 Introduction

Mobile devices like phones and tablets have be-

come ubiquitous and indispensable to carry out

our daily activities. It is not an exaggeration to

say that usage of mobile devices is becoming a re-

quirement for full participation in society. Recent

reports from the WHO and others (Organization,

2021;Peter Ackland and Bourne,2017) estimate

that around 2.2 billion people across the world have

some form of vision impairment, out of which 36

million people are blind. Accessibility of mobile

*Equal contribution, correspondence: {srini-

vasksun,mariawang}@google.com

devices is necessary for these visually impaired

users to carry out their daily tasks and is an impor-

tant tool for their social integration (Ladner,2015).

Accessibility of mobile apps has improved sig-

niﬁcantly over the past few years aided by develop-

ments on two main fronts: Firstly, the development

of screen readers, like VoiceOver (Apple,2021c)

on iOS and TalkBack (Accessibility,2021e) on

Android, enable visually impaired users to con-

trol their phone in an eyes-free manner. Secondly,

development tools and standards to enhance ac-

cessibility, such as the Accessibility guidelines

for iOS and Android (Accessibility,2021a;Apple,

2021b), Android Accessibility Scanner (Accessibil-

ity,2021b), and iOS Accessibility Inspector (Apple,

2021a), have helped developers identify and ﬁx ac-

cessibility issues for applications. Among most of

these utilities, the main source of accessibility data

is the accessibility labels (Accessibility,2021d)

provided by app developers. These labels are speci-

ﬁed as attributes on a structured representation such

as View Hierarchy, for the different UI elements

on the screen and are available to screen readers

(Accessibility,2021c;Apple,2018). Despite the

growth in accessibility tools, recent studies (Ross

et al.,2020,2017;Chen et al.,2020a) have found

that even the most widely used apps have large gaps

in accessibility. For instance, a study by Chen et al.

(2020a) of more than 7k apps and 279k screens re-

vealed that around 77% of the apps and 60% of the

screens had at least one element without explicit

labels. Similarly, Ross et al. (2020) found that, in a

population of 10k apps, 53% of the Image Button

elements were missing labels.

In this paper, we attempt to encourage further

research into improving mobile device accessibil-

ity and increasing device automation by releasing

an enhanced version of the RICO dataset (Deka

et al.,2017) with high-quality human annotations

aimed at semantic understanding of various UI el-

ements. Firstly, following a study by Ross et al.

arXiv:2210.02663v1 [cs.HC] 6 Oct 2022

(2020) where missing labels for Image Button in-

stances was found to be the primary accessibility

barrier, we focus on creating annotations useful for

identifying icons. In particular, we annotated the

most frequent 77 classes of icons based on their

appearance. We refer to this task as the Icon Shape

task. Secondly, we identiﬁed icon shapes which

can have multiple semantic meanings and anno-

tated each such icon with its semantic label. This

task is called Icon Semantics. Some examples of

such icons can be seen in Figure 1b. Finally, we

annotate UI elements, like icons, text inputs, check-

boxes etc., and associate them with their text labels.

These associations can help us identify meaningful

labels for the long tail of icons and UI elements not

present in our schema, but having a textual label

associated with them. We refer to this task as the

Label Association task. The main contributions of

this paper are as follows:

•

A large scale dataset

of human annotations

for 1) Icon Shape 2) Icon Semantics and 3) se-

lected general UI Elements (icons, form ﬁelds,

radio buttons, text ﬁelds) and their associated

text labels on the RICO dataset.

•

Strong benchmark models

based on state of

the art models (He et al.,2016;Carion et al.,

2020;Vaswani et al.,2017) using image-only

and multimodal inputs with different architec-

tures. We present an analysis of these models

evaluating the beneﬁts of using View Hierar-

chy attributes and optical character recogni-

tion (OCR) along with the image pixels.

2 Related Work

2.1 Datasets

Large scale datasets like ImageNet (Deng et al.,

2009) played a crucial part in the development of

Deep Learning models (Krizhevsky et al.,2012;He

et al.,2016) for Image Understanding. Similarly,

the release of the RICO dataset (Deka et al.,2017)

enabled data driven modeling for understanding

user interfaces of mobile apps. RICO is, to the best

of our knowledge, the largest public repository of

mobile app data, containing 72k UI screenshots

and their View Hierarchies from 9.7k Android apps

The datasets are released at https://github.com/google-

research-datasets/rico-semantics.

Benchmark models and code are released at

https://https://github.com/google-research/google-

research/tree/master/rico-semantics

spanning 27 categories. Apart from RICO, other

datasets include ERICA (Deka et al.,2016) with

sequences of user interactions with mobile UIs and

LabelDROID (Chen et al.,2020a) which contains

13.1k mobile UI screenshots and View Hierarchies.

There have been a few efforts to provide addi-

tional annotations on RICO. SWIRE (Huang et al.,

2019) and VINS (Bunian et al.,2021) added anno-

tations for UI retrieval, Enrico (Leiva et al.,2020)

added annotations for 20 design topics. Liu et al.

(2018) automatically generated semantic annota-

tions for UI elements using a convolutional neural

network trained on a subset of the data. Recently Li

et al. (2022) released UI element labels on view hi-

erarchy boxes including identifying boxes which do

not match to any elements in the UI. Even though

some of these works are similar in spirit to the

dataset presented in this paper, there are two ma-

jor differences: 1) The icon and UI element labels

are inferred on the boxes extracted from the View

Hierarchy, whereas, in our work, we add human

annotated bounding boxes directly on the image.

Due to noise in the view hierarchies like missing

and misaligned bounding boxes for UI elements (Li

et al.,2020a,2022), we observe that human annota-

tion increases the number of icons labelled by 47%.

2) The semantic icon labels in Liu et al. (2018)

conﬂate appearance and functionality. For exam-

ple “close” and “delete,” “undo” and “back,” “add”

and “expand” are mapped to the same class, even

though they represent different functionalities. The

Icon Semantics annotations in our dataset speciﬁ-

cally try to distinguish between icons with the same

appearance but differences in functionality.

2.2 Models

Pixel based methods for UI understanding have

been studied for a long time. They have been used

for a variety of applications like GUI testing (Yeh

et al.,2009;Chang et al.,2010), identifying similar

products across screens (Bell and Bala,2015), ﬁnd-

ing similar designs and UIs (Behrang et al.,2018;

Bunian et al.,2021), detecting issues in UIs (Liu,

2020), generating accessibility metadata (Zhang

et al.,2021) and generating descriptions of ele-

ments (Chen et al.,2020a). A recent study by

Chen et al. (2020b) compares traditional image

processing methods and Deep Learning methods

to identify different UI elements. The image-only

baseline models studied in this paper are based on

Object Detection methods presented in Chen et al.

(a) Examples of 76 icon shape annotations. We include classes that reﬂect the social

aspects of app usage, e.g., "person" for proﬁle and community, share via popular apps

such as Facebook and Twitter, etc.

(b) Examples of 38 icon semantics annotations, in the format of <shape>:<semantics>.

Note that a single shape may represent multiple semantics, depending on the context. E.g.,

"X" shape may mean "close", "delete text", or "multiply". We use an umbrella semantic

class "OTHER" to cover the semantics not covered in our proposed set of classes

Figure 1: In this paper, we annotated the RICO dataset with both icon shapes and their semantics, to encourage

further research on app automation and accessibility. The existing icon annotations from Liu et al. (2018) were

algorithmically generated with 10% of them veriﬁed. However, we observed ∼32% were missing labels compared

to our full human annotations. We release our annotations in the hope to contribute back to the community.

(2020b), Zhang et al. (2021), Chen et al. (2020a),

and Carion et al. (2020).

Extending to other modalities beyond pixels,

Banovic et al. (2012) use video tutorials to un-

derstand UIs and annotate them with additional

information. Li et al. (2021) use only the screen

information for identifying embeddings of UI ele-

ments. Hurst et al. (2010) use both the screen and

accessibility API information to identify interaction

targets in UIs and Chang et al. (2011) use similar

inputs to detect and identify certain UI elements

and Nguyen et al. (2018) use it for identifying sim-

ilar UI designs. Multimodal inputs have also been

used for understanding screen contents like generat-

ing element descriptions (Li et al.,2020b), training

UI embeddings for multiple downstream tasks (He

et al.,2021;Bai et al.,2021) and, denoising data,

predicting bounding box types (Li et al.,2022).

3 Datasets, taxonomy and annotation

To enable accessible hands-free experience for mo-

bile users, it is necessary for the system to under-

stand the functionality of the different screen UI

elements. For learning data driven models to en-

able these functionalities, we use the RICO dataset

(Deka et al.,2017). RICO spans >9K apps and

>72K UIs, each with a screenshot and information

regarding the structure of the UI in the form of a

View Hierarchy (VH). Besides the bounding boxes

of the different UI elements, the VH contains useful

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsBetterSemanticUnderstandingofMobileInterfacesSrinivasSunkara1MariaWang1LijuanLiu1GillesBaechler1Yu-ChungHsiao1Jindong(JD)Chen1AbhanshuSharma1JamesStout21GoogleResearch,2GoogleAbstractImprovingtheaccessibilityandautomationca-pabilitiesofmobiledevicescanhaveasigni-cantpositiveimpactonthedail...

展开>> 收起<<

Towards Better Semantic Understanding of Mobile Interfaces Srinivas Sunkara1Maria Wang1Lijuan Liu1Gilles Baechler1 Yu-Chung Hsiao1Jindong JD Chen1Abhanshu Sharma1James Stout2.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Better Semantic Understanding of Mobile Interfaces Srinivas Sunkara1Maria Wang1Lijuan Liu1Gilles Baechler1 Yu-Chung Hsiao1Jindong JD Chen1Abhanshu Sharma1James Stout2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: