1 DinoDroid Testing Android Apps Using Deep Q-Networks

2025-04-27 0 0 3.84MB 14 页 10玖币

DinoDroid: Testing Android Apps Using Deep

Q-Networks

Yu Zhao, Member, IEEE, Brent Harrison, Member, IEEE, and Tingting Yu, Member, IEEE

Abstract—The large demand of mobile devices creates signiﬁcant concerns about the quality of mobile applications (apps).

Developers need to guarantee the quality of mobile apps before it is released to the market. There have been many approaches using

different strategies to test the GUI of mobile apps. However, they still need improvement due to their limited effectiveness. In this paper,

we propose DinoDroid, an approach based on deep Q-networks to automate testing of Android apps. DinoDroid learns a behavior

model from a set of existing apps and the learned model can be used to explore and generate tests for new apps. DinoDroid is able to

capture the ﬁne-grained details of GUI events (e.g., the content of GUI widgets) and use them as features that are fed into deep neural

network, which acts as the agent to guide app exploration. DinoDroid automatically adapts the learned model during the exploration

without the need of any modeling strategies or pre-deﬁned rules. We conduct experiments on 64 open-source Android apps. The

results showed that DinoDroid outperforms existing Android testing tools in terms of code coverage and bug detection.

Index Terms—Mobile Testing, Deep Q-Networks, Reinforcement Learning.

1 INTRODUCTION

Mobile applications (apps) have become extremely popu-

lar with about three million apps in Google Play’s app

store [1]. The increase in app complexity has created sig-

niﬁcant concerns about the quality of apps. Also, because

of the rapid releasing cycle of apps and limited human

resources, it is difﬁcult for developers to manually construct

test cases. Therefore, different automated mobile app testing

techniques have been developed and applied [2].

Test cases for mobile apps are often represented by

sequences of GUI events 1to mimic the interactions between

users and apps. The goal of an automated test generator

is generating such event sequences to achieve high code

coverage and/or detecting bugs. A successful test generator

is able to exercise the correct GUI widget on the current app

page, so that when exercising that widget, it can bring the

app to a new page, leading to the exploration of new events.

However, existing mobile app testing tools often explore a

limited set of events because they have limited capability

of understanding which GUI events would expand the

exploration like humans do. This can lead to automated test

generators performing unnecessary actions that are unlikely

to lead to new coverage or detect new bugs.

Many automated GUI testing approaches for mobile

apps have been proposed, such as random testing [3], [4]

•Yu Zhao was with the Department of Computer Science and Computer

Science and Cybersecurity, University of Central Missouri, Warrensburg,

MO, 64093.

E-mail: yzhao@ucmo.edu

•Brent Harrison was with the Department of Computer Science, University

of Kentucky, Lexington, KY, 40506.

E-mail: harrison@cs.uky.edu.

•Tingting Yu (corresponding author) was with the Department of Electrical

Engineering and Computer Science, University of Cincinnati, Cincinnati,

OH, 45221.

E-mail: yutt@ucmail.uc.edu

1. In our setting, an event refers to an executable GUI widget associ-

ated with an action type (e.g., click, scroll, edit, swipe, etc).

and model-based testing [5], [6], [7]. Random testing (e.g.,

Monkey [3]) is popular in testing mobile apps because of

its simplicity and availability. It generates tests by sending

thousands of GUI events per second to the app. While

random testing can sometimes be effective, it is difﬁcult

to explore hard-to-reach events to drive the app to new

pages because of the natural of randomness. Model-based

testing [7], [8] can improve code coverage by employing

pre-deﬁned strategies or rules to guide the app exploration.

For example, A3E [7] employs depth-ﬁrst search (DFS) to

explore the model of an app under test (AUT) based on

event-ﬂow across app pages. Stoat [8] utilizes a stochastic

Finite State Machine model to describe the behavior of AUT

and then utilizes Markov Chain Monte Carlo sampling [9] to

guide the testing. However, model-based testing often relies

on human-designed models and it is almost impossible to

precisely model an app’s behavior. Also, many techniques

apply pre-deﬁned rules to the model for improving testing.

For example, Stoat [8] designed rules to assign each event

an execution weight in order to speed up exploration. How-

ever, these pre-deﬁned rules are often derived from limited

observations and may not generalize to a wide categories of

apps.

To summarize, the inherent limitation of the above tech-

niques is that they do not automatically understand GUI

layout and the content of the GUI elements, so it is difﬁcult

for them to exercise the most effective events that will bring

the app into new states. Recently, machine learning tech-

niques have been proposed to perform GUI testing in mobile

apps [10], [11], [12], [13]. For example, Humanoid [12] uses

deep learning to learn from human-generated interaction

traces and uses the learned model to guide test generation

as a human tester. However, this approach relies on human-

generated datasets (i.e., interaction traces) to train a model

and needs to combine the model with a set of pre-deﬁned

rules to guide testing.

Reinforcement learning (RL) can teach machine to decide

arXiv:2210.06307v1 [cs.SE] 12 Oct 2022

which events to explore rather than relying on pre-deﬁned

models or human-made strategies [14]. A Q-table is used

to record the reward of each event and the information

of previous testing knowledge. The reward function can

be deﬁned based on the differences between pages [11] or

unique activities [10]. Reinforcement learners will learn to

maximize cumulative reward with the goal of achieving

higher code coverage or detecting more bugs.

While existing RL techniques have improved app test-

ing, they focus on abstracting the information of app pages

and then use the abstracted features to train behavior mod-

els for testing [15], [16]. For example, QBE [15], a Q-learning-

based Android app testing tool, abstracts each app page into

ﬁve categories based on the number of widgets (e.g., too-

few, few, moderate, many, too-many). The ﬁve categories are

used to decide which events to explore. However, existing

RL techniques do not understand the ﬁne-grained informa-

tion of app pages like human testers normally do during

testing, such as the execution frequencies and the content of

GUI widgets. This is due to the limitation of the basic tabular

setting of RL, which requires a ﬁnite number of states [17].

Therefore, the learned model may not capture the accurate

behaviors of the app. Also, many RL-based techniques focus

on training each app independently [18], [16], [19], [20], [11]

and thus cannot transfer the model learned from one app to

another.

To address the aforementioned challenges, we propose

a novel approach, DinoDroid, based on deep Q-networks

(DQN). DinoDroid learns a behavior model from a set of

existing apps and the learned model can be used to explore

and generate tests for new apps. During the training pro-

cess, DinoDroid is able to understand and learn the details

of app events by leveraging a deep neural network (DNN)

model [21]. More precisely, we have developed a set of

features taken as input by DinoDroid. The insight of these

features represents what a human tester would do during

the exploration. For example, a human tester may decide

which widget to execute based on its content or how many

times it was executed in the past. DinoDroid does not use

any pre-deﬁned rules or thresholds to tune the parameters

of these features but let the DQN agent learn a behavior

model based on the feature values (represented by vectors)

automatically obtained during training and testing phases.

A key novel component of DinoDroid is a deep neural

network (DNN) model that can process multiple complex

features to predict Q value for each GUI event to guide Q-

learning. With the DNN, DinoDroid can be easily extended

to handle other types of features. Speciﬁcally, to test an

app, DinoDroid ﬁrst trains a set of existing apps to learn a

behavior model. The DNN serves as an agent to compute

Q values used to determine the action (i.e., which event

to execute) at each iteration. In the meantime, DinoDroid

maintains a special event ﬂow graph (EFG) to record and

update the feature vectors, which are used for DNN to

compute Q values.

Because the features are often shared among different

apps, DinoDroid is able to apply the model learned from ex-

isting apps to new AUTs. To do this, the agent continuously

adapts the existing model to the new AUT by generating

new actions to reach the desired testing goal (e.g., code

coverage).

Fig. 1: A Motivating Example

In summary, our paper makes the following contribu-

tions:

•An approach to testing Android apps based on deep

Q-learning.

•A novel and the ﬁrst deep Q-learning model that can

process complex features at a ﬁne-grained level.

•An empirical study showing that the approach is effec-

tive at achieving higher code coverage and better bug

detection than the state-of-the-art tools.

•The implementation of the approach as a publicly

available tool, DinoDroid, along with all experiment

data [22].

2 MOTIVATION AND BACKGROUND

In this section, we ﬁrst describe a motivating example

of DinoDroid, followed by the background of deep Q-

networks (DQN), the problem formulation, and the discus-

sion of existing work.

2.1 A Motivating Example.

Fig. 1 shows an example of the app lockpatterngenerator [23].

This simple example demonstrates the ideas of DinoDroid,

but the real testing process is much more complex. After

clicking “Minimum length”, a message box pops up with

a textﬁeld and two clickable buttons. Therefore, the current

page of the app has a total of ﬁve events (i.e., “restart”,

“back”, “menu”, “OK”, and “Cancel”). The home button is

not considered because it is not speciﬁc to the app. When a

human tester encounters this page, he/she needs to decide

which event to execute based on his/her prior experience.

For example, a tester is likely to execute the events that have

never been executed before. The tester may also need to

know the execution context of the current page (e.g., the

layout of next page) to decide which widget to exercise.

In this example, suppose none of the ﬁve events on

the current page have been executed before. Intuitively, the

tester tends to select the “OK” event to execute because it

is more likely to bring the app to a new page. “Cancel”

is very possible to be the next event to consider because

“restart”, “back”, and “menu” are general events, so the

tester may have already had experience in executing them

when testing other apps. In summary, to decide whether an

event has a higher priority to be executed, the tester may

need to consider its “features”, such as how many times

it was executed (i.e., execution frequency) and the content

of the widget. DinoDroid is able to automatically learn a

behavior model from a set of existing apps based on these

features and the learned model can be used to test new apps.

Tab.(a)-Tab.(c) in Fig. 1 are used to illustrate DinoDroid.

In this example, DinoDroid dynamically records the feature

values for each event, including the execution frequency, the

number of events not executed on the next page (i.e., child

page), and the text on the event associated with the event.

Next, DinoDroid employs a deep neural network to predict

the accumulative reward (i.e., Q value) of each event on

the current page based on the aforementioned features and

selects the event that has the largest Q value to execute.

Tab.(a) shows the feature values and Q values when

the page appears the ﬁrst time. Since “OK” has the largest

Q value, it is selected for execution. DinoDroid continues

exploring the events on the new page and updating the Q

value. When the second time this page appears, the Q value

of the event on executing “OK” button decreases because it

is already executed. As a result, “Cancel” has the largest Q

value and is selected for execution. In this case (Tab.(b)),

the child page of “OK” contains 10 unexecuted events.

However, suppose the child page contains zero unexecuted

events (Tab.(c)), the Q value becomes much smaller. This

is because DinoDroid tends to select the event whose child

page contains more unexecuted events.

The underlying assumption of our approach is that the

application under test should follow the principle of least

surprise (PLS). If an app does not meet the PLS, e.g.,

an “OK” textual widget is incorrectly associated with the

functionally of “Cancel”, it would mislead DinoDroid when

ﬁnding the right events to execute. Speciﬁcally, DinoDroid

exploits the learned knowledge to execute correct events

that result in higher code coverage or triggering bugs.

2.2 Background

2.2.1 Q-Learning

Q-learning [24] is a model-free reinforcement learning

method which seeks to learn a behavior policy for any

ﬁnite Markov decision process (FMDP), Q-learning ﬁnds

an optimal policy, π, that maximizes expected cumulative

reward over a sequence of actions taken. Q-learning is based

on trial-and-error learning in which an agent interacts with

its environment and assigns utility estimates known as Q

values to each state.

As shown in Fig. 2 the agent iteratively interacts with the

outside environment. At each iteration t, the agent selects an

action at∈Abased on the current state st∈Sand executes

it on the outside environment. After exercising the action,

there is a new state st+1 ∈S, which can be observed by

the agent. In the meantime, an immediate reward rt∈Ris

Fig. 2: Deep Q-Networks

received. Then the agent will update the Q values using the

Bellman equation [25] as follows:

Q(st, at)←Q(st, at)+α∗(rt+γ∗max

aQ(st+1, a)−Q(st, at))

In this equation, αis a learning rate between 0 and 1, γ

is a discount factor between 0 and 1, stis the state at time

t, and atis the action taken at time t. Once learned, these Q

values can be used to determine optimal behavior in each

state by selecting action at= arg maxaQ(st, a).

2.2.2 Deep Q-Networks

Deep Q-networks (DQN) are used to scale the classic Q-

learning to more complex state and action spaces [26], [27].

For the classical Q-learning, Q(st, at)are stored and visited

in a Q table. It can only handle the fully observed, low-

dimensional state and action space. As shown in Fig. 2, in

DQN, a deep neural network (DNN), speciﬁcally involving

convolutional neural networks (CNN) [28], is a multi-

layered neural network that for a given state stoutputs Q

values for each action Q(st, a). Because a neural network

can input and output high-dimensional state and action

space, DQN has an ability to scale more complex state

and action spaces. A neural network can also generalize Q

values to unseen states, which is not possible when using a

Q-table. It utilizes the follow loss function [27] to alter the

network to minimize the temporal difference (TD) [29] error

as a loss function loss =rt+γ∗max

aQ(st+1, a)−Q(st, at).

The γis the discount factor which is between 0 and 1. In

other word, with the input of (st, at), the neural network is

trained to predict the Q value as:

Q(st, at) = rt+γ∗max

aQ(st+1, a)(1)

So in a training sample, the input is (st, at)and output

is the corresponding Q value which can be computed by

rt+γ∗max

aQ(st+1, a).

2.3 Terminologies

A GUI widget is a graphical element of an app, such as

a button, a text ﬁeld, and a check box. An event is an

executable GUI widget with a particular event type (e.g.

click, long-click, swipe, edit), so a widget can be associated

with one or more events. In our setting, a state srepresents

an app page (i.e., a set of widgets shown on the current

screen. If the set of widgets is different, we have another

page). We use stto represent the current state and st+1 to

represent the next state. A reward ris calculated based on

the improvement of coverage. If code coverage increases, r

is assigned a positive number (r=5 by default); otherwise, r

is assigned a negative number (r=-2 by default). An Agent

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1DinoDroid:TestingAndroidAppsUsingDeepQ-NetworksYuZhao,Member,IEEE,BrentHarrison,Member,IEEE,andTingtingYu,Member,IEEEAbstractThelargedemandofmobiledevicescreatessignicantconcernsaboutthequalityofmobileapplications(apps).Developersneedtoguaranteethequalityofmobileappsbeforeitisreleasedtothemarket....

展开>> 收起<<

1 DinoDroid Testing Android Apps Using Deep Q-Networks.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 DinoDroid Testing Android Apps Using Deep Q-Networks

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: