
3
tester tends to select the “OK” event to execute because it
is more likely to bring the app to a new page. “Cancel”
is very possible to be the next event to consider because
“restart”, “back”, and “menu” are general events, so the
tester may have already had experience in executing them
when testing other apps. In summary, to decide whether an
event has a higher priority to be executed, the tester may
need to consider its “features”, such as how many times
it was executed (i.e., execution frequency) and the content
of the widget. DinoDroid is able to automatically learn a
behavior model from a set of existing apps based on these
features and the learned model can be used to test new apps.
Tab.(a)-Tab.(c) in Fig. 1 are used to illustrate DinoDroid.
In this example, DinoDroid dynamically records the feature
values for each event, including the execution frequency, the
number of events not executed on the next page (i.e., child
page), and the text on the event associated with the event.
Next, DinoDroid employs a deep neural network to predict
the accumulative reward (i.e., Q value) of each event on
the current page based on the aforementioned features and
selects the event that has the largest Q value to execute.
Tab.(a) shows the feature values and Q values when
the page appears the first time. Since “OK” has the largest
Q value, it is selected for execution. DinoDroid continues
exploring the events on the new page and updating the Q
value. When the second time this page appears, the Q value
of the event on executing “OK” button decreases because it
is already executed. As a result, “Cancel” has the largest Q
value and is selected for execution. In this case (Tab.(b)),
the child page of “OK” contains 10 unexecuted events.
However, suppose the child page contains zero unexecuted
events (Tab.(c)), the Q value becomes much smaller. This
is because DinoDroid tends to select the event whose child
page contains more unexecuted events.
The underlying assumption of our approach is that the
application under test should follow the principle of least
surprise (PLS). If an app does not meet the PLS, e.g.,
an “OK” textual widget is incorrectly associated with the
functionally of “Cancel”, it would mislead DinoDroid when
finding the right events to execute. Specifically, DinoDroid
exploits the learned knowledge to execute correct events
that result in higher code coverage or triggering bugs.
2.2 Background
2.2.1 Q-Learning
Q-learning [24] is a model-free reinforcement learning
method which seeks to learn a behavior policy for any
finite Markov decision process (FMDP), Q-learning finds
an optimal policy, π, that maximizes expected cumulative
reward over a sequence of actions taken. Q-learning is based
on trial-and-error learning in which an agent interacts with
its environment and assigns utility estimates known as Q
values to each state.
As shown in Fig. 2 the agent iteratively interacts with the
outside environment. At each iteration t, the agent selects an
action at∈Abased on the current state st∈Sand executes
it on the outside environment. After exercising the action,
there is a new state st+1 ∈S, which can be observed by
the agent. In the meantime, an immediate reward rt∈Ris
Fig. 2: Deep Q-Networks
received. Then the agent will update the Q values using the
Bellman equation [25] as follows:
Q(st, at)←Q(st, at)+α∗(rt+γ∗max
aQ(st+1, a)−Q(st, at))
In this equation, αis a learning rate between 0 and 1, γ
is a discount factor between 0 and 1, stis the state at time
t, and atis the action taken at time t. Once learned, these Q
values can be used to determine optimal behavior in each
state by selecting action at= arg maxaQ(st, a).
2.2.2 Deep Q-Networks
Deep Q-networks (DQN) are used to scale the classic Q-
learning to more complex state and action spaces [26], [27].
For the classical Q-learning, Q(st, at)are stored and visited
in a Q table. It can only handle the fully observed, low-
dimensional state and action space. As shown in Fig. 2, in
DQN, a deep neural network (DNN), specifically involving
convolutional neural networks (CNN) [28], is a multi-
layered neural network that for a given state stoutputs Q
values for each action Q(st, a). Because a neural network
can input and output high-dimensional state and action
space, DQN has an ability to scale more complex state
and action spaces. A neural network can also generalize Q
values to unseen states, which is not possible when using a
Q-table. It utilizes the follow loss function [27] to alter the
network to minimize the temporal difference (TD) [29] error
as a loss function loss =rt+γ∗max
aQ(st+1, a)−Q(st, at).
The γis the discount factor which is between 0 and 1. In
other word, with the input of (st, at), the neural network is
trained to predict the Q value as:
Q(st, at) = rt+γ∗max
aQ(st+1, a)(1)
So in a training sample, the input is (st, at)and output
is the corresponding Q value which can be computed by
rt+γ∗max
aQ(st+1, a).
2.3 Terminologies
A GUI widget is a graphical element of an app, such as
a button, a text field, and a check box. An event is an
executable GUI widget with a particular event type (e.g.
click, long-click, swipe, edit), so a widget can be associated
with one or more events. In our setting, a state srepresents
an app page (i.e., a set of widgets shown on the current
screen. If the set of widgets is different, we have another
page). We use stto represent the current state and st+1 to
represent the next state. A reward ris calculated based on
the improvement of coverage. If code coverage increases, r
is assigned a positive number (r=5 by default); otherwise, r
is assigned a negative number (r=-2 by default). An Agent