Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems Fitash Ul Haq

2025-05-02 0 0 421.99KB 13 页 10玖币
侵权投诉
Many-Objective Reinforcement Learning for Online
Testing of DNN-Enabled Systems
Fitash Ul Haq
University of Luxembourg
Luxembourg
fitash.ulhaq@uni.lu
Donghwan Shin
University of Sheffield
Sheffield, United Kingdom
d.shin@sheffield.ac.uk
Lionel C. Briand
University of Luxembourg
Luxembourg
University of Ottawa
Ottawa, Canada
lionel.briand@uni.lu
Abstract—Deep Neural Networks (DNNs) have been widely
used to perform real-world tasks in cyber-physical systems such
as Autonomous Driving Systems (ADS). Ensuring the correct
behavior of such DNN-Enabled Systems (DES) is a crucial topic.
Online testing is one of the promising modes for testing such
systems with their application environments (simulated or real)
in a closed loop, taking into account the continuous interaction
between the systems and their environments. However, the
environmental variables (e.g., lighting conditions) that might
change during the systems’ operation in the real world, causing
the DES to violate requirements (safety, functional), are often
kept constant during the execution of an online test scenario due
to the two major challenges: (1) the space of all possible scenarios
to explore would become even larger if they changed and (2) there
are typically many requirements to test simultaneously.
In this paper, we present MORLOT (Many-Objective Rein-
forcement Learning for Online Testing), a novel online testing ap-
proach to address these challenges by combining Reinforcement
Learning (RL) and many-objective search. MORLOT leverages
RL to incrementally generate sequences of environmental changes
while relying on many-objective search to determine the changes
so that they are more likely to achieve any of the uncovered
objectives. We empirically evaluate MORLOT using CARLA,
a high-fidelity simulator widely used for autonomous driving
research, integrated with Transfuser, a DNN-enabled ADS for
end-to-end driving. The evaluation results show that MORLOT
is significantly more effective and efficient than alternatives with a
large effect size. In other words, MORLOT is a good option to test
DES with dynamically changing environments while accounting
for multiple safety requirements.
Index Terms—DNN Testing, Reinforcement learning, Many
objective search, Self-driving cars, Online testing
I. INTRODUCTION
Deep Neural Networks (DNNs) have been widely used
to perform various tasks, such as object classification [1],
speech recognition [2] and object detection [3]. With the recent
advances in Deep Learning (DL), DNNs are increasingly
applied to safety-critical software systems, such as Automated
Driving Systems (ADS) that drive a vehicle with the aim
of preventing safety and functional violations (e.g., colliding
This work has been carried out as part of the COSMOS Project, which
has received funding from the European Union’s Horizon 2020 Research and
Innovation Programme under grant agreement No. 957254. This work was also
supported by NSERC of Canada under the Discovery and CRC programs.
Part of this work was done while the author was affiliated with the
University of Luxembourg, Luxembourg.
with other vehicles). Therefore, ensuring the correct behavior
of such DNN-Enabled Systems (DES) has emerged as a
fundamental problem of software testing.
Online testing [4] is one of the promising modes of DES
testing that accounts for the closed-loop interaction between
the DES under test and its environment. In online testing,
the DES under test is embedded into the application envi-
ronment (e.g., a simulated driving environment for ADS) and
is monitored to detect the violations of safety and functional
requirements during the closed-loop interaction. Such interac-
tion helps account for the accumulation of (possibly small)
errors over time, eventually leading to requirements violations
that often cannot be detected by offline testing [5]. As a
result, online testing has been actively applied in many testing
approaches recently [6, 7, 8, 9, 10].
However, existing online testing approaches for DES exhibit
at least one of two critical limitations. First, they do not
account for the fact that there are often many safety and
functional requirements, possibly independent of each other,
that must be considered together in practice. Though one
could simply repeat an existing test approach for individual
requirements, it is inefficient due to its inability to dynamically
distribute the test budget (e.g., time) over many requirements
according to the feasibility of requirements violations. For
example, if one of the requirements cannot be violated, the
pre-assigned budget for this requirement would simply be
wasted. Furthermore, dividing the limited test budget across
many requirements may result in too small a budget for
testing individual requirements thoroughly. Second, they do
not vary dynamic environmental elements, such as neighboring
vehicles and weather conditions, during test case (i.e., test
scenario) execution. For example, certain weather conditions
(e.g., sunny) remain the same throughout a test scenario,
whereas in reality they may change over time, which can
trigger requirements violations. This is mainly because the
number of possible test scenarios increases exponentially when
considering individual dynamic elements’ changes (i.e., time
series). However, not accounting for such dynamic environ-
ments could significantly limit test effectiveness by limiting
the scope of the test scenarios being considered.
Contributions. To overcome the above limitations, we
present MORLOT (Many-Objective Reinforcement Learning
arXiv:2210.15432v2 [cs.LG] 22 Dec 2022
for Online Testing), a novel online testing approach for DES.
MORLOT leverages two distinct approaches: (1) Reinforce-
ment Learning (RL) [11] to generate the sequences of changes
to the dynamic elements of the environment with the aim
of causing requirements violations, and (2) many-objective
search [12, 13] to efficiently satisfy as many independent
requirements as possible within a limited testing budget.
The combination of both approaches works as follows: (1)
RL incrementally generates the sequence of changes for the
dynamic elements of the environment, and (2) those changes
are determined by many-objective search such that they are
more likely to achieve any of the uncovered objective (i.e.,
the requirement not yet violated). In other words, changes
in the dynamic elements tend to be driven by the objectives
closest to being satisfied. For example, if there are three
objectives o1,o2, and o3where o2is closest to being satisfied,
MORLOT incrementally appends those changes that help in
achieving o2to the sequence. Furthermore, by keeping a record
of uncovered objectives as the state-of-the-art many-objective
search for test suite generation does, MORLOT focuses on the
uncovered ones over the search process.
Though MORLOT can be applied to any DES interacting
with an environment including dynamically changeable ele-
ments, it is evaluated on DNN-enabled Autonomous Diving
Systems (DADS). Specifically, we use Transfuser [14], the
highest ranked DADS, at the time of our evaluation, among
publicly available ones in the CARLA Autonomous Driving
Leaderboard [15] and CARLA [16], a high-fidelity simulator
that has been widely used for training and validating DADS.
Our evaluation results, involving more than 600 computing
hours, show that MORLOT is significantly more effective and
efficient at finding safety and functional violations than state-
of-the-art, many-objective search-based testing approaches tai-
lored for test suite generation [12, 13] and random search.
Our contributions can be summarized as follows:
- MORLOT, a novel approach that efficiently generates com-
plex test scenarios, including sequential changes to the
dynamic elements of the environment of the DES under test,
by leveraging both RL and many-objective search;
- An empirical evaluation of MORLOT in terms of test effec-
tiveness and efficiency and comparison with alternatives;
- A publicly available replication package, including the im-
plementation of MORLOT and instructions to set up our
case study.
Significance. For DES that continuously interact with their
operational environments and have many safety and func-
tional requirements to be tested, performing online testing
efficiently to identify as many requirements violations as
possible, without arbitrarily limiting the space of test scenar-
ios, is essential. However, taking into account the sequential
changes of dynamic elements of an environment over time is
extremely challenging since it renders the test scenario space
exponentially larger as test scenario time increases. MORLOT
addresses the problem by leveraging and carefully combining
RL and many-objective search, thus providing an important
and novel contribution towards scalable online testing of real-
world DES in practice.
Paper Structure. The rest of the paper is structured as
follows. Section II provides a brief background on RL and
many-objective search. Section III formalizes the problem of
DES online testing with a dynamically changing environment.
Section IV discusses and contrasts related work. Section V
describes our proposed approach, starting from a generic RL-
based test generation approach and then MORLOT, a many-
objective reinforcement learning approach for online testing.
Section VI evaluates the test effectiveness and efficiency of
MORLOT using an open-source DNN-based ADS with a high-
fidelity driving simulator. Section VII concludes the paper.
Section VIII provides details on the replication package.
II. BACKGROUND
A. Reinforcement Learning
Reinforcement Learning (RL) is about learning how to
perform a sequence of actions to achieve a goal by iterating
trials and errors to learn the best action for a given state [11].
In particular, RL involves the interaction between an RL
agent (i.e., the learner) and its surrounding environment, which
is formalized by a Markov Decision Process (MDP). At each
time step j, the RL agent observes the environment’s state sj
and takes an action ajbased on its own policy π(i.e., the
mapping between states and actions). At the next time step
j+ 1, the agent first gets a reward wj+1 indicating how well
taking ajin sjhelped in achieving the goal, updates πbased
on wj+1, and then continues to interact with its environment.
From the interactions (trials and errors), the agent is expected
to learn the unknown optimal policy πthat can select the
best action maximizing the expected sum of future rewards in
any state.
An important assumption underlying MDP is that states sat-
isfy the Markov property: states captures information about all
aspects of the past agent–environment interactions that make a
difference for the future [11]. In other words, wj+1 and sj+1
depend only on sjand aj, and are independent from the previ-
ous states sj1, sj2, . . . , s1and actions aj1, aj2, . . . , a1.
This assumption allows the RL agent to take an action by
considering only the current state, as opposed to all past states
(and actions).
In general, there are two types of RL methods: (1) tabu-
lar-based and (2) approximation-based. Tabular-based meth-
ods [17, 18] use tables or arrays to store the expected sum
of future rewards for each state. Though they are applicable
only if state and action spaces are small enough to be
represented in tables or arrays, they can often find exactly
the optimal policy [11]. Discretization can be used to control
the size of state and action spaces, especially when states
and actions are continuous. It is essential to apply the right
degree of discretization since coarse-grained discretization
may make it impossible for the agent to distinguish between
states that require different actions, resulting in significant
loss of information. When the state space is enormous and
cannot be easily discretized without significant information
loss, approximation-based methods [19] can be used where
the expected sum of future rewards for a newly discovered
state can be approximated based on known states (often with
the help of state abstraction when they are too complex to
directly compare). While they can address complex problems
in very large state spaces, they can only provide approximate
solutions.
One of the most commonly used tabular-based reinforce-
ment learning algorithms is Q-learning due to its simplicity
and guaranteed convergence to an optimal policy [17]. It stores
and iteratively updates the expected sum of future rewards
for each state-action pair in a table (a.k.a., Q-table) while
going through trials and errors. A properly updated Q-table
can therefore tell what is the best action to choose in a given
state. A more detailed explanation, including how to update
a Q-table to ensure the convergence, can be found in Sutton
and Barto [11].
B. Many-Objective Search
Many-objective search [20, 21] refers to solving multiple
objectives simultaneously, typically more than four, using
meta-heuristic search algorithms, such as MOEA/D [22] and
NSGA-III [23]. With the help of fitness functions that quantify
the goodness of candidate solutions in terms of individual ob-
jectives, the algorithms can deliver a set of solutions satisfying
as many objectives as possible.
In software testing, many-objective search has been applied
to solve testing problems with many test requirements. The
idea is to recast such testing problems into optimization
problems by carefully defining fitness functions for all ob-
jectives. However, since the number of objectives (i.e., test
requirements) is often much more than four (e.g., covering all
branches in a program), researchers have developed tailored
algorithms for testing. MOSA [12] is a well-known algorithm
dedicated to test suite generation with many objectives. It uses
an evolutionary algorithm to achieve each objective individ-
ually by effectively searching for uncovered objectives and
keeping an archive for the best test cases achieving objectives.
By defining fitness functions to measure the likelihood of
covering individual branches in a program, MOSA can ef-
fectively and efficiently generate a test suite covering as many
branches as possible [12]. FITEST [13] is another state-of-the-
art algorithm that extends MOSA to decrease the population
size as the number of uncovered objectives decreases to
improve efficiency, in contrast to MOSA that maintains the
same population size during the entire search.
III. PROBLEM DESCRIPTION
In this section, we provide a precise problem description
regarding the automated online testing of DNN-enabled sys-
tems (DES) by dynamically changing their environment during
simulation. As a working example, we use a DNN-enabled
autonomous driving system (DADS) to illustrate our main
points, but the description can be generalized to any DES.
In online testing, a DADS under test is embedded into and
interacts with its driving environment. However, because of the
risks and costs it entails, online testing is usually performed
with a simulator rather than on-road vehicle testing. Using a
simulator enables the control of the driving environment, such
as weather conditions, lighting conditions, and the behavior
of other actors (e.g., vehicles on the road and pedestrians).
During simulation, the DADS continuously interacts with
the environment by observing the environment via the ego
vehicle’s sensors (e.g., camera and LIDAR) and driving the
ego vehicle through commands (e.g., steering, throttle, and
braking). Due to the closed-loop interaction between the
DADS and its environment, simulation is an effective in-
strument to check if any requirements violation can occur
under realistic conditions. Notice that such violations can be
triggered by dynamically changing the driving environment
during simulation; for example, certain changes to the speed
of the vehicle in front, which can often occur in practice due
to impaired driving or sudden stops, can cause a collision. The
goal of DADS online testing, based on dynamically changing
the environment, is to find a minimal set of test cases, each
of them changing the environment in a different way, to cause
the DADS to violate as many requirements as possible.
Specifically, let dbe the DADS under test on board the
ego vehicle and E= (X, K)be the environment where
X={x1, x2, . . . }is a set of dynamic elements of the
simulation (e.g., actors other than the ego vehicle, and the
environment conditions) and Kis a set of static elements (e.g.,
roads, buildings, and trees). Each dynamic element xXcan
be further decomposed into a sequence x=hx1, x2, . . . , xJi
where Jis the duration of the simulation and xjis the value
of xat time j∈ {1,2, . . . , J}(e.g., the position, speed,
and acceleration of the vehicle in front at time j). Based
on that, we can define Xj={xj|xX}as capturing
the set of values of all the dynamic elements in Xat time
jand Ej= (Xj, K)as indicating the set of values of all
environmental elements (both dynamic and static) in Eat time
j. Let a test case t=hX1, . . . , XJibe a sequence of values
of the dynamic elements for Jtime steps. By running din E
with t, using a simulator, at each time step j∈ {1,2, . . . , J},
dobserves the snapshot of Ej= (Xj, K)using the sensors
(e.g., camera and LIDAR) and generates driving commands
Cj(e.g., throttle, steering, and braking) for the ego vehicle.
Note that what dobserves from the same Evaries depending
on the ego vehicle’s dynamics (e.g., position, speed, and
acceleration) computed by C; for example, dobserves the
relative distance between the ego vehicle and the vehicle in
front, which naturally varies depending on the position of
the ego vehicle. For the next time step j+ 1, the simulator
computes Ej+1 = (Xj+1, K)according to tand generates
what dobserves from Ej+1 by taking into account Cj.
Given a set of requirements (safety, functional) R, the
degree of a violation for a requirement rRproduced by
din Efor tat any time step j∈ {1, . . . , J}, denoted by
v(r, d, E, t, j), can be measured by monitoring the simulation
of din Efor t. For example, the distance between the ego
vehicle and the vehicle in front can be used to measure how
close they are from colliding. If v(r, d, E, t, j)is greater than
摘要:

Many-ObjectiveReinforcementLearningforOnlineTestingofDNN-EnabledSystemsFitashUlHaqUniversityofLuxembourgLuxembourgtash.ulhaq@uni.luDonghwanShinUniversityofShefeldShefeld,UnitedKingdomd.shin@shefeld.ac.ukLionelC.BriandUniversityofLuxembourgLuxembourgUniversityofOttawaOttawa,Canadalionel.briand@u...

展开>> 收起<<
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems Fitash Ul Haq.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:421.99KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注