Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems Fitash Ul Haq

2025-05-02 3 0 421.99KB 13 页 10玖币

侵权投诉

Many-Objective Reinforcement Learning for Online

Testing of DNN-Enabled Systems

Fitash Ul Haq

University of Luxembourg

Luxembourg

ﬁtash.ulhaq@uni.lu

Donghwan Shin∗

University of Shefﬁeld

Shefﬁeld, United Kingdom

d.shin@shefﬁeld.ac.uk

Lionel C. Briand

University of Luxembourg

Luxembourg

University of Ottawa

Ottawa, Canada

lionel.briand@uni.lu

Abstract—Deep Neural Networks (DNNs) have been widely

used to perform real-world tasks in cyber-physical systems such

as Autonomous Driving Systems (ADS). Ensuring the correct

behavior of such DNN-Enabled Systems (DES) is a crucial topic.

Online testing is one of the promising modes for testing such

systems with their application environments (simulated or real)

in a closed loop, taking into account the continuous interaction

between the systems and their environments. However, the

environmental variables (e.g., lighting conditions) that might

change during the systems’ operation in the real world, causing

the DES to violate requirements (safety, functional), are often

kept constant during the execution of an online test scenario due

to the two major challenges: (1) the space of all possible scenarios

to explore would become even larger if they changed and (2) there

are typically many requirements to test simultaneously.

In this paper, we present MORLOT (Many-Objective Rein-

forcement Learning for Online Testing), a novel online testing ap-

proach to address these challenges by combining Reinforcement

Learning (RL) and many-objective search. MORLOT leverages

RL to incrementally generate sequences of environmental changes

while relying on many-objective search to determine the changes

so that they are more likely to achieve any of the uncovered

objectives. We empirically evaluate MORLOT using CARLA,

a high-ﬁdelity simulator widely used for autonomous driving

research, integrated with Transfuser, a DNN-enabled ADS for

end-to-end driving. The evaluation results show that MORLOT

is signiﬁcantly more effective and efﬁcient than alternatives with a

large effect size. In other words, MORLOT is a good option to test

DES with dynamically changing environments while accounting

for multiple safety requirements.

Index Terms—DNN Testing, Reinforcement learning, Many

objective search, Self-driving cars, Online testing

I. INTRODUCTION

Deep Neural Networks (DNNs) have been widely used

to perform various tasks, such as object classiﬁcation [1],

speech recognition [2] and object detection [3]. With the recent

advances in Deep Learning (DL), DNNs are increasingly

applied to safety-critical software systems, such as Automated

Driving Systems (ADS) that drive a vehicle with the aim

of preventing safety and functional violations (e.g., colliding

This work has been carried out as part of the COSMOS Project, which

has received funding from the European Union’s Horizon 2020 Research and

Innovation Programme under grant agreement No. 957254. This work was also

supported by NSERC of Canada under the Discovery and CRC programs.

∗Part of this work was done while the author was afﬁliated with the

University of Luxembourg, Luxembourg.

with other vehicles). Therefore, ensuring the correct behavior

of such DNN-Enabled Systems (DES) has emerged as a

fundamental problem of software testing.

Online testing [4] is one of the promising modes of DES

testing that accounts for the closed-loop interaction between

the DES under test and its environment. In online testing,

the DES under test is embedded into the application envi-

ronment (e.g., a simulated driving environment for ADS) and

is monitored to detect the violations of safety and functional

requirements during the closed-loop interaction. Such interac-

tion helps account for the accumulation of (possibly small)

errors over time, eventually leading to requirements violations

that often cannot be detected by ofﬂine testing [5]. As a

result, online testing has been actively applied in many testing

approaches recently [6, 7, 8, 9, 10].

However, existing online testing approaches for DES exhibit

at least one of two critical limitations. First, they do not

account for the fact that there are often many safety and

functional requirements, possibly independent of each other,

that must be considered together in practice. Though one

could simply repeat an existing test approach for individual

requirements, it is inefﬁcient due to its inability to dynamically

distribute the test budget (e.g., time) over many requirements

according to the feasibility of requirements violations. For

example, if one of the requirements cannot be violated, the

pre-assigned budget for this requirement would simply be

wasted. Furthermore, dividing the limited test budget across

many requirements may result in too small a budget for

testing individual requirements thoroughly. Second, they do

not vary dynamic environmental elements, such as neighboring

vehicles and weather conditions, during test case (i.e., test

scenario) execution. For example, certain weather conditions

(e.g., sunny) remain the same throughout a test scenario,

whereas in reality they may change over time, which can

trigger requirements violations. This is mainly because the

number of possible test scenarios increases exponentially when

considering individual dynamic elements’ changes (i.e., time

series). However, not accounting for such dynamic environ-

ments could signiﬁcantly limit test effectiveness by limiting

the scope of the test scenarios being considered.

Contributions. To overcome the above limitations, we

present MORLOT (Many-Objective Reinforcement Learning

arXiv:2210.15432v2 [cs.LG] 22 Dec 2022

for Online Testing), a novel online testing approach for DES.

MORLOT leverages two distinct approaches: (1) Reinforce-

ment Learning (RL) [11] to generate the sequences of changes

to the dynamic elements of the environment with the aim

of causing requirements violations, and (2) many-objective

search [12, 13] to efﬁciently satisfy as many independent

requirements as possible within a limited testing budget.

The combination of both approaches works as follows: (1)

RL incrementally generates the sequence of changes for the

dynamic elements of the environment, and (2) those changes

are determined by many-objective search such that they are

more likely to achieve any of the uncovered objective (i.e.,

the requirement not yet violated). In other words, changes

in the dynamic elements tend to be driven by the objectives

closest to being satisﬁed. For example, if there are three

objectives o1,o2, and o3where o2is closest to being satisﬁed,

MORLOT incrementally appends those changes that help in

achieving o2to the sequence. Furthermore, by keeping a record

of uncovered objectives as the state-of-the-art many-objective

search for test suite generation does, MORLOT focuses on the

uncovered ones over the search process.

Though MORLOT can be applied to any DES interacting

with an environment including dynamically changeable ele-

ments, it is evaluated on DNN-enabled Autonomous Diving

Systems (DADS). Speciﬁcally, we use Transfuser [14], the

highest ranked DADS, at the time of our evaluation, among

publicly available ones in the CARLA Autonomous Driving

Leaderboard [15] and CARLA [16], a high-ﬁdelity simulator

that has been widely used for training and validating DADS.

Our evaluation results, involving more than 600 computing

hours, show that MORLOT is signiﬁcantly more effective and

efﬁcient at ﬁnding safety and functional violations than state-

of-the-art, many-objective search-based testing approaches tai-

lored for test suite generation [12, 13] and random search.

Our contributions can be summarized as follows:

- MORLOT, a novel approach that efﬁciently generates com-

plex test scenarios, including sequential changes to the

dynamic elements of the environment of the DES under test,

by leveraging both RL and many-objective search;

- An empirical evaluation of MORLOT in terms of test effec-

tiveness and efﬁciency and comparison with alternatives;

- A publicly available replication package, including the im-

plementation of MORLOT and instructions to set up our

case study.

Signiﬁcance. For DES that continuously interact with their

operational environments and have many safety and func-

tional requirements to be tested, performing online testing

efﬁciently to identify as many requirements violations as

possible, without arbitrarily limiting the space of test scenar-

ios, is essential. However, taking into account the sequential

changes of dynamic elements of an environment over time is

extremely challenging since it renders the test scenario space

exponentially larger as test scenario time increases. MORLOT

addresses the problem by leveraging and carefully combining

RL and many-objective search, thus providing an important

and novel contribution towards scalable online testing of real-

world DES in practice.

Paper Structure. The rest of the paper is structured as

follows. Section II provides a brief background on RL and

many-objective search. Section III formalizes the problem of

DES online testing with a dynamically changing environment.

Section IV discusses and contrasts related work. Section V

describes our proposed approach, starting from a generic RL-

based test generation approach and then MORLOT, a many-

objective reinforcement learning approach for online testing.

Section VI evaluates the test effectiveness and efﬁciency of

MORLOT using an open-source DNN-based ADS with a high-

ﬁdelity driving simulator. Section VII concludes the paper.

Section VIII provides details on the replication package.

II. BACKGROUND

A. Reinforcement Learning

Reinforcement Learning (RL) is about learning how to

perform a sequence of actions to achieve a goal by iterating

trials and errors to learn the best action for a given state [11].

In particular, RL involves the interaction between an RL

agent (i.e., the learner) and its surrounding environment, which

is formalized by a Markov Decision Process (MDP). At each

time step j, the RL agent observes the environment’s state sj

and takes an action ajbased on its own policy π(i.e., the

mapping between states and actions). At the next time step

j+ 1, the agent ﬁrst gets a reward wj+1 indicating how well

taking ajin sjhelped in achieving the goal, updates πbased

on wj+1, and then continues to interact with its environment.

From the interactions (trials and errors), the agent is expected

to learn the unknown optimal policy π∗that can select the

best action maximizing the expected sum of future rewards in

any state.

An important assumption underlying MDP is that states sat-

isfy the Markov property: states captures information about all

aspects of the past agent–environment interactions that make a

difference for the future [11]. In other words, wj+1 and sj+1

depend only on sjand aj, and are independent from the previ-

ous states sj−1, sj−2, . . . , s1and actions aj−1, aj−2, . . . , a1.

This assumption allows the RL agent to take an action by

considering only the current state, as opposed to all past states

(and actions).

In general, there are two types of RL methods: (1) tabu-

lar-based and (2) approximation-based. Tabular-based meth-

ods [17, 18] use tables or arrays to store the expected sum

of future rewards for each state. Though they are applicable

only if state and action spaces are small enough to be

represented in tables or arrays, they can often ﬁnd exactly

the optimal policy [11]. Discretization can be used to control

the size of state and action spaces, especially when states

and actions are continuous. It is essential to apply the right

degree of discretization since coarse-grained discretization

may make it impossible for the agent to distinguish between

states that require different actions, resulting in signiﬁcant

loss of information. When the state space is enormous and

cannot be easily discretized without signiﬁcant information

loss, approximation-based methods [19] can be used where

the expected sum of future rewards for a newly discovered

state can be approximated based on known states (often with

the help of state abstraction when they are too complex to

directly compare). While they can address complex problems

in very large state spaces, they can only provide approximate

solutions.

One of the most commonly used tabular-based reinforce-

ment learning algorithms is Q-learning due to its simplicity

and guaranteed convergence to an optimal policy [17]. It stores

and iteratively updates the expected sum of future rewards

for each state-action pair in a table (a.k.a., Q-table) while

going through trials and errors. A properly updated Q-table

can therefore tell what is the best action to choose in a given

state. A more detailed explanation, including how to update

a Q-table to ensure the convergence, can be found in Sutton

and Barto [11].

B. Many-Objective Search

Many-objective search [20, 21] refers to solving multiple

objectives simultaneously, typically more than four, using

meta-heuristic search algorithms, such as MOEA/D [22] and

NSGA-III [23]. With the help of ﬁtness functions that quantify

the goodness of candidate solutions in terms of individual ob-

jectives, the algorithms can deliver a set of solutions satisfying

as many objectives as possible.

In software testing, many-objective search has been applied

to solve testing problems with many test requirements. The

idea is to recast such testing problems into optimization

problems by carefully deﬁning ﬁtness functions for all ob-

jectives. However, since the number of objectives (i.e., test

requirements) is often much more than four (e.g., covering all

branches in a program), researchers have developed tailored

algorithms for testing. MOSA [12] is a well-known algorithm

dedicated to test suite generation with many objectives. It uses

an evolutionary algorithm to achieve each objective individ-

ually by effectively searching for uncovered objectives and

keeping an archive for the best test cases achieving objectives.

By deﬁning ﬁtness functions to measure the likelihood of

covering individual branches in a program, MOSA can ef-

fectively and efﬁciently generate a test suite covering as many

branches as possible [12]. FITEST [13] is another state-of-the-

art algorithm that extends MOSA to decrease the population

size as the number of uncovered objectives decreases to

improve efﬁciency, in contrast to MOSA that maintains the

same population size during the entire search.

III. PROBLEM DESCRIPTION

In this section, we provide a precise problem description

regarding the automated online testing of DNN-enabled sys-

tems (DES) by dynamically changing their environment during

simulation. As a working example, we use a DNN-enabled

autonomous driving system (DADS) to illustrate our main

points, but the description can be generalized to any DES.

In online testing, a DADS under test is embedded into and

interacts with its driving environment. However, because of the

risks and costs it entails, online testing is usually performed

with a simulator rather than on-road vehicle testing. Using a

simulator enables the control of the driving environment, such

as weather conditions, lighting conditions, and the behavior

of other actors (e.g., vehicles on the road and pedestrians).

During simulation, the DADS continuously interacts with

the environment by observing the environment via the ego

vehicle’s sensors (e.g., camera and LIDAR) and driving the

ego vehicle through commands (e.g., steering, throttle, and

braking). Due to the closed-loop interaction between the

DADS and its environment, simulation is an effective in-

strument to check if any requirements violation can occur

under realistic conditions. Notice that such violations can be

triggered by dynamically changing the driving environment

during simulation; for example, certain changes to the speed

of the vehicle in front, which can often occur in practice due

to impaired driving or sudden stops, can cause a collision. The

goal of DADS online testing, based on dynamically changing

the environment, is to ﬁnd a minimal set of test cases, each

of them changing the environment in a different way, to cause

the DADS to violate as many requirements as possible.

Speciﬁcally, let dbe the DADS under test on board the

ego vehicle and E= (X, K)be the environment where

X={x1, x2, . . . }is a set of dynamic elements of the

simulation (e.g., actors other than the ego vehicle, and the

environment conditions) and Kis a set of static elements (e.g.,

roads, buildings, and trees). Each dynamic element x∈Xcan

be further decomposed into a sequence x=hx1, x2, . . . , xJi

where Jis the duration of the simulation and xjis the value

of xat time j∈ {1,2, . . . , J}(e.g., the position, speed,

and acceleration of the vehicle in front at time j). Based

on that, we can deﬁne Xj={xj|x∈X}as capturing

the set of values of all the dynamic elements in Xat time

jand Ej= (Xj, K)as indicating the set of values of all

environmental elements (both dynamic and static) in Eat time

j. Let a test case t=hX1, . . . , XJibe a sequence of values

of the dynamic elements for Jtime steps. By running din E

with t, using a simulator, at each time step j∈ {1,2, . . . , J},

dobserves the snapshot of Ej= (Xj, K)using the sensors

(e.g., camera and LIDAR) and generates driving commands

Cj(e.g., throttle, steering, and braking) for the ego vehicle.

Note that what dobserves from the same Evaries depending

on the ego vehicle’s dynamics (e.g., position, speed, and

acceleration) computed by C; for example, dobserves the

relative distance between the ego vehicle and the vehicle in

front, which naturally varies depending on the position of

the ego vehicle. For the next time step j+ 1, the simulator

computes Ej+1 = (Xj+1, K)according to tand generates

what dobserves from Ej+1 by taking into account Cj.

Given a set of requirements (safety, functional) R, the

degree of a violation for a requirement r∈Rproduced by

din Efor tat any time step j∈ {1, . . . , J}, denoted by

v(r, d, E, t, j), can be measured by monitoring the simulation

of din Efor t. For example, the distance between the ego

vehicle and the vehicle in front can be used to measure how

close they are from colliding. If v(r, d, E, t, j)is greater than

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Many-ObjectiveReinforcementLearningforOnlineTestingofDNN-EnabledSystemsFitashUlHaqUniversityofLuxembourgLuxembourgtash.ulhaq@uni.luDonghwanShinUniversityofShefeldShefeld,UnitedKingdomd.shin@shefeld.ac.ukLionelC.BriandUniversityofLuxembourgLuxembourgUniversityofOttawaOttawa,Canadalionel.briand@u...

展开>> 收起<<

Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems Fitash Ul Haq.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems Fitash Ul Haq

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: