Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees Ðor de Žikeli c1 Mathias Lechner2 Thomas A. Henzinger1 Krishnendu Chatterjee1 1Institute of Science and Technology Austria ISTA

2025-04-29 1 0 544.79KB 15 页 10玖币

侵权投诉

Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees

Ðor ¯

de Žikeli´

c?1, Mathias Lechner?2, Thomas A. Henzinger1, Krishnendu Chatterjee1

1Institute of Science and Technology Austria (ISTA)

2Massachusetts Institute of Technology (MIT)

?Equal contribution

Correspondence to djordje.zikelic@ist.ac.at, mlechner@mit.edu

Abstract

We study the problem of learning controllers for discrete-

time non-linear stochastic dynamical systems with formal

reach-avoid guarantees. This work presents the ﬁrst method

for providing formal reach-avoid guarantees, which combine

and generalize stability and safety guarantees, with a toler-

able probability threshold

p∈[0,1]

over the inﬁnite time

horizon in general Lipschitz continuous systems. Our method

leverages advances in machine learning literature and it repre-

sents formal certiﬁcates as neural networks. In particular, we

learn a certiﬁcate in the form of a reach-avoid supermartin-

gale (RASM), a novel notion that we introduce in this work.

Our RASMs provide reachability and avoidance guarantees

by imposing constraints on what can be viewed as a stochastic

extension of level sets of Lyapunov functions for deterministic

systems. Our approach solves several important problems – it

can be used to learn a control policy from scratch, to verify

a reach-avoid speciﬁcation for a ﬁxed control policy, or to

ﬁne-tune a pre-trained policy if it does not satisfy the reach-

avoid speciﬁcation. We validate our approach on

stochastic

non-linear reinforcement learning tasks.

Introduction

Reinforcement learning (RL) has achieved impressive results

in solving non-linear control problems, resulting in an inter-

est to deploy RL algorithms in safety-critical applications.

However, most RL algorithms focus solely on optimizing

expected performance and do not take safety constraints into

account [

]. This raises concerns about their applicability

to safety-critical domains in which unsafe behavior can lead

to catastrophic consequences [

]. Complicating matters,

models are usually imperfect approximations of real systems

that are obtained from observed data, thus models often need

to account for uncertainty which is modelled via stochastic

disturbances. Formal safety veriﬁcation of policies learned

via RL algorithms and design of learning algorithms that take

safety constraints into account have thus become very active

research topics.

Reach-avoid constraints are one of the most common and

practically relevant constraints appearing in safety-critical

applications that generalize both reachability and safety con-

straints [

]. Given a target region and an unsafe region, the

2023, Association for the Advancement of Artiﬁcial

reach-avoid constraint requires that a system controlled by a

policy converges to the target region while avoiding the un-

safe region. For instance, a lane-keeping constraint requires

a self-driving car to reach its destination without leaving the

allowed car lanes [

]. In the case of stochastic control prob-

lems, reach-avoid constraints are also speciﬁed by a minimal

probability with which the system controlled by a policy

needs to satisfy the reach-avoid constraint.

In this work, we consider discrete-time stochastic control

problems under reach-avoid constraints. Following the re-

cent trend that aims to leverage advances in deep RL to safe

control, we propose a learning method that learns a control

policy together with a formal reach-avoid certiﬁcate in the

form of a reach-avoid supermartingale (RASM), a novel no-

tion that we introduce in this work. Informally, an RASM

is a function assigning nonnegative real values to each state

that is required to strictly decrease in expected value until

the target region is reached, but needs to strictly increase for

the system to reach the unsafe region. By carefully choosing

the ratio of the initial level set of the RASM and the least

level set that the RASM needs to attain for the system to

reach the unsafe region (here we use the standard level set

terminology of Lyapunov functions [

]), we obtain a formal

reach-avoid certiﬁcate. The name of RASMs is chosen to

emphasize the connection to supermartingale processes in

probability theory [

]. Our RASMs signiﬁcantly generalize

and unify the stochastic control barrier functions which are a

standard certiﬁcate for safe control of stochastic systems [

]

and ranking supermartingales that were introduced to certify

probability 1reachability and stability in [38].

Contributions.

This work presents the ﬁrst control

method that provides formal reach-avoid guarantees for con-

trol of stochastic systems with a speciﬁed probability thresh-

old over the inﬁnite time horizon in Lipschitz continuous

systems. In contrast, the existing approaches to control under

reach-avoid constraints are only applicable to ﬁnite horizon

settings, polynomial stochastic systems or to deterministic

systems (see the following section for an overview of re-

lated work). Moreover, our method simultaneously learns

the control policy and the RASM certiﬁcate in the form of

neural networks and is applicable to general non-linear sys-

tems. This contrasts the existing methods from the literature

that are based on stochastic control barrier functions, which

utilize convex optimization tools to compute control policies

arXiv:2210.05308v2 [cs.LG] 29 Nov 2022

and are restricted to polynomial system dynamics and poli-

cies [

]. Our algorithm draws insight from estab-

lished methods for learning Lyapunov functions for stability

in deterministic control problems [

], which were

demonstrated to be more efﬁcient than the existing convex

optimization methods and were adapted in [

] for proba-

bility

reachability and stability veriﬁcation. Finally, our

method learns a suitable policy on demand, or alternatively,

veriﬁes reach-avoid properties of a ﬁxed Lipschitz continuous

control policy. We experimentally validate our method on

stochastic RL tasks and show that it efﬁciently learns control

policies with probabilistic reach-avoid guarantees in practice.

Related Work

Deterministic control problems

There is extensive litera-

ture on safe control, with most works certifying stability via

Lyapunov functions [

] or safety via control barrier func-

tions [

]. Most early works rely either on hand-designed

certiﬁcates, or automate their computation through convex

optimization methods such as sum-of-squares (SOS) pro-

gramming [

]. Automation via SOS programming is

restricted to problems with polynomial system dynamics and

does not scale well with dimension. A promising approach

to overcome these limitations is to learn a control policy to-

gether with a safety certiﬁcate in the form of neural networks,

for instance see [

]. In particular, [

]

learn a control policy and a certiﬁcate as neural networks by

using a learner-veriﬁer framework which repeatedly learns

a candidate policy and a certiﬁcate and then tries to either

verify or reﬁne them. Our method extends some of these

ideas to stochastic systems.

Stochastic control problems

Safe control of stochastic sys-

tems has received comparatively less attention. Most exist-

ing approaches are abstraction based – they consider ﬁnite-

time horizon systems and approximate them via a ﬁnite-state

Markov decision process (MDP). The constrained control

problem is then solved for the MDP. Due to accumulation

of the approximation error in each time step, the size of the

MDP state space needs to grow with the length of the con-

sidered time horizon, making these methods applicable to

systems that evolve over ﬁxed ﬁnite time horizons. Notable

examples include [

]. Another line of

work considers polynomial systems and utilizes stochastic

control barrier functions and convex optimization tools to

compute polynomial control policies [45, 52, 49, 65].

Constrained MDPs

Safe RL has also been studied in the

context of constrained MDPs (CMDPs) [

]. An agent

in a CMDP must satisfy hard constraints on expected cost

for one or more auxiliary notions of cost aggregated over an

episode. Several works study RL algorithms for CMDPs [

notably the Constrained Policy Optimization (CPO) [

] or

the method [

] which proposed a Lyapunov method for

solving CMDPs. While these algorithms perform well, their

constraints are satisﬁed in expectation which makes them

less suitable for safety-critical systems.

Safe RL via shielding

Some approaches ensure safety by

computing two control policies – the main policy that opti-

mizes the expected reward, and the backup policy that the

system falls back to whenever a safety constraint may be

violated [

]. The backup policy can thus be

of simpler form. Shielding for stochastic linear systems with

additive disturbances has been considered in [

]. [

]

are applicable to stochastic non-linear systems, however their

safety guarantees are statistical – their algorithms are random-

ized with parameters

δ,  ∈(0,1)

and they with probability

1−δ

compute an action that is safe in the current state with

probability at least

1−

. The statistical error is accumulated

at each state, hence these approaches are not suitable for inﬁ-

nite or long time horizons. In contrast, our approach targets

formal guarantees for inﬁnite time horizon problems.

Safe exploration

Model-free RL algorithms need to explore

the state space in order to learn high performing actions.

Safe exploration RL restricts exploration in a way which

ensures that given safety constraints are satisﬁed. The most

common approach to ensuring safe exploration is learning

the system dynamics’ uncertainty bounds and limiting the

exploratory actions within a high probability safety region,

with the existing methods based on Gaussian Processes [

], linearized models [

], deep robust regression [

safe padding [

] and Bayesian neural networks [

]. Recent

work has also considered learning stable stochastic dynamics

from data [59, 36].

Probabilistic program analysis

Supermartingales have also

been used for the analysis of probabilistic programs (PPs).

In particular, RSMs were originally used to prove almost-

sure termination in PPs [

] and [

] learns RSMs in PPs.

Supermartingales were also used for probabilistic termination

and safety analysis in PPs [17, 16].

Preliminaries

We consider discrete-time stochastic dynamical systems de-

ﬁned by the equation

xt+1 =f(xt,ut, ωt),x0∈ X0.

The function

f:X × U × N → X

deﬁnes system dynamics,

where

X ⊆ Rm

is the system state space,

U ⊆ Rn

is the con-

trol action space and

N ⊆ Rp

is the stochastic disturbance

space. We use

t∈N0

to denote the time index,

xt∈ X

the state of the system,

ut∈ U

the action and

ωt∈ N

the

stochastic disturbance vector at time

. The set

X0⊆ X

the set of initial states. The action

is chosen according to

a control policy

π:X → U

, i.e.

ut=π(xt)

. The stochastic

disturbance vector

ωt

is sampled according to a speciﬁed

probability distribution

over

. The dynamics function

control policy

and probability distribution

together deﬁne

a stochastic feedback loop system.

A sequence

(xt,ut, ωt)t∈N0

of state-action-disturbance

triples is a trajectory of the system, if for each

t∈N0

we have

ut=π(xt)

ωt∈support(d)

and

xt+1 =f(xt,ut, ωt)

. For

each initial state

x0∈ X

, the system induces a Markov

process which gives rise to the probability space over the

set of all trajectories that start in

[

]. We denote the

probability measure and the expectation in this probability

space by Px0and Ex0.

Assumptions

We assume that

X ⊆ Rm

X0⊆Rm

U ⊆ Rn

and

N ⊆ Rp

are all Borel-measurable, which is a technical

assumption necessary for the system semantics to be math-

ematically well-deﬁned. We also assume that

X ⊆ Rm

compact and that the dynamics function

is Lipschitz con-

tinuous, which are common assumptions in control theory.

Probabilistic reach-avoid problem

Let

Xt⊆ X

and

Xu⊆

be disjoint Borel-measurable subsets of

, which we

refer to as the target set and the unsafe set, respectively. Let

p∈[0,1]

be a probability threshold. Our goal is to learn a

control policy which guarantees that, with probability at least

, the system reaches the target set

without reaching the

unsafe set

. Formally, we want to learn a control policy

such that, for any initial state x0∈ X0, we have

Px0hReachAvoid(Xt,Xu)i≥p

with

ReachAvoid(Xt,Xu) = {(xt,ut, ωt)t∈N0| ∃t∈

N0.xt∈ Xt∧(∀t0≤t. xt06∈ Xu)}

the set of trajectories

that reach Xtwithout reaching Xu.

We restrict to the cases when either

p < 1

, or

p= 1

and

Xu=∅

. Our approach is not applicable to the case

p= 1

and

Xu6=∅

due to technical issues that arise in deﬁning

our formal certiﬁcate, which we discuss in the following

section. We remark that probabilistic reachability is a special

instance of our problem obtained by setting

Xu=∅

. On

the other hand, we cannot directly obtain the probabilistic

safety problem by assuming any speciﬁc form of the target

set

, however we will show in the following section that

our method implies probabilistic safety with respect to

we provide it with Xt=∅.

Theoretical Results

We now present our framework for formally certifying a

reach-avoid constraint with a given probability threshold. Our

framework is based on the novel notion of reach-avoid super-

martingales (RASMs) that we introduce in this work. Note

that, in this section only, we assume that the policy is ﬁxed.

In the next section, we will present our algorithm for learning

policies that provide formal reach-avoid guarantees in which

RASMs will be an integral ingredient. In what follows, we

consider a discrete-time stochastic dynamical system deﬁned

as in the previous section. For now, we assume that the prob-

ability threshold is strictly smaller than

, i.e.

p < 1

. We will

later show that our approach straightforwardly extends to the

case p= 1 and Xu=∅.

Reach-avoid supermartingales

We deﬁne a reach-avoid su-

permartingale (RASM) to be a continuous function

V:X →

that assigns real values to system states. The name is cho-

sen to emphasize the connection to supermartingale processes

from probability theory [

], which we will explore later in

order to prove the effectiveness of RASMs for verifying

reach-avoid properties. The value of

is required to be non-

negative over the state space

(Nonnegativity condition),

to be bounded from above by

over the set of initial states

(Initial condition) and to be bounded from below by

1−p

over the set of unsafe states

(Safety condition). Hence,

in order for a system trajectory to reach an unsafe state and

violate the safety speciﬁcation, the value of the RASM

needs to increase at least

1−p

times along the trajectory. Fi-

nally, we require the existence of

 > 0

such that the value

decreases in expected value by at least



after every

one-step evolution of the system from every system state

x∈ X \Xt

for which

V(x)≤1

1−p

(Expected decrease con-

dition). Intuitively, this last condition imposes that the system

has a tendency to strictly decrease the value of

until either

the target set

is reached or a state with

V(x)≥1

1−p

reached. However, as the value of

needs to increase at least

1−p

times in order for the system to reach an unsafe state,

these four conditions will allow us to use RASMs to certify

that the reach-avoid constraint is satisﬁed with probability at

least p.

Deﬁnition 1

(Reach-avoid supermartingales)

Let

Xt⊆ X

and

Xu⊆ X

be the target set and the unsafe set, and let

p∈[0,1)

be the probability threshold. A continuous function

V:X → R

is said to be a reach-avoid supermartingale

(RASM) with respect to Xt,Xuand pif it satisﬁes:

1. Nonnegativity condition. V(x)≥0for each x∈ X .

2. Initial condition. V(x)≤1for each x∈ X0.

3. Safety condition. V(x)≥1

1−pfor each x∈ Xu.

Expected decrease condition. There exists

 > 0

such

that, for each

x∈ X \Xt

at which

V(x)≤1

1−p

, we have

V(x)≥Eω∼d[V(f(x, π(x), ω))] + .

Comparison to Lyapunov functions

The deﬁning proper-

ties of RASMs hint a connection to Lyapunov functions for

deterministic control systems. However, the key difference

between Lyapunov functions and our RASMs is that Lya-

punov functions deterministically decrease in value whereas

RASMs decrease in expectation. Deterministic decrease en-

sures that each level set of a Lyapunov function, i.e. a set of

states at which the value of Lyapunov functions is at most

for some

l≥0

, is an invariant of the system. However,

it is in general not possible to impose such a condition on

stochastic systems. In contrast, our RASMs only require ex-

pected decrease in the level, and the Initial and the Unsafe

conditions can be viewed as conditions on the maximal initial

level set and the minimal unsafe level set. The choice of a

ratio of these two level values allows us to use existing results

from martingale theory in order to obtain probabilistic avoid-

ance guarantees, while the Expected decrease condition by

 > 0

furthermore provides us with probabilistic reachability

guarantees.

Certifying reach-avoid constraints via RASMs

We now

show that the existence of an



-RASM for some

 > 0

implies

that the reach-avoid constraint is satisﬁed with probability at

least p.

Theorem 1.

Let

Xt⊆ X

and

Xu⊆ X

be the target set

and the unsafe set, respectively, and let

p∈[0,1)

be the

probability threshold. Suppose that there exists an RASM

with respect to

and

. Then, for every

x0∈ X0

Px0[ReachAvoid(Xt,Xu)] ≥p.

The complete proof of Theorem 1 is provided in the Ap-

pendix. However, in what follows we sketch the key ideas

behind our proof, in order to illustrate the applicability of mar-

tingale theory to reasoning about stochastic systems which

we believe to have signiﬁcant potential for applications be-

yond the scope of this work. To prove the theorem, we ﬁrst

show that an



-RASM

induces a supermartingale [

] in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningControlPoliciesforStochasticSystemswithReach-avoidGuaranteesÐor¯deikeli´c?1,MathiasLechner?2,ThomasA.Henzinger1,KrishnenduChatterjee11InstituteofScienceandTechnologyAustria(ISTA)2MassachusettsInstituteofTechnology(MIT)?EqualcontributionCorrespondencetodjordje.zikelic@ist.ac.at,mlechner@mit....

展开>> 收起<<

Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees Ðor de Žikeli c1 Mathias Lechner2 Thomas A. Henzinger1 Krishnendu Chatterjee1 1Institute of Science and Technology Austria ISTA.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees Ðor de Žikeli c1 Mathias Lechner2 Thomas A. Henzinger1 Krishnendu Chatterjee1 1Institute of Science and Technology Austria ISTA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: