Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees Ðor de Žikeli c1 Mathias Lechner2 Thomas A. Henzinger1 Krishnendu Chatterjee1 1Institute of Science and Technology Austria ISTA

2025-04-29 0 0 544.79KB 15 页 10玖币
侵权投诉
Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees
Ðor ¯
de Žikeli´
c?1, Mathias Lechner?2, Thomas A. Henzinger1, Krishnendu Chatterjee1
1Institute of Science and Technology Austria (ISTA)
2Massachusetts Institute of Technology (MIT)
?Equal contribution
Correspondence to djordje.zikelic@ist.ac.at, mlechner@mit.edu
Abstract
We study the problem of learning controllers for discrete-
time non-linear stochastic dynamical systems with formal
reach-avoid guarantees. This work presents the first method
for providing formal reach-avoid guarantees, which combine
and generalize stability and safety guarantees, with a toler-
able probability threshold
p[0,1]
over the infinite time
horizon in general Lipschitz continuous systems. Our method
leverages advances in machine learning literature and it repre-
sents formal certificates as neural networks. In particular, we
learn a certificate in the form of a reach-avoid supermartin-
gale (RASM), a novel notion that we introduce in this work.
Our RASMs provide reachability and avoidance guarantees
by imposing constraints on what can be viewed as a stochastic
extension of level sets of Lyapunov functions for deterministic
systems. Our approach solves several important problems – it
can be used to learn a control policy from scratch, to verify
a reach-avoid specification for a fixed control policy, or to
fine-tune a pre-trained policy if it does not satisfy the reach-
avoid specification. We validate our approach on
3
stochastic
non-linear reinforcement learning tasks.
Introduction
Reinforcement learning (RL) has achieved impressive results
in solving non-linear control problems, resulting in an inter-
est to deploy RL algorithms in safety-critical applications.
However, most RL algorithms focus solely on optimizing
expected performance and do not take safety constraints into
account [
55
]. This raises concerns about their applicability
to safety-critical domains in which unsafe behavior can lead
to catastrophic consequences [
7
,
23
]. Complicating matters,
models are usually imperfect approximations of real systems
that are obtained from observed data, thus models often need
to account for uncertainty which is modelled via stochastic
disturbances. Formal safety verification of policies learned
via RL algorithms and design of learning algorithms that take
safety constraints into account have thus become very active
research topics.
Reach-avoid constraints are one of the most common and
practically relevant constraints appearing in safety-critical
applications that generalize both reachability and safety con-
straints [
53
]. Given a target region and an unsafe region, the
Copyright
©
2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
reach-avoid constraint requires that a system controlled by a
policy converges to the target region while avoiding the un-
safe region. For instance, a lane-keeping constraint requires
a self-driving car to reach its destination without leaving the
allowed car lanes [
60
]. In the case of stochastic control prob-
lems, reach-avoid constraints are also specified by a minimal
probability with which the system controlled by a policy
needs to satisfy the reach-avoid constraint.
In this work, we consider discrete-time stochastic control
problems under reach-avoid constraints. Following the re-
cent trend that aims to leverage advances in deep RL to safe
control, we propose a learning method that learns a control
policy together with a formal reach-avoid certificate in the
form of a reach-avoid supermartingale (RASM), a novel no-
tion that we introduce in this work. Informally, an RASM
is a function assigning nonnegative real values to each state
that is required to strictly decrease in expected value until
the target region is reached, but needs to strictly increase for
the system to reach the unsafe region. By carefully choosing
the ratio of the initial level set of the RASM and the least
level set that the RASM needs to attain for the system to
reach the unsafe region (here we use the standard level set
terminology of Lyapunov functions [
27
]), we obtain a formal
reach-avoid certificate. The name of RASMs is chosen to
emphasize the connection to supermartingale processes in
probability theory [
64
]. Our RASMs significantly generalize
and unify the stochastic control barrier functions which are a
standard certificate for safe control of stochastic systems [
45
]
and ranking supermartingales that were introduced to certify
probability 1reachability and stability in [38].
Contributions.
This work presents the first control
method that provides formal reach-avoid guarantees for con-
trol of stochastic systems with a specified probability thresh-
old over the infinite time horizon in Lipschitz continuous
systems. In contrast, the existing approaches to control under
reach-avoid constraints are only applicable to finite horizon
settings, polynomial stochastic systems or to deterministic
systems (see the following section for an overview of re-
lated work). Moreover, our method simultaneously learns
the control policy and the RASM certificate in the form of
neural networks and is applicable to general non-linear sys-
tems. This contrasts the existing methods from the literature
that are based on stochastic control barrier functions, which
utilize convex optimization tools to compute control policies
arXiv:2210.05308v2 [cs.LG] 29 Nov 2022
and are restricted to polynomial system dynamics and poli-
cies [
45
,
52
,
49
,
65
]. Our algorithm draws insight from estab-
lished methods for learning Lyapunov functions for stability
in deterministic control problems [
48
,
15
,
1
], which were
demonstrated to be more efficient than the existing convex
optimization methods and were adapted in [
38
] for proba-
bility
1
reachability and stability verification. Finally, our
method learns a suitable policy on demand, or alternatively,
verifies reach-avoid properties of a fixed Lipschitz continuous
control policy. We experimentally validate our method on
3
stochastic RL tasks and show that it efficiently learns control
policies with probabilistic reach-avoid guarantees in practice.
Related Work
Deterministic control problems
There is extensive litera-
ture on safe control, with most works certifying stability via
Lyapunov functions [
27
] or safety via control barrier func-
tions [
6
]. Most early works rely either on hand-designed
certificates, or automate their computation through convex
optimization methods such as sum-of-squares (SOS) pro-
gramming [
29
,
43
,
30
]. Automation via SOS programming is
restricted to problems with polynomial system dynamics and
does not scale well with dimension. A promising approach
to overcome these limitations is to learn a control policy to-
gether with a safety certificate in the form of neural networks,
for instance see [
48
,
54
,
31
,
14
,
47
]. In particular, [
15
,
1
]
learn a control policy and a certificate as neural networks by
using a learner-verifier framework which repeatedly learns
a candidate policy and a certificate and then tries to either
verify or refine them. Our method extends some of these
ideas to stochastic systems.
Stochastic control problems
Safe control of stochastic sys-
tems has received comparatively less attention. Most exist-
ing approaches are abstraction based – they consider finite-
time horizon systems and approximate them via a finite-state
Markov decision process (MDP). The constrained control
problem is then solved for the MDP. Due to accumulation
of the approximation error in each time step, the size of the
MDP state space needs to grow with the length of the con-
sidered time horizon, making these methods applicable to
systems that evolve over fixed finite time horizons. Notable
examples include [
51
,
35
,
12
,
62
,
61
,
20
]. Another line of
work considers polynomial systems and utilizes stochastic
control barrier functions and convex optimization tools to
compute polynomial control policies [45, 52, 49, 65].
Constrained MDPs
Safe RL has also been studied in the
context of constrained MDPs (CMDPs) [
5
,
24
]. An agent
in a CMDP must satisfy hard constraints on expected cost
for one or more auxiliary notions of cost aggregated over an
episode. Several works study RL algorithms for CMDPs [
58
],
notably the Constrained Policy Optimization (CPO) [
3
] or
the method [
18
] which proposed a Lyapunov method for
solving CMDPs. While these algorithms perform well, their
constraints are satisfied in expectation which makes them
less suitable for safety-critical systems.
Safe RL via shielding
Some approaches ensure safety by
computing two control policies – the main policy that opti-
mizes the expected reward, and the backup policy that the
system falls back to whenever a safety constraint may be
violated [
41
,
44
,
4
,
22
,
25
]. The backup policy can thus be
of simpler form. Shielding for stochastic linear systems with
additive disturbances has been considered in [
63
]. [
39
,
8
]
are applicable to stochastic non-linear systems, however their
safety guarantees are statistical – their algorithms are random-
ized with parameters
δ,  (0,1)
and they with probability
1δ
compute an action that is safe in the current state with
probability at least
1
. The statistical error is accumulated
at each state, hence these approaches are not suitable for infi-
nite or long time horizons. In contrast, our approach targets
formal guarantees for infinite time horizon problems.
Safe exploration
Model-free RL algorithms need to explore
the state space in order to learn high performing actions.
Safe exploration RL restricts exploration in a way which
ensures that given safety constraints are satisfied. The most
common approach to ensuring safe exploration is learning
the system dynamics’ uncertainty bounds and limiting the
exploratory actions within a high probability safety region,
with the existing methods based on Gaussian Processes [
33
,
57
,
9
], linearized models [
21
], deep robust regression [
40
],
safe padding [
28
] and Bayesian neural networks [
37
]. Recent
work has also considered learning stable stochastic dynamics
from data [59, 36].
Probabilistic program analysis
Supermartingales have also
been used for the analysis of probabilistic programs (PPs).
In particular, RSMs were originally used to prove almost-
sure termination in PPs [
13
] and [
2
] learns RSMs in PPs.
Supermartingales were also used for probabilistic termination
and safety analysis in PPs [17, 16].
Preliminaries
We consider discrete-time stochastic dynamical systems de-
fined by the equation
xt+1 =f(xt,ut, ωt),x0∈ X0.
The function
f:X × U × N X
defines system dynamics,
where
X Rm
is the system state space,
U Rn
is the con-
trol action space and
N Rp
is the stochastic disturbance
space. We use
tN0
to denote the time index,
xt∈ X
the state of the system,
ut∈ U
the action and
ωt∈ N
the
stochastic disturbance vector at time
t
. The set
X0⊆ X
is
the set of initial states. The action
ut
is chosen according to
a control policy
π:X → U
, i.e.
ut=π(xt)
. The stochastic
disturbance vector
ωt
is sampled according to a specified
probability distribution
d
over
Rp
. The dynamics function
f
,
control policy
π
and probability distribution
d
together define
a stochastic feedback loop system.
A sequence
(xt,ut, ωt)tN0
of state-action-disturbance
triples is a trajectory of the system, if for each
tN0
we have
ut=π(xt)
,
ωtsupport(d)
and
xt+1 =f(xt,ut, ωt)
. For
each initial state
x0∈ X
, the system induces a Markov
process which gives rise to the probability space over the
set of all trajectories that start in
x0
[
46
]. We denote the
probability measure and the expectation in this probability
space by Px0and Ex0.
Assumptions
We assume that
X Rm
,
X0Rm
,
U Rn
and
N Rp
are all Borel-measurable, which is a technical
assumption necessary for the system semantics to be math-
ematically well-defined. We also assume that
X Rm
is
compact and that the dynamics function
f
is Lipschitz con-
tinuous, which are common assumptions in control theory.
Probabilistic reach-avoid problem
Let
Xt⊆ X
and
Xu
X
be disjoint Borel-measurable subsets of
Rm
, which we
refer to as the target set and the unsafe set, respectively. Let
p[0,1]
be a probability threshold. Our goal is to learn a
control policy which guarantees that, with probability at least
p
, the system reaches the target set
Xt
without reaching the
unsafe set
Xu
. Formally, we want to learn a control policy
π
such that, for any initial state x0∈ X0, we have
Px0hReachAvoid(Xt,Xu)ip
with
ReachAvoid(Xt,Xu) = {(xt,ut, ωt)tN0| ∃t
N0.xt∈ Xt(t0t. xt06∈ Xu)}
the set of trajectories
that reach Xtwithout reaching Xu.
We restrict to the cases when either
p < 1
, or
p= 1
and
Xu=
. Our approach is not applicable to the case
p= 1
and
Xu6=
due to technical issues that arise in defining
our formal certificate, which we discuss in the following
section. We remark that probabilistic reachability is a special
instance of our problem obtained by setting
Xu=
. On
the other hand, we cannot directly obtain the probabilistic
safety problem by assuming any specific form of the target
set
Xt
, however we will show in the following section that
our method implies probabilistic safety with respect to
Xu
if
we provide it with Xt=.
Theoretical Results
We now present our framework for formally certifying a
reach-avoid constraint with a given probability threshold. Our
framework is based on the novel notion of reach-avoid super-
martingales (RASMs) that we introduce in this work. Note
that, in this section only, we assume that the policy is fixed.
In the next section, we will present our algorithm for learning
policies that provide formal reach-avoid guarantees in which
RASMs will be an integral ingredient. In what follows, we
consider a discrete-time stochastic dynamical system defined
as in the previous section. For now, we assume that the prob-
ability threshold is strictly smaller than
1
, i.e.
p < 1
. We will
later show that our approach straightforwardly extends to the
case p= 1 and Xu=.
Reach-avoid supermartingales
We define a reach-avoid su-
permartingale (RASM) to be a continuous function
V:X →
R
that assigns real values to system states. The name is cho-
sen to emphasize the connection to supermartingale processes
from probability theory [
64
], which we will explore later in
order to prove the effectiveness of RASMs for verifying
reach-avoid properties. The value of
V
is required to be non-
negative over the state space
X
(Nonnegativity condition),
to be bounded from above by
1
over the set of initial states
X0
(Initial condition) and to be bounded from below by
1
1p
over the set of unsafe states
Xu
(Safety condition). Hence,
in order for a system trajectory to reach an unsafe state and
violate the safety specification, the value of the RASM
V
needs to increase at least
1
1p
times along the trajectory. Fi-
nally, we require the existence of
 > 0
such that the value
of
V
decreases in expected value by at least
after every
one-step evolution of the system from every system state
x X \Xt
for which
V(x)1
1p
(Expected decrease con-
dition). Intuitively, this last condition imposes that the system
has a tendency to strictly decrease the value of
V
until either
the target set
Xt
is reached or a state with
V(x)1
1p
is
reached. However, as the value of
V
needs to increase at least
1
1p
times in order for the system to reach an unsafe state,
these four conditions will allow us to use RASMs to certify
that the reach-avoid constraint is satisfied with probability at
least p.
Definition 1
(Reach-avoid supermartingales)
.
Let
Xt⊆ X
and
Xu⊆ X
be the target set and the unsafe set, and let
p[0,1)
be the probability threshold. A continuous function
V:X R
is said to be a reach-avoid supermartingale
(RASM) with respect to Xt,Xuand pif it satisfies:
1. Nonnegativity condition. V(x)0for each x X .
2. Initial condition. V(x)1for each x∈ X0.
3. Safety condition. V(x)1
1pfor each x∈ Xu.
4.
Expected decrease condition. There exists
 > 0
such
that, for each
x X \Xt
at which
V(x)1
1p
, we have
V(x)Eωd[V(f(x, π(x), ω))] + .
Comparison to Lyapunov functions
The defining proper-
ties of RASMs hint a connection to Lyapunov functions for
deterministic control systems. However, the key difference
between Lyapunov functions and our RASMs is that Lya-
punov functions deterministically decrease in value whereas
RASMs decrease in expectation. Deterministic decrease en-
sures that each level set of a Lyapunov function, i.e. a set of
states at which the value of Lyapunov functions is at most
l
for some
l0
, is an invariant of the system. However,
it is in general not possible to impose such a condition on
stochastic systems. In contrast, our RASMs only require ex-
pected decrease in the level, and the Initial and the Unsafe
conditions can be viewed as conditions on the maximal initial
level set and the minimal unsafe level set. The choice of a
ratio of these two level values allows us to use existing results
from martingale theory in order to obtain probabilistic avoid-
ance guarantees, while the Expected decrease condition by
 > 0
furthermore provides us with probabilistic reachability
guarantees.
Certifying reach-avoid constraints via RASMs
We now
show that the existence of an
-RASM for some
 > 0
implies
that the reach-avoid constraint is satisfied with probability at
least p.
Theorem 1.
Let
Xt⊆ X
and
Xu⊆ X
be the target set
and the unsafe set, respectively, and let
p[0,1)
be the
probability threshold. Suppose that there exists an RASM
V
with respect to
Xt
,
Xu
and
p
. Then, for every
x0∈ X0
,
Px0[ReachAvoid(Xt,Xu)] p.
The complete proof of Theorem 1 is provided in the Ap-
pendix. However, in what follows we sketch the key ideas
behind our proof, in order to illustrate the applicability of mar-
tingale theory to reasoning about stochastic systems which
we believe to have significant potential for applications be-
yond the scope of this work. To prove the theorem, we first
show that an
-RASM
V
induces a supermartingale [
64
] in
摘要:

LearningControlPoliciesforStochasticSystemswithReach-avoidGuaranteesÐor¯deikeli´c?1,MathiasLechner?2,ThomasA.Henzinger1,KrishnenduChatterjee11InstituteofScienceandTechnologyAustria(ISTA)2MassachusettsInstituteofTechnology(MIT)?EqualcontributionCorrespondencetodjordje.zikelic@ist.ac.at,mlechner@mit....

展开>> 收起<<
Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees Ðor de Žikeli c1 Mathias Lechner2 Thomas A. Henzinger1 Krishnendu Chatterjee1 1Institute of Science and Technology Austria ISTA.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:544.79KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注