Learning Provably Stabilizing Neural Controllers for Discrete-Time Stochastic Systems Matin Ansaripour1 Krishnendu Chatterjee2 Thomas A. Henzinger2

2025-04-29 0 0 1.26MB 35 页 10玖币
侵权投诉
Learning Provably Stabilizing Neural Controllers
for Discrete-Time Stochastic Systems
Matin Ansaripour1, Krishnendu Chatterjee2, Thomas A. Henzinger2,
Mathias Lechner3, and Ðorđe Žikelić2
1Sharif University of Technology, Tehran, Iran
matinansaripour@gmail.com
2Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria
{krishnendu.chatterjee, tah, djordje.zikelic}@ist.ac.at
3Massachusetts Institute of Technology, Cambridge, MA, USA
mlechner@mit.edu
Abstract.
We consider the problem of learning control policies in discrete-
time stochastic systems which guarantee that the system stabilizes within
some specified stabilization region with probability 1. Our approach is
based on the novel notion of stabilizing ranking supermartingales (sRSMs)
that we introduce in this work. Our sRSMs overcome the limitation of
methods proposed in previous works whose applicability is restricted
to systems in which the stabilizing region cannot be left once entered
under any control policy. We present a learning procedure that learns a
control policy together with an sRSM that formally certifies probability 1
stability, both learned as neural networks. We show that this procedure
can also be adapted to formally verifying that, under a given Lipschitz
continuous control policy, the stochastic system stabilizes within some
stabilizing region with probability 1. Our experimental evaluation shows
that our learning procedure can successfully learn provably stabilizing
policies in practice.
Keywords: Learning-based control ·Stochastic systems ·Martingales
·Formal verification ·Stabilization.
1 Introduction
Machine learning based methods and in particular reinforcement learning (RL)
present a promising approach to solving highly non-linear control problems. This
has sparked interest in the deployment of learning-based control methods in
safety-critical autonomous systems such as self-driving cars or healthcare devices.
However, the key challenge for their deployment in real-world scenarios is that
they do not consider hard safety constraints. For instance, the main objective of
RL is to maximize expected reward [
46
], but doing so provides no guarantees of
This work was supported in part by the ERC-2020-AdG 101020093, ERC CoG 863818
(FoRM-SMArt) and the European Union’s Horizon 2020 research and innovation
programme under the Marie Skłodowska-Curie Grant Agreement No. 665385.
arXiv:2210.05304v2 [cs.LG] 28 Jul 2023
2 M. Ansaripour, K. Chatterjee, T. A. Henzinger, M. Lechner, Ð. Žikelić
the system’s safety. A more recent paradigm in safe RL considers constrained
Markov decision processes (cMDPs) [
4
,
26
,
50
,
3
,
20
], which are equiped with both
a reward function and an auxiliary cost function. The goal of these works is then
to maximize expected reward while keeping expected cost below some tolerable
threshold. While these methods do enhance safety, they only ensure empirically
that the expected cost function is below the threshold and do not provide any
formal guarantees on constraint satisfaction.
This is particularly concerning for safety-critical applications, in which unsafe
behavior of the system might have fatal consequences. Thus, a fundamental
challenge for deploying learning-based methods in safety-critical autonomous
systems applications is formally certifying safety of learned control policies [
5
,
25
].
Stability is a fundamental safety constraint in control theory, which requires
the system to converge to and eventually stay within some specified stabilizing
region with probability 1, a.k.a. almost-sure (a.s.) asymptotic stability [
31
,
33
].
Most existing research on learning policies for a control system with formal
guarantees on stability considers deterministic systems and employs Lyapunov
functions [
31
] for certifying the system’s stability. In particular, a Lyapunov func-
tion is learned jointly with the control policy [
7
,
42
,
14
,
1
]. Informally, a Lyapunov
function is a function that maps system states to nonnegative real numbers
whose value decreases after every one-step evolution of the system until the
stabilizing region is reached. Recently, [
37
] proposed a learning procedure for
learning ranking supermartingales (RSMs) [
10
] for certifying a.s. asymptotic
stability in discrete-time stochastic systems. RSMs generalize Lyapunov functions
to supermartingale processes in probability theory [54] and decrease in value in
expectation upon every one-step evolution of the system.
While these works present significant advances in learning control policies with
formal stability guarantees as well as formal stability verification, they are either
only applicable to deterministic systems or assume that the stabilizing set is closed
under system dynamics, i.e., the agent cannot leave it once entered. In particular,
the work of [
37
] reduces stability in stochastic systems to an a.s. reachability
condition by assuming that the agent cannot leave the stabilization set. However,
this assumption may not hold in real-world settings because the agent may be
able to leave the stabilizing set with some positive probability due to the existence
of stochastic disturbances, see Figure 1. We illustrate the importance of relaxing
this assumption on the classical example of balancing a pendulum in the upright
position, which we also study in our experimental evaluation. The closedness
under system dynamics assumption implies that, once the pendulum is in an
upright position, it is ensured to stay upright and not move away. However, this
is not a very realistic assumption due to possible existence of minor disturbances
which the controller needs to balance out. The closedness under system dynamics
assumption essentially assumes the existence of a balancing control policy which
takes care of this problem. In contrast, our method does not assume such a
balancing policy and learns a control policy which ensures that both (1) the
pendulum reaches the upright position and (2) that the pendulum eventually
stays upright with probability 1.
Learning Provably Stabilizing Neural Controllers 3
While the removal of the assumption that a stabilizing region cannot be left
may appear to be a small improvement, in formal methods this is well-understood
to be a significant and difficult step. With the assumption, the desired controller
has an a.s. reachability objective. Without the assumption, the desired controller
has an a.s. persistence (or co-Büchi) objective, namely, to reach and stay in the
stabilizing region with probability 1. Verification or synthesis for reachability
conditions allow in general much simpler techniques than verification or synthesis
for persistence conditions. For example, in non-stochastic systems, reachability can
be expressed in alternation-free
µ
-calculus (i.e., fixpoint computation), whereas
persistence requires alternation (i.e., nested fixpoint computation). Technically,
reachability conditions are found on the first level of the Borel hierarchy, while
persistence conditions are found on the second level [
12
]. It is, therefore, not
surprising that also over continuous and stochastic state spaces, reachability
techniques are insufficient for solving persistence problems.
In this work, we present the following three contributions.
1.
Theoretical contributions In this work, we introduce stabilizing ranking
supermartingales (sRSMs) and prove that they certify a.s. asymptotic stability
in discrete-time stochastic systems even when the stabilizing set is not
assumed to be closed under system dynamics. The key novelty of our sRSMs
compared to RSMs is that they also impose an expected decrease condition
within a part of the stabilizing region. The additional condition ensures that,
once entered, the agent leaves the stabilizing region with probability at most
p <
1. Thus, we show that the probability of the agent entering and leaving
the stabilizing region
N
times is at most
pN
, which by letting
N→ ∞
implies
that the agent eventually stabilizes within the region with probability 1. The
key conceptual novelty is that we combine the convergence results of RSMs
which were also exploited in [
37
] with a concentration bound on the supremum
value of a supermartingale process. This combined reasoning allows us to
formally guarantee a.s. asymptotic stability even for systems in which the
stabilizing region is not closed under system dynamics. We remark that
our proof that sRSMs certify a.s. asymptotic stability is not an immediate
application of results from martingale theory, but that it introduces a novel
method to reason about eventual stabilization within a set. We present this
novel method in the proof of Theorem 1. Finally, we show that sRSMs not
only present qualitative results to certify a.s. asymptotic stability but also
present quantitative upper bounds on the number of time steps that the
system may spend outside of the stabilization set prior to stabilization.
2.
Algorithmic contributions Following our theoretical results on sRSMs,
we present an algorithm for learning a control policy jointly with an sRSM
that certifies a.s. asymptotic stability. The method parametrizes both the
policy and the sRSM as neural networks and draws insight from established
procedures for learning neural network Lyapunov functions [
14
] and RSMs
[
37
]. It loops between a learner module that jointly trains a policy and
an sRSM candidate and a verifier module that certifies the learned sRSM
candidate by formally checking whether all sRSM conditions are satisfied.
4 M. Ansaripour, K. Chatterjee, T. A. Henzinger, M. Lechner, Ð. Žikelić
If the sRSM candidate violates some sRSM conditions, the verifier module
produces counterexamples that are added to the learner module’s training set
to guide the learner in the next loop iteration. Otherwise, if the verification
is successful and the algorithm outputs a policy, then the policy guarantees
a.s. asymptotic stability. By fixing the control policy and only learning and
verifying the sRSM, our algorithm can also be used to verify that a given
control policy guarantees a.s. asymptotic stability. This verification procedure
only requires that the control policy is a Lipschitz continuous function.
3.
Experimental contributions We experimentally evaluate our learning
procedure on 2stochastic RL tasks in which the stabilizing region is not closed
under system dynamics and show that our learning procedure successfully
learns control policies with a.s. asymptotic stability guarantees for both tasks.
Organization The rest of this work is organized as follows. Section 2 contains
preliminaries. In Section 3, we introduce our novel notion of stabilizing ranking su-
permartingales and prove that they provide a sound certificate for a.s. asymptotic
stability, which is the main theoretical contribution of our work. In Section 4, we
present the learner-verifier procedure for jointly learning a control policy together
with an sRSM that formally certifies a.s. asymptotic stability. In Section 5, we
experimentally evaluate our approach. We survey related work in Section 6.
Finally, we conclude in Section 7.
2 Preliminaries
We consider a discrete-time stochastic dynamical system of the form
xt+1 =f(xt, π(xt), ωt),
where
f
:
X × U × N X
is a dynamics function,
π
:
X → U
is a control policy
and
ωt∈ N
is a stochastic disturbance vector. Here, we use
X Rn
to denote
the state space,
U Rm
the action space and
N Rp
the stochastic disturbance
space of the system. In each time step,
ωt
is sampled according to a probability
distribution dover N, independently from the previous samples.
A sequence (x
t,
u
t, ωt
)
tN0
of state-action-disturbance triples is a trajectory
of the system, if u
t
=
π
(x
t
),
ωtsupport
(
d
)and x
t+1
=
f
(x
t,
u
t, ωt
)hold for
each
tN0
. For each state x
0∈ X
, the system induces a Markov process and
defines a probability space over the set of all trajectories that start in x
0
[
41
],
with the probability measure and the expectation operators Px0and Ex0.
Assumptions The state space
X Rn
, the action space
U Rm
and the
stochastic disturbance space
N Rp
are all assumed to be Borel-measurable.
Furthermore, we assume that the system has a bounded maximal step size under
any policy
π
, i.e. that there exists
∆ >
0such that for every x
∈ X
,
ω∈ N
and policy
π
we have
||
x
f
(x
, π
(x)
, ω
)
||1
. Note that this is a realistic
assumption that is satisfied in many real-world scenarios, e.g. a self-driving car
Learning Provably Stabilizing Neural Controllers 5
can only traverse a certain maximal distance within each time step whose bounds
depend on the maximal speed that the car can develop.
For our learning procedure in Section 4, we assume that
X Rn
is compact
and that
f
is Lipschitz continuous, which are common assumptions in control
theory. Given two metric spaces (
X, dX
)and (
Y, dY
), a function
g
:
XY
is
said to be Lipschitz continuous if there exists a constant
L >
0such that for
every
x1, x2X
we have
dY
(
g
(
x1
)
, g
(
x2
))
L·dX
(
x1, x2
). We say that
L
is a
Lipschitz constant of
g
. For the verification procedure when the control policy
π
is given, we also assume that
π
is Lipschitz continuous. This is also a common
assumption in control theory and RL that allows for a rich class of policies
including neural network policies, as all standard activation functions such as
ReLU, sigmoid or tanh are Lipschitz continuous [
47
]. Finally, in Section 4 we
assume that the stochastic disturbance space
N
is bounded or that
d
is a product
of independent univariate distributions, which is needed for efficient sampling
and expected value computation.
Almost-sure asymptotic stability There are several notions of stability in stochas-
tic systems. In this work, we consider the notion of almost-sure asymptotic
stability [
33
], which requires the system to eventually converge and stay within
the stabilizing set. In order to define this formally, for each x
∈ X
let
d
(x
,Xs
) =
infxs∈Xs||xxs||1, where || · ||1is the l1-norm on Rm.
Definition 1.
A Borel-measurable set
Xs⊆ X
is almost-surely (a.s.) asymptoti-
cally stable, if for each initial state x0∈ X we have
Px0hlim
t→∞ d(xt,Xs)=0i= 1.
The above definition slightly differs from that of [
33
] which considers the
special case of a singleton
Xs
=
{
0
}
. The reason for this difference is that,
analogously to [
37
] and to the existing works on learning stabilizing policies in
deterministic systems [
7
,
42
,
14
], we need to consider stability with respect to an
open neighborhood of the origin for learning to be numerically stable.
3 Theoretical Results
We now introduce our novel notion of stabilizing ranking supermartingales
(sRSMs). We then show that sRSMs can be used to formally certify a.s. asymptotic
stability with respect to a fixed policy
π
without requiring that the stabilizing set
is closed under system dynamics. To that end, in this section we assume that the
policy
π
is fixed. In the next section, we will then present our learning procedure.
Prior work – ranking supermartingales (RSMs) In order to motivate our sRSMs
and to explain their novelty, we first recall ranking supermartingales (RSMs) [
10
]
that were used in [
37
] for certifying a.s. asymptotic stability under a given policy
π
, when the stabilizing set is assumed to be closed under system dynamics.
摘要:

LearningProvablyStabilizingNeuralControllersforDiscrete-TimeStochasticSystems⋆MatinAnsaripour1,KrishnenduChatterjee2,ThomasA.Henzinger2,MathiasLechner3,andÐorđeŽikelić21SharifUniversityofTechnology,Tehran,Iranmatinansaripour@gmail.com2InstituteofScienceandTechnologyAustria(ISTA),Klosterneuburg,Austr...

展开>> 收起<<
Learning Provably Stabilizing Neural Controllers for Discrete-Time Stochastic Systems Matin Ansaripour1 Krishnendu Chatterjee2 Thomas A. Henzinger2.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:1.26MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注