Learning Provably Stabilizing Neural Controllers for Discrete-Time Stochastic Systems Matin Ansaripour1 Krishnendu Chatterjee2 Thomas A. Henzinger2

2025-04-29 0 0 1.26MB 35 页 10玖币

侵权投诉

Learning Provably Stabilizing Neural Controllers

for Discrete-Time Stochastic Systems⋆

Matin Ansaripour1, Krishnendu Chatterjee2, Thomas A. Henzinger2,

Mathias Lechner3, and Ðorđe Žikelić2

1Sharif University of Technology, Tehran, Iran

matinansaripour@gmail.com

2Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria

{krishnendu.chatterjee, tah, djordje.zikelic}@ist.ac.at

3Massachusetts Institute of Technology, Cambridge, MA, USA

mlechner@mit.edu

Abstract.

We consider the problem of learning control policies in discrete-

time stochastic systems which guarantee that the system stabilizes within

some speciﬁed stabilization region with probability 1. Our approach is

based on the novel notion of stabilizing ranking supermartingales (sRSMs)

that we introduce in this work. Our sRSMs overcome the limitation of

methods proposed in previous works whose applicability is restricted

to systems in which the stabilizing region cannot be left once entered

under any control policy. We present a learning procedure that learns a

control policy together with an sRSM that formally certiﬁes probability 1

stability, both learned as neural networks. We show that this procedure

can also be adapted to formally verifying that, under a given Lipschitz

continuous control policy, the stochastic system stabilizes within some

stabilizing region with probability 1. Our experimental evaluation shows

that our learning procedure can successfully learn provably stabilizing

policies in practice.

Keywords: Learning-based control ·Stochastic systems ·Martingales

·Formal veriﬁcation ·Stabilization.

1 Introduction

Machine learning based methods and in particular reinforcement learning (RL)

present a promising approach to solving highly non-linear control problems. This

has sparked interest in the deployment of learning-based control methods in

safety-critical autonomous systems such as self-driving cars or healthcare devices.

However, the key challenge for their deployment in real-world scenarios is that

they do not consider hard safety constraints. For instance, the main objective of

RL is to maximize expected reward [

], but doing so provides no guarantees of

⋆

This work was supported in part by the ERC-2020-AdG 101020093, ERC CoG 863818

(FoRM-SMArt) and the European Union’s Horizon 2020 research and innovation

programme under the Marie Skłodowska-Curie Grant Agreement No. 665385.

arXiv:2210.05304v2 [cs.LG] 28 Jul 2023

2 M. Ansaripour, K. Chatterjee, T. A. Henzinger, M. Lechner, Ð. Žikelić

the system’s safety. A more recent paradigm in safe RL considers constrained

Markov decision processes (cMDPs) [

], which are equiped with both

a reward function and an auxiliary cost function. The goal of these works is then

to maximize expected reward while keeping expected cost below some tolerable

threshold. While these methods do enhance safety, they only ensure empirically

that the expected cost function is below the threshold and do not provide any

formal guarantees on constraint satisfaction.

This is particularly concerning for safety-critical applications, in which unsafe

behavior of the system might have fatal consequences. Thus, a fundamental

challenge for deploying learning-based methods in safety-critical autonomous

systems applications is formally certifying safety of learned control policies [

Stability is a fundamental safety constraint in control theory, which requires

the system to converge to and eventually stay within some speciﬁed stabilizing

region with probability 1, a.k.a. almost-sure (a.s.) asymptotic stability [

Most existing research on learning policies for a control system with formal

guarantees on stability considers deterministic systems and employs Lyapunov

functions [

] for certifying the system’s stability. In particular, a Lyapunov func-

tion is learned jointly with the control policy [

]. Informally, a Lyapunov

function is a function that maps system states to nonnegative real numbers

whose value decreases after every one-step evolution of the system until the

stabilizing region is reached. Recently, [

] proposed a learning procedure for

learning ranking supermartingales (RSMs) [

] for certifying a.s. asymptotic

stability in discrete-time stochastic systems. RSMs generalize Lyapunov functions

to supermartingale processes in probability theory [54] and decrease in value in

expectation upon every one-step evolution of the system.

While these works present signiﬁcant advances in learning control policies with

formal stability guarantees as well as formal stability veriﬁcation, they are either

only applicable to deterministic systems or assume that the stabilizing set is closed

under system dynamics, i.e., the agent cannot leave it once entered. In particular,

the work of [

] reduces stability in stochastic systems to an a.s. reachability

condition by assuming that the agent cannot leave the stabilization set. However,

this assumption may not hold in real-world settings because the agent may be

able to leave the stabilizing set with some positive probability due to the existence

of stochastic disturbances, see Figure 1. We illustrate the importance of relaxing

this assumption on the classical example of balancing a pendulum in the upright

position, which we also study in our experimental evaluation. The closedness

under system dynamics assumption implies that, once the pendulum is in an

upright position, it is ensured to stay upright and not move away. However, this

is not a very realistic assumption due to possible existence of minor disturbances

which the controller needs to balance out. The closedness under system dynamics

assumption essentially assumes the existence of a balancing control policy which

takes care of this problem. In contrast, our method does not assume such a

balancing policy and learns a control policy which ensures that both (1) the

pendulum reaches the upright position and (2) that the pendulum eventually

stays upright with probability 1.

Learning Provably Stabilizing Neural Controllers 3

While the removal of the assumption that a stabilizing region cannot be left

may appear to be a small improvement, in formal methods this is well-understood

to be a signiﬁcant and diﬃcult step. With the assumption, the desired controller

has an a.s. reachability objective. Without the assumption, the desired controller

has an a.s. persistence (or co-Büchi) objective, namely, to reach and stay in the

stabilizing region with probability 1. Veriﬁcation or synthesis for reachability

conditions allow in general much simpler techniques than veriﬁcation or synthesis

for persistence conditions. For example, in non-stochastic systems, reachability can

be expressed in alternation-free

-calculus (i.e., ﬁxpoint computation), whereas

persistence requires alternation (i.e., nested ﬁxpoint computation). Technically,

reachability conditions are found on the ﬁrst level of the Borel hierarchy, while

persistence conditions are found on the second level [

]. It is, therefore, not

surprising that also over continuous and stochastic state spaces, reachability

techniques are insuﬃcient for solving persistence problems.

In this work, we present the following three contributions.

Theoretical contributions In this work, we introduce stabilizing ranking

supermartingales (sRSMs) and prove that they certify a.s. asymptotic stability

in discrete-time stochastic systems even when the stabilizing set is not

assumed to be closed under system dynamics. The key novelty of our sRSMs

compared to RSMs is that they also impose an expected decrease condition

within a part of the stabilizing region. The additional condition ensures that,

once entered, the agent leaves the stabilizing region with probability at most

p <

1. Thus, we show that the probability of the agent entering and leaving

the stabilizing region

times is at most

, which by letting

N→ ∞

implies

that the agent eventually stabilizes within the region with probability 1. The

key conceptual novelty is that we combine the convergence results of RSMs

which were also exploited in [

] with a concentration bound on the supremum

value of a supermartingale process. This combined reasoning allows us to

formally guarantee a.s. asymptotic stability even for systems in which the

stabilizing region is not closed under system dynamics. We remark that

our proof that sRSMs certify a.s. asymptotic stability is not an immediate

application of results from martingale theory, but that it introduces a novel

method to reason about eventual stabilization within a set. We present this

novel method in the proof of Theorem 1. Finally, we show that sRSMs not

only present qualitative results to certify a.s. asymptotic stability but also

present quantitative upper bounds on the number of time steps that the

system may spend outside of the stabilization set prior to stabilization.

Algorithmic contributions Following our theoretical results on sRSMs,

we present an algorithm for learning a control policy jointly with an sRSM

that certiﬁes a.s. asymptotic stability. The method parametrizes both the

policy and the sRSM as neural networks and draws insight from established

procedures for learning neural network Lyapunov functions [

] and RSMs

[

]. It loops between a learner module that jointly trains a policy and

an sRSM candidate and a veriﬁer module that certiﬁes the learned sRSM

candidate by formally checking whether all sRSM conditions are satisﬁed.

4 M. Ansaripour, K. Chatterjee, T. A. Henzinger, M. Lechner, Ð. Žikelić

If the sRSM candidate violates some sRSM conditions, the veriﬁer module

produces counterexamples that are added to the learner module’s training set

to guide the learner in the next loop iteration. Otherwise, if the veriﬁcation

is successful and the algorithm outputs a policy, then the policy guarantees

a.s. asymptotic stability. By ﬁxing the control policy and only learning and

verifying the sRSM, our algorithm can also be used to verify that a given

control policy guarantees a.s. asymptotic stability. This veriﬁcation procedure

only requires that the control policy is a Lipschitz continuous function.

Experimental contributions We experimentally evaluate our learning

procedure on 2stochastic RL tasks in which the stabilizing region is not closed

under system dynamics and show that our learning procedure successfully

learns control policies with a.s. asymptotic stability guarantees for both tasks.

Organization The rest of this work is organized as follows. Section 2 contains

preliminaries. In Section 3, we introduce our novel notion of stabilizing ranking su-

permartingales and prove that they provide a sound certiﬁcate for a.s. asymptotic

stability, which is the main theoretical contribution of our work. In Section 4, we

present the learner-veriﬁer procedure for jointly learning a control policy together

with an sRSM that formally certiﬁes a.s. asymptotic stability. In Section 5, we

experimentally evaluate our approach. We survey related work in Section 6.

Finally, we conclude in Section 7.

2 Preliminaries

We consider a discrete-time stochastic dynamical system of the form

xt+1 =f(xt, π(xt), ωt),

where

X × U × N → X

is a dynamics function,

X → U

is a control policy

and

ωt∈ N

is a stochastic disturbance vector. Here, we use

X ⊆ Rn

to denote

the state space,

U ⊆ Rm

the action space and

N ⊆ Rp

the stochastic disturbance

space of the system. In each time step,

ωt

is sampled according to a probability

distribution dover N, independently from the previous samples.

A sequence (x

t, ωt

)

t∈N0

of state-action-disturbance triples is a trajectory

of the system, if u

ωt∈support

(

)and x

t+1

t, ωt

)hold for

each

t∈N0

. For each state x

0∈ X

, the system induces a Markov process and

deﬁnes a probability space over the set of all trajectories that start in x

[

with the probability measure and the expectation operators Px0and Ex0.

Assumptions The state space

X ⊆ Rn

, the action space

U ⊆ Rm

and the

stochastic disturbance space

N ⊆ Rp

are all assumed to be Borel-measurable.

Furthermore, we assume that the system has a bounded maximal step size under

any policy

, i.e. that there exists

∆ >

0such that for every x

∈ X

ω∈ N

and policy

we have

−f

, π

(x)

, ω

)

||1≤∆

. Note that this is a realistic

assumption that is satisﬁed in many real-world scenarios, e.g. a self-driving car

Learning Provably Stabilizing Neural Controllers 5

can only traverse a certain maximal distance within each time step whose bounds

depend on the maximal speed that the car can develop.

For our learning procedure in Section 4, we assume that

X ⊆ Rn

is compact

and that

is Lipschitz continuous, which are common assumptions in control

theory. Given two metric spaces (

X, dX

)and (

Y, dY

), a function

X→Y

said to be Lipschitz continuous if there exists a constant

L >

0such that for

every

x1, x2∈X

we have

(

)

, g

(

))

≤L·dX

(

x1, x2

). We say that

is a

Lipschitz constant of

. For the veriﬁcation procedure when the control policy

is given, we also assume that

is Lipschitz continuous. This is also a common

assumption in control theory and RL that allows for a rich class of policies

including neural network policies, as all standard activation functions such as

ReLU, sigmoid or tanh are Lipschitz continuous [

]. Finally, in Section 4 we

assume that the stochastic disturbance space

is bounded or that

is a product

of independent univariate distributions, which is needed for eﬃcient sampling

and expected value computation.

Almost-sure asymptotic stability There are several notions of stability in stochas-

tic systems. In this work, we consider the notion of almost-sure asymptotic

stability [

], which requires the system to eventually converge and stay within

the stabilizing set. In order to deﬁne this formally, for each x

∈ X

let

,Xs

) =

infxs∈Xs||x−xs||1, where || · ||1is the l1-norm on Rm.

Deﬁnition 1.

A Borel-measurable set

Xs⊆ X

is almost-surely (a.s.) asymptoti-

cally stable, if for each initial state x0∈ X we have

Px0hlim

t→∞ d(xt,Xs)=0i= 1.

The above deﬁnition slightly diﬀers from that of [

] which considers the

special case of a singleton

{

}

. The reason for this diﬀerence is that,

analogously to [

] and to the existing works on learning stabilizing policies in

deterministic systems [

], we need to consider stability with respect to an

open neighborhood of the origin for learning to be numerically stable.

3 Theoretical Results

We now introduce our novel notion of stabilizing ranking supermartingales

(sRSMs). We then show that sRSMs can be used to formally certify a.s. asymptotic

stability with respect to a ﬁxed policy

without requiring that the stabilizing set

is closed under system dynamics. To that end, in this section we assume that the

policy

is ﬁxed. In the next section, we will then present our learning procedure.

Prior work – ranking supermartingales (RSMs) In order to motivate our sRSMs

and to explain their novelty, we ﬁrst recall ranking supermartingales (RSMs) [

]

that were used in [

] for certifying a.s. asymptotic stability under a given policy

, when the stabilizing set is assumed to be closed under system dynamics.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningProvablyStabilizingNeuralControllersforDiscrete-TimeStochasticSystems⋆MatinAnsaripour1,KrishnenduChatterjee2,ThomasA.Henzinger2,MathiasLechner3,andÐorđeŽikelić21SharifUniversityofTechnology,Tehran,Iranmatinansaripour@gmail.com2InstituteofScienceandTechnologyAustria(ISTA),Klosterneuburg,Austr...

展开>> 收起<<

Learning Provably Stabilizing Neural Controllers for Discrete-Time Stochastic Systems Matin Ansaripour1 Krishnendu Chatterjee2 Thomas A. Henzinger2.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Provably Stabilizing Neural Controllers for Discrete-Time Stochastic Systems Matin Ansaripour1 Krishnendu Chatterjee2 Thomas A. Henzinger2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: