2 M. Ansaripour, K. Chatterjee, T. A. Henzinger, M. Lechner, Ð. Žikelić
the system’s safety. A more recent paradigm in safe RL considers constrained
Markov decision processes (cMDPs) [
4
,
26
,
50
,
3
,
20
], which are equiped with both
a reward function and an auxiliary cost function. The goal of these works is then
to maximize expected reward while keeping expected cost below some tolerable
threshold. While these methods do enhance safety, they only ensure empirically
that the expected cost function is below the threshold and do not provide any
formal guarantees on constraint satisfaction.
This is particularly concerning for safety-critical applications, in which unsafe
behavior of the system might have fatal consequences. Thus, a fundamental
challenge for deploying learning-based methods in safety-critical autonomous
systems applications is formally certifying safety of learned control policies [
5
,
25
].
Stability is a fundamental safety constraint in control theory, which requires
the system to converge to and eventually stay within some specified stabilizing
region with probability 1, a.k.a. almost-sure (a.s.) asymptotic stability [
31
,
33
].
Most existing research on learning policies for a control system with formal
guarantees on stability considers deterministic systems and employs Lyapunov
functions [
31
] for certifying the system’s stability. In particular, a Lyapunov func-
tion is learned jointly with the control policy [
7
,
42
,
14
,
1
]. Informally, a Lyapunov
function is a function that maps system states to nonnegative real numbers
whose value decreases after every one-step evolution of the system until the
stabilizing region is reached. Recently, [
37
] proposed a learning procedure for
learning ranking supermartingales (RSMs) [
10
] for certifying a.s. asymptotic
stability in discrete-time stochastic systems. RSMs generalize Lyapunov functions
to supermartingale processes in probability theory [54] and decrease in value in
expectation upon every one-step evolution of the system.
While these works present significant advances in learning control policies with
formal stability guarantees as well as formal stability verification, they are either
only applicable to deterministic systems or assume that the stabilizing set is closed
under system dynamics, i.e., the agent cannot leave it once entered. In particular,
the work of [
37
] reduces stability in stochastic systems to an a.s. reachability
condition by assuming that the agent cannot leave the stabilization set. However,
this assumption may not hold in real-world settings because the agent may be
able to leave the stabilizing set with some positive probability due to the existence
of stochastic disturbances, see Figure 1. We illustrate the importance of relaxing
this assumption on the classical example of balancing a pendulum in the upright
position, which we also study in our experimental evaluation. The closedness
under system dynamics assumption implies that, once the pendulum is in an
upright position, it is ensured to stay upright and not move away. However, this
is not a very realistic assumption due to possible existence of minor disturbances
which the controller needs to balance out. The closedness under system dynamics
assumption essentially assumes the existence of a balancing control policy which
takes care of this problem. In contrast, our method does not assume such a
balancing policy and learns a control policy which ensures that both (1) the
pendulum reaches the upright position and (2) that the pendulum eventually
stays upright with probability 1.