
Moving away from our specific example, the above five properties would be ideal for safe RL in many
applications where safety is naturally human-defined. Unfortunately, there are no existing safe RL methods
that can provably satisfy all of these properties. In particular, existing safe RL methods all fail to satisfy
these properties for at least one of the following reasons: (1) they assume a safety function is fully known
(e.g. Chow et al., 2018; Sim˜ao et al., 2021); (2) they learn safety using numeric feedback (e.g. Wachi and
Sui, 2020; Amani et al., 2021); or (3) they require a human-in-the-loop who can intervene and monitors the
safety of every action taken in real time (Saunders et al., 2018). In addition, none of these existing methods
address the issue of only asking for minimal feedback, which is an important consideration which places the
problem at the interface of RL and active learning.
In this paper, we present a novel safe RL framework encompassing the above concerns, involving an
offline safety-labeling oracle that can provide binary feedback on the safety of state, action pairs in between
episodes. We provide a meta algorithm, SABRE, which can be applied to any MDP class within this
safety framework, given a blackbox RL algorithm for the MDP class. This algorithm utilizes ideas from
disagreement-based active learning, by iteratively exploring the state space to find all regions where there is
disagreement on possible safety. Under some appropriate technical assumptions, including that the blackox
RL algorithm can provably optimize any given reward, SABRE will satisfy all of the above five properties,
and will return an approximately optimal safe policy with high probability. Importantly, we provide high-
probability bounds on the number of samples and calls to the labeling oracle needed, and show that they
are both polynomial, and that the latter is lower order than the former. Finally, we provide some discussion
of how this meta-algorithm approach may be applied in various settings of both theoretical and practical
interest.
Math Notation For any natural number n∈N, we let [n] denote the set {1,2,··· , n}. For any countable
set X, we let ∆(X) denote the space of all probability distributions over X. Given sets Xand Y, we let
X → Y denote the set of all functions from Xto Y. Lastly, sign(y) denotes the sign function which takes a
value of 1 if y > 0, a value of 0 at y= 0, and −1 if y < 0.
2 Problem Setup
Markov Decision Process We consider learning in finite-horizon Markov decision processes (MDPs). A
reward-free MDP is characterized by a tuple (S,A, T, µ, H), where Sis a given (potentially infinitely large)
state space, Ais a finite action space with |A| =A,T:S ×A → ∆(S) is a transition operator, µ∈∆(S) is
an initial state distribution, and H∈Nis the horizon. An MDP (with reward) is a tuple (S,A, T, R, µ, H),
where R:S×A → [0,1] is the reward function. We assume that the transition operator and reward function
are time homogeneous; i.e.,Tand Rdo not depend explicitly on the time index.1We will let (M, R) refer
to the MDP we are learning in, where Mis the corresponding reward-free MDP.
The agent interacts with an MDP over a series of rounds. In each round the agent successively takes
actions according to some policy, which is a mapping from states to actions, in order to generate an episode.
Specifically, in the n’th round for each n∈N, the agent selects some policy πn∈ S → ∆(A), which it then
uses in order to generate an episode τn= (sn
1, an
1, rn
1, . . . , sn
H, an
H, rn
H, sn
H+1), where sn
1∼µ(·), and for each
h∈[H] we have an
h∼πn(· | sn
h), rn
h=R(sn
h, an
h), and sn
h+1 ∼T(· | sn
h, an
h).
For any given policy π, we let Eπ[·] and Pπ(·) denote the expectation and probability operators over
trajectories generated using πin the MDP (M, R). Also, for any policy π, we define the value function
Vπ∈ S → Raccording to Vπ(s) = Eπ[PH
h=1 R(sh, ah)|s1=s], and we similarly define the value of the
policy according to V(π) = Es∼µ(·)[Vπ(s)].
Our main metric of success for reinforcement learning is suboptimality (SubOpt). For any policy class
Π⊆ S → ∆(A) and any policy ˆπ∈Π, we define SubOpt(ˆπ; Π) = supπ∈ΠV(π)−V(ˆπ). Then, one of our
primary goals is to learn a policy ˆπthat has low suboptimality with respect to a given set of policies that it
must choose from, in a relatively small number of episodes.
Over the past few decades, many reinforcement learning algorithms have been developed that can provably
obtain low suboptimality using a small number of episodes, for various classes of MDPs . Our focus is on
adapting such RL algorithms, in order to additionally take into account safety considerations. For these
1This is w.l.o.g. since we can always include the current time index in the observation definition.
2