
Safe Reinforcement Learning From Pixels
estimates the utility of taking a certain action in the environment, which serves as a supervision signal for the policy,
also named the actor. The SLAC method has excellent sample efficiency in the safety-agnostic partially observable
setting, which renders it a promising candidate to adapt to high-dimensional settings with safety constraints. At its
core, SLAC is an actor-critic approach, so it carries the potential for a natural extension to consider safety with a safety
critic. We extend SLAC in three ways to create our safe RL approach for partially observable settings (Safe SLAC):
(1) the latent variable model also predicts cost violations, (2) we learn a safety critic that predicts the discounted
cost return, and (3) we modify the policy training procedure to optimize a safety-constrained objective by use of
a Lagrangian relaxation, solved using dual gradient descent on the primary objective and a Lagrange multiplier to
overcome the inherent difficulty of constrained optimization.
We evaluate Safe SLAC using a set of benchmark environments introduced by Ray et al. [2019]. The empirical
evaluation shows competitive results compared with complex state-of-the-art approaches.
2 Related work
Established baseline algorithms for safe reinforcement learning in the fully observable setting include constrained
policy optimization [CPO; Achiam et al., 2017], as well as trust region policy optimization (TRPO)-Lagrangian [Ray
et al., 2019], a cost-constrained variant of the existing trust region policy optimization [TRPO; Schulman et al., 2015].
While TRPO-Lagrangian uses an adaptive Lagrange multiplier to solve the constrained problem with primal-dual
optimization, CPO solves the problem of constraint satisfaction analytically during the policy update.
The method closest related to ours is Lagrangian model-based agent [LAMBDA; As et al., 2022], which also addresses
the problem of learning a safe policy from pixel observations under high partial observability. LAMBDA uses the
partially stochastic dynamics model introduced by [Hafner et al., 2019]. The authors take a Bayesian approach on
the dynamics model, sampling from the posterior over parameters to obtain different instantiations of this model. For
each instantiation, simulated trajectories are sampled. Then, the worst cost return and best reward return are used to
train critic functions that provide a gradient to the policy. LAMBDA shows competitive performance with baseline
algorithms, however, there are two major trade-offs. First, by taking a pessimistic approach, the learned policy attains
a lower cost return than the allowed cost budget. A less pessimistic approach that uses the entirety of the allowed cost
budget may yield a constraint-satisfying policy with a higher reward return. Second, the LAMBDA training procedure
involves generating many samples from their variable model to estimate the optimistic/pessimistic temporal difference
updates.
While the reinforcement learning literature has numerous safety perspectives [Garcıa and Fern´
andez, 2015, Pecka and
Svoboda, 2014], we focus on constraining the behavior of the agent on expectation. A method called shielding ensures
safety already during training, using temporal logic specifications for safety [Alshiekh et al., 2018, Jansen et al., 2020].
Such methods, however, require extensive prior knowledge in the form of a (partial) model of the environment [Carr
et al., 2022].
3 Constrained partially observable Markov decision processes
In reinforcement learning, an agent learns to sequentially interact with an environment to maximize some signal of
utility. This problem setting is typically modeled as a Markov decision process [MDP; Sutton and Barto, 2018], in
which the environment is composed of a set of states Sand a set of actions A. At each timestep t, the agent receives
the current environment state st∈Sand executes an action at∈ A, according to the policy π:at∼π(at|st).
This action results in a new state according to the transition dynamics st+1 ∼p(st+1 |st, at)and a scalar reward
signal rt=r(st, at)∈R, where ris the reward function. The goal is for the agent to learn an optimal policy π?
such that the expectation of discounted, accumulated reward in the environment under that policy is maximized, i.e.
π?= arg maxπE[Ptγtrt]with γ∈[0,1). We use ρπto denote the distribution over trajectories induced in the
environment by a policy π.
In a partially observable Markov decision process [POMDP; Kaelbling et al., 1998], the agent cannot observe the
true state stof the MDP and instead receives some observation xt∈ X that provides partial information about the
state, sampled from the observation function xt∼X(st). In this setting, learning a policy πis more difficult than in
MDPs since the true state of the underlying MDP is not known and must be inferred from sequences of observations
to allow the policy to be optimal. Therefore, the optimal policy is a function of a history of observations and actions
π(at|x1:t, a1:t−1). In practice, representing such policy can be infeasible, so the policy is often conditioned on a
compact representation of the history.
2