SAFE REINFORCEMENT LEARNING FROM PIXELS USING A STOCHASTIC LATENT REPRESENTATION Yannick Hogewind Thiago D. Sim ao Tal Kachman Nils Jansen

2025-05-03 0 0 609.09KB 12 页 10玖币
侵权投诉
SAFE REINFORCEMENT LEARNING FROM PIXELS
USING A STOCHASTIC LATENT REPRESENTATION
Yannick Hogewind Thiago D. Sim˜
ao Tal Kachman Nils Jansen
ABSTRACT
We address the problem of safe reinforcement learning from pixel observations. Inherent challenges
in such settings are (1) a trade-off between reward optimization and adhering to safety constraints,
(2) partial observability, and (3) high-dimensional observations. We formalize the problem in a
constrained, partially observable Markov decision process framework, where an agent obtains dis-
tinct reward and safety signals. To address the curse of dimensionality, we employ a novel safety
critic using the stochastic latent actor-critic (SLAC) approach. The latent variable model predicts
rewards and safety violations, and we use the safety critic to train safe policies. Using well-known
benchmark environments, we demonstrate competitive performance over existing approaches with
respects to computational requirements, final reward return, and satisfying the safety constraints.
1 Introduction
As reinforcement learning (RL) algorithms are increasingly applied in the real-world [Mnih et al., 2015, Jumper et al.,
2021, Fu et al., 2021], their safety becomes ever more important with the increase of both model complexity, and
uncertainty. Considerable effort has been made to increase the safety of RL [Liu et al., 2021]. However, major
challenges remain that prevent the deployment of RL in the real-world [Dulac-Arnold et al., 2021]. Most approaches
to safe RL are limited to fully observable settings, neglecting issues such as noisy or imprecise sensors. Moreover,
realistic environments exhibit high-dimensional observation spaces and are largely out of reach for the state-of-the-
art. In this work, we present an effective safe RL approach that handles partial observability with high-dimensional
observation spaces in the form of pixel observations.
In tandem with prior work, we formalize the safety requirements using a constrained Markov decision process [CMDP;
Altman, 1999]. The objective is to learn a policy that maximizes a reward while constraining the expected return of a
scalar cost signal below a certain value [Achiam et al., 2017]. According to the reward hypothesis, it could be possible
to encode safety requirements directly in the reward signal. However, as argued by Ray et al. [2019], safe RL based
only on a scalar reward carries the issue of designing a suitable reward function. In particular, balancing the trade-off
between reward optimization and safety within a single reward is a difficult problem. Moreover, over-engineering
rewards to complex safety requirements runs the risk of triggering negative side effects that surface after integration
into broader system operation [Abbeel and Ng, 2005, Amodei et al., 2016]. Constrained RL addresses this issue via a
clear separation of reward and safety.
Reinforcement learning from pixels typically suffers from sample inefficiency, as it requires many interactions with the
environment. In the case of safe RL, improving the sample efficiency is especially crucial as each interaction with the
environment, before the agent reaches a safe policy, has an opportunity to cause harm [Zanger et al., 2021]. Moreover,
to ensure safety, there is an incentive to act pessimistic with regard to the cost [As et al., 2022]. This conservative
assessment of safety, in turn, may yield a lower reward performance than is possible within the safety constraints.
Our contribution. We introduce Safe SLAC, an extension of the Stochastic Latent Actor Critic approach [SLAC;
Lee et al., 2020]. SLAC learns a stochastic latent variable model of the environment dynamics, to address the fact
that optimal policies in partially observable settings must estimate the underlying state of the environment from the
observations. The model predicts the next observation, the next latent state, and the reward based on the current
observation and current latent state. This latent state inferred by the model is then used to provide the input for an
actor-critic approach [Konda and Tsitsiklis, 1999]. The actor-critic algorithm involves learning a critic function that
arXiv:2210.01801v1 [cs.LG] 2 Oct 2022
Safe Reinforcement Learning From Pixels
estimates the utility of taking a certain action in the environment, which serves as a supervision signal for the policy,
also named the actor. The SLAC method has excellent sample efficiency in the safety-agnostic partially observable
setting, which renders it a promising candidate to adapt to high-dimensional settings with safety constraints. At its
core, SLAC is an actor-critic approach, so it carries the potential for a natural extension to consider safety with a safety
critic. We extend SLAC in three ways to create our safe RL approach for partially observable settings (Safe SLAC):
(1) the latent variable model also predicts cost violations, (2) we learn a safety critic that predicts the discounted
cost return, and (3) we modify the policy training procedure to optimize a safety-constrained objective by use of
a Lagrangian relaxation, solved using dual gradient descent on the primary objective and a Lagrange multiplier to
overcome the inherent difficulty of constrained optimization.
We evaluate Safe SLAC using a set of benchmark environments introduced by Ray et al. [2019]. The empirical
evaluation shows competitive results compared with complex state-of-the-art approaches.
2 Related work
Established baseline algorithms for safe reinforcement learning in the fully observable setting include constrained
policy optimization [CPO; Achiam et al., 2017], as well as trust region policy optimization (TRPO)-Lagrangian [Ray
et al., 2019], a cost-constrained variant of the existing trust region policy optimization [TRPO; Schulman et al., 2015].
While TRPO-Lagrangian uses an adaptive Lagrange multiplier to solve the constrained problem with primal-dual
optimization, CPO solves the problem of constraint satisfaction analytically during the policy update.
The method closest related to ours is Lagrangian model-based agent [LAMBDA; As et al., 2022], which also addresses
the problem of learning a safe policy from pixel observations under high partial observability. LAMBDA uses the
partially stochastic dynamics model introduced by [Hafner et al., 2019]. The authors take a Bayesian approach on
the dynamics model, sampling from the posterior over parameters to obtain different instantiations of this model. For
each instantiation, simulated trajectories are sampled. Then, the worst cost return and best reward return are used to
train critic functions that provide a gradient to the policy. LAMBDA shows competitive performance with baseline
algorithms, however, there are two major trade-offs. First, by taking a pessimistic approach, the learned policy attains
a lower cost return than the allowed cost budget. A less pessimistic approach that uses the entirety of the allowed cost
budget may yield a constraint-satisfying policy with a higher reward return. Second, the LAMBDA training procedure
involves generating many samples from their variable model to estimate the optimistic/pessimistic temporal difference
updates.
While the reinforcement learning literature has numerous safety perspectives [Garcıa and Fern´
andez, 2015, Pecka and
Svoboda, 2014], we focus on constraining the behavior of the agent on expectation. A method called shielding ensures
safety already during training, using temporal logic specifications for safety [Alshiekh et al., 2018, Jansen et al., 2020].
Such methods, however, require extensive prior knowledge in the form of a (partial) model of the environment [Carr
et al., 2022].
3 Constrained partially observable Markov decision processes
In reinforcement learning, an agent learns to sequentially interact with an environment to maximize some signal of
utility. This problem setting is typically modeled as a Markov decision process [MDP; Sutton and Barto, 2018], in
which the environment is composed of a set of states Sand a set of actions A. At each timestep t, the agent receives
the current environment state stSand executes an action at∈ A, according to the policy π:atπ(at|st).
This action results in a new state according to the transition dynamics st+1 p(st+1 |st, at)and a scalar reward
signal rt=r(st, at)R, where ris the reward function. The goal is for the agent to learn an optimal policy π?
such that the expectation of discounted, accumulated reward in the environment under that policy is maximized, i.e.
π?= arg maxπE[Ptγtrt]with γ[0,1). We use ρπto denote the distribution over trajectories induced in the
environment by a policy π.
In a partially observable Markov decision process [POMDP; Kaelbling et al., 1998], the agent cannot observe the
true state stof the MDP and instead receives some observation xt∈ X that provides partial information about the
state, sampled from the observation function xtX(st). In this setting, learning a policy πis more difficult than in
MDPs since the true state of the underlying MDP is not known and must be inferred from sequences of observations
to allow the policy to be optimal. Therefore, the optimal policy is a function of a history of observations and actions
π(at|x1:t, a1:t1). In practice, representing such policy can be infeasible, so the policy is often conditioned on a
compact representation of the history.
2
Safe Reinforcement Learning From Pixels
While the reward hypothesis states that desired agent behavior can plausibly be represented in a single scalar reward
signal, in practice it can be difficult to define a reward function that balances different objectives [Vamplew et al.,
2022, Roy et al., 2022]. The same is true for safety: as argued by Ray et al. [2019], a useful definition of safety
is a constraint on the behavior, stating the problem as a constrained MDPs [CMDP; Altman, 1999], with analogous
definitions for constrained POMDPs [CPOMDP; Isom et al., 2008, Lee et al., 2018, Walraven and Spaan, 2018]. A
scalar cost variable ctRat each time step of the C(PO)MDP, according to a cost function ct=c(st, at), serves
as a measure of safety violation. The objective of the reinforcement learning problem then changes to constrain the
accumulated cost under a given safety threshold dR:
π= arg max
π
E"X
t
γtrt#s.t.E"X
t
ct#< d. (1)
Next, we review RL algorithms for POMDPs and CMDPs, laying the foundations for the safe RL algorithm we propose
in Section 5 dedicated to CPOMDPs with unknown environment dynamics.
4 Stochastic latent actor-critic
Soft actor-critic [SAC; Haarnoja et al., 2018a] is an RL approach based on the maximum entropy framework. Besides
the traditional reinforcement learning objective of maximizing the reward, it also aims to maximize the entropy of the
policy. The sampling of states that allow high-entropy action distributions is maximized. This approach can be robust
to disturbances in the dynamics of the environment [Eysenbach and Levine, 2022]. In practice, it can be challenging
to determine a weight for the entropy term in the objective a priori, so instead, the entropy is constrained to a minimum
value [Haarnoja et al., 2018b]. A stochastic policy is particularly interesting to our work since a deterministic policy
might be suboptimal in the CPOMDP setting [Kim et al., 2011].
Stochastic latent actor-critic [SLAC; Lee et al., 2020] is an RL algorithm that addresses partial observability and high-
dimensional observations by combining a probabilistic sequential latent variable model with an actor-critic approach.
The sequential latent variable model aims to infer the true state of the environment by considering the environment
as having an unseen true latent state ztand latent dynamics p(zt+1 |zt,at). This model generates observations x
p(xt|zt)and rewards rtp(rt|zt). By approximating these latent dynamics in the latent variable model using
approximate variational inference, the learned model can infer the latent state from previous observations and actions
as q(zt|x0:t,a0:t). Consequently, SLAC can be viewed as an adaption of SAC to POMDPs, using the inferred latent
state zas input for the critic in SAC.
Equation (2) describes the latent variable model, which is parameterized by ψ. This model infers a low-dimensional
latent state representation zfrom a history of high-dimensional observations xand actions agathered through
interaction with the environment. This model is trained by sampling trajectories from the environment and in-
ferring a latent state from each trajectory according to the distributions given in Equation (3). The model uses
this inferred latent state to predict the next observation and reward according to the distributions in Equation (3).
z1
1p(z1
1).
z2
1pψ(z2
1|z1
1).
z1
t+1 pψ(z1
t+1 |z2
t,at).
z2
t+1 pψ(z2
t+1 |z1
t+1,z2
t,at).
xtpψ(xt|z1
t,z2
t).
rtpψ(rt|z1
t,z2
t,at,z1
t+1,z2
t+1).
(2)
z1
1qψ(z1
1|x1).
z2
1pψ(z2
1|z1
1).
z1
t+1 qψ(z1
t+1 |xt+1,z2
t,at).
z2
t+1 pψ(z2
t+1 |z1
t+1,z2
t,at).
(3)
The parameters of the model ψare trained to optimize the objective in Equation (4), in which DKL is the Kullback-
Leibler divergence, an asymmetric measure of the difference between two distributions commonly used in variational
inference [Kingma and Welling, 2014]:
JM(ψ) = E
z1:τ+1qψ
τ
X
t=0
log pψ(xt+1 |zt+1)
log pψ(rt+1 |zt+1)
+DKL(qψ(zt+1 |xt+1,zt,at)pψ(zt+1 |zt,at))
.(4)
The resulting inferred latent state, which captures information about previous observations and expected reward in
future time steps, serves as the input for the critic in SAC.
3
摘要:

SAFEREINFORCEMENTLEARNINGFROMPIXELSUSINGASTOCHASTICLATENTREPRESENTATIONYannickHogewindThiagoD.Sim˜aoTalKachmanNilsJansenABSTRACTWeaddresstheproblemofsafereinforcementlearningfrompixelobservations.Inherentchallengesinsuchsettingsare(1)atrade-offbetweenrewardoptimizationandadheringtosafetyconstraints,...

展开>> 收起<<
SAFE REINFORCEMENT LEARNING FROM PIXELS USING A STOCHASTIC LATENT REPRESENTATION Yannick Hogewind Thiago D. Sim ao Tal Kachman Nils Jansen.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:609.09KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注