SAFE REINFORCEMENT LEARNING FROM PIXELS USING A STOCHASTIC LATENT REPRESENTATION Yannick Hogewind Thiago D. Sim ao Tal Kachman Nils Jansen

2025-05-03 0 0 609.09KB 12 页 10玖币

侵权投诉

SAFE REINFORCEMENT LEARNING FROM PIXELS

USING A STOCHASTIC LATENT REPRESENTATION

Yannick Hogewind Thiago D. Sim˜

ao Tal Kachman Nils Jansen

ABSTRACT

We address the problem of safe reinforcement learning from pixel observations. Inherent challenges

in such settings are (1) a trade-off between reward optimization and adhering to safety constraints,

(2) partial observability, and (3) high-dimensional observations. We formalize the problem in a

constrained, partially observable Markov decision process framework, where an agent obtains dis-

tinct reward and safety signals. To address the curse of dimensionality, we employ a novel safety

critic using the stochastic latent actor-critic (SLAC) approach. The latent variable model predicts

rewards and safety violations, and we use the safety critic to train safe policies. Using well-known

benchmark environments, we demonstrate competitive performance over existing approaches with

respects to computational requirements, ﬁnal reward return, and satisfying the safety constraints.

1 Introduction

As reinforcement learning (RL) algorithms are increasingly applied in the real-world [Mnih et al., 2015, Jumper et al.,

2021, Fu et al., 2021], their safety becomes ever more important with the increase of both model complexity, and

uncertainty. Considerable effort has been made to increase the safety of RL [Liu et al., 2021]. However, major

challenges remain that prevent the deployment of RL in the real-world [Dulac-Arnold et al., 2021]. Most approaches

to safe RL are limited to fully observable settings, neglecting issues such as noisy or imprecise sensors. Moreover,

realistic environments exhibit high-dimensional observation spaces and are largely out of reach for the state-of-the-

art. In this work, we present an effective safe RL approach that handles partial observability with high-dimensional

observation spaces in the form of pixel observations.

In tandem with prior work, we formalize the safety requirements using a constrained Markov decision process [CMDP;

Altman, 1999]. The objective is to learn a policy that maximizes a reward while constraining the expected return of a

scalar cost signal below a certain value [Achiam et al., 2017]. According to the reward hypothesis, it could be possible

to encode safety requirements directly in the reward signal. However, as argued by Ray et al. [2019], safe RL based

only on a scalar reward carries the issue of designing a suitable reward function. In particular, balancing the trade-off

between reward optimization and safety within a single reward is a difﬁcult problem. Moreover, over-engineering

rewards to complex safety requirements runs the risk of triggering negative side effects that surface after integration

into broader system operation [Abbeel and Ng, 2005, Amodei et al., 2016]. Constrained RL addresses this issue via a

clear separation of reward and safety.

Reinforcement learning from pixels typically suffers from sample inefﬁciency, as it requires many interactions with the

environment. In the case of safe RL, improving the sample efﬁciency is especially crucial as each interaction with the

environment, before the agent reaches a safe policy, has an opportunity to cause harm [Zanger et al., 2021]. Moreover,

to ensure safety, there is an incentive to act pessimistic with regard to the cost [As et al., 2022]. This conservative

assessment of safety, in turn, may yield a lower reward performance than is possible within the safety constraints.

Our contribution. We introduce Safe SLAC, an extension of the Stochastic Latent Actor Critic approach [SLAC;

Lee et al., 2020]. SLAC learns a stochastic latent variable model of the environment dynamics, to address the fact

that optimal policies in partially observable settings must estimate the underlying state of the environment from the

observations. The model predicts the next observation, the next latent state, and the reward based on the current

observation and current latent state. This latent state inferred by the model is then used to provide the input for an

actor-critic approach [Konda and Tsitsiklis, 1999]. The actor-critic algorithm involves learning a critic function that

arXiv:2210.01801v1 [cs.LG] 2 Oct 2022

Safe Reinforcement Learning From Pixels

estimates the utility of taking a certain action in the environment, which serves as a supervision signal for the policy,

also named the actor. The SLAC method has excellent sample efﬁciency in the safety-agnostic partially observable

setting, which renders it a promising candidate to adapt to high-dimensional settings with safety constraints. At its

core, SLAC is an actor-critic approach, so it carries the potential for a natural extension to consider safety with a safety

critic. We extend SLAC in three ways to create our safe RL approach for partially observable settings (Safe SLAC):

(1) the latent variable model also predicts cost violations, (2) we learn a safety critic that predicts the discounted

cost return, and (3) we modify the policy training procedure to optimize a safety-constrained objective by use of

a Lagrangian relaxation, solved using dual gradient descent on the primary objective and a Lagrange multiplier to

overcome the inherent difﬁculty of constrained optimization.

We evaluate Safe SLAC using a set of benchmark environments introduced by Ray et al. [2019]. The empirical

evaluation shows competitive results compared with complex state-of-the-art approaches.

2 Related work

Established baseline algorithms for safe reinforcement learning in the fully observable setting include constrained

policy optimization [CPO; Achiam et al., 2017], as well as trust region policy optimization (TRPO)-Lagrangian [Ray

et al., 2019], a cost-constrained variant of the existing trust region policy optimization [TRPO; Schulman et al., 2015].

While TRPO-Lagrangian uses an adaptive Lagrange multiplier to solve the constrained problem with primal-dual

optimization, CPO solves the problem of constraint satisfaction analytically during the policy update.

The method closest related to ours is Lagrangian model-based agent [LAMBDA; As et al., 2022], which also addresses

the problem of learning a safe policy from pixel observations under high partial observability. LAMBDA uses the

partially stochastic dynamics model introduced by [Hafner et al., 2019]. The authors take a Bayesian approach on

the dynamics model, sampling from the posterior over parameters to obtain different instantiations of this model. For

each instantiation, simulated trajectories are sampled. Then, the worst cost return and best reward return are used to

train critic functions that provide a gradient to the policy. LAMBDA shows competitive performance with baseline

algorithms, however, there are two major trade-offs. First, by taking a pessimistic approach, the learned policy attains

a lower cost return than the allowed cost budget. A less pessimistic approach that uses the entirety of the allowed cost

budget may yield a constraint-satisfying policy with a higher reward return. Second, the LAMBDA training procedure

involves generating many samples from their variable model to estimate the optimistic/pessimistic temporal difference

updates.

While the reinforcement learning literature has numerous safety perspectives [Garcıa and Fern´

andez, 2015, Pecka and

Svoboda, 2014], we focus on constraining the behavior of the agent on expectation. A method called shielding ensures

safety already during training, using temporal logic speciﬁcations for safety [Alshiekh et al., 2018, Jansen et al., 2020].

Such methods, however, require extensive prior knowledge in the form of a (partial) model of the environment [Carr

et al., 2022].

3 Constrained partially observable Markov decision processes

In reinforcement learning, an agent learns to sequentially interact with an environment to maximize some signal of

utility. This problem setting is typically modeled as a Markov decision process [MDP; Sutton and Barto, 2018], in

which the environment is composed of a set of states Sand a set of actions A. At each timestep t, the agent receives

the current environment state st∈Sand executes an action at∈ A, according to the policy π:at∼π(at|st).

This action results in a new state according to the transition dynamics st+1 ∼p(st+1 |st, at)and a scalar reward

signal rt=r(st, at)∈R, where ris the reward function. The goal is for the agent to learn an optimal policy π?

such that the expectation of discounted, accumulated reward in the environment under that policy is maximized, i.e.

π?= arg maxπE[Ptγtrt]with γ∈[0,1). We use ρπto denote the distribution over trajectories induced in the

environment by a policy π.

In a partially observable Markov decision process [POMDP; Kaelbling et al., 1998], the agent cannot observe the

true state stof the MDP and instead receives some observation xt∈ X that provides partial information about the

state, sampled from the observation function xt∼X(st). In this setting, learning a policy πis more difﬁcult than in

MDPs since the true state of the underlying MDP is not known and must be inferred from sequences of observations

to allow the policy to be optimal. Therefore, the optimal policy is a function of a history of observations and actions

π(at|x1:t, a1:t−1). In practice, representing such policy can be infeasible, so the policy is often conditioned on a

compact representation of the history.

Safe Reinforcement Learning From Pixels

While the reward hypothesis states that desired agent behavior can plausibly be represented in a single scalar reward

signal, in practice it can be difﬁcult to deﬁne a reward function that balances different objectives [Vamplew et al.,

2022, Roy et al., 2022]. The same is true for safety: as argued by Ray et al. [2019], a useful deﬁnition of safety

is a constraint on the behavior, stating the problem as a constrained MDPs [CMDP; Altman, 1999], with analogous

deﬁnitions for constrained POMDPs [CPOMDP; Isom et al., 2008, Lee et al., 2018, Walraven and Spaan, 2018]. A

scalar cost variable ct∈Rat each time step of the C(PO)MDP, according to a cost function ct=c(st, at), serves

as a measure of safety violation. The objective of the reinforcement learning problem then changes to constrain the

accumulated cost under a given safety threshold d∈R:

π∗= arg max

E"X

γtrt#s.t.E"X

ct#< d. (1)

Next, we review RL algorithms for POMDPs and CMDPs, laying the foundations for the safe RL algorithm we propose

in Section 5 dedicated to CPOMDPs with unknown environment dynamics.

4 Stochastic latent actor-critic

Soft actor-critic [SAC; Haarnoja et al., 2018a] is an RL approach based on the maximum entropy framework. Besides

the traditional reinforcement learning objective of maximizing the reward, it also aims to maximize the entropy of the

policy. The sampling of states that allow high-entropy action distributions is maximized. This approach can be robust

to disturbances in the dynamics of the environment [Eysenbach and Levine, 2022]. In practice, it can be challenging

to determine a weight for the entropy term in the objective a priori, so instead, the entropy is constrained to a minimum

value [Haarnoja et al., 2018b]. A stochastic policy is particularly interesting to our work since a deterministic policy

might be suboptimal in the CPOMDP setting [Kim et al., 2011].

Stochastic latent actor-critic [SLAC; Lee et al., 2020] is an RL algorithm that addresses partial observability and high-

dimensional observations by combining a probabilistic sequential latent variable model with an actor-critic approach.

The sequential latent variable model aims to infer the true state of the environment by considering the environment

as having an unseen true latent state ztand latent dynamics p(zt+1 |zt,at). This model generates observations x∼

p(xt|zt)and rewards rt∼p(rt|zt). By approximating these latent dynamics in the latent variable model using

approximate variational inference, the learned model can infer the latent state from previous observations and actions

as q(zt|x0:t,a0:t). Consequently, SLAC can be viewed as an adaption of SAC to POMDPs, using the inferred latent

state zas input for the critic in SAC.

Equation (2) describes the latent variable model, which is parameterized by ψ. This model infers a low-dimensional

latent state representation zfrom a history of high-dimensional observations xand actions agathered through

interaction with the environment. This model is trained by sampling trajectories from the environment and in-

ferring a latent state from each trajectory according to the distributions given in Equation (3). The model uses

this inferred latent state to predict the next observation and reward according to the distributions in Equation (3).

1∼p(z1

1).

1∼pψ(z2

1|z1

1).

t+1 ∼pψ(z1

t+1 |z2

t,at).

t+1 ∼pψ(z2

t+1 |z1

t+1,z2

t,at).

xt∼pψ(xt|z1

t,z2

t).

rt∼pψ(rt|z1

t,z2

t,at,z1

t+1,z2

t+1).

(2)

1∼qψ(z1

1|x1).

1∼pψ(z2

1|z1

1).

t+1 ∼qψ(z1

t+1 |xt+1,z2

t,at).

t+1 ∼pψ(z2

t+1 |z1

t+1,z2

t,at).

(3)

The parameters of the model ψare trained to optimize the objective in Equation (4), in which DKL is the Kullback-

Leibler divergence, an asymmetric measure of the difference between two distributions commonly used in variational

inference [Kingma and Welling, 2014]:

JM(ψ) = E

z1:τ+1∼qψ





t=0

−log pψ(xt+1 |zt+1)

−log pψ(rt+1 |zt+1)

+DKL(qψ(zt+1 |xt+1,zt,at)pψ(zt+1 |zt,at))





.(4)

The resulting inferred latent state, which captures information about previous observations and expected reward in

future time steps, serves as the input for the critic in SAC.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SAFEREINFORCEMENTLEARNINGFROMPIXELSUSINGASTOCHASTICLATENTREPRESENTATIONYannickHogewindThiagoD.SimaoTalKachmanNilsJansenABSTRACTWeaddresstheproblemofsafereinforcementlearningfrompixelobservations.Inherentchallengesinsuchsettingsare(1)atrade-offbetweenrewardoptimizationandadheringtosafetyconstraints,...

展开>> 收起<<

SAFE REINFORCEMENT LEARNING FROM PIXELS USING A STOCHASTIC LATENT REPRESENTATION Yannick Hogewind Thiago D. Sim ao Tal Kachman Nils Jansen.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SAFE REINFORCEMENT LEARNING FROM PIXELS USING A STOCHASTIC LATENT REPRESENTATION Yannick Hogewind Thiago D. Sim ao Tal Kachman Nils Jansen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: