Safe Policy Improvement in Constrained Markov Decision Processes Luigi Berducci1and Radu Grosu1

2025-05-03 0 0 803.32KB 23 页 10玖币

侵权投诉

Safe Policy Improvement

in Constrained Markov Decision Processes

Luigi Berducci1and Radu Grosu1

CPS Group, TU Wien, Austria

Abstract. The automatic synthesis of a policy through reinforcement

learning (RL) from a given set of formal requirements depends on the

construction of a reward signal and consists of the iterative application

of many policy-improvement steps. The synthesis algorithm has to bal-

ance target, safety, and comfort requirements in a single objective and to

guarantee that the policy improvement does not increase the number of

safety-requirements violations, especially for safety-critical applications.

In this work, we present a solution to the synthesis problem by solving its

two main challenges: reward-shaping from a set of formal requirements

and safe policy update. For the ﬁrst, we propose an automatic reward-

shaping procedure, deﬁning a scalar reward signal compliant with the

task speciﬁcation. For the second, we introduce an algorithm ensuring

that the policy is improved in a safe fashion, with high-conﬁdence guar-

antees. We also discuss the adoption of a model-based RL algorithm to

eﬃciently use the collected data and train a model-free agent on the

predicted trajectories, where the safety violation does not have the same

impact as in the real world. Finally, we demonstrate in standard control

benchmarks that the resulting learning procedure is eﬀective and robust

even under heavy perturbations of the hyperparameters.

Keywords: Reinforcement Learning ·Safe Policy Improvement ·Formal

Speciﬁcation

1 Introduction

Reinforcement Learning (RL) has become a practical approach for solving com-

plex control tasks in increasingly challenging environments. However, despite the

availability of a large set of standard benchmarks with well-deﬁned structure and

reward signals, solving new problems remains an art.

There are two major challenges in applying RL to new synthesis problems.

The ﬁrst, arises from the need to deﬁne a good reward signal for the problem.

We illustrate this challenge with an autonomous-driving (AD) application. In

AD, we have numerous requirements that have to be mapped into a single scalar

reward signal. In realistic applications, more than 200 requirements need to be

considered when assessing the course of action [14]. Moreover, determining the

relative importance of the diﬀerent requirements is a highly non-trivial task. In

this realm, there are a plethora of regulations, ranging from safety and traﬃc

rules to performance, comfort, legal, and ethical requirements.

arXiv:2210.11259v1 [cs.LG] 20 Oct 2022

2 L. Berducci et al.

The second challenge concerns the RL algorithm with whom we intend to

search for an optimal policy. Considering most of the modern RL algorithms,

especially model-free policy-gradient methods [53], they require an iterative in-

teraction with the environment, from which they collect fresh experiences for

further improving the policy. Despite their eﬀectiveness, a slight change in the

policy parameters could have unexpected consequences in the resulting perfor-

mance. This fact often results in learning curves with unstable performances,

strongly depending on the tuning of hyperparameters and algorithmic design

choices [28]. The lack of guarantees and robustness in the policy improvement

still limits the application of RL outside controlled environments or simulators.

In this paper, we tackle the two problems mentioned above by proposing a

complete design pipeline for safe policies: from the deﬁnition of the problem to

the safe optimization of the policy (Fig. 1). We discuss how to structure the

problem from a set of formal requirements and enrich the reward signal while

keeping a sound formulation. Moreover, we deﬁne the conditions which char-

acterize a correct-by-construction algorithm in this context and demonstrate

its realizability with high-conﬁdence oﬀ-policy evaluation. We propose two al-

gorithms, one model-free and one model-based, respectively. We formally prove

that they are correct by construction and evaluate their performance empirically.

Set of formal requirements CMDP & Reward Shaping

HCOPE

Thresholds

unsafe interaction

Safe Policy Improvement

Confidence

ρi

(S, A, p, R, C, d, γ)

Fig. 1. Overall RL pipeline to solve formally-speciﬁed control tasks.

2 Motivating Example

We motivate our work with a cart-pole example extended with safety, target, and

comfort requirements, as follows: A pole is attached to a cart that moves between

aleft and a right limit, within a ﬂat and frictionless environment. The environ-

ment has a target area within the limits and a static obstacle hanging above the

track. We deﬁne ﬁve requirements for the cart-pole, as shown in Table 1.

We aim to teach the cart-pole to satisfy all requirements. The system is

controlled by applying a continuous force to the cart, allowing the left and right

movements of the cart-pole with diﬀerent velocities. In order to reach the goal

and satisfy the target requirement Req1, the cart-pole must do an uncomfortable

and potentially unsafe maneuver: since moving a perfectly balanced pole would

result in a collision with the obstacle, thus violating the safety requirement Req3,

Safe Policy Improvement in Constrained Markov Decision Processes 3

Req ID Description

Req1The cart shall reach the target in bounded time

Req2The pole shall never fall from the cart

Req3The pole shall never collide with the obstacle

Req4The cart shall never leave the left/right limits

Req5The cart shall keep the pole balanced within a

comfortably small angle as often as possible

Table 1. Cart-pole example – informal requirements

the cart-pole must lose balancing and pass below it. Furthermore, if the obstacle

is too large, or the cart does not apply enough force once passing the obstacle,

it may not be able to reach the target without falling, thus violating the safety

requirement Req2. A sequence of pictures showing the cart-pole successfully

overcoming the obstacle in the environment is depicted in Figure 2.

Fig. 2. A cart-pole overcomes a hanging obstacle.

Observe that not all requirements have the same importance for this task.

We regard the safety requirements as fundamental constraints on the agent’s

behavior. A safety violation compromises the validity of the entire episode. The

target requirement expresses the main objective, and its completion is the agent’s

reason to be. Finally, comfort requirements are secondary objectives that should

be optimized as long as they do not interfere with the other two classes of

requirements. In the remainder of this paper, we will use this motivating example

to illustrate the steps that lead to the proposed methodology.

3 Related Work

Safety is a well-known and highly-researched problem in the RL community, and

the related literature is wide [11,23]. Many diﬀerent deﬁnitions and approaches

have emerged over the years, so we will ﬁrst clarify the interpretations of safety

that we adopt. Much work considers safety as the property of the trained policy,

determining whether the agent satisﬁes a set of constraints. They either try to

converge to a safe policy at the end of the training [5,10,15,2], or try to ensure

safety during the entire training process [51,16,45,4,18,54,64,56].

Placing ourselves in a design perspective, we consider safety as a property

of the RL algorithm instead. By safety, we mean that a safe algorithm will not

4 L. Berducci et al.

return a policy with performance below a given threshold, with high probability

guarantees. Building on this interpretation, we will later deﬁne the conditions

for correct-by-construction RL algorithms. In the following, we review the main

approaches relevant to our contribution.

Safe policy improvement Early representative of these algorithms are CPI

[34,46] that provide monotonically improving updates at the cost of enormous

sample complexity. More recently, [52] introduced TRPO, the ﬁrst policy-gradient

algorithm which builds on trust-region methods to guarantee policy improve-

ment. However, its sound formulation must be relaxed in practice for computa-

tional tractability. The class of algorithms more relevant for our work is based on

oﬀ-policy evaluation [47,57]. Seminal works in this ﬁeld is HCPI [58,59] that use

concentration inequalities [40] to deﬁne high-conﬁdence lower bounds on the im-

portance sampling estimates [47] of the policy performance. More recently, this

class of algorithms has been shown to be part of the general Seldonian frame-

work [60]. We build on this formalism to deﬁne our approach and diﬀer from

the existing work by proposing a novel interface with a set of formally speciﬁed

requirements, and propose Daedalus-style solutions [58] to tackle the problem of

iterative improvement in an online setting.

Policy evaluation with statistical model checking Statistical guarantees

on the policy evaluation for some temporal speciﬁcation is commonly solved with

statistical model checking (SMC) [3,35]. Recent works have proposed SMC for

evaluating a NN decision policy operating in an MDP [24]. However, while SMC

relies on collecting a large number of executions of the policy in the environ-

ment and hypothesis testing for providing performance bounds, our oﬀ-policy

setting tackles a diﬀerent problem. In oﬀ-policy evaluation, the model of the

environment is generally unknown and we cannot deploy the decision policy

in the environment because it could be costly or dangerous. Unlike SMC, we

cannot directly collect statistics with the decision policy. Conversely, we try to

approximate the expected performance using data collected by another policy.

RL with temporal logic Much prior work adopts temporal logic (TL) in

RL. Some of it focuses on the decomposition of a complex task into many sub

tasks [33,61]. Other on formulations tailored to tasks speciﬁed in TL [21,36,31,29].

Several works use the quantitative semantics of TL (i.e., STL and its variants) to

derive a reward signal [37,32,7]. However, they describe the task as a monolithic

TL speciﬁcation and compute the reward on complete [37] or truncated trajec-

tories [7]. In this work, we use HPRS, introduced in [9] to mitigate the reward

sparsity and subsequent credit-assignment problem, which combines the individ-

ual evaluation of requirements into a hierarchically-weighted reward, capturing

the priority among the various classes of requirements.

Multi-objective RL. Multi-objective RL (MORL) studies the optimization of

multiple and often conﬂicting objectives. MORL algorithms learn single or mul-

Safe Policy Improvement in Constrained Markov Decision Processes 5

tiple policies [50,38]. There exist several techniques to combine multiple reward

signals into a single scalar value (i.e., scalarization), such as linear or non-linear

projections [42,8,62]. Other approaches formulate structured rewards by impos-

ing or assuming a preference ranking on the objectives and ﬁnding an equilib-

rium among them [22,55,65,1]. [13] proposes to decompose the task speciﬁcation

into many requirements. However, they do not consider any structured priority

among requirements and rely on the arbitrary choice of weights for each of them.

In general, balancing between safety and performance shows connections with

MORL. However, we focus on the safety aspects and how to guarantee safety

below a certain threshold with high probability. These characteristics are not

present in MORL approaches.

Hierarchically Structured Requirements. The use of partially ordered re-

quirements to formalize complex tasks has been proposed in a few works. The

rulebook formalism [14] represents a set of prioritized requirements, and it has

been used for evaluating the behaviors produced with a planner [14], or gen-

erating adversarial testing scenarios [63]. In [48], a complementary inverse RL

approach proposes learning dependencies for formal requirements starting from

existing demonstrations. Conversely, we use the requirements to describe our

task and deﬁne a Constrained MDP, and then focus on the design of a safe

policy-improvement algorithm.

Model-based reinforcement learning Among the ﬁrst representatives of

model-based RL algorithms is PILCO [19], which learned a Gaussian process for

low-dimensional state systems. More recent works showed that it is possible to

exploit expressive neural-networks models to learn complex dynamics in robotics

systems [41], and use them for planning [17] or policy learning [30]. In the context

of control from pixel images, world models [25] proved that it is possible to

learn accurate dynamic models for POMDPs by using noisy high-dimensional

observations instead of accurate states. Their application in planning [27], and

later in policy learning [26], have achieved the new state-of-the-art performance

in many benchmarks and were recently applied to real-world robots [12].

4 Preliminaries

4.1 Reinforcement Learning

Reinforcement Learning (RL) aims to infer an intelligent agent’s policy that takes

actions in an environment in a way that maximizes some notion of expected re-

turn. The environment is typically modeled as a Markov decision process (MDP),

and the return is deﬁned as the cumulative discounted reward.

Deﬁnition 1. A Markov Decision Process (MDP) is a tuple M= (S, A, p, R, γ),

where Sis a set of states; Ais a set of possible actions; p:S×A×S→[0,1] is

a transition probability function (where p(st+1|at, st)describes the probability of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SafePolicyImprovementinConstrainedMarkovDecisionProcessesLuigiBerducci1andRaduGrosu1CPSGroup,TUWien,AustriaAbstract.Theautomaticsynthesisofapolicythroughreinforcementlearning(RL)fromagivensetofformalrequirementsdependsontheconstructionofarewardsignalandconsistsoftheiterativeapplicationofmanypolicy-i...

展开>> 收起<<

Safe Policy Improvement in Constrained Markov Decision Processes Luigi Berducci1and Radu Grosu1.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Safe Policy Improvement in Constrained Markov Decision Processes Luigi Berducci1and Radu Grosu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: