Safe Policy Improvement in Constrained Markov Decision Processes Luigi Berducci1and Radu Grosu1

2025-05-03 0 0 803.32KB 23 页 10玖币
侵权投诉
Safe Policy Improvement
in Constrained Markov Decision Processes
Luigi Berducci1and Radu Grosu1
CPS Group, TU Wien, Austria
Abstract. The automatic synthesis of a policy through reinforcement
learning (RL) from a given set of formal requirements depends on the
construction of a reward signal and consists of the iterative application
of many policy-improvement steps. The synthesis algorithm has to bal-
ance target, safety, and comfort requirements in a single objective and to
guarantee that the policy improvement does not increase the number of
safety-requirements violations, especially for safety-critical applications.
In this work, we present a solution to the synthesis problem by solving its
two main challenges: reward-shaping from a set of formal requirements
and safe policy update. For the first, we propose an automatic reward-
shaping procedure, defining a scalar reward signal compliant with the
task specification. For the second, we introduce an algorithm ensuring
that the policy is improved in a safe fashion, with high-confidence guar-
antees. We also discuss the adoption of a model-based RL algorithm to
efficiently use the collected data and train a model-free agent on the
predicted trajectories, where the safety violation does not have the same
impact as in the real world. Finally, we demonstrate in standard control
benchmarks that the resulting learning procedure is effective and robust
even under heavy perturbations of the hyperparameters.
Keywords: Reinforcement Learning ·Safe Policy Improvement ·Formal
Specification
1 Introduction
Reinforcement Learning (RL) has become a practical approach for solving com-
plex control tasks in increasingly challenging environments. However, despite the
availability of a large set of standard benchmarks with well-defined structure and
reward signals, solving new problems remains an art.
There are two major challenges in applying RL to new synthesis problems.
The first, arises from the need to define a good reward signal for the problem.
We illustrate this challenge with an autonomous-driving (AD) application. In
AD, we have numerous requirements that have to be mapped into a single scalar
reward signal. In realistic applications, more than 200 requirements need to be
considered when assessing the course of action [14]. Moreover, determining the
relative importance of the different requirements is a highly non-trivial task. In
this realm, there are a plethora of regulations, ranging from safety and traffic
rules to performance, comfort, legal, and ethical requirements.
arXiv:2210.11259v1 [cs.LG] 20 Oct 2022
2 L. Berducci et al.
The second challenge concerns the RL algorithm with whom we intend to
search for an optimal policy. Considering most of the modern RL algorithms,
especially model-free policy-gradient methods [53], they require an iterative in-
teraction with the environment, from which they collect fresh experiences for
further improving the policy. Despite their effectiveness, a slight change in the
policy parameters could have unexpected consequences in the resulting perfor-
mance. This fact often results in learning curves with unstable performances,
strongly depending on the tuning of hyperparameters and algorithmic design
choices [28]. The lack of guarantees and robustness in the policy improvement
still limits the application of RL outside controlled environments or simulators.
In this paper, we tackle the two problems mentioned above by proposing a
complete design pipeline for safe policies: from the definition of the problem to
the safe optimization of the policy (Fig. 1). We discuss how to structure the
problem from a set of formal requirements and enrich the reward signal while
keeping a sound formulation. Moreover, we define the conditions which char-
acterize a correct-by-construction algorithm in this context and demonstrate
its realizability with high-confidence off-policy evaluation. We propose two al-
gorithms, one model-free and one model-based, respectively. We formally prove
that they are correct by construction and evaluate their performance empirically.
Set of formal requirements CMDP & Reward Shaping
HCOPE
Thresholds
π
unsafe interaction
Safe Policy Improvement
D
Confidence
ρi
δ
(S, A, p, R, C, d, γ)
Fig. 1. Overall RL pipeline to solve formally-specified control tasks.
2 Motivating Example
We motivate our work with a cart-pole example extended with safety, target, and
comfort requirements, as follows: A pole is attached to a cart that moves between
aleft and a right limit, within a flat and frictionless environment. The environ-
ment has a target area within the limits and a static obstacle hanging above the
track. We define five requirements for the cart-pole, as shown in Table 1.
We aim to teach the cart-pole to satisfy all requirements. The system is
controlled by applying a continuous force to the cart, allowing the left and right
movements of the cart-pole with different velocities. In order to reach the goal
and satisfy the target requirement Req1, the cart-pole must do an uncomfortable
and potentially unsafe maneuver: since moving a perfectly balanced pole would
result in a collision with the obstacle, thus violating the safety requirement Req3,
Safe Policy Improvement in Constrained Markov Decision Processes 3
Req ID Description
Req1The cart shall reach the target in bounded time
Req2The pole shall never fall from the cart
Req3The pole shall never collide with the obstacle
Req4The cart shall never leave the left/right limits
Req5The cart shall keep the pole balanced within a
comfortably small angle as often as possible
Table 1. Cart-pole example – informal requirements
the cart-pole must lose balancing and pass below it. Furthermore, if the obstacle
is too large, or the cart does not apply enough force once passing the obstacle,
it may not be able to reach the target without falling, thus violating the safety
requirement Req2. A sequence of pictures showing the cart-pole successfully
overcoming the obstacle in the environment is depicted in Figure 2.
Fig. 2. A cart-pole overcomes a hanging obstacle.
Observe that not all requirements have the same importance for this task.
We regard the safety requirements as fundamental constraints on the agent’s
behavior. A safety violation compromises the validity of the entire episode. The
target requirement expresses the main objective, and its completion is the agent’s
reason to be. Finally, comfort requirements are secondary objectives that should
be optimized as long as they do not interfere with the other two classes of
requirements. In the remainder of this paper, we will use this motivating example
to illustrate the steps that lead to the proposed methodology.
3 Related Work
Safety is a well-known and highly-researched problem in the RL community, and
the related literature is wide [11,23]. Many different definitions and approaches
have emerged over the years, so we will first clarify the interpretations of safety
that we adopt. Much work considers safety as the property of the trained policy,
determining whether the agent satisfies a set of constraints. They either try to
converge to a safe policy at the end of the training [5,10,15,2], or try to ensure
safety during the entire training process [51,16,45,4,18,54,64,56].
Placing ourselves in a design perspective, we consider safety as a property
of the RL algorithm instead. By safety, we mean that a safe algorithm will not
4 L. Berducci et al.
return a policy with performance below a given threshold, with high probability
guarantees. Building on this interpretation, we will later define the conditions
for correct-by-construction RL algorithms. In the following, we review the main
approaches relevant to our contribution.
Safe policy improvement Early representative of these algorithms are CPI
[34,46] that provide monotonically improving updates at the cost of enormous
sample complexity. More recently, [52] introduced TRPO, the first policy-gradient
algorithm which builds on trust-region methods to guarantee policy improve-
ment. However, its sound formulation must be relaxed in practice for computa-
tional tractability. The class of algorithms more relevant for our work is based on
off-policy evaluation [47,57]. Seminal works in this field is HCPI [58,59] that use
concentration inequalities [40] to define high-confidence lower bounds on the im-
portance sampling estimates [47] of the policy performance. More recently, this
class of algorithms has been shown to be part of the general Seldonian frame-
work [60]. We build on this formalism to define our approach and differ from
the existing work by proposing a novel interface with a set of formally specified
requirements, and propose Daedalus-style solutions [58] to tackle the problem of
iterative improvement in an online setting.
Policy evaluation with statistical model checking Statistical guarantees
on the policy evaluation for some temporal specification is commonly solved with
statistical model checking (SMC) [3,35]. Recent works have proposed SMC for
evaluating a NN decision policy operating in an MDP [24]. However, while SMC
relies on collecting a large number of executions of the policy in the environ-
ment and hypothesis testing for providing performance bounds, our off-policy
setting tackles a different problem. In off-policy evaluation, the model of the
environment is generally unknown and we cannot deploy the decision policy
in the environment because it could be costly or dangerous. Unlike SMC, we
cannot directly collect statistics with the decision policy. Conversely, we try to
approximate the expected performance using data collected by another policy.
RL with temporal logic Much prior work adopts temporal logic (TL) in
RL. Some of it focuses on the decomposition of a complex task into many sub
tasks [33,61]. Other on formulations tailored to tasks specified in TL [21,36,31,29].
Several works use the quantitative semantics of TL (i.e., STL and its variants) to
derive a reward signal [37,32,7]. However, they describe the task as a monolithic
TL specification and compute the reward on complete [37] or truncated trajec-
tories [7]. In this work, we use HPRS, introduced in [9] to mitigate the reward
sparsity and subsequent credit-assignment problem, which combines the individ-
ual evaluation of requirements into a hierarchically-weighted reward, capturing
the priority among the various classes of requirements.
Multi-objective RL. Multi-objective RL (MORL) studies the optimization of
multiple and often conflicting objectives. MORL algorithms learn single or mul-
Safe Policy Improvement in Constrained Markov Decision Processes 5
tiple policies [50,38]. There exist several techniques to combine multiple reward
signals into a single scalar value (i.e., scalarization), such as linear or non-linear
projections [42,8,62]. Other approaches formulate structured rewards by impos-
ing or assuming a preference ranking on the objectives and finding an equilib-
rium among them [22,55,65,1]. [13] proposes to decompose the task specification
into many requirements. However, they do not consider any structured priority
among requirements and rely on the arbitrary choice of weights for each of them.
In general, balancing between safety and performance shows connections with
MORL. However, we focus on the safety aspects and how to guarantee safety
below a certain threshold with high probability. These characteristics are not
present in MORL approaches.
Hierarchically Structured Requirements. The use of partially ordered re-
quirements to formalize complex tasks has been proposed in a few works. The
rulebook formalism [14] represents a set of prioritized requirements, and it has
been used for evaluating the behaviors produced with a planner [14], or gen-
erating adversarial testing scenarios [63]. In [48], a complementary inverse RL
approach proposes learning dependencies for formal requirements starting from
existing demonstrations. Conversely, we use the requirements to describe our
task and define a Constrained MDP, and then focus on the design of a safe
policy-improvement algorithm.
Model-based reinforcement learning Among the first representatives of
model-based RL algorithms is PILCO [19], which learned a Gaussian process for
low-dimensional state systems. More recent works showed that it is possible to
exploit expressive neural-networks models to learn complex dynamics in robotics
systems [41], and use them for planning [17] or policy learning [30]. In the context
of control from pixel images, world models [25] proved that it is possible to
learn accurate dynamic models for POMDPs by using noisy high-dimensional
observations instead of accurate states. Their application in planning [27], and
later in policy learning [26], have achieved the new state-of-the-art performance
in many benchmarks and were recently applied to real-world robots [12].
4 Preliminaries
4.1 Reinforcement Learning
Reinforcement Learning (RL) aims to infer an intelligent agent’s policy that takes
actions in an environment in a way that maximizes some notion of expected re-
turn. The environment is typically modeled as a Markov decision process (MDP),
and the return is defined as the cumulative discounted reward.
Definition 1. A Markov Decision Process (MDP) is a tuple M= (S, A, p, R, γ),
where Sis a set of states; Ais a set of possible actions; p:S×A×S[0,1] is
a transition probability function (where p(st+1|at, st)describes the probability of
摘要:

SafePolicyImprovementinConstrainedMarkovDecisionProcessesLuigiBerducci1andRaduGrosu1CPSGroup,TUWien,AustriaAbstract.Theautomaticsynthesisofapolicythroughreinforcementlearning(RL)fromagivensetofformalrequirementsdependsontheconstructionofarewardsignalandconsistsoftheiterativeapplicationofmanypolicy-i...

展开>> 收起<<
Safe Policy Improvement in Constrained Markov Decision Processes Luigi Berducci1and Radu Grosu1.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:803.32KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注