Provable Safe Reinforcement Learning with Binary Feedback Andrew Bennett Cornell University

2025-05-02 0 0 608.58KB 28 页 10玖币

侵权投诉

Provable Safe Reinforcement Learning with Binary Feedback

Andrew Bennett

Cornell University

awb222@cornell.edu

Dipendra Misra

Microsoft Research

dimisra@microsoft.com

Nathan Kallus

Cornell University

kallus@cornell.edu

October 2022

Abstract

Safety is a crucial necessity in many applications of reinforcement learning (RL), whether robotic,

automotive, or medical. Many existing approaches to safe RL rely on receiving numeric safety feedback,

but in many cases this feedback can only take binary values; that is, whether an action in a given

state is safe or unsafe. This is particularly true when feedback comes from human experts. We therefore

consider the problem of provable safe RL when given access to an oﬄine oracle providing binary feedback

on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied

to any MDP setting given access to a blackbox PAC RL algorithm for that setting. SABRE applies

concepts from active learning to reinforcement learning to provably control the number of queries to the

safety oracle. SABRE works by iteratively exploring the state space to ﬁnd regions where the agent is

currently uncertain about safety. Our main theoretical results shows that, under appropriate technical

assumptions, SABRE never takes unsafe actions during training, and is guaranteed to return a near-

optimal safe policy with high probability. We provide a discussion of how our meta-algorithm may be

applied to various settings studied in both theoretical and empirical frameworks.

1 Introduction

Reinforcement learning (RL) is an important paradigm that can be used to solve important dynamic decision-

making problems in a diverse set of ﬁelds, such as robotics, transportation, healthcare, and user assistance.

In recent years there has been a signiﬁcant increase in interest in this problem, with many proposed solutions.

However, in many such applications there are important safety considerations that are diﬃcult to address

with existing techniques.

Consider the running example of a cleaning robot, whose task is to learn how to vacuum the ﬂoor of a

house. The primary goal of the robot, of course, is to learn to vacuum as eﬃciently as possible, which may be

measured by the amount cleaned in a given time. However, we would also like to impose safety constraints

on the robot’s actions; for example, the robot shouldn’t roll oﬀ of a staircase where it could damage itself,

it shouldn’t roll over electrical cords, or it shouldn’t vacuum up the owner’s possessions. In this example,

there are several desirable properties we would like a safety-aware learning algorithm to have, including:

1. The agent should avoid taking any unsafe actions, even during training

2. Since it is hard to concretely deﬁne a safety function from the robot’s sensory observations a priori,

we would like the agent to learn a safety function given feedback of observed states

3. Since the notion of safety is human-deﬁned, and we would like the safety feedback to be manually

provided by humans (e.g. the owner), we would want the agent to ask for as little feedback as possible

4. We would like to use binary feedback (i.e. is an action in a given state safe or unsafe) rather than

numeric feedback, as this is more natural for humans to provide

5. Since the agent may need to act in real time without direct intervention, they should only ask for

feedback oﬄine in between episodes

arXiv:2210.14492v1 [cs.LG] 26 Oct 2022

Moving away from our speciﬁc example, the above ﬁve properties would be ideal for safe RL in many

applications where safety is naturally human-deﬁned. Unfortunately, there are no existing safe RL methods

that can provably satisfy all of these properties. In particular, existing safe RL methods all fail to satisfy

these properties for at least one of the following reasons: (1) they assume a safety function is fully known

(e.g. Chow et al., 2018; Sim˜ao et al., 2021); (2) they learn safety using numeric feedback (e.g. Wachi and

Sui, 2020; Amani et al., 2021); or (3) they require a human-in-the-loop who can intervene and monitors the

safety of every action taken in real time (Saunders et al., 2018). In addition, none of these existing methods

address the issue of only asking for minimal feedback, which is an important consideration which places the

problem at the interface of RL and active learning.

In this paper, we present a novel safe RL framework encompassing the above concerns, involving an

oﬄine safety-labeling oracle that can provide binary feedback on the safety of state, action pairs in between

episodes. We provide a meta algorithm, SABRE, which can be applied to any MDP class within this

safety framework, given a blackbox RL algorithm for the MDP class. This algorithm utilizes ideas from

disagreement-based active learning, by iteratively exploring the state space to ﬁnd all regions where there is

disagreement on possible safety. Under some appropriate technical assumptions, including that the blackox

RL algorithm can provably optimize any given reward, SABRE will satisfy all of the above ﬁve properties,

and will return an approximately optimal safe policy with high probability. Importantly, we provide high-

probability bounds on the number of samples and calls to the labeling oracle needed, and show that they

are both polynomial, and that the latter is lower order than the former. Finally, we provide some discussion

of how this meta-algorithm approach may be applied in various settings of both theoretical and practical

interest.

Math Notation For any natural number n∈N, we let [n] denote the set {1,2,··· , n}. For any countable

set X, we let ∆(X) denote the space of all probability distributions over X. Given sets Xand Y, we let

X → Y denote the set of all functions from Xto Y. Lastly, sign(y) denotes the sign function which takes a

value of 1 if y > 0, a value of 0 at y= 0, and −1 if y < 0.

2 Problem Setup

Markov Decision Process We consider learning in ﬁnite-horizon Markov decision processes (MDPs). A

reward-free MDP is characterized by a tuple (S,A, T, µ, H), where Sis a given (potentially inﬁnitely large)

state space, Ais a ﬁnite action space with |A| =A,T:S ×A → ∆(S) is a transition operator, µ∈∆(S) is

an initial state distribution, and H∈Nis the horizon. An MDP (with reward) is a tuple (S,A, T, R, µ, H),

where R:S×A → [0,1] is the reward function. We assume that the transition operator and reward function

are time homogeneous; i.e.,Tand Rdo not depend explicitly on the time index.1We will let (M, R) refer

to the MDP we are learning in, where Mis the corresponding reward-free MDP.

The agent interacts with an MDP over a series of rounds. In each round the agent successively takes

actions according to some policy, which is a mapping from states to actions, in order to generate an episode.

Speciﬁcally, in the n’th round for each n∈N, the agent selects some policy πn∈ S → ∆(A), which it then

uses in order to generate an episode τn= (sn

1, an

1, rn

1, . . . , sn

H, an

H, rn

H, sn

H+1), where sn

1∼µ(·), and for each

h∈[H] we have an

h∼πn(· | sn

h), rn

h=R(sn

h, an

h), and sn

h+1 ∼T(· | sn

h, an

h).

For any given policy π, we let Eπ[·] and Pπ(·) denote the expectation and probability operators over

trajectories generated using πin the MDP (M, R). Also, for any policy π, we deﬁne the value function

Vπ∈ S → Raccording to Vπ(s) = Eπ[PH

h=1 R(sh, ah)|s1=s], and we similarly deﬁne the value of the

policy according to V(π) = Es∼µ(·)[Vπ(s)].

Our main metric of success for reinforcement learning is suboptimality (SubOpt). For any policy class

Π⊆ S → ∆(A) and any policy ˆπ∈Π, we deﬁne SubOpt(ˆπ; Π) = supπ∈ΠV(π)−V(ˆπ). Then, one of our

primary goals is to learn a policy ˆπthat has low suboptimality with respect to a given set of policies that it

must choose from, in a relatively small number of episodes.

Over the past few decades, many reinforcement learning algorithms have been developed that can provably

obtain low suboptimality using a small number of episodes, for various classes of MDPs . Our focus is on

adapting such RL algorithms, in order to additionally take into account safety considerations. For these

1This is w.l.o.g. since we can always include the current time index in the observation deﬁnition.

-1.0 -0.5 0.5 1.0

-1.5

-1.0

-0.5

0.5

1.0

Region of

disagreement

-1.0 -0.5 0.5 1.0

-1.5

-1.0

-0.5

0.5

1.0

Region of

disagreement

Figure 1: Illustrating the challenge with binary safety feedback. Left: we observe binary safe/unsafe labels

(our setting) and we know the safety function is a halfspace; the safety of state-action pairs in the region of

disagreement (shaded gold region) is uncertain, as there exist halfspaces consistent with both the observed

data and these being either safe or unsafe. Right: we observe noisy numeric safety values (positive being

safe) and we know their mean is linear; the region of disagreement (using 95% conﬁdence) is far smaller

because we can extrapolate to yet-unseen state-action pairs.

reasons, we take a meta-learning approach, and assume access to a blackbox reinforcement learning algorithm

Alg. We formalize this as follows:

Assumption 1. We have access to a blackbox RL algorithm Alg for a set of reward-free MDPs Mthat

contains M. Speciﬁcally, given any reward function R0:S × A → [0,1] and policy class Πas input, along

with any , δ ∈(0,1),Alg(R0,Π, , δ)returns a policy ˆπ∈Πsuch that SubOpt(ˆπ; Π) ≤with probability at

least 1−δ. Furthermore, it is guaranteed to do so after at most nAlg(, δ)episodes, for some nAlg(, δ) =

Poly(, log(1/δ)), while only following policies in Π.

We refer to nAlg(, δ) as the sample complexity of Alg, and note that it implicitly depends on the time

horizon H. We also note that the requirement that Alg only follows policies in Π is important, as in practice

we will call Alg using classes of policies that are guaranteed to be safe.

Safety Considerations In addition to the general MDP setup presented above, we would also like to take

safety considerations into account. Speciﬁcally, we will use a binary notion of safety as follows: we assume

there exists a function f?:S → {±1}such that f?(s, a) = 1 means that taking action ain state sis safe,

and otherwise action ain state sis unsafe. Importantly, we assume that the safety function f?is unknown,

and we require the agent to only take actions ain states ssuch that f?(s, a) = 1, even during training. That

is, the agent must learn the safety relation, while simultaneously maintaining safety.

The assumption that the safety values are binary is in contrast to most of the literature, which mostly

assumes that these values are numeric (i.e. takes values in Rrather than {±1}, with safety given by

sign(f?(s, a)); see Section 5 for a detailed discussion). To see why this makes the problem more challenging,

consider for example the example displayed in Figure 1. In this example, we consider the case where S=R,

and the safety function is assumed to be linear. On the left we display a series of observed binary safety

values for some ﬁxed action, and on the right we display the corresponding observations when the safety

values are numeric. In each case, we shade in gold the disagreement region where we are unsure about

safety values. With the numeric observations, we can be sure of safety for almost all states except near the

f?(s, a) = 0 margin, whereas with the binary observations we are unsure of safety for all states in between

the two observed clusters. Note that this is true even though the numeric observations have noise, while

the binary ones don’t. In other words, when we receive binary safety feedback, we cannot easily extrapolate

safety beyond the observed states.

In order to make the task of learning given these safety constraints feasible, we model f?using some

class F ⊆ S × A → {±1}of candidate safety functions. This allows us to ensure that f?is learnable, by

using a suﬃciently well-behaved class F. In particular, in our theory we will focus on classes Fwith ﬁnite

Vapnik-Chervonenkis (VC) dimension (Dudley, 1987). Then, in order to make it feasible to guarantee safety

during training, we make the following core assumption on the setting.

Assumption 2. The safety class Fis correctly speciﬁed; that is, f?∈ F. Furthermore, there exists a policy

πsafe ∈Πthat is known to always take safe actions.

To explain the importance of this assumption, let us deﬁne the set of safe policies corresponding to each

f∈ F, by

Πsafe(f) = {πsafe}∪{π∈Π : f(s, π(s)) = 1 a.s. ∀s∈ S}.

That is Πsafe(f) is the set of all policies that would be safe if fwas the true safety function, along with the

known safe policy. We similarly deﬁne the shorthand Πsafe = Πsafe(f?) for the actual set of safe policies.

Then, the assumption that Fis correctly speciﬁed allows us to reason about which policies are safe, since if

π∈Πsafe(f) for all possible f∈ F that are consistent with our observations so far, we are guaranteed that

π∈Πsafe. Furthermore, the existence of a known safe policy Πsafe allows the agent to avoid getting stuck

with no known safe action to perform. Note that the deﬁnition of πsafe could be problem dependent. For

example, in the previous cleaning robot example, the policy that always takes the “don’t move” action could

qualify. Alternatively, in diﬀerent applications, πsafe could be deﬁned for example by taking a special action

that terminates the episode early, or has an expert take control.

Obtaining Safety Feedback A key motivation behind using binary safety feedback rather than continu-

ous is for applications where safety is human-deﬁned. In such applications, obtaining safety labels for (s, a)

pairs may be expensive or otherwise burdensome. This motivates an active learning-style approach to the

problem, where we have to explicitly ask for labels of (s, a) pairs, with an additional goal of minimizing the

number of times we do so.

Concretely, our model for the labeling process is as follows. We assume access to a labeling oracle, which

can be given (s, a) pairs and returns the corresponding values of f?(s, a). We assume that this oracle can

be queried an unlimited number of times in between episodes, and that it can only be queried with states

sthat have been previously observed. The reason for the latter restriction is that in many applications the

state corresponds to some kind of observation of the environment, such as an image, and therefore the total

space of states Sis not a-priori known.

Problem Setup Summary. Given a class of MDPs M, a policy class Π, a blackbox RL algorithm Alg,

a safety function class F, a desired sub-optimality , and failure probability δ, we want to propose an online

learning algorithm that, for any unknown M∈ M and f?∈ F, interacts with the MDP over Nrounds and

returns a policy ˆπ∈Π such that, with probability at least 1 −δ, we have:

1. SubOpt(ˆπ; Π) ≤(ˆπis approximately optimal)

2. ˆπ∈Πsafe (the returned policy is safe)

3. f?(sn

h, an

h) = 1 for all h∈[H] and n∈[N] (the agent never takes unsafe actions during training)

In addition, we would like to establish sample-complexity bounds on the number of episodes Nneeded to

ensure the above, in terms of 1/, log(1/δ), H, and nAlg(·). Finally, we would like to establish corresponding

high-probability bounds on the total number of calls to the labelling oracle, which are lower order than the

total sample complexity.

3 Learning Safety via Active Learning in Reinforcement Learning

First, we establish some basic deﬁnitions and a result that will form the basis for safety learning. For any

safety-labeled dataset Dconsisting of (s, a, f?(s, a)) tuples, we deﬁne the corresponding version space of

safety functions consistent with Dby

V(D) = {f∈ F :f(s, a) = c∀(s, a, c)∈ D}.

Similarly, for any given a∈ A, we can deﬁne the corresponding region of disagreement of states where safety

of that action is not known given Dby

RDa(D) = {s:∃f, f 0∈ V(D) s.t. f(s, a)6=f0(s, a)}.

Algorithm 1 SAfe Binary-feedback REinforcement learning (SABRE)

Input: Number of epochs N, number of iterations per epoch B, number of rollouts per batch m, accuracy

parameters explore and R, and probability parameters δexplore and δR, initial safety-labeled dataset D(B)

(optional)

Output: Final policy ˆπ

1: for n= 1,2, . . . , N do

2: Πn←Π(D(B)

n−1)// deﬁne a safe policy class for the current safety labeled dataset D(B)

n−1

3: D(0)

n← D(B)

n−1

4: for i= 1,2, . . . , B do

5: e

R(i)

n(·)←1{∃a∈ A :· ∈ RDa(D(i−1)

n)}// safety reward that incentivizes visiting RD

6: ˆπ(i)

n←Alg(e

R(i)

n,Πn, explore, δexplore)// call blackbox RL method with safety exploration reward

7: Roll out with ˆπ(i)

nfor mepisodes, and collect all observed states in S(i)

8: Expand D(i−1)

nto D(i)

nby labelling all s∈ S(i)

nand a∈ A such that s∈RDa(D(i−1)

9: return Alg(R, Π(D(B)

N), R, δR)// return a safe policy by optimizing the environment reward R

Note that these are both standard deﬁnitions in disagreement-based active learning (Hanneke, 2014).

Next, deﬁne the set of policies known to surely be safe given Das follows:

Π(D) = \

f∈V(D)

Πsafe(f).

Note that given Assumption 2, Π(D) is always ensured to be non-empty, as it will at least contain πsafe.

Given these deﬁnitions, we have the following lemma.

Lemma 1. For any safety labeled dataset D, let

U(D) = sup

π∈Πsafe

Pπ(∃h∈[H], a ∈ A :sh∈RDa(D)) .

Then, for any Dand ˆπ∈Π(D), we have

SubOpt(ˆπ; Πsafe)≤SubOpt(ˆπ; Π(D)) + HU (D).

3.1 Challenges with a Naive Approach

This lemma in fact suggests a rather simple, naive approach, based on greedily trying to follow the blackbox

RL algorithm, and switching to πsafe and querying the safety oracle whenever we reach a state where safety

is not known. Before we outline our novel solution that addresses the challenges outlined in the previous

section, we consider the limitations of this naive approach.

Given Lemma 1 and Assumption 1, we can easily ensure sub-optimality of at most +HU(D) for any

target using a naive approach like above, where Dis the labeled dataset incidentally collected following

this approach. However, since this approach does nothing explicit to try to learn the safety and shrink the

region of disagreement, it is diﬃcult to provide any guarantees on how fast U(D) shrinks.

A second issue is that this approach provides no control on the number of calls to the labelling oracle.

Existing analyses from disagreement-based active learning provide bounds on the number of samples needed

to shrink regions of disagreement under ﬁxed distributions, but the agent may roll out with a diﬀerent

distribution every episode.

3.2 The SABRE Algorithm

Motivated by Lemma 1, as well as the limitations of the above naive baseline approach, we now present our

novel algorithm in Algorithm 1. This algorithm is superﬁcially similar to the baseline approach, in the sense

that it builds a labeled dataset Dover a series of rounds, and then estimates an optimal policy following

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ProvableSafeReinforcementLearningwithBinaryFeedbackAndrewBennettCornellUniversityawb222@cornell.eduDipendraMisraMicrosoftResearchdimisra@microsoft.comNathanKallusCornellUniversitykallus@cornell.eduOctober2022AbstractSafetyisacrucialnecessityinmanyapplicationsofreinforcementlearning(RL),whetherroboti...

展开>> 收起<<

Provable Safe Reinforcement Learning with Binary Feedback Andrew Bennett Cornell University.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Provable Safe Reinforcement Learning with Binary Feedback Andrew Bennett Cornell University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: