Environment Design for Inverse Reinforcement Learning

2025-05-06 0 0 4.33MB 21 页 10玖币

侵权投诉

Thomas Kleine Buening 1 * Victor Villin 2 * Christos Dimitrakakis 2

Abstract

Learning a reward function from demonstra-

tions suffers from low sample-efﬁciency. Even

with abundant data, current inverse reinforcement

learning methods that focus on learning from a sin-

gle environment can fail to handle slight changes

in the environment dynamics. We tackle these

challenges through adaptive environment design.

In our framework, the learner repeatedly interacts

with the expert, with the former selecting environ-

ments to identify the reward function as quickly

as possible from the expert’s demonstrations in

said environments. This results in improvements

in both sample-efﬁciency and robustness, as we

show experimentally, for both exact and approxi-

mate inference.

1. Introduction

Reinforcement Learning (RL) is a powerful framework for

autonomous decision-making in games (Mnih et al.,2015),

continuous control problems (Lillicrap et al.,2015), and

robotics (Levine et al.,2016). However, specifying suitable

reward functions remains one of the main barriers to the

wider application of RL in real-world settings and methods

that allow us to communicate tasks without manually deﬁn-

ing reward functions could be of great practical value. One

such approach is Inverse Reinforcement Learning (IRL),

which aims to ﬁnd a reward function that explains observed

(human) behaviour (Russell,1998;Ng & Russell,2000).

Much of recent effort in IRL has been devoted to making

existing methods more sample-efﬁcient as well as robust

to changes in the environment dynamics (Arora & Doshi,

2021). Sample-efﬁciency is crucial, as data requires expen-

sive human input. We also need robust estimates of the

unknown reward function, so that the resulting policies re-

main near-optimal, even when the deployed environment

Equal contribution

The Alan Turing Institute, London, UK

Universit

e de Neuch

atel, Neuch

atel, Switzerland. Correspon-

dence to: Thomas Kleine Buening <tbuening@turing.ac.uk>.

Proceedings of the

41 st

International Conference on Machine

the author(s).

dynamics differ from the ones we learned from.

However, recent work has found that IRL methods tend

to heavily specialise (“overﬁt”) to the speciﬁc transition

dynamics under which the demonstrations were provided,

thereby failing to generalise even across minor changes in

the environment (Toyer et al.,2020). More generally, even

with unlimited access to expert demonstrations, we may still

fail to learn suitable reward functions from a ﬁxed environ-

ment. In particular, prior work has explored the identiﬁa-

bility problem in IRL (Cao et al.,2021;Kim et al.,2021),

illustrating the inherent limitations of IRL when learning

from expert demonstrations in a single, ﬁxed environment.

In our study, we consider the situation where we can de-

sign a sequence of environments, in which the expert will

demonstrate the task. This can either mean slightly modify-

ing a base environment, or selecting an environment from

a ﬁnite set. Crafting new environments can involve simple

adjustments such as relocating objects or adding obstacles,

which can be done with little effort and cost. In open-world

settings, simply having different task conditions (e.g., dif-

ferent cars, locations, or time-of-day in a vehicle scenario)

amounts to a different environment.

Contributions. We propose algorithms for designing envi-

ronments in order to infer human goals from demonstrations.

This requires two key components: ﬁrstly, an environment

design algorithm and secondly, an inference algorithm for

data from multiple environment dynamics. Our hypothe-

sis is that intelligent environment design can signiﬁcantly

improve both sample-efﬁciency of IRL methods and the

robustness of learned rewards against variations in the en-

vironment dynamics. An example where our approach is

applicable is given in Figure 1, where we need to learn

the reward function (i.e., the location of the goal and lava

squares). In summary, our contributions are:

An environment design framework that selects informa-

tive demo environments for the experts (Section 3).

An objective based on maximin Bayesian regret to choose

environments in a way that compels the expert to provide

useful information about the unknown reward function

(Section 4).

An extension of Bayesian IRL and Maximum Entropy

IRL to multiple environments (Section 5). We provide

arXiv:2210.14972v3 [cs.LG] 14 May 2024

Environment Design for Inverse Reinforcement Learning

(a) 1st round (b) 2nd round (c) 3rd round

Figure 1: The expert navigates to the closest of the three possible goal squares while avoiding lava in adaptively elected

maze environments. For three consecutive rounds (a)-(c), we display the mazes chosen by

ED-BIRL

(Algorithm 2in

Section 5) as well as the current reward estimate after observing an expert trajectory in the current and past mazes. By

adaptively designing environments and combining the expert demonstrations, we can recover the locations of all goal and

most lava squares. In contrast, from observations in a ﬁxed environment, e.g., repeatedly observing the expert in maze (a), it

would be impossible to recover all relevant aspects of the reward function, i.e., the location of the goal squares, as only the

nearest goal square would be visited by the expert (repeatedly). Observing the human expert in new and carefully curated

environments can lead to a more precise and robust estimate of the unknown reward function.

concrete implementations for the extensions of MCMC

Bayesian IRL (Ramachandran & Amir,2007) and Ad-

versarial IRL (AIRL) (Fu et al.,2018).

We conduct extensive experiments to evaluate our ap-

proaches (Section 6). We test learned reward functions

in unknown transition dynamics across various environ-

ments, including continuous mazes and MuJoCo bench-

marks (Todorov et al.,2012). We compare against sev-

eral other IRL and imitation learning algorithms, such

as Robust Imitation learning with Multiple perturbed

Environments (RIME) (Chae et al.,2022).

Our results illustrate the superior robustness of our algo-

rithms and the effectiveness of the environment design

framework. This shows that active environment selection

signiﬁcantly improves both the sample-efﬁciency of IRL

and the robustness of learned rewards (generalisability).

2. Related Work

Inverse Reinforcement Learning. The goal of IRL (Rus-

sell,1998;Ng & Russell,2000) is to ﬁnd a reward function

that explains observed behaviour, which is assumed to be ap-

proximately optimal. Two of the most popular approaches

to the IRL problem are Bayesian IRL (Ramachandran &

Amir,2007;Rothkopf & Dimitrakakis,2011;Choi & Kim,

2011) and Maximum Entropy IRL (Ziebart et al.,2008;Ho

& Ermon,2016;Finn et al.,2016). In this work, we extend

both IRL formulations to demonstrations under varying envi-

ronment dynamics. Note that this differs from the situation,

where we observe demonstrations by experts of varying

quality (Castro et al.,2019), or demonstrations by experts

that optimise different rewards (Ramponi et al.,2020;Lik-

meta et al.,2021), in a ﬁxed environment. Moreover, Cao

et al. (2021) and Rolland et al. (2022) study the identiﬁabil-

ity of the true reward function in IRL and showed that when

observing experts under different environment dynamics

the true reward function can be identiﬁed up to a constant

under certain conditions. However, it is important to note

that in all of these cases, the learner is passive and does

not actively seek information about the reward function by

choosing speciﬁc experts or environments.

Active Inverse Reinforcement Learning. The environ-

ment design problem that we consider in this paper can

be viewed as one of active reward elicitation (Lopes et al.,

2009). Prior work on active reward learning has focused

on querying the expert for additional demonstrations in spe-

ciﬁc states (Lopes et al.,2009;Brown et al.,2018;Lindner

et al.,2021;2022), mainly with the goal of resolving the

uncertainty that is due to the expert’s policy not being speci-

ﬁed accurately in these states. In contrast, we consider the

situation where we cannot directly query the expert for addi-

tional information in speciﬁc states, but instead sequentially

choose environments for the expert to act in. Importantly,

this means that the same state can be visited under different

transition dynamics, which can be crucial to distinguish the

true reward function among multiple plausible candidates

(Cao et al.,2021;Rolland et al.,2022).

In other related work, Amin et al. (2017) consider a repeated

IRL setting in which the learner can choose any task for

the expert to complete (with full information of the expert

policy). He & Dragan (2021) study an iterative reward de-

sign setup where a human provides the learner with a proxy

reward function, upon which the learner tries to choose

an edge-case environment in which the proxy fails so that

the human revises their proxy. In a similar vein, Buening

et al. (2022) introduced Interactive IRL, where the learner

interacts with a human in a collaborative Stackelberg game

without knowledge of the joint reward function. This setting

is similar to the framework presented in this paper in that

the leader in a Stackelberg game can be viewed as designing

environments by committing to speciﬁc policies.

Environment Design for Inverse Reinforcement Learning

Environment Design for Reinforcement Learning. En-

vironment design and curriculum learning for RL aim to

design a sequence of environments with increasing difﬁculty

to improve the training of an autonomous agent (Narvekar

et al.,2020). However, in contrast to our problem setup,

observations in generated training environments are cheap,

since this only involves actions from an autonomous agent,

not a human expert. As such, approaches like domain ran-

domisation (Tobin et al.,2017;Akkaya et al.,2019) can

be practical for RL, whereas they can be extremely inef-

ﬁcient and wasteful in an IRL setting. Moreover, in IRL

we typically work with a handful of rounds only, so that

slowly improving the environment generation process over

thousands of training episodes (i.e., rounds) is impracti-

cal (Dennis et al.,2020;Gur et al.,2021). As a result, most

methods, which are viable for the RL, can be expected to be

unsuitable for the IRL problem.

3. Problem Formulation

We now formally introduce the Environment Design for

Inverse Reinforcement Learning framework. A Markov De-

cision Process (MDP) is a tuple

(S,A, T, R, γ, ω)

, where

is a set of states,

is a set of actions,

T:S×A×S → [0,1]

is a transition function,

R:S → R

is a reward function,

a discount factor, and

an initial state distribution. We

assume that there is a set transition functions

from which

can be selected. Similar models have been considered for

the RL problem under the name of Underspeciﬁed MDPs

(Dennis et al.,2020) or Conﬁgurable MDPs (Metelli et al.,

2018;Ramponi et al.,2021).

We assume that the true reward function, denoted

, is un-

known to the learner and consider the situation where the

learner gets to interact with the human expert in a sequence

rounds.

More precisely, every round

k∈[m]

, the

learner gets to select a demo environment

Tk∈ T

for which

an expert trajectory

τk

is observed. Our objective is to adap-

tively select a sequence of demo environments

T1, . . . , Tm

so as to recover a robust estimate of the unknown reward

function. We describe the general framework for this inter-

action between learner and human expert in Framework 1.

To summarise, a problem-instance in our setting is given by

(S,A,T,R, γ, ω, m)

, where

is a set of environments,

is the unknown reward function, and

the learner’s budget.

From Framework 1we see that the Environment Design for

IRL problem has two main ingredients: a) choosing useful

demo environments for the human to demonstrate the task

in (Section 4), and b) inferring the reward function from

expert demonstrations in multiple environments (Section 5).

Typically, expert demonstrations are a limited resource as they

involve expensive human input. We thus consider a limited budget

of mexpert trajectories that the learner is able to obtain.

Framework 1 Environment Design for IRL

1: input set of environments T, resources m∈N

2: for k = 1, . . . , m do

3: Choose an environment Tk∈ T

4: Observe expert trajectory τkin environment Tk

5: Estimate rewards from observations up to round k

3.1. Preliminaries and Notation

Throughout the paper,

denotes a generic reward function,

whereas

refers to the true (unknown) reward function.

We let

denote a generic policy space. Now,

Vπ

R,T (s):=

E[P∞

t=0 γtR(st)|π, T, s0=s]

is the expected discounted

return, i.e., value function, of a policy

under some reward

function

and transition function

in state

. For the

value under the initial state distribution

, we then merely

write

Vπ

R,T :=Es∼ω[Vπ

R,T (s)]

and denote its maximum by

V∗

R,T := maxπVπ

R,T

. We accordingly refer to the

-values

under a policy

Qπ

R,T (s, a)

and their optimal values by

Q∗

R,T (s, a)

. In the following, we let

π∗

R,T

always denote the

optimal policy w.r.t.

and

, i.e., the policy maximising the

expected discounted return in the MDP (S,A, T, R, γ, ω).

In the following, we let

denote expert trajectories. Note

that in Framework 1every such trajectory is generated w.r.t.

some transition dynamics

. In round

, we thus observe

Dk= (τk, Tk)

, i.e., the expert trajectory

τk

in environment

. We then write

D1:k= (D1,...,Dk)

for all observa-

tions up to (and including) the

-th round. Moreover, we

let

P(· | D1:k)

denote the posterior over reward functions

given observations

D1:k

. For the prior

P(·)

, we introduce

the convention that P(·) = P(·|D1:0).

4. Environment Design via Maximin Regret

Our goal is to adaptively select demo environments for the

expert based on our current belief about the reward function.

In Section 4.1, we introduce a maximin Bayesian regret

objective for the environment design process which aims to

select demo environments so as to ensure that our reward

estimate is robust. Section 4.2 then deals with the selection

of such environments when the set of environments exhibits

a useful decomposable structure. We additionally provide a

way to approximate the process when the set has an arbitrary

structure or is challenging to construct.

4.1. Maximin Bayesian Regret

We begin by reﬂecting on the potential loss of an agent when

deploying a policy

under transition function

and the

true reward function R, given by the difference

ℓR(T, π):=V∗

R,T − Vπ

R,T .

Environment Design for Inverse Reinforcement Learning

The reward function

is unknown to us so that we can

instead use our belief

over reward functions

and consider

the Bayesian regret (i.e., loss) of a policy

under

and

given by

BRP(T, π):=ER∼PℓR(T, π).

The concept of Bayesian regret is well-known from, e.g.,

online optimisation and online learning (Russo & Van Roy,

2014) and has been utilised for IRL in a slightly different

form by Brown et al. (2018). The idea is that given a (prior)

belief about some parameter, we evaluate our policy against

an oracle that knows the true parameter. Typically, under

such uncertainty about the true parameter (in our case, re-

ward function) we are interested in policies minimising the

Bayesian regret:

min

π∈ΠBRP(T, π).

To derive an objective for the environment design prob-

lem, we then consider the maximin problem given by the

worst-case environment

for our current belief over reward

functions P:3

max

T∈T min

π∈ΠBRP(T, π).(1)

What this means is that we search for an environment

T∈ T

such that the regret-minimising policy w.r.t.

performs the

worst compared to the optimal policies w.r.t. the reward can-

didates

R∼P

. In other words, the maximin environment

from

(1)

can be viewed as the environment in which we

expect our current reward estimate to perform the worst.

Choosing environments for the expert according to

(1)

also

has the advantage that maximin environments are in gen-

eral solvable for the expert, since the regret in degenerate

or purely adversarial environments will be close to zero.

Moreover, the regret objective is performance-based and not

only uncertainty-based, such as entropy-based objectives

(Lopes et al.,2009)). This is typically desired as reducing

our uncertainty about the rewards in states that are not rel-

evant under any transition function in

(e.g., states that

are not being visited by any optimal policy) is unnecessary

and generally a wasteful use of our budget. Finally, we also

see that if the Bayesian regret objective becomes zero, the

posterior mean is guaranteed to be optimal in every demo

environment.

Lemma 4.1. If for some posterior

P(· | D)

we have

maxT∈T minπ∈ΠBRP(T, π) = 0

, then the posterior mean

R=EP[R]

is optimal for every

T∈ T

, i.e.,

induces an

optimal policy in every environment contained in T.

When we do not have a posterior over rewards, it is still possi-

ble to build a pseudo-belief upon point estimates. This approach is

later explained in Section 5.2.

We consider

maxTminπ

and not the reverse, as we are inter-

ested in the maximin environment (and not minimax policy).

It is worth noting that our maximin Bayesian regret objec-

tive resembles several approaches to robust reinforcement

learning, e.g., (Roy et al.,2017;Zhou et al.,2021;Buening

et al.,2023;Zhou et al.,2024). However, it differs in that

we are interested in the maximin environment (not mini-

max policy) and the Bayesian regret is deﬁned w.r.t. a set of

environments Tand a distribution over reward functions.

4.2. Finding Maximin Environments

Structured Environments. Often the set of environments

has a useful structure that can be exploited to search the

space of environments efﬁciently. We here consider the

special case where each environment

T∈ T

is build from a

collection of transition matrices

. Similar setups can be

found in the robust dynamic programming literature (e.g.,

(Iyengar,2005;Nilim & El Ghaoui,2005;Xu & Mannor,

2010;Mannor et al.,2016)).

Let

Ts∈RS×A

denote a state-transition matrix dictating

the transition probabilities in state

. We can identify any

transition function

with a family of state-transition ma-

trices

{Ts}s∈S

. We then say that an environment set

allows us to make state-individual transition choices if there

exist sets

such that

T={{Ts}s∈S :Ts∈ Ts}

. In other

words, we can choose a new environment

by arbitrarily

combining transition matrices for each state. Note that this

of course allows for the case where the transitions in some

state

are ﬁxed, i.e.,

Ts={Ts}

. When we can make such

state-individual transition choices, we can use an extended

value iteration approach to approximate the maximin envi-

ronment efﬁciently. The extended value iteration algorithm

is speciﬁed in Appendix B, Algorithm 4.

Arbitrary Environments. In some situations, the set of

environments

may not exhibit any useful structure. More-

over, we may not even have explicit knowledge of the transi-

tion functions in

, but can only access a set of correspond-

ing simulators. In this case, we are left with approximating

the maximin environment (1) by sampling simulators from

and performing policy rollouts. We describe the complete

procedure in Appendix B, Algorithm 5.

Flexible Environment Set Construction. Although our

assumption initially considers the simplest scenario where

any environment within

is accessible at any time, this may

become impractical when the process of building environ-

ments is labour-intensive. Instead of probing the entire set

at each environment design step, our framework allows

some ﬂexibility. With little approximation in Framework 1

Line 3, we can select the next environment from a new sub-

set

T⊂⊂ T

. This allows users to progressively build new

environments with every additionally desired round.

Environment Design for Inverse Reinforcement Learning

Algorithm 2 ED-BIRL: Environment Design for BIRL

1: input environments T, prior P, budget m∈N

2: for k= 1, . . . , m do

3: Sample rewards from P(· | D1:k−1)using BIRL

Construct empirical distribution

Pk−1

from samples

5: Find Tk∈arg maxTminπBRˆ

Pk−1(T, π)

6: Observe trajectory τkin Tk, i.e., Dk= (τk, Tk)

7: return BIRL(D1:m)

5. Inverse Reinforcement Learning with

Multiple Environments

We now explain how to learn about the reward function from

demonstrations that were provided under multiple environ-

ment dynamics. To do so, we will extend Bayesian and Max-

Ent IRL methods to this setting, and combine them with en-

vironment design to obtain the

ED-BIRL

and

ED-AIRL

al-

gorithm, respectively. While

ED-BIRL

is designed for sim-

ple tabular problems due to its high complexity, ED-AIRL

can be applied to environments with large or continuous

action/observation spaces.

5.1. The Bayesian Setting: ED-BIRL

The Bayesian perspective to the IRL problem provides a

principled way to reason about reward uncertainty (Ra-

machandran & Amir,2007). Typically, the human is mod-

elled by a Boltzmann-rational policy (Jeon et al.,2020).

This means that for a given reward function

and transition

function

the expert is acting according to a softmax policy

πsoftmax

R,T (a|s) = exp(cQ∗

R,T (s,a))

Pa′exp(cQ∗

R,T (s,a′))

, where the parame-

ter

relates to our judgement of the expert’s optimality.

Given a prior distribution

P(·)

, the goal of Bayesian IRL is

to recover the posterior distribution

P(· | D)

and to either

sample from the posterior using MCMC (Ramachandran &

Amir,2007;Rothkopf & Dimitrakakis,2011) or perform

MAP estimation (Choi & Kim,2011).

In our case, the data is given by the sequence

D1:k=

(D1,...,Dk)

with

Di= (τi, Ti)

. We see that this is no

obstacle as the likelihood factorises as

P(D1:k|R) =

Qi≤kP(τi|R, Ti)

, since the expert trajectories (i.e., ex-

pert policies) are conditionally independent given the re-

ward function and transition function. The likelihood of

each expert demonstration is then given by

P(τi|R, Ti) =

Q(s,a)∈τiπsoftmax

R,Ti(a|s)

. Hence, the reward posterior can

be expressed as

P(R| D1:k)∝Y

i≤kY

(s,a)∈τi

πsoftmax

R,Ti(a|s)·P(R).(2)

Note that when using MCMC Bayesian IRL methods we can

also perform inference over the parameter

and must not assume

knowledge of the expert’s optimality.

Algorithm 3 ED-AIRL: Environment Design for AIRL

1: input environments T, budget m∈N

2: for k= 1, . . . , m do

3: if k= 1 then

4: Choose Tkarbitrarily from T

5: else

6: Let ˆ

Pk−1≡ U({R1, . . . , Rk−1})

7: Find Tk∈arg maxTBRˆ

Pk−1(T, π∗

R1:k−1,T )

8: Observe trajectory τkin T, i.e., Dk= (τk, Tk)

9: Compute point estimate Rk=AIRL(Dk)

10: Compute best guess R1:k=AIRL-ME(D1:k)

11: return AIRL-ME(D1:m)

As a result, we can, for instance, sample from the posterior

using the Policy-Walk algorithm from (Ramachandran &

Amir,2007) with minor modiﬁcations or the Metropolis-

Hastings Simplex-Walk algorithm from (Buening et al.,

2022). We generally denote any Bayesian IRL algorithm

that is capable of sampling from the posterior by BIRL.

Plugging

BIRL

into our environment design framework,

we get the

ED-BIRL

procedure detailed in Algorithm 2.

Note that, in practice, we approximate the posterior

P(· |

D1:k)

by sampling rewards and constructing an empirical

distribution ˆ

Pk.

5.2. The MaxEnt Setting: AIRL-ME and ED-AIRL

In the following, we describe how to extend MaxEnt IRL

methods to demonstrations from multiple environments, and

use them in combination with environment design for IRL.

Reward Inference with Multiple Environments. In

MaxEnt IRL, the reward function is assumed to be pa-

rameterised by some parameter

. Given observations

D1:k= (D1,...,Dk)

with

Di= (τi, Ti)

, our goal for

multiple environments is to solve the maximum likelihood

problem

arg max

θX

(τ,T )∈D1:k

log P(τ|θ, T ),(3)

where we again used that trajectories are independent condi-

tional on the reward parameter

and the dynamics

. Con-

sequently, the only difference to the original MaxEnt IRL

formulation is that we now sum over pairs

(τ, T )

instead of

just trajectories

. The speciﬁc algorithm we consider here

is Adversarial IRL (

AIRL

) (Fu et al.,2018), which frames

the optimisation of (3) as a generative adversarial network.

To extend

AIRL

to multiple environments, we can consider

distinct policies

π1, . . . , πk

used to generate trajectories in

environments

T1, . . . , Tk

, respectively, and discriminators

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EnvironmentDesignforInverseReinforcementLearningThomasKleineBuening1*VictorVillin2*ChristosDimitrakakis2AbstractLearningarewardfunctionfromdemonstra-tionssuffersfromlowsample-efficiency.Evenwithabundantdata,currentinversereinforcementlearningmethodsthatfocusonlearningfromasin-gleenvironmentcanfailto...

展开>> 收起<<

Environment Design for Inverse Reinforcement Learning.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Environment Design for Inverse Reinforcement Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: