Environment Design for Inverse Reinforcement Learning

2025-05-06 0 0 4.33MB 21 页 10玖币
侵权投诉
Environment Design for Inverse Reinforcement Learning
Thomas Kleine Buening 1 * Victor Villin 2 * Christos Dimitrakakis 2
Abstract
Learning a reward function from demonstra-
tions suffers from low sample-efficiency. Even
with abundant data, current inverse reinforcement
learning methods that focus on learning from a sin-
gle environment can fail to handle slight changes
in the environment dynamics. We tackle these
challenges through adaptive environment design.
In our framework, the learner repeatedly interacts
with the expert, with the former selecting environ-
ments to identify the reward function as quickly
as possible from the expert’s demonstrations in
said environments. This results in improvements
in both sample-efficiency and robustness, as we
show experimentally, for both exact and approxi-
mate inference.
1. Introduction
Reinforcement Learning (RL) is a powerful framework for
autonomous decision-making in games (Mnih et al.,2015),
continuous control problems (Lillicrap et al.,2015), and
robotics (Levine et al.,2016). However, specifying suitable
reward functions remains one of the main barriers to the
wider application of RL in real-world settings and methods
that allow us to communicate tasks without manually defin-
ing reward functions could be of great practical value. One
such approach is Inverse Reinforcement Learning (IRL),
which aims to find a reward function that explains observed
(human) behaviour (Russell,1998;Ng & Russell,2000).
Much of recent effort in IRL has been devoted to making
existing methods more sample-efficient as well as robust
to changes in the environment dynamics (Arora & Doshi,
2021). Sample-efficiency is crucial, as data requires expen-
sive human input. We also need robust estimates of the
unknown reward function, so that the resulting policies re-
main near-optimal, even when the deployed environment
*
Equal contribution
1
The Alan Turing Institute, London, UK
2
Universit
´
e de Neuch
ˆ
atel, Neuch
ˆ
atel, Switzerland. Correspon-
dence to: Thomas Kleine Buening <tbuening@turing.ac.uk>.
Proceedings of the
41 st
International Conference on Machine
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s).
dynamics differ from the ones we learned from.
However, recent work has found that IRL methods tend
to heavily specialise (“overfit”) to the specific transition
dynamics under which the demonstrations were provided,
thereby failing to generalise even across minor changes in
the environment (Toyer et al.,2020). More generally, even
with unlimited access to expert demonstrations, we may still
fail to learn suitable reward functions from a fixed environ-
ment. In particular, prior work has explored the identifia-
bility problem in IRL (Cao et al.,2021;Kim et al.,2021),
illustrating the inherent limitations of IRL when learning
from expert demonstrations in a single, fixed environment.
In our study, we consider the situation where we can de-
sign a sequence of environments, in which the expert will
demonstrate the task. This can either mean slightly modify-
ing a base environment, or selecting an environment from
a finite set. Crafting new environments can involve simple
adjustments such as relocating objects or adding obstacles,
which can be done with little effort and cost. In open-world
settings, simply having different task conditions (e.g., dif-
ferent cars, locations, or time-of-day in a vehicle scenario)
amounts to a different environment.
Contributions. We propose algorithms for designing envi-
ronments in order to infer human goals from demonstrations.
This requires two key components: firstly, an environment
design algorithm and secondly, an inference algorithm for
data from multiple environment dynamics. Our hypothe-
sis is that intelligent environment design can significantly
improve both sample-efficiency of IRL methods and the
robustness of learned rewards against variations in the en-
vironment dynamics. An example where our approach is
applicable is given in Figure 1, where we need to learn
the reward function (i.e., the location of the goal and lava
squares). In summary, our contributions are:
1.
An environment design framework that selects informa-
tive demo environments for the experts (Section 3).
2.
An objective based on maximin Bayesian regret to choose
environments in a way that compels the expert to provide
useful information about the unknown reward function
(Section 4).
3.
An extension of Bayesian IRL and Maximum Entropy
IRL to multiple environments (Section 5). We provide
1
arXiv:2210.14972v3 [cs.LG] 14 May 2024
Environment Design for Inverse Reinforcement Learning
(a) 1st round (b) 2nd round (c) 3rd round
Figure 1: The expert navigates to the closest of the three possible goal squares while avoiding lava in adaptively elected
maze environments. For three consecutive rounds (a)-(c), we display the mazes chosen by
ED-BIRL
(Algorithm 2in
Section 5) as well as the current reward estimate after observing an expert trajectory in the current and past mazes. By
adaptively designing environments and combining the expert demonstrations, we can recover the locations of all goal and
most lava squares. In contrast, from observations in a fixed environment, e.g., repeatedly observing the expert in maze (a), it
would be impossible to recover all relevant aspects of the reward function, i.e., the location of the goal squares, as only the
nearest goal square would be visited by the expert (repeatedly). Observing the human expert in new and carefully curated
environments can lead to a more precise and robust estimate of the unknown reward function.
concrete implementations for the extensions of MCMC
Bayesian IRL (Ramachandran & Amir,2007) and Ad-
versarial IRL (AIRL) (Fu et al.,2018).
4.
We conduct extensive experiments to evaluate our ap-
proaches (Section 6). We test learned reward functions
in unknown transition dynamics across various environ-
ments, including continuous mazes and MuJoCo bench-
marks (Todorov et al.,2012). We compare against sev-
eral other IRL and imitation learning algorithms, such
as Robust Imitation learning with Multiple perturbed
Environments (RIME) (Chae et al.,2022).
5.
Our results illustrate the superior robustness of our algo-
rithms and the effectiveness of the environment design
framework. This shows that active environment selection
significantly improves both the sample-efficiency of IRL
and the robustness of learned rewards (generalisability).
2. Related Work
Inverse Reinforcement Learning. The goal of IRL (Rus-
sell,1998;Ng & Russell,2000) is to find a reward function
that explains observed behaviour, which is assumed to be ap-
proximately optimal. Two of the most popular approaches
to the IRL problem are Bayesian IRL (Ramachandran &
Amir,2007;Rothkopf & Dimitrakakis,2011;Choi & Kim,
2011) and Maximum Entropy IRL (Ziebart et al.,2008;Ho
& Ermon,2016;Finn et al.,2016). In this work, we extend
both IRL formulations to demonstrations under varying envi-
ronment dynamics. Note that this differs from the situation,
where we observe demonstrations by experts of varying
quality (Castro et al.,2019), or demonstrations by experts
that optimise different rewards (Ramponi et al.,2020;Lik-
meta et al.,2021), in a fixed environment. Moreover, Cao
et al. (2021) and Rolland et al. (2022) study the identifiabil-
ity of the true reward function in IRL and showed that when
observing experts under different environment dynamics
the true reward function can be identified up to a constant
under certain conditions. However, it is important to note
that in all of these cases, the learner is passive and does
not actively seek information about the reward function by
choosing specific experts or environments.
Active Inverse Reinforcement Learning. The environ-
ment design problem that we consider in this paper can
be viewed as one of active reward elicitation (Lopes et al.,
2009). Prior work on active reward learning has focused
on querying the expert for additional demonstrations in spe-
cific states (Lopes et al.,2009;Brown et al.,2018;Lindner
et al.,2021;2022), mainly with the goal of resolving the
uncertainty that is due to the expert’s policy not being speci-
fied accurately in these states. In contrast, we consider the
situation where we cannot directly query the expert for addi-
tional information in specific states, but instead sequentially
choose environments for the expert to act in. Importantly,
this means that the same state can be visited under different
transition dynamics, which can be crucial to distinguish the
true reward function among multiple plausible candidates
(Cao et al.,2021;Rolland et al.,2022).
In other related work, Amin et al. (2017) consider a repeated
IRL setting in which the learner can choose any task for
the expert to complete (with full information of the expert
policy). He & Dragan (2021) study an iterative reward de-
sign setup where a human provides the learner with a proxy
reward function, upon which the learner tries to choose
an edge-case environment in which the proxy fails so that
the human revises their proxy. In a similar vein, Buening
et al. (2022) introduced Interactive IRL, where the learner
interacts with a human in a collaborative Stackelberg game
without knowledge of the joint reward function. This setting
is similar to the framework presented in this paper in that
the leader in a Stackelberg game can be viewed as designing
environments by committing to specific policies.
2
Environment Design for Inverse Reinforcement Learning
Environment Design for Reinforcement Learning. En-
vironment design and curriculum learning for RL aim to
design a sequence of environments with increasing difficulty
to improve the training of an autonomous agent (Narvekar
et al.,2020). However, in contrast to our problem setup,
observations in generated training environments are cheap,
since this only involves actions from an autonomous agent,
not a human expert. As such, approaches like domain ran-
domisation (Tobin et al.,2017;Akkaya et al.,2019) can
be practical for RL, whereas they can be extremely inef-
ficient and wasteful in an IRL setting. Moreover, in IRL
we typically work with a handful of rounds only, so that
slowly improving the environment generation process over
thousands of training episodes (i.e., rounds) is impracti-
cal (Dennis et al.,2020;Gur et al.,2021). As a result, most
methods, which are viable for the RL, can be expected to be
unsuitable for the IRL problem.
3. Problem Formulation
We now formally introduce the Environment Design for
Inverse Reinforcement Learning framework. A Markov De-
cision Process (MDP) is a tuple
(S,A, T, R, γ, ω)
, where
S
is a set of states,
A
is a set of actions,
T:S×A×S [0,1]
is a transition function,
R:S R
is a reward function,
γ
a discount factor, and
ω
an initial state distribution. We
assume that there is a set transition functions
T
from which
T
can be selected. Similar models have been considered for
the RL problem under the name of Underspecified MDPs
(Dennis et al.,2020) or Configurable MDPs (Metelli et al.,
2018;Ramponi et al.,2021).
We assume that the true reward function, denoted
R
, is un-
known to the learner and consider the situation where the
learner gets to interact with the human expert in a sequence
of
m
rounds.
1
More precisely, every round
k[m]
, the
learner gets to select a demo environment
Tk∈ T
for which
an expert trajectory
τk
is observed. Our objective is to adap-
tively select a sequence of demo environments
T1, . . . , Tm
so as to recover a robust estimate of the unknown reward
function. We describe the general framework for this inter-
action between learner and human expert in Framework 1.
To summarise, a problem-instance in our setting is given by
(S,A,T,R, γ, ω, m)
, where
T
is a set of environments,
R
is the unknown reward function, and
m
the learner’s budget.
From Framework 1we see that the Environment Design for
IRL problem has two main ingredients: a) choosing useful
demo environments for the human to demonstrate the task
in (Section 4), and b) inferring the reward function from
expert demonstrations in multiple environments (Section 5).
1
Typically, expert demonstrations are a limited resource as they
involve expensive human input. We thus consider a limited budget
of mexpert trajectories that the learner is able to obtain.
Framework 1 Environment Design for IRL
1: input set of environments T, resources mN
2: for k = 1, . . . , m do
3: Choose an environment Tk∈ T
4: Observe expert trajectory τkin environment Tk
5: Estimate rewards from observations up to round k
3.1. Preliminaries and Notation
Throughout the paper,
R
denotes a generic reward function,
whereas
R
refers to the true (unknown) reward function.
We let
Π
denote a generic policy space. Now,
Vπ
R,T (s):=
E[P
t=0 γtR(st)|π, T, s0=s]
is the expected discounted
return, i.e., value function, of a policy
π
under some reward
function
R
and transition function
T
in state
s
. For the
value under the initial state distribution
ω
, we then merely
write
Vπ
R,T :=Esω[Vπ
R,T (s)]
and denote its maximum by
V
R,T := maxπVπ
R,T
. We accordingly refer to the
Q
-values
under a policy
π
by
Qπ
R,T (s, a)
and their optimal values by
Q
R,T (s, a)
. In the following, we let
π
R,T
always denote the
optimal policy w.r.t.
R
and
T
, i.e., the policy maximising the
expected discounted return in the MDP (S,A, T, R, γ, ω).
In the following, we let
τ
denote expert trajectories. Note
that in Framework 1every such trajectory is generated w.r.t.
some transition dynamics
T
. In round
k
, we thus observe
Dk= (τk, Tk)
, i.e., the expert trajectory
τk
in environment
Tk
. We then write
D1:k= (D1,...,Dk)
for all observa-
tions up to (and including) the
k
-th round. Moreover, we
let
P(· | D1:k)
denote the posterior over reward functions
given observations
D1:k
. For the prior
P(·)
, we introduce
the convention that P(·) = P(·|D1:0).
4. Environment Design via Maximin Regret
Our goal is to adaptively select demo environments for the
expert based on our current belief about the reward function.
In Section 4.1, we introduce a maximin Bayesian regret
objective for the environment design process which aims to
select demo environments so as to ensure that our reward
estimate is robust. Section 4.2 then deals with the selection
of such environments when the set of environments exhibits
a useful decomposable structure. We additionally provide a
way to approximate the process when the set has an arbitrary
structure or is challenging to construct.
4.1. Maximin Bayesian Regret
We begin by reflecting on the potential loss of an agent when
deploying a policy
π
under transition function
T
and the
true reward function R, given by the difference
R(T, π):=V
R,T − Vπ
R,T .
3
Environment Design for Inverse Reinforcement Learning
The reward function
R
is unknown to us so that we can
instead use our belief
P
over reward functions
2
and consider
the Bayesian regret (i.e., loss) of a policy
π
under
T
and
P
given by
BRP(T, π):=ERPR(T, π).
The concept of Bayesian regret is well-known from, e.g.,
online optimisation and online learning (Russo & Van Roy,
2014) and has been utilised for IRL in a slightly different
form by Brown et al. (2018). The idea is that given a (prior)
belief about some parameter, we evaluate our policy against
an oracle that knows the true parameter. Typically, under
such uncertainty about the true parameter (in our case, re-
ward function) we are interested in policies minimising the
Bayesian regret:
min
πΠBRP(T, π).
To derive an objective for the environment design prob-
lem, we then consider the maximin problem given by the
worst-case environment
T
for our current belief over reward
functions P:3
max
T∈T min
πΠBRP(T, π).(1)
What this means is that we search for an environment
T∈ T
such that the regret-minimising policy w.r.t.
P
performs the
worst compared to the optimal policies w.r.t. the reward can-
didates
RP
. In other words, the maximin environment
T
from
(1)
can be viewed as the environment in which we
expect our current reward estimate to perform the worst.
Choosing environments for the expert according to
(1)
also
has the advantage that maximin environments are in gen-
eral solvable for the expert, since the regret in degenerate
or purely adversarial environments will be close to zero.
Moreover, the regret objective is performance-based and not
only uncertainty-based, such as entropy-based objectives
(Lopes et al.,2009)). This is typically desired as reducing
our uncertainty about the rewards in states that are not rel-
evant under any transition function in
T
(e.g., states that
are not being visited by any optimal policy) is unnecessary
and generally a wasteful use of our budget. Finally, we also
see that if the Bayesian regret objective becomes zero, the
posterior mean is guaranteed to be optimal in every demo
environment.
Lemma 4.1. If for some posterior
P(· | D)
we have
maxT∈T minπΠBRP(T, π) = 0
, then the posterior mean
¯
R=EP[R]
is optimal for every
T∈ T
, i.e.,
¯
R
induces an
optimal policy in every environment contained in T.
2
When we do not have a posterior over rewards, it is still possi-
ble to build a pseudo-belief upon point estimates. This approach is
later explained in Section 5.2.
3
We consider
maxTminπ
and not the reverse, as we are inter-
ested in the maximin environment (and not minimax policy).
It is worth noting that our maximin Bayesian regret objec-
tive resembles several approaches to robust reinforcement
learning, e.g., (Roy et al.,2017;Zhou et al.,2021;Buening
et al.,2023;Zhou et al.,2024). However, it differs in that
we are interested in the maximin environment (not mini-
max policy) and the Bayesian regret is defined w.r.t. a set of
environments Tand a distribution over reward functions.
4.2. Finding Maximin Environments
Structured Environments. Often the set of environments
has a useful structure that can be exploited to search the
space of environments efficiently. We here consider the
special case where each environment
T∈ T
is build from a
collection of transition matrices
Ts
. Similar setups can be
found in the robust dynamic programming literature (e.g.,
(Iyengar,2005;Nilim & El Ghaoui,2005;Xu & Mannor,
2010;Mannor et al.,2016)).
Let
TsRS×A
denote a state-transition matrix dictating
the transition probabilities in state
s
. We can identify any
transition function
T
with a family of state-transition ma-
trices
{Ts}s∈S
. We then say that an environment set
T
allows us to make state-individual transition choices if there
exist sets
Ts
such that
T={{Ts}s∈S :Ts∈ Ts}
. In other
words, we can choose a new environment
T
by arbitrarily
combining transition matrices for each state. Note that this
of course allows for the case where the transitions in some
state
s
are fixed, i.e.,
Ts={Ts}
. When we can make such
state-individual transition choices, we can use an extended
value iteration approach to approximate the maximin envi-
ronment efficiently. The extended value iteration algorithm
is specified in Appendix B, Algorithm 4.
Arbitrary Environments. In some situations, the set of
environments
T
may not exhibit any useful structure. More-
over, we may not even have explicit knowledge of the transi-
tion functions in
T
, but can only access a set of correspond-
ing simulators. In this case, we are left with approximating
the maximin environment (1) by sampling simulators from
T
and performing policy rollouts. We describe the complete
procedure in Appendix B, Algorithm 5.
Flexible Environment Set Construction. Although our
assumption initially considers the simplest scenario where
any environment within
T
is accessible at any time, this may
become impractical when the process of building environ-
ments is labour-intensive. Instead of probing the entire set
T
at each environment design step, our framework allows
some flexibility. With little approximation in Framework 1
Line 3, we can select the next environment from a new sub-
set
T⊂ T
. This allows users to progressively build new
environments with every additionally desired round.
4
Environment Design for Inverse Reinforcement Learning
Algorithm 2 ED-BIRL: Environment Design for BIRL
1: input environments T, prior P, budget mN
2: for k= 1, . . . , m do
3: Sample rewards from P(· | D1:k1)using BIRL
4:
Construct empirical distribution
ˆ
Pk1
from samples
5: Find Tkarg maxTminπBRˆ
Pk1(T, π)
6: Observe trajectory τkin Tk, i.e., Dk= (τk, Tk)
7: return BIRL(D1:m)
5. Inverse Reinforcement Learning with
Multiple Environments
We now explain how to learn about the reward function from
demonstrations that were provided under multiple environ-
ment dynamics. To do so, we will extend Bayesian and Max-
Ent IRL methods to this setting, and combine them with en-
vironment design to obtain the
ED-BIRL
and
ED-AIRL
al-
gorithm, respectively. While
ED-BIRL
is designed for sim-
ple tabular problems due to its high complexity, ED-AIRL
can be applied to environments with large or continuous
action/observation spaces.
5.1. The Bayesian Setting: ED-BIRL
The Bayesian perspective to the IRL problem provides a
principled way to reason about reward uncertainty (Ra-
machandran & Amir,2007). Typically, the human is mod-
elled by a Boltzmann-rational policy (Jeon et al.,2020).
This means that for a given reward function
R
and transition
function
T
the expert is acting according to a softmax policy
πsoftmax
R,T (a|s) = exp(cQ
R,T (s,a))
Paexp(cQ
R,T (s,a))
, where the parame-
ter
c
relates to our judgement of the expert’s optimality.
4
Given a prior distribution
P(·)
, the goal of Bayesian IRL is
to recover the posterior distribution
P(· | D)
and to either
sample from the posterior using MCMC (Ramachandran &
Amir,2007;Rothkopf & Dimitrakakis,2011) or perform
MAP estimation (Choi & Kim,2011).
In our case, the data is given by the sequence
D1:k=
(D1,...,Dk)
with
Di= (τi, Ti)
. We see that this is no
obstacle as the likelihood factorises as
P(D1:k|R) =
QikP(τi|R, Ti)
, since the expert trajectories (i.e., ex-
pert policies) are conditionally independent given the re-
ward function and transition function. The likelihood of
each expert demonstration is then given by
P(τi|R, Ti) =
Q(s,a)τiπsoftmax
R,Ti(a|s)
. Hence, the reward posterior can
be expressed as
P(R| D1:k)Y
ikY
(s,a)τi
πsoftmax
R,Ti(a|s)·P(R).(2)
4
Note that when using MCMC Bayesian IRL methods we can
also perform inference over the parameter
c
and must not assume
knowledge of the expert’s optimality.
Algorithm 3 ED-AIRL: Environment Design for AIRL
1: input environments T, budget mN
2: for k= 1, . . . , m do
3: if k= 1 then
4: Choose Tkarbitrarily from T
5: else
6: Let ˆ
Pk1≡ U({R1, . . . , Rk1})
7: Find Tkarg maxTBRˆ
Pk1(T, π
R1:k1,T )
8: Observe trajectory τkin T, i.e., Dk= (τk, Tk)
9: Compute point estimate Rk=AIRL(Dk)
10: Compute best guess R1:k=AIRL-ME(D1:k)
11: return AIRL-ME(D1:m)
As a result, we can, for instance, sample from the posterior
using the Policy-Walk algorithm from (Ramachandran &
Amir,2007) with minor modifications or the Metropolis-
Hastings Simplex-Walk algorithm from (Buening et al.,
2022). We generally denote any Bayesian IRL algorithm
that is capable of sampling from the posterior by BIRL.
Plugging
BIRL
into our environment design framework,
we get the
ED-BIRL
procedure detailed in Algorithm 2.
Note that, in practice, we approximate the posterior
P(· |
D1:k)
by sampling rewards and constructing an empirical
distribution ˆ
Pk.
5.2. The MaxEnt Setting: AIRL-ME and ED-AIRL
In the following, we describe how to extend MaxEnt IRL
methods to demonstrations from multiple environments, and
use them in combination with environment design for IRL.
Reward Inference with Multiple Environments. In
MaxEnt IRL, the reward function is assumed to be pa-
rameterised by some parameter
θ
. Given observations
D1:k= (D1,...,Dk)
with
Di= (τi, Ti)
, our goal for
multiple environments is to solve the maximum likelihood
problem
arg max
θX
(τ,T )∈D1:k
log P(τ|θ, T ),(3)
where we again used that trajectories are independent condi-
tional on the reward parameter
θ
and the dynamics
T
. Con-
sequently, the only difference to the original MaxEnt IRL
formulation is that we now sum over pairs
(τ, T )
instead of
just trajectories
τ
. The specific algorithm we consider here
is Adversarial IRL (
AIRL
) (Fu et al.,2018), which frames
the optimisation of (3) as a generative adversarial network.
To extend
AIRL
to multiple environments, we can consider
k
distinct policies
π1, . . . , πk
used to generate trajectories in
environments
T1, . . . , Tk
, respectively, and discriminators
5
摘要:

EnvironmentDesignforInverseReinforcementLearningThomasKleineBuening1*VictorVillin2*ChristosDimitrakakis2AbstractLearningarewardfunctionfromdemonstra-tionssuffersfromlowsample-efficiency.Evenwithabundantdata,currentinversereinforcementlearningmethodsthatfocusonlearningfromasin-gleenvironmentcanfailto...

展开>> 收起<<
Environment Design for Inverse Reinforcement Learning.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:4.33MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注