
Environment Design for Inverse Reinforcement Learning
Environment Design for Reinforcement Learning. En-
vironment design and curriculum learning for RL aim to
design a sequence of environments with increasing difficulty
to improve the training of an autonomous agent (Narvekar
et al.,2020). However, in contrast to our problem setup,
observations in generated training environments are cheap,
since this only involves actions from an autonomous agent,
not a human expert. As such, approaches like domain ran-
domisation (Tobin et al.,2017;Akkaya et al.,2019) can
be practical for RL, whereas they can be extremely inef-
ficient and wasteful in an IRL setting. Moreover, in IRL
we typically work with a handful of rounds only, so that
slowly improving the environment generation process over
thousands of training episodes (i.e., rounds) is impracti-
cal (Dennis et al.,2020;Gur et al.,2021). As a result, most
methods, which are viable for the RL, can be expected to be
unsuitable for the IRL problem.
3. Problem Formulation
We now formally introduce the Environment Design for
Inverse Reinforcement Learning framework. A Markov De-
cision Process (MDP) is a tuple
(S,A, T, R, γ, ω)
, where
S
is a set of states,
A
is a set of actions,
T:S×A×S → [0,1]
is a transition function,
R:S → R
is a reward function,
γ
a discount factor, and
ω
an initial state distribution. We
assume that there is a set transition functions
T
from which
T
can be selected. Similar models have been considered for
the RL problem under the name of Underspecified MDPs
(Dennis et al.,2020) or Configurable MDPs (Metelli et al.,
2018;Ramponi et al.,2021).
We assume that the true reward function, denoted
R
, is un-
known to the learner and consider the situation where the
learner gets to interact with the human expert in a sequence
of
m
rounds.
1
More precisely, every round
k∈[m]
, the
learner gets to select a demo environment
Tk∈ T
for which
an expert trajectory
τk
is observed. Our objective is to adap-
tively select a sequence of demo environments
T1, . . . , Tm
so as to recover a robust estimate of the unknown reward
function. We describe the general framework for this inter-
action between learner and human expert in Framework 1.
To summarise, a problem-instance in our setting is given by
(S,A,T,R, γ, ω, m)
, where
T
is a set of environments,
R
is the unknown reward function, and
m
the learner’s budget.
From Framework 1we see that the Environment Design for
IRL problem has two main ingredients: a) choosing useful
demo environments for the human to demonstrate the task
in (Section 4), and b) inferring the reward function from
expert demonstrations in multiple environments (Section 5).
1
Typically, expert demonstrations are a limited resource as they
involve expensive human input. We thus consider a limited budget
of mexpert trajectories that the learner is able to obtain.
Framework 1 Environment Design for IRL
1: input set of environments T, resources m∈N
2: for k = 1, . . . , m do
3: Choose an environment Tk∈ T
4: Observe expert trajectory τkin environment Tk
5: Estimate rewards from observations up to round k
3.1. Preliminaries and Notation
Throughout the paper,
R
denotes a generic reward function,
whereas
R
refers to the true (unknown) reward function.
We let
Π
denote a generic policy space. Now,
Vπ
R,T (s):=
E[P∞
t=0 γtR(st)|π, T, s0=s]
is the expected discounted
return, i.e., value function, of a policy
π
under some reward
function
R
and transition function
T
in state
s
. For the
value under the initial state distribution
ω
, we then merely
write
Vπ
R,T :=Es∼ω[Vπ
R,T (s)]
and denote its maximum by
V∗
R,T := maxπVπ
R,T
. We accordingly refer to the
Q
-values
under a policy
π
by
Qπ
R,T (s, a)
and their optimal values by
Q∗
R,T (s, a)
. In the following, we let
π∗
R,T
always denote the
optimal policy w.r.t.
R
and
T
, i.e., the policy maximising the
expected discounted return in the MDP (S,A, T, R, γ, ω).
In the following, we let
τ
denote expert trajectories. Note
that in Framework 1every such trajectory is generated w.r.t.
some transition dynamics
T
. In round
k
, we thus observe
Dk= (τk, Tk)
, i.e., the expert trajectory
τk
in environment
Tk
. We then write
D1:k= (D1,...,Dk)
for all observa-
tions up to (and including) the
k
-th round. Moreover, we
let
P(· | D1:k)
denote the posterior over reward functions
given observations
D1:k
. For the prior
P(·)
, we introduce
the convention that P(·) = P(·|D1:0).
4. Environment Design via Maximin Regret
Our goal is to adaptively select demo environments for the
expert based on our current belief about the reward function.
In Section 4.1, we introduce a maximin Bayesian regret
objective for the environment design process which aims to
select demo environments so as to ensure that our reward
estimate is robust. Section 4.2 then deals with the selection
of such environments when the set of environments exhibits
a useful decomposable structure. We additionally provide a
way to approximate the process when the set has an arbitrary
structure or is challenging to construct.
4.1. Maximin Bayesian Regret
We begin by reflecting on the potential loss of an agent when
deploying a policy
π
under transition function
T
and the
true reward function R, given by the difference
ℓR(T, π):=V∗
R,T − Vπ
R,T .
3