
[
15
,
58
,
41
,
34
,
17
,
7
,
29
,
53
] or embodied AI tasks where the agent must generalize its behavior to
unseen physical spaces at test time [
59
,
61
,
28
,
72
]. A number of methods have been proposed which
have shown promising performance in PCG environments with sparse rewards, such as RIDE [
56
],
AGAC [
25
] and NovelD [
76
]. These methods propose different intrinsic reward functions, such as
the change in representation in a latent space, the divergence between the predictions of a policy and
an adversary, or the difference between random network prediction errors at two consecutive states.
Although not presented as a central algorithmic feature, these methods also include a count-based
bonus which is computed at the episode level.
In this work, we take a closer look at exploration in CMDPs, where each episode corresponds to a
different environment context. We first show that, surprisingly, the count-based episodic bonus that is
often included as a heuristic is in fact essential for good performance, and current methods fail if it is
omitted. Furthermore, due to this dependence on a count-based term, existing methods fail on more
complex tasks with irrelevant features or dynamic entities, where each observation is rarely seen
more than once. We find that performance can be improved by counting certain features extracted
from the observations, rather than the observations themselves. However, different features are useful
for different tasks, making it difficult to design a feature extractor that performs well across all tasks.
To address this fundamental limitation, we propose a new method, E3B, which uses an elliptical
bonus [
6
,
20
,
42
] at the episode level that can be seen as a natural generalization of a count-based
episodic bonus to continuous state spaces, and that is paired with a self-supervised feature learning
method using an inverse dynamics model. Our algorithm is simple to implement, scalable to large or
infinite state spaces, and achieves state-of-the-art performance across 16 challenging tasks from the
MiniHack suite [
58
], without the need for task-specific prior knowledge. It also matches existing
methods on hard exploration tasks from the VizDoom environment [
37
], and significantly outperforms
existing methods in reward-free exploration on the Habitat embodied AI environment [
59
,
69
]. This
demonstrates that E3B scales to rich, high dimensional pixel-based observations and real-world
scenarios. Our code is available at https://github.com/facebookresearch/e3b.
2 Background
2.1 Contextual MDPs
We consider a Contextual Markov Decision Process
1
(CMDP [
30
]) given by
M=
(S,A,C, P, r, µC, µS)
where
S
is the state space,
A
is the action space,
C
is the context space,
P
is the transition function,
r
is the reward function,
µC
is the distribution over contexts and
µS
is
the conditional initial state distribution. At each episode, we sample a context
c∼µC
, an initial state
s0∼µ(·|c)
, and subsequent states in the episode are sampled according to
st+1 ∼P(·|st, at, c)
. Let
dc
π
denote the distribution over states induced by following policy
π
in context
c
. The goal is to learn a
policy
π
which maximizes the expected return over all contexts, i.e.
R=Ec∼µC,s∼dc
π,a∼π(s)[r(s, a)]
.
Examples of CMDPs include procedurally-generated environments, such as ProcGen [
17
], MiniGrid
[
15
], NetHack [
41
], or MiniHack [
58
], where each context
c
corresponds to the random seed used
to generate the environment; in this case, the number of contexts
|C|
is effectively infinite. Other
examples include embodied AI environments [
59
,
69
,
28
,
61
,
72
], where the agent is placed in
different simulated houses and must navigate to a location or find an object. In this setting, each
context
c∈ C
represents a house identifier and the number of houses
|C|
is typically between
20
and
1000. For an in-depth review of the literature on CMDPs and generalization in RL, see [39].
2.2 Exploration Bonuses
If the environment rewards are sparse, learning a policy using simple
-greedy exploration may require
intractably many samples. We therefore consider methods that augment the external reward function
r
with an intrinsic reward bonus
b
. A number of intrinsic bonuses that encourage exploration in
singleton (non-contextual) MDPs have been proposed, including pseudocounts [
9
], intrinsic curiosity
modules (ICM) [
52
] and random network distillation (RND) error [
13
]. At a high level, these methods
1
Technically, some of the environments we consider are Contextual Partially Observed MDPs (CPOMDPs),
but we follow the convention in [
39
] and adopt the CMDP framework for simplicity. For CPOMDPs, we use
recurrent networks or frame stacking to convert to CMDPs, as done in prior work [46].
2