Exploration via Elliptical Episodic Bonuses Mikael Henaff Meta AI Research

2025-04-26 0 0 5.17MB 28 页 10玖币
侵权投诉
Exploration via Elliptical Episodic Bonuses
Mikael Henaff
Meta AI Research
mikaelhenaff@meta.com
Roberta Raileanu
Meta AI Research
raileanu@meta.com
Minqi Jiang
University College London
Meta AI Research
meta@fb.com
Tim Rocktäschel
University College London
t.rocktaschel@cs.ucl.ac.uk
Abstract
In recent years, a number of reinforcement learning (RL) methods have been pro-
posed to explore complex environments which differ across episodes. In this work,
we show that the effectiveness of these methods critically relies on a count-based
episodic term in their exploration bonus. As a result, despite their success in
relatively simple, noise-free settings, these methods fall short in more realistic
scenarios where the state space is vast and prone to noise. To address this lim-
itation, we introduce
E
xploration via
E
lliptical
E
pisodic
B
onuses (
E3B
), a new
method which extends count-based episodic bonuses to continuous state spaces and
encourages an agent to explore states that are diverse under a learned embedding
within each episode. The embedding is learned using an inverse dynamics model
in order to capture controllable aspects of the environment. Our method sets a
new state-of-the-art across 16 challenging tasks from the MiniHack suite, with-
out requiring task-specific inductive biases. E3B also matches existing methods
on sparse reward, pixel-based Vizdoom environments, and outperforms existing
methods in reward-free exploration on Habitat, demonstrating that it can scale to
high-dimensional pixel-based observations and realistic environments.
1 Introduction
Exploration in environments with sparse rewards is a fundamental challenge in reinforcement learning
(RL). In the tabular setting, provably optimal algorithms have existed since the early 2000s [
36
,
10
].
More recently, exploration has been studied in the context of deep RL, and a number of empirically
successful methods have been proposed, such as pseudocounts [
9
], intrinsic curiosity modules (ICM)
[
52
], and random network distillation (RND) [
13
]. These methods rely on intrinsically generated
exploration bonuses that reward the agent for visiting states that are novel according to some measure.
Different measures of novelty have been proposed, such as the likelihood of a state under a learned
density model, the error of a forward dynamics model, or the loss on a random prediction task. These
approaches have proven effective on hard exploration problems, as exemplified by the Atari games
Montezuma’s Revenge and PitFall [22].
The approaches above are, however, designed for singleton RL tasks, where the agent is spawned
in the same environment in every episode. Recently, several studies have drawn attention to the
fact that RL agents exhibit poor generalization across environments, and that even minor changes
to the environment can lead to substantial degradation in performance [
75
,
35
,
74
,
18
,
38
]. This
has motivated the creation of benchmarks in the Contextual Markov Decision Process (CMDP)
framework, where different episodes correspond to different environments that nevertheless share
certain characteristics. Examples of CMDPs include procedurally generated (PCG) environments
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05805v2 [cs.LG] 4 Jan 2023
[
15
,
58
,
41
,
34
,
17
,
7
,
29
,
53
] or embodied AI tasks where the agent must generalize its behavior to
unseen physical spaces at test time [
59
,
61
,
28
,
72
]. A number of methods have been proposed which
have shown promising performance in PCG environments with sparse rewards, such as RIDE [
56
],
AGAC [
25
] and NovelD [
76
]. These methods propose different intrinsic reward functions, such as
the change in representation in a latent space, the divergence between the predictions of a policy and
an adversary, or the difference between random network prediction errors at two consecutive states.
Although not presented as a central algorithmic feature, these methods also include a count-based
bonus which is computed at the episode level.
In this work, we take a closer look at exploration in CMDPs, where each episode corresponds to a
different environment context. We first show that, surprisingly, the count-based episodic bonus that is
often included as a heuristic is in fact essential for good performance, and current methods fail if it is
omitted. Furthermore, due to this dependence on a count-based term, existing methods fail on more
complex tasks with irrelevant features or dynamic entities, where each observation is rarely seen
more than once. We find that performance can be improved by counting certain features extracted
from the observations, rather than the observations themselves. However, different features are useful
for different tasks, making it difficult to design a feature extractor that performs well across all tasks.
To address this fundamental limitation, we propose a new method, E3B, which uses an elliptical
bonus [
6
,
20
,
42
] at the episode level that can be seen as a natural generalization of a count-based
episodic bonus to continuous state spaces, and that is paired with a self-supervised feature learning
method using an inverse dynamics model. Our algorithm is simple to implement, scalable to large or
infinite state spaces, and achieves state-of-the-art performance across 16 challenging tasks from the
MiniHack suite [
58
], without the need for task-specific prior knowledge. It also matches existing
methods on hard exploration tasks from the VizDoom environment [
37
], and significantly outperforms
existing methods in reward-free exploration on the Habitat embodied AI environment [
59
,
69
]. This
demonstrates that E3B scales to rich, high dimensional pixel-based observations and real-world
scenarios. Our code is available at https://github.com/facebookresearch/e3b.
2 Background
2.1 Contextual MDPs
We consider a Contextual Markov Decision Process
1
(CMDP [
30
]) given by
M=
(S,A,C, P, r, µC, µS)
where
S
is the state space,
A
is the action space,
C
is the context space,
P
is the transition function,
r
is the reward function,
µC
is the distribution over contexts and
µS
is
the conditional initial state distribution. At each episode, we sample a context
cµC
, an initial state
s0µ(·|c)
, and subsequent states in the episode are sampled according to
st+1 P(·|st, at, c)
. Let
dc
π
denote the distribution over states induced by following policy
π
in context
c
. The goal is to learn a
policy
π
which maximizes the expected return over all contexts, i.e.
R=EcµC,sdc
π,aπ(s)[r(s, a)]
.
Examples of CMDPs include procedurally-generated environments, such as ProcGen [
17
], MiniGrid
[
15
], NetHack [
41
], or MiniHack [
58
], where each context
c
corresponds to the random seed used
to generate the environment; in this case, the number of contexts
|C|
is effectively infinite. Other
examples include embodied AI environments [
59
,
69
,
28
,
61
,
72
], where the agent is placed in
different simulated houses and must navigate to a location or find an object. In this setting, each
context
c∈ C
represents a house identifier and the number of houses
|C|
is typically between
20
and
1000. For an in-depth review of the literature on CMDPs and generalization in RL, see [39].
2.2 Exploration Bonuses
If the environment rewards are sparse, learning a policy using simple
-greedy exploration may require
intractably many samples. We therefore consider methods that augment the external reward function
r
with an intrinsic reward bonus
b
. A number of intrinsic bonuses that encourage exploration in
singleton (non-contextual) MDPs have been proposed, including pseudocounts [
9
], intrinsic curiosity
modules (ICM) [
52
] and random network distillation (RND) error [
13
]. At a high level, these methods
1
Technically, some of the environments we consider are Contextual Partially Observed MDPs (CPOMDPs),
but we follow the convention in [
39
] and adopt the CMDP framework for simplicity. For CPOMDPs, we use
recurrent networks or frame stacking to convert to CMDPs, as done in prior work [46].
2
define an intrinsic reward bonus that is high if the current state is different from the previous states
visited by the agent, and low if it is similar (according to some measure).
Method Exploration bonus
RIDE [56] kφ(st+1)φ(st)k2·1/pNe(st+1)
NovelD [76] hbRND(st+1)α·bRND(st)i+·I[Ne(st+1) = 1]
AGAC [25] DKL(π(·|st)kπadv(·|st)) + β·1/pNe(st+1)
Table 1: Summary of recent exploration methods for procedurally-generated environments. Each
exploration bonus has a count-based episodic term, marked in blue.
More recently, several methods have been proposed for and evaluated on procedurally-generated
MDPs. All use different exploration bonuses, which are summarized in Table 1. RIDE [
56
] defines
a bonus based on the distance between the embeddings of two consecutive observations, NovelD
[
76
] uses a bonus based on the difference between two consecutive RND bonuses, and AGAC [
25
]
uses the KL divergence between the predictions of the agent’s policy and those of an adversarial
policy trained to mimic it (see the original works for details). In addition, all three methods have a
term in their bonus (marked in blue) which depends on
Ne(st+1)
– the number of times
st+1
has
been encountered during the current episode. Although presented as a heuristic, below we will show
that without this count-based episodic term, all three methods fail to learn. This in turn limits their
effectiveness in more complex, dynamic, and noisy environments.
3 Importance and Limitations of Count-Based Episodic Bonuses
We now discuss in more detail the importance and limitations of the count-based episodic bonuses
used in RIDE, AGAC and NovelD, which depend on
Ne(st)
. Figure 1 shows results for the three
methods with and without their respective count-based episodic terms, on one of the MiniGrid
environments used in prior work. When the count-based terms are removed, all three methods fail to
learn. Similar trends apply for other MiniGrid environments (see Appendix D.1). This shows that the
episodic bonus is in fact essential for good performance.
Figure 1: Results for RIDE, AGAC and NovelD with and without the count-based episodic bonus,
over 5random seeds (solid line indicates the mean, shaded region indicates one standard deviation).
All three algorithms fail if the count-based episodic bonus is removed.
However, the count-based episodic bonus suffers from a fundamental limitation, which is similar to
that faced by count-based approaches in general: if each state is unique, then
Ne(st)
will always be
1
and the episodic bonus is no longer meaningful. This is the case for many real-world applications.
For example, a household robot’s state as recorded by its camera might include moving trees outside
the window, clocks showing the time or images on a television screen which are not relevant for its
tasks, but nevertheless make each state unique.
Previous works [
56
,
25
,
76
] have used the MiniGrid test suite [
15
] for evaluation, where observations
are less noisy and do not typically contain irrelevant information. Thus, methods relying on episodic
counts have been effective in these scenarios. However, in more complex environments such as
3
MiniHack [
58
] or with high-dimensional pixel-based observations, episodic count-based approaches
can cease to be viable.
A possible alternative could be to design a function to extract relevant features from each state
and feed them to the count-based episodic bonus. Specifically, instead of defining a bonus using
Ne(st)
, we could define the bonus using
Ne(φ(st))
, where
φ
is a hand-designed feature extractor.
For example, in the paper introducing the MiniHack suite [
58
], the RIDE implementation uses
φ(st)=(xt, yt)
, where
(xt, yt)
is the spatial location of the agent at time
t
. However, this approach
relies heavily on task-specific knowledge.
Figure 2 shows results on two tasks from the MiniHack suite for three NovelD variants (we decided
to focus our study on variants of NovelD, since it was previously shown to outperform competing
methods on MiniGrid [
76
]). The first, NOVELD, denotes the standard formulation which uses
the bonus
I[Ne(st) = 1]
. The second, NOVELD-POSITION, uses the positional feature encoding
described above, i.e. the episodic bonus is defined as
I[Ne(φ(st)) = 1]
with
φ(st) = (xt, yt)
.
The third, NOVELD-MESSAGE, uses a feature encoding where
φ(st)
extracts the message por-
tion of the state
st
similarly to [
47
] (both encodings are explained in more detail in Section 5).
Figure 2: Performance of NovelD with differ-
ent feature extractors: entire observation, agent
location, or environment message. Results are
averaged over
5
seeds and the shaded region rep-
resents one standard deviation.
In contrast to the MiniGrid environments, here
standard NOVELD fails completely due to the
presence of a time counter feature in the Mini-
Hack observations, which makes each observa-
tion in the episode unique. Using the positional
encoding enables NOVELD-POSITION to solve
the
MultiRoom
task, but this method fails on
the
Freeze
task. On the other hand, using the
message encoding enables NOVELD-MESSAGE
to succeed on the
Freeze
task, but it fails on
the
MultiRoom
one. This illustrates that different
feature extractors are effective on different tasks,
and that designing one which is broadly effective
is challenging. Therefore, robust new methods
which do not require task-specific engineering are
needed.
4 Elliptical Episodic Bonuses
In this section we describe
Exploration via Elliptical Episodic Bonuses
, (
E3B
), our algorithm for
exploration in contextual MDPs. It is designed to address the shortcomings of count-based episodic
bonuses described above, with two aims in mind. First, we would like an episodic bonus that can
be used with continuous state representations, unlike the count-based bonus which requires discrete
states. Second, we would like a representation learning method that only captures information about
the environment that is relevant for the task at hand. The first requirement is met by using an elliptical
bonus [
6
,
20
,
42
], which provides a continuous analog to the count-based bonus, while the second
requirement is met by using a representation learned with an inverse dynamics model [52, 56].
A summary of the method is shown in Figure 3. We define an intrinsic reward based on the position
of the current state’s embedding with respect to an ellipse fit on the embeddings of previous states
encountered within the same episode. This bonus is then combined with the environment reward and
used to update the agent’s policy. The next two sections describe the elliptical bonus and embedding
method in detail.
4.1 Elliptical Episodic Bonus
Given a feature encoding
φ
, at each time step
t
in the episode the elliptical bonus
b
is defined as
follows:
b(st) = φ(st)>C1
t1φ(st), Ct1=
t1
X
i=1
φ(si)φ(si)>+λI (1)
Here
λI
is a regularization term to ensure that the matrix
Ct1
is non-singular, where
λ
is a scalar
coefficient and
I
is the identity matrix. The reward optimized by the algorithm is then defined as
4
Ellipses fit on
previous embeddings
within episode
Intrinsic
rewards
Policy rollout (single episode)
Environment
rewards
Agent Policy
Policy update
State embedding
high reward
low rewardlow rewardlow reward
Figure 3: Overview of E3B. At each step in an episode, an ellipse is fit on the embeddings of previous
states encountered within the episode. Intrinsic rewards are computed based on the position of
the current state’s embedding with respect to the ellipse: the further outside the ellipse, the higher
the reward. These are combined with environment rewards, the policy is updated, and the process
repeated. The state embeddings are learned using an inverse dynamics model.
¯r(st, at) = r(st, at) + β·b(st)
, where
r(st, at)
is the extrinsic reward provided by the environment
and βis a scalar term balancing the tradeoff between exploration and exploitation.
Intuition.
One perspective is that the elliptical bonus is a natural generalization of a count-based
episodic bonus [
2
]. To see this, observe that if the problem is tabular and
φ
is a one-hot encoding
of the state, then
Ct1
will be a diagonal matrix whose entries contain the counts corresponding
to each state encountered in the episode. Its inverse
C1
t1
will also be a diagonal matrix whose
entries are inverse state visitation counts, and the bilinear form
φ(st)>C1
t1φ(st)
reads off the entry
corresponding to the current state st, yielding a bonus of 1/Ne(st).
For a more general geometric interpretation, if
φ(s0), ..., φ(st1)
are roughly centered at zero, then
Ct1
can be viewed as their unnormalized covariance matrix. Now consider the eigendecomposition
Ct1=U>ΛU
, where
Λ
is the diagonal matrix whose entries are the eigenvalues
λ1, ..., λn
(these
are real since
Ct1
is symmetric). Letting
z=Uφ(st)=(z1, ..., zn)
be the set of coordinates of
φ(st)in the eigenspace of Ct1, we can rewrite the elliptical bonus as:
b(st) = z>Λ1z=
n
X
i=1
z2
i
λi
The bonus increases the more
φ(st)
is aligned with the eigenvectors corresponding to smaller
eigenvalues of
Ct1
(directions of low data density), and decreases the more it is aligned with
eigenvectors corresponding to larger eigenvalues (directions of high data density). An illustration
for
n= 2
is shown in Figure 10 of Appendix B. In our experiments,
n
is typically between
256
and
1024.
Efficient Computation.
The matrix
Ct1
needs to be inverted at each step
t
in the episode, an
operation which is cubic in the dimension of
φ
and may thus be expensive. To address this, we use
the Sherman-Morrison matrix identity [62] to perform fast rank-1updates in quadratic time:
C1
t=
1
λIif t= 0
C1
t1C1
t1φ(st)φ(st)>C1>
t1
1+φ(st)>C1
t1φ(st)if t1
5
摘要:

ExplorationviaEllipticalEpisodicBonusesMikaelHenaffMetaAIResearchmikaelhenaff@meta.comRobertaRaileanuMetaAIResearchraileanu@meta.comMinqiJiangUniversityCollegeLondonMetaAIResearchmeta@fb.comTimRocktäschelUniversityCollegeLondont.rocktaschel@cs.ucl.ac.ukAbstractInrecentyears,anumberofreinforcementlea...

展开>> 收起<<
Exploration via Elliptical Episodic Bonuses Mikael Henaff Meta AI Research.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:5.17MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注