Exploration via Elliptical Episodic Bonuses Mikael Henaff Meta AI Research

2025-04-26 1 0 5.17MB 28 页 10玖币

侵权投诉

Exploration via Elliptical Episodic Bonuses

Mikael Henaff

Meta AI Research

mikaelhenaff@meta.com

Roberta Raileanu

Meta AI Research

raileanu@meta.com

Minqi Jiang

University College London

Meta AI Research

meta@fb.com

Tim Rocktäschel

University College London

t.rocktaschel@cs.ucl.ac.uk

Abstract

In recent years, a number of reinforcement learning (RL) methods have been pro-

posed to explore complex environments which differ across episodes. In this work,

we show that the effectiveness of these methods critically relies on a count-based

episodic term in their exploration bonus. As a result, despite their success in

relatively simple, noise-free settings, these methods fall short in more realistic

scenarios where the state space is vast and prone to noise. To address this lim-

itation, we introduce

xploration via

lliptical

pisodic

onuses (

E3B

), a new

method which extends count-based episodic bonuses to continuous state spaces and

encourages an agent to explore states that are diverse under a learned embedding

within each episode. The embedding is learned using an inverse dynamics model

in order to capture controllable aspects of the environment. Our method sets a

new state-of-the-art across 16 challenging tasks from the MiniHack suite, with-

out requiring task-speciﬁc inductive biases. E3B also matches existing methods

on sparse reward, pixel-based Vizdoom environments, and outperforms existing

methods in reward-free exploration on Habitat, demonstrating that it can scale to

high-dimensional pixel-based observations and realistic environments.

1 Introduction

Exploration in environments with sparse rewards is a fundamental challenge in reinforcement learning

(RL). In the tabular setting, provably optimal algorithms have existed since the early 2000s [

More recently, exploration has been studied in the context of deep RL, and a number of empirically

successful methods have been proposed, such as pseudocounts [

], intrinsic curiosity modules (ICM)

[

], and random network distillation (RND) [

]. These methods rely on intrinsically generated

exploration bonuses that reward the agent for visiting states that are novel according to some measure.

Different measures of novelty have been proposed, such as the likelihood of a state under a learned

density model, the error of a forward dynamics model, or the loss on a random prediction task. These

approaches have proven effective on hard exploration problems, as exempliﬁed by the Atari games

Montezuma’s Revenge and PitFall [22].

The approaches above are, however, designed for singleton RL tasks, where the agent is spawned

in the same environment in every episode. Recently, several studies have drawn attention to the

fact that RL agents exhibit poor generalization across environments, and that even minor changes

to the environment can lead to substantial degradation in performance [

]. This

has motivated the creation of benchmarks in the Contextual Markov Decision Process (CMDP)

framework, where different episodes correspond to different environments that nevertheless share

certain characteristics. Examples of CMDPs include procedurally generated (PCG) environments

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05805v2 [cs.LG] 4 Jan 2023

[

] or embodied AI tasks where the agent must generalize its behavior to

unseen physical spaces at test time [

]. A number of methods have been proposed which

have shown promising performance in PCG environments with sparse rewards, such as RIDE [

AGAC [

] and NovelD [

]. These methods propose different intrinsic reward functions, such as

the change in representation in a latent space, the divergence between the predictions of a policy and

an adversary, or the difference between random network prediction errors at two consecutive states.

Although not presented as a central algorithmic feature, these methods also include a count-based

bonus which is computed at the episode level.

In this work, we take a closer look at exploration in CMDPs, where each episode corresponds to a

different environment context. We ﬁrst show that, surprisingly, the count-based episodic bonus that is

often included as a heuristic is in fact essential for good performance, and current methods fail if it is

omitted. Furthermore, due to this dependence on a count-based term, existing methods fail on more

complex tasks with irrelevant features or dynamic entities, where each observation is rarely seen

more than once. We ﬁnd that performance can be improved by counting certain features extracted

from the observations, rather than the observations themselves. However, different features are useful

for different tasks, making it difﬁcult to design a feature extractor that performs well across all tasks.

To address this fundamental limitation, we propose a new method, E3B, which uses an elliptical

bonus [

] at the episode level that can be seen as a natural generalization of a count-based

episodic bonus to continuous state spaces, and that is paired with a self-supervised feature learning

method using an inverse dynamics model. Our algorithm is simple to implement, scalable to large or

inﬁnite state spaces, and achieves state-of-the-art performance across 16 challenging tasks from the

MiniHack suite [

], without the need for task-speciﬁc prior knowledge. It also matches existing

methods on hard exploration tasks from the VizDoom environment [

], and signiﬁcantly outperforms

existing methods in reward-free exploration on the Habitat embodied AI environment [

]. This

demonstrates that E3B scales to rich, high dimensional pixel-based observations and real-world

scenarios. Our code is available at https://github.com/facebookresearch/e3b.

2 Background

2.1 Contextual MDPs

We consider a Contextual Markov Decision Process

(CMDP [

]) given by

(S,A,C, P, r, µC, µS)

where

is the state space,

is the action space,

is the context space,

is the transition function,

is the reward function,

µC

is the distribution over contexts and

µS

the conditional initial state distribution. At each episode, we sample a context

c∼µC

, an initial state

s0∼µ(·|c)

, and subsequent states in the episode are sampled according to

st+1 ∼P(·|st, at, c)

. Let

denote the distribution over states induced by following policy

in context

. The goal is to learn a

policy

which maximizes the expected return over all contexts, i.e.

R=Ec∼µC,s∼dc

π,a∼π(s)[r(s, a)]

Examples of CMDPs include procedurally-generated environments, such as ProcGen [

], MiniGrid

[

], NetHack [

], or MiniHack [

], where each context

corresponds to the random seed used

to generate the environment; in this case, the number of contexts

|C|

is effectively inﬁnite. Other

examples include embodied AI environments [

], where the agent is placed in

different simulated houses and must navigate to a location or ﬁnd an object. In this setting, each

context

c∈ C

represents a house identiﬁer and the number of houses

|C|

is typically between

and

1000. For an in-depth review of the literature on CMDPs and generalization in RL, see [39].

2.2 Exploration Bonuses

If the environment rewards are sparse, learning a policy using simple



-greedy exploration may require

intractably many samples. We therefore consider methods that augment the external reward function

with an intrinsic reward bonus

. A number of intrinsic bonuses that encourage exploration in

singleton (non-contextual) MDPs have been proposed, including pseudocounts [

], intrinsic curiosity

modules (ICM) [

] and random network distillation (RND) error [

]. At a high level, these methods

Technically, some of the environments we consider are Contextual Partially Observed MDPs (CPOMDPs),

but we follow the convention in [

] and adopt the CMDP framework for simplicity. For CPOMDPs, we use

recurrent networks or frame stacking to convert to CMDPs, as done in prior work [46].

deﬁne an intrinsic reward bonus that is high if the current state is different from the previous states

visited by the agent, and low if it is similar (according to some measure).

Method Exploration bonus

RIDE [56] kφ(st+1)−φ(st)k2·1/pNe(st+1)

NovelD [76] hbRND(st+1)−α·bRND(st)i+·I[Ne(st+1) = 1]

AGAC [25] DKL(π(·|st)kπadv(·|st)) + β·1/pNe(st+1)

Table 1: Summary of recent exploration methods for procedurally-generated environments. Each

exploration bonus has a count-based episodic term, marked in blue.

More recently, several methods have been proposed for and evaluated on procedurally-generated

MDPs. All use different exploration bonuses, which are summarized in Table 1. RIDE [

] deﬁnes

a bonus based on the distance between the embeddings of two consecutive observations, NovelD

[

] uses a bonus based on the difference between two consecutive RND bonuses, and AGAC [

]

uses the KL divergence between the predictions of the agent’s policy and those of an adversarial

policy trained to mimic it (see the original works for details). In addition, all three methods have a

term in their bonus (marked in blue) which depends on

Ne(st+1)

– the number of times

st+1

has

been encountered during the current episode. Although presented as a heuristic, below we will show

that without this count-based episodic term, all three methods fail to learn. This in turn limits their

effectiveness in more complex, dynamic, and noisy environments.

3 Importance and Limitations of Count-Based Episodic Bonuses

We now discuss in more detail the importance and limitations of the count-based episodic bonuses

used in RIDE, AGAC and NovelD, which depend on

Ne(st)

. Figure 1 shows results for the three

methods with and without their respective count-based episodic terms, on one of the MiniGrid

environments used in prior work. When the count-based terms are removed, all three methods fail to

learn. Similar trends apply for other MiniGrid environments (see Appendix D.1). This shows that the

episodic bonus is in fact essential for good performance.

Figure 1: Results for RIDE, AGAC and NovelD with and without the count-based episodic bonus,

over 5random seeds (solid line indicates the mean, shaded region indicates one standard deviation).

All three algorithms fail if the count-based episodic bonus is removed.

However, the count-based episodic bonus suffers from a fundamental limitation, which is similar to

that faced by count-based approaches in general: if each state is unique, then

Ne(st)

will always be

and the episodic bonus is no longer meaningful. This is the case for many real-world applications.

For example, a household robot’s state as recorded by its camera might include moving trees outside

the window, clocks showing the time or images on a television screen which are not relevant for its

tasks, but nevertheless make each state unique.

Previous works [

] have used the MiniGrid test suite [

] for evaluation, where observations

are less noisy and do not typically contain irrelevant information. Thus, methods relying on episodic

counts have been effective in these scenarios. However, in more complex environments such as

MiniHack [

] or with high-dimensional pixel-based observations, episodic count-based approaches

can cease to be viable.

A possible alternative could be to design a function to extract relevant features from each state

and feed them to the count-based episodic bonus. Speciﬁcally, instead of deﬁning a bonus using

Ne(st)

, we could deﬁne the bonus using

Ne(φ(st))

, where

is a hand-designed feature extractor.

For example, in the paper introducing the MiniHack suite [

], the RIDE implementation uses

φ(st)=(xt, yt)

, where

(xt, yt)

is the spatial location of the agent at time

. However, this approach

relies heavily on task-speciﬁc knowledge.

Figure 2 shows results on two tasks from the MiniHack suite for three NovelD variants (we decided

to focus our study on variants of NovelD, since it was previously shown to outperform competing

methods on MiniGrid [

]). The ﬁrst, NOVELD, denotes the standard formulation which uses

the bonus

I[Ne(st) = 1]

. The second, NOVELD-POSITION, uses the positional feature encoding

described above, i.e. the episodic bonus is deﬁned as

I[Ne(φ(st)) = 1]

with

φ(st) = (xt, yt)

The third, NOVELD-MESSAGE, uses a feature encoding where

φ(st)

extracts the message por-

tion of the state

similarly to [

] (both encodings are explained in more detail in Section 5).

Figure 2: Performance of NovelD with differ-

ent feature extractors: entire observation, agent

location, or environment message. Results are

averaged over

seeds and the shaded region rep-

resents one standard deviation.

In contrast to the MiniGrid environments, here

standard NOVELD fails completely due to the

presence of a time counter feature in the Mini-

Hack observations, which makes each observa-

tion in the episode unique. Using the positional

encoding enables NOVELD-POSITION to solve

the

MultiRoom

task, but this method fails on

the

Freeze

task. On the other hand, using the

message encoding enables NOVELD-MESSAGE

to succeed on the

Freeze

task, but it fails on

the

MultiRoom

one. This illustrates that different

feature extractors are effective on different tasks,

and that designing one which is broadly effective

is challenging. Therefore, robust new methods

which do not require task-speciﬁc engineering are

needed.

4 Elliptical Episodic Bonuses

In this section we describe

Exploration via Elliptical Episodic Bonuses

, (

E3B

), our algorithm for

exploration in contextual MDPs. It is designed to address the shortcomings of count-based episodic

bonuses described above, with two aims in mind. First, we would like an episodic bonus that can

be used with continuous state representations, unlike the count-based bonus which requires discrete

states. Second, we would like a representation learning method that only captures information about

the environment that is relevant for the task at hand. The ﬁrst requirement is met by using an elliptical

bonus [

], which provides a continuous analog to the count-based bonus, while the second

requirement is met by using a representation learned with an inverse dynamics model [52, 56].

A summary of the method is shown in Figure 3. We deﬁne an intrinsic reward based on the position

of the current state’s embedding with respect to an ellipse ﬁt on the embeddings of previous states

encountered within the same episode. This bonus is then combined with the environment reward and

used to update the agent’s policy. The next two sections describe the elliptical bonus and embedding

method in detail.

4.1 Elliptical Episodic Bonus

Given a feature encoding

, at each time step

in the episode the elliptical bonus

is deﬁned as

follows:

b(st) = φ(st)>C−1

t−1φ(st), Ct−1=

t−1

i=1

φ(si)φ(si)>+λI (1)

Here

λI

is a regularization term to ensure that the matrix

Ct−1

is non-singular, where

is a scalar

coefﬁcient and

is the identity matrix. The reward optimized by the algorithm is then deﬁned as

Ellipses fit on

previous embeddings

within episode

Intrinsic

rewards

Policy rollout (single episode)

Environment

rewards

Agent Policy

Policy update

State embedding

high reward

low rewardlow rewardlow reward

Figure 3: Overview of E3B. At each step in an episode, an ellipse is ﬁt on the embeddings of previous

states encountered within the episode. Intrinsic rewards are computed based on the position of

the current state’s embedding with respect to the ellipse: the further outside the ellipse, the higher

the reward. These are combined with environment rewards, the policy is updated, and the process

repeated. The state embeddings are learned using an inverse dynamics model.

¯r(st, at) = r(st, at) + β·b(st)

, where

r(st, at)

is the extrinsic reward provided by the environment

and βis a scalar term balancing the tradeoff between exploration and exploitation.

Intuition.

One perspective is that the elliptical bonus is a natural generalization of a count-based

episodic bonus [

]. To see this, observe that if the problem is tabular and

is a one-hot encoding

of the state, then

Ct−1

will be a diagonal matrix whose entries contain the counts corresponding

to each state encountered in the episode. Its inverse

C−1

t−1

will also be a diagonal matrix whose

entries are inverse state visitation counts, and the bilinear form

φ(st)>C−1

t−1φ(st)

reads off the entry

corresponding to the current state st, yielding a bonus of 1/Ne(st).

For a more general geometric interpretation, if

φ(s0), ..., φ(st−1)

are roughly centered at zero, then

Ct−1

can be viewed as their unnormalized covariance matrix. Now consider the eigendecomposition

Ct−1=U>ΛU

, where

is the diagonal matrix whose entries are the eigenvalues

λ1, ..., λn

(these

are real since

Ct−1

is symmetric). Letting

z=Uφ(st)=(z1, ..., zn)

be the set of coordinates of

φ(st)in the eigenspace of Ct−1, we can rewrite the elliptical bonus as:

b(st) = z>Λ−1z=

i=1

λi

The bonus increases the more

φ(st)

is aligned with the eigenvectors corresponding to smaller

eigenvalues of

Ct−1

(directions of low data density), and decreases the more it is aligned with

eigenvectors corresponding to larger eigenvalues (directions of high data density). An illustration

for

n= 2

is shown in Figure 10 of Appendix B. In our experiments,

is typically between

256

and

1024.

Efﬁcient Computation.

The matrix

Ct−1

needs to be inverted at each step

in the episode, an

operation which is cubic in the dimension of

and may thus be expensive. To address this, we use

the Sherman-Morrison matrix identity [62] to perform fast rank-1updates in quadratic time:

C−1

t=





λIif t= 0

C−1

t−1−C−1

t−1φ(st)φ(st)>C−1>

t−1

1+φ(st)>C−1

t−1φ(st)if t≥1

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExplorationviaEllipticalEpisodicBonusesMikaelHenaffMetaAIResearchmikaelhenaff@meta.comRobertaRaileanuMetaAIResearchraileanu@meta.comMinqiJiangUniversityCollegeLondonMetaAIResearchmeta@fb.comTimRocktäschelUniversityCollegeLondont.rocktaschel@cs.ucl.ac.ukAbstractInrecentyears,anumberofreinforcementlea...

展开>> 收起<<

Exploration via Elliptical Episodic Bonuses Mikael Henaff Meta AI Research.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploration via Elliptical Episodic Bonuses Mikael Henaff Meta AI Research

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: