A Mixture of Surprises for Unsupervised Reinforcement Learning Andrew Zhao1Matthieu Gaetan Lin2Yangguang Li3Yong-Jin Liu2y

2025-04-30 0 0 3.28MB 19 页 10玖币

侵权投诉

A Mixture of Surprises for Unsupervised

Reinforcement Learning

Andrew Zhao 1∗Matthieu Gaetan Lin 2∗Yangguang Li3Yong-Jin Liu2†

Gao Huang 1†

1Department of Automation, BNRist, Tsinghua University

2Department of Computer Science, BNRist, Tsinghua University

3SenseTime

{zqc21,lyh21}@mails.tsinghua.edu.cn,

liyangguang@sensetime.com

{liuyongjin,gaohuang}@tsinghua.edu.cn,

Abstract

Unsupervised reinforcement learning aims at learning a generalist policy in a

reward-free manner for fast adaptation to downstream tasks. Most of the existing

methods propose to provide an intrinsic reward based on surprise. Maximizing

or minimizing surprise drives the agent to either explore or gain control over its

environment. However, both strategies rely on a strong assumption: the entropy of

the environment’s dynamics is either high or low. This assumption may not always

hold in real-world scenarios, where the entropy of the environment’s dynamics

may be unknown. Hence, choosing between the two objectives is a dilemma. We

propose a novel yet simple mixture of policies to address this concern, allowing us

to optimize an objective that simultaneously maximizes and minimizes the surprise.

Concretely, we train one mixture component whose objective is to maximize

the surprise and another whose objective is to minimize the surprise. Hence,

our method does not make assumptions about the entropy of the environment’s

dynamics. We call our method a

ixture

urprise

(MOSS) for unsupervised

reinforcement learning. Experimental results show that our simple method achieves

state-of-the-art performance on the URLB benchmark, outperforming previous

pure surprise maximization-based objectives. Our code is available at:

https:

//github.com/LeapLabTHU/MOSS.

1 Introduction

Humans can learn meaningful behaviors without external supervision, i.e., in an unsupervised manner,

and then adapt those behaviors to new tasks [

]. Inspired by this, unsupervised reinforcement

learning decomposes the reinforcement learning (RL) problem into a pretraining phase and a ﬁnetune

phase [

]. During a pretraining phase, an agent prepares all possible tasks that a user might select.

Afterward, the agent tries to ﬁgure out the selected task as quickly as possible during ﬁnetuning [

Doing so allows solving the RL problem in a meaningful order [

], e.g., a cook ﬁrst has to look at

what is in the fridge before deciding what to cook. Unsupervised representation learning has shown

great success in computer vision [

] and natural language processing [

]; however, one challenge

is that RL includes both behavior learning and representation learning [32, 53].

∗Equal contribution.

†Corresponding authors.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06702v1 [cs.LG] 13 Oct 2022

Current unsupervised RL methods provide an intrinsic reward to the agent during a pretraining phase

to tackle the behavior learning problem [

]. Intuitively, this intrinsic reward should incentivize

the agent to understand its environment [

]. Current methods formulate the intrinsic reward as

either maximizing or minimizing surprise [

], where

knowledge-based methods quantify surprise as the uncertainty of a prediction model [

] and

data-based methods measure surprise as an information-theoretic quantity [

]

. Surprise

maximization methods [

] formulate the problem as an exploration problem. Intuitively, to

prepare all possible downstream tasks, an agent has to explore the state space and ﬁgure out what is

possible in the environment. For instance, one instantiation is to maximize the agent’s state entropy

[

]. In contrast to surprise maximization, another line of work takes inspiration from the free-energy

principle [

] and proposes to minimize the surprise [

]. In particular, these works argue

that external perturbations naturally provide the agent with surprising events. Hence, an agent can

learn meaningful behaviors by minimizing its surprise, and minimizing this quantity requires gaining

control over these external perturbations [

]. For example, SMiRL [

] proposed to minimize the

agent’s state entropy.

Figure 1:

Outline of our Mixture Of SurpriseS (MOSS) strategy

. We provide the agent with two paths, each

corresponding to maximizing or minimizing surprise. During the pretraining phase, the agent gathers experience

from both paths using the skills from the corresponding toolbox. Finally, during the ﬁnetuning phase, the agent

can choose any skills from any toolbox.

A closer look at these two approaches indicates a strong assumption. Substantial external perturba-

tions already naturally provide an agent with a high state entropy [

]. Therefore, surprise-seeking

approaches assume that the environment does not provide signiﬁcant external perturbations. On

the other hand, in an environment without any external perturbations, the agent already achieves

minimum entropy by not acting [

]. Therefore, surprise minimization approaches assume that

the environment provides external perturbations for the agent to control. However, in real-world

scenarios, it is often difﬁcult to quantify the entropy of the environment’s dynamics beforehand, or

the agent might face both settings. For example, a butler robot performs mundane chores daily in

a household. However, the robot might also encounter surprising events (e.g., putting out a ﬁre).

Therefore a ﬁxed assumption on the entropy of the environment dynamics is often not possible. In

other words, choosing between surprise maximization or minimization for pretraining is a dilemma.

Simultaneously optimizing these two opposite objectives does not make sense. Instead, in this

paper, we show that competence-based methods [

] offer a simple yet effective way to combine the

beneﬁts of these two objectives. In addition to conditioning the policy on the state, competence-based

methods condition the policy on a latent vector [

]. Conditioning the policy

offers an appealing way to formulate the unsupervised RL problem. During a pretraining phase,

the agent tries to learn a set of possible behaviors in the environment and distills them into skills.

Then, during ﬁnetuning, the agent tries to ﬁgure out the selected task as quickly as possible, using

the repertoire of skills gathered during pretraining. For example, as illustrated in Fig. 1, we view

skill distributions as a mixture of policies instead of a single policy. In particular, we train one set

of skills whose objective is to maximize the surprise and another set of skills whose objective is to

3Refer to [32] for detailed deﬁnitions of data-based, knowledge-based and competence-based methods.

minimize the surprise. Surprisingly, this paper shows that this simple approach works well in practice,

which simultaneously optimizes two contradicting objectives. Our primary contribution, presented

in Section 4, is a simple intrinsic reward called MOSS that does not make assumptions about the

entropy of the environment’s dynamics. In Section 5, our experimental results on URLB [

] and

ViZDoom [30] show that, surprisingly, our MOSS methods achieve state-of-the-art results.

We organize the paper as follows. Section 3 brieﬂy analyzes previous unsupervised RL algorithms

under the surprise framework. Then in Section 4, we introduce our MOSS method. Next, experimental

results in Section 5 shows that on URLB [

] and ViZDoom [

], our MOSS method improves

upon previous pure maximization and minimization methods. Finally, we provide discussions and

limitations in Section 6.

2 Preliminaries

Markov Decision Process.

Unsupervised RL methods studied in this paper operate under a Markov

Decision Process (MDP) [

]. In particular, we specify an MDP as a tuple

M= (S,A, T, rext, ρ, γ)

where

is the state space and

is the action space of the environment.

T(S0|S,A)

represents the

state-transition dynamics,

ρ:S → [0,1]

is the initial state distribution, and

γ∈[0,1)

is the discount

factor. At each discrete time step

t∈Z∗

, the agent receives a state and performs an action, which

we denote as

St∈ S

and

At∈ A

, respectively. During pretraining, unsupervised RL algorithms

compute an intrinsic reward

rint

; during the ﬁnetune phase, the agent receives the extrinsic reward

rext given by the environment at each interaction.

Skill.

Intuitively, a skill is an abstraction of a speciﬁc behavior (e.g., walking), and in practice, a

skill is a latent conditioned policy [

]. Given a latent vector

, we denote a skill as

πθ(at|st,z)

where

πθ

is the policy parameterized by

. For instance, during pretraining, the latent vectors are

sampled every

steps such that the latent vector

is associated with the behavior executed during

the associated nsteps.

Mutual Information. Knowledge-based, data-based, and competence-based methods have differ-

ent measures of surprise. The study in this paper falls into the category of competence-based methods.

In particular, data-based and competence-based methods rely on an information-theoretic deﬁnition

of surprise, i.e., entropy. Previous competence-based methods acquire skills by maximizing mutual

information [16] between Tand skills Z

I(T;Z) = H[T]−H[T |Z](1)

=H[Z]−H[Z|T ],(2)

where

can be the states

, the joint distribution of state-transitions

(S0,S)

, or the state-transitions

(S0|S)

. In particular, these methods differ in how they decompose the mutual information. Theoreti-

cally, these different decompositions are equivalent, i.e., they all maximize the mutual information

between states and skills. However, the particular choice greatly inﬂuences the performance in

practice as optimizing this objective relies on approximations.

To motivate the potential of competence-based methods over data-based or knowledge-based methods,

we provide an intuitive understanding of Eq. (1). On the one hand, the entropy term says that we

want skills in aggregate that explore the state space; we use it as a proxy to learn skills that cover

the set of possible behaviors. On the other hand, It is not enough to learn skills that randomly go to

different places. We want to reuse those skills as accurately as possible, meaning we need to be able

to discriminate or predict the agent’s state transitions from skills. To do so, we minimize conditional

entropy. In other words, appropriate skills should cover the set of possible behaviors and should be

easily distinguishable.

3 Information-Theoretic Skill Discovery

Competence-based methods employ different intrinsic rewards to maximize mutual information: (1)

discriminability-based and (2) exploratory-based intrinsic rewards. For example, the former rewards

the agent for discriminable skills. In contrast, the latter rewards the agent for skills that effectively

cover the state space using a KNN density estimator [

] to approximate the entropy term. Below

we analyze both approaches.

3.1 Discriminability-based Intrinsic Reward

Previous work such as DIAYN [

] uses a discriminability-based intrinsic reward. They use the

decomposition in Eq. (2) and given a variational distribution

qφ

, they optimize the following

variational lower bound

I(S;Z) = KL(p(s,z)|| p(s)p(z))

=Es,z∼p(s,z)log qφ(z|s)

p(z)+Es∼p(s)[KL(p(z|s)|| qφ(z|s))]

≥Ez,s∼p(z,s)[log qφ(z|s)] −Ez∼p(z)[p(z)],

where

Z∼p(Z)

is a discrete random variable. In particular, the discriminator rewards the agent

if it can guess the skill from the state, i.e.,

rint(s),log qφ(z|s)−log p(z)

. These methods may

easily run into a chicken and egg problem, where skills learn to be diverse using the discriminator’s

output. However, the discriminator cannot learn to discriminate skills if the skills are not diverse.

Hence, it discourages the agent from exploring. Previous work [

] has tried to solve this problem by

decoupling the aleatoric uncertainty from the epistemic uncertainty.

Furthermore, solving the chicken and egg problem is not enough since a state must map to a single

skill in Eq. (2). Accordingly, the methods that use the decomposition in Eq. (2) require the skill space

to be smaller than the state space so that the skills are distinguishable. Since previous work showed

that a high-dimensional continuous skill space empirically performs better, the intrinsic reward should

not rely on the decomposition in Eq. (2). Intuitively, a continuous skill space allows (1) to interpolate

between skills and (2) to represent a more signiﬁcant set of skills in a more compact representation.

Instead, in Eq. (1), in contrast to skills that are predictable by states

H[Z| T ]

, we require that states

are predictable by skills

H[T | Z]

. Since any state should be predictable from a given skill, the skill

space must be larger than the state space (i.e., we do not want a skill mapping to multiple states).

Therefore, another work [47] relies on a variational bound on Eq. (1).

I(S0;Z|S) = Ez,s,s0∼p(z,s,s0)log qφ(s0|s,z)

p(s0|s)+Ez,s∼p(z,s)KL(p(s0|s,z)|| qφ(s0|s,z))

≥Ez,s,s0∼p(z,s,s0)log qφ(s0|s,z)

p(s0|s),

which uses a continuous skill space; however, it does not scale to high dimensions as the intrinsic

reward

rint(s0,a,s),log qφ(s

0|s,z)

p(s0|s)

relies on an estimation of

p(s0|s)

. In particular, they assume

that

p(z|s) = p(z)

. Intuitively, given

, if we assume each element in the latent vector

(z)i

to be

independent of each other, the error of this assumption will be scaled by the dimension of z.

3.2 Exploratory-based Intrinsic Reward

As aforementioned, discriminability-based intrinsic reward may easily run into a chicken and egg

problem. Hence, other work [

] uses an exploratory-based intrinsic reward to address the chicken

and egg problem. In other words, they use the decomposition in Eq. (1), where the intrinsic reward

explicitly rewards the agent for exploring through H[T].

Previous work [

] maximizes the state entropy, i.e.,

T=S

. However, the authors in [

] argue that it

often results in the discriminator simply memorizing the last state of each skill. Instead, CIC [

]

proposes to maximize the entropy of the joint distribution of state-transitions (which from now on we

refer as joint entropy), i.e.,H[S0,S].

Therefore, CIC [

] proposes to estimate the conditional entropy

H[S0,S|Z]

using noise contrastive

estimation [

] and the joint entropy

H[S0,S]

using a

-nearest neighbor estimator [

] to handle

high dimensional skill space.

KNN-density estimation.

Previous works [

] approximate the joint entropy

H[S0,S]

using a

-nearest neighbor estimator [

]. Given a random sample of size

{(S(i)0,S(i))}N

i=1 ∼P(S0,S)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AMixtureofSurprisesforUnsupervisedReinforcementLearningAndrewZhao1MatthieuGaetanLin2YangguangLi3Yong-JinLiu2yGaoHuang1y1DepartmentofAutomation,BNRist,TsinghuaUniversity2DepartmentofComputerScience,BNRist,TsinghuaUniversity3SenseTime{zqc21,lyh21}@mails.tsinghua.edu.cn,liyangguang@sensetime.com{liuy...

展开>> 收起<<

A Mixture of Surprises for Unsupervised Reinforcement Learning Andrew Zhao1Matthieu Gaetan Lin2Yangguang Li3Yong-Jin Liu2y.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Mixture of Surprises for Unsupervised Reinforcement Learning Andrew Zhao1Matthieu Gaetan Lin2Yangguang Li3Yong-Jin Liu2y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: