A Mixture of Surprises for Unsupervised Reinforcement Learning Andrew Zhao1Matthieu Gaetan Lin2Yangguang Li3Yong-Jin Liu2y

2025-04-30 0 0 3.28MB 19 页 10玖币
侵权投诉
A Mixture of Surprises for Unsupervised
Reinforcement Learning
Andrew Zhao 1Matthieu Gaetan Lin 2Yangguang Li3Yong-Jin Liu2
Gao Huang 1
1Department of Automation, BNRist, Tsinghua University
2Department of Computer Science, BNRist, Tsinghua University
3SenseTime
{zqc21,lyh21}@mails.tsinghua.edu.cn,
liyangguang@sensetime.com
{liuyongjin,gaohuang}@tsinghua.edu.cn,
Abstract
Unsupervised reinforcement learning aims at learning a generalist policy in a
reward-free manner for fast adaptation to downstream tasks. Most of the existing
methods propose to provide an intrinsic reward based on surprise. Maximizing
or minimizing surprise drives the agent to either explore or gain control over its
environment. However, both strategies rely on a strong assumption: the entropy of
the environment’s dynamics is either high or low. This assumption may not always
hold in real-world scenarios, where the entropy of the environment’s dynamics
may be unknown. Hence, choosing between the two objectives is a dilemma. We
propose a novel yet simple mixture of policies to address this concern, allowing us
to optimize an objective that simultaneously maximizes and minimizes the surprise.
Concretely, we train one mixture component whose objective is to maximize
the surprise and another whose objective is to minimize the surprise. Hence,
our method does not make assumptions about the entropy of the environment’s
dynamics. We call our method a
M
ixture
O
f
S
urprise
S
(MOSS) for unsupervised
reinforcement learning. Experimental results show that our simple method achieves
state-of-the-art performance on the URLB benchmark, outperforming previous
pure surprise maximization-based objectives. Our code is available at:
https:
//github.com/LeapLabTHU/MOSS.
1 Introduction
Humans can learn meaningful behaviors without external supervision, i.e., in an unsupervised manner,
and then adapt those behaviors to new tasks [
3
]. Inspired by this, unsupervised reinforcement
learning decomposes the reinforcement learning (RL) problem into a pretraining phase and a finetune
phase [
32
]. During a pretraining phase, an agent prepares all possible tasks that a user might select.
Afterward, the agent tries to figure out the selected task as quickly as possible during finetuning [
24
].
Doing so allows solving the RL problem in a meaningful order [
3
], e.g., a cook first has to look at
what is in the fridge before deciding what to cook. Unsupervised representation learning has shown
great success in computer vision [
27
] and natural language processing [
10
]; however, one challenge
is that RL includes both behavior learning and representation learning [32, 53].
Equal contribution.
Corresponding authors.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06702v1 [cs.LG] 13 Oct 2022
Current unsupervised RL methods provide an intrinsic reward to the agent during a pretraining phase
to tackle the behavior learning problem [
32
]. Intuitively, this intrinsic reward should incentivize
the agent to understand its environment [
15
]. Current methods formulate the intrinsic reward as
either maximizing or minimizing surprise [
45
,
6
,
36
,
57
,
42
,
43
,
11
,
35
,
31
,
17
,
47
,
46
,
13
], where
knowledge-based methods quantify surprise as the uncertainty of a prediction model [
11
,
42
,
43
] and
data-based methods measure surprise as an information-theoretic quantity [
45
,
6
,
57
,
36
,
5
]
3
. Surprise
maximization methods [
24
,
31
,
36
] formulate the problem as an exploration problem. Intuitively, to
prepare all possible downstream tasks, an agent has to explore the state space and figure out what is
possible in the environment. For instance, one instantiation is to maximize the agent’s state entropy
[
36
]. In contrast to surprise maximization, another line of work takes inspiration from the free-energy
principle [
22
,
21
,
20
] and proposes to minimize the surprise [
6
,
45
]. In particular, these works argue
that external perturbations naturally provide the agent with surprising events. Hence, an agent can
learn meaningful behaviors by minimizing its surprise, and minimizing this quantity requires gaining
control over these external perturbations [
45
,
6
]. For example, SMiRL [
6
] proposed to minimize the
agent’s state entropy.
Figure 1:
Outline of our Mixture Of SurpriseS (MOSS) strategy
. We provide the agent with two paths, each
corresponding to maximizing or minimizing surprise. During the pretraining phase, the agent gathers experience
from both paths using the skills from the corresponding toolbox. Finally, during the finetuning phase, the agent
can choose any skills from any toolbox.
A closer look at these two approaches indicates a strong assumption. Substantial external perturba-
tions already naturally provide an agent with a high state entropy [
6
]. Therefore, surprise-seeking
approaches assume that the environment does not provide significant external perturbations. On
the other hand, in an environment without any external perturbations, the agent already achieves
minimum entropy by not acting [
45
]. Therefore, surprise minimization approaches assume that
the environment provides external perturbations for the agent to control. However, in real-world
scenarios, it is often difficult to quantify the entropy of the environment’s dynamics beforehand, or
the agent might face both settings. For example, a butler robot performs mundane chores daily in
a household. However, the robot might also encounter surprising events (e.g., putting out a fire).
Therefore a fixed assumption on the entropy of the environment dynamics is often not possible. In
other words, choosing between surprise maximization or minimization for pretraining is a dilemma.
Simultaneously optimizing these two opposite objectives does not make sense. Instead, in this
paper, we show that competence-based methods [
32
] offer a simple yet effective way to combine the
benefits of these two objectives. In addition to conditioning the policy on the state, competence-based
methods condition the policy on a latent vector [
31
,
35
,
26
,
55
,
23
,
1
,
18
]. Conditioning the policy
offers an appealing way to formulate the unsupervised RL problem. During a pretraining phase,
the agent tries to learn a set of possible behaviors in the environment and distills them into skills.
Then, during finetuning, the agent tries to figure out the selected task as quickly as possible, using
the repertoire of skills gathered during pretraining. For example, as illustrated in Fig. 1, we view
skill distributions as a mixture of policies instead of a single policy. In particular, we train one set
of skills whose objective is to maximize the surprise and another set of skills whose objective is to
3Refer to [32] for detailed definitions of data-based, knowledge-based and competence-based methods.
2
minimize the surprise. Surprisingly, this paper shows that this simple approach works well in practice,
which simultaneously optimizes two contradicting objectives. Our primary contribution, presented
in Section 4, is a simple intrinsic reward called MOSS that does not make assumptions about the
entropy of the environment’s dynamics. In Section 5, our experimental results on URLB [
32
] and
ViZDoom [30] show that, surprisingly, our MOSS methods achieve state-of-the-art results.
We organize the paper as follows. Section 3 briefly analyzes previous unsupervised RL algorithms
under the surprise framework. Then in Section 4, we introduce our MOSS method. Next, experimental
results in Section 5 shows that on URLB [
32
] and ViZDoom [
30
], our MOSS method improves
upon previous pure maximization and minimization methods. Finally, we provide discussions and
limitations in Section 6.
2 Preliminaries
Markov Decision Process.
Unsupervised RL methods studied in this paper operate under a Markov
Decision Process (MDP) [
51
]. In particular, we specify an MDP as a tuple
M= (S,A, T, rext, ρ, γ)
where
S
is the state space and
A
is the action space of the environment.
T(S0|S,A)
represents the
state-transition dynamics,
ρ:S [0,1]
is the initial state distribution, and
γ[0,1)
is the discount
factor. At each discrete time step
tZ
, the agent receives a state and performs an action, which
we denote as
St∈ S
and
At∈ A
, respectively. During pretraining, unsupervised RL algorithms
compute an intrinsic reward
rint
; during the finetune phase, the agent receives the extrinsic reward
rext given by the environment at each interaction.
Skill.
Intuitively, a skill is an abstraction of a specific behavior (e.g., walking), and in practice, a
skill is a latent conditioned policy [
17
]. Given a latent vector
z
, we denote a skill as
πθ(at|st,z)
,
where
πθ
is the policy parameterized by
θ
. For instance, during pretraining, the latent vectors are
sampled every
n
steps such that the latent vector
z
is associated with the behavior executed during
the associated nsteps.
Mutual Information. Knowledge-based, data-based, and competence-based methods have differ-
ent measures of surprise. The study in this paper falls into the category of competence-based methods.
In particular, data-based and competence-based methods rely on an information-theoretic definition
of surprise, i.e., entropy. Previous competence-based methods acquire skills by maximizing mutual
information [16] between Tand skills Z
I(T;Z) = H[T]H[T |Z](1)
=H[Z]H[Z|T ],(2)
where
T
can be the states
S
, the joint distribution of state-transitions
(S0,S)
, or the state-transitions
(S0|S)
. In particular, these methods differ in how they decompose the mutual information. Theoreti-
cally, these different decompositions are equivalent, i.e., they all maximize the mutual information
between states and skills. However, the particular choice greatly influences the performance in
practice as optimizing this objective relies on approximations.
To motivate the potential of competence-based methods over data-based or knowledge-based methods,
we provide an intuitive understanding of Eq. (1). On the one hand, the entropy term says that we
want skills in aggregate that explore the state space; we use it as a proxy to learn skills that cover
the set of possible behaviors. On the other hand, It is not enough to learn skills that randomly go to
different places. We want to reuse those skills as accurately as possible, meaning we need to be able
to discriminate or predict the agent’s state transitions from skills. To do so, we minimize conditional
entropy. In other words, appropriate skills should cover the set of possible behaviors and should be
easily distinguishable.
3 Information-Theoretic Skill Discovery
Competence-based methods employ different intrinsic rewards to maximize mutual information: (1)
discriminability-based and (2) exploratory-based intrinsic rewards. For example, the former rewards
the agent for discriminable skills. In contrast, the latter rewards the agent for skills that effectively
cover the state space using a KNN density estimator [
48
,
36
] to approximate the entropy term. Below
we analyze both approaches.
3
3.1 Discriminability-based Intrinsic Reward
Previous work such as DIAYN [
17
] uses a discriminability-based intrinsic reward. They use the
decomposition in Eq. (2) and given a variational distribution
qφ
, they optimize the following
variational lower bound
I(S;Z) = KL(p(s,z)|| p(s)p(z))
=Es,zp(s,z)log qφ(z|s)
p(z)+Esp(s)[KL(p(z|s)|| qφ(z|s))]
Ez,sp(z,s)[log qφ(z|s)] Ezp(z)[p(z)],
where
Zp(Z)
is a discrete random variable. In particular, the discriminator rewards the agent
if it can guess the skill from the state, i.e.,
rint(s),log qφ(z|s)log p(z)
. These methods may
easily run into a chicken and egg problem, where skills learn to be diverse using the discriminator’s
output. However, the discriminator cannot learn to discriminate skills if the skills are not diverse.
Hence, it discourages the agent from exploring. Previous work [
49
] has tried to solve this problem by
decoupling the aleatoric uncertainty from the epistemic uncertainty.
Furthermore, solving the chicken and egg problem is not enough since a state must map to a single
skill in Eq. (2). Accordingly, the methods that use the decomposition in Eq. (2) require the skill space
to be smaller than the state space so that the skills are distinguishable. Since previous work showed
that a high-dimensional continuous skill space empirically performs better, the intrinsic reward should
not rely on the decomposition in Eq. (2). Intuitively, a continuous skill space allows (1) to interpolate
between skills and (2) to represent a more significant set of skills in a more compact representation.
Instead, in Eq. (1), in contrast to skills that are predictable by states
H[Z| T ]
, we require that states
are predictable by skills
H[T | Z]
. Since any state should be predictable from a given skill, the skill
space must be larger than the state space (i.e., we do not want a skill mapping to multiple states).
Therefore, another work [47] relies on a variational bound on Eq. (1).
I(S0;Z|S) = Ez,s,s0p(z,s,s0)log qφ(s0|s,z)
p(s0|s)+Ez,sp(z,s)KL(p(s0|s,z)|| qφ(s0|s,z))
Ez,s,s0p(z,s,s0)log qφ(s0|s,z)
p(s0|s),
which uses a continuous skill space; however, it does not scale to high dimensions as the intrinsic
reward
rint(s0,a,s),log qφ(s
0|s,z)
p(s0|s)
relies on an estimation of
p(s0|s)
. In particular, they assume
that
p(z|s) = p(z)
. Intuitively, given
s
, if we assume each element in the latent vector
(z)i
to be
independent of each other, the error of this assumption will be scaled by the dimension of z.
3.2 Exploratory-based Intrinsic Reward
As aforementioned, discriminability-based intrinsic reward may easily run into a chicken and egg
problem. Hence, other work [
31
,
35
] uses an exploratory-based intrinsic reward to address the chicken
and egg problem. In other words, they use the decomposition in Eq. (1), where the intrinsic reward
explicitly rewards the agent for exploring through H[T].
Previous work [
36
] maximizes the state entropy, i.e.,
T=S
. However, the authors in [
4
] argue that it
often results in the discriminator simply memorizing the last state of each skill. Instead, CIC [
31
]
proposes to maximize the entropy of the joint distribution of state-transitions (which from now on we
refer as joint entropy), i.e.,H[S0,S].
Therefore, CIC [
31
] proposes to estimate the conditional entropy
H[S0,S|Z]
using noise contrastive
estimation [
31
,
25
] and the joint entropy
H[S0,S]
using a
k
-nearest neighbor estimator [
48
] to handle
high dimensional skill space.
KNN-density estimation.
Previous works [
36
,
31
] approximate the joint entropy
H[S0,S]
using a
k
-nearest neighbor estimator [
48
]. Given a random sample of size
N
,
{(S(i)0,S(i))}N
i=1 P(S0,S)
,
4
摘要:

AMixtureofSurprisesforUnsupervisedReinforcementLearningAndrewZhao1MatthieuGaetanLin2YangguangLi3Yong-JinLiu2yGaoHuang1y1DepartmentofAutomation,BNRist,TsinghuaUniversity2DepartmentofComputerScience,BNRist,TsinghuaUniversity3SenseTime{zqc21,lyh21}@mails.tsinghua.edu.cn,liyangguang@sensetime.com{liuy...

展开>> 收起<<
A Mixture of Surprises for Unsupervised Reinforcement Learning Andrew Zhao1Matthieu Gaetan Lin2Yangguang Li3Yong-Jin Liu2y.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:3.28MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注