minimize the surprise. Surprisingly, this paper shows that this simple approach works well in practice,
which simultaneously optimizes two contradicting objectives. Our primary contribution, presented
in Section 4, is a simple intrinsic reward called MOSS that does not make assumptions about the
entropy of the environment’s dynamics. In Section 5, our experimental results on URLB [
32
] and
ViZDoom [30] show that, surprisingly, our MOSS methods achieve state-of-the-art results.
We organize the paper as follows. Section 3 briefly analyzes previous unsupervised RL algorithms
under the surprise framework. Then in Section 4, we introduce our MOSS method. Next, experimental
results in Section 5 shows that on URLB [
32
] and ViZDoom [
30
], our MOSS method improves
upon previous pure maximization and minimization methods. Finally, we provide discussions and
limitations in Section 6.
2 Preliminaries
Markov Decision Process.
Unsupervised RL methods studied in this paper operate under a Markov
Decision Process (MDP) [
51
]. In particular, we specify an MDP as a tuple
M= (S,A, T, rext, ρ, γ)
where
S
is the state space and
A
is the action space of the environment.
T(S0|S,A)
represents the
state-transition dynamics,
ρ:S → [0,1]
is the initial state distribution, and
γ∈[0,1)
is the discount
factor. At each discrete time step
t∈Z∗
, the agent receives a state and performs an action, which
we denote as
St∈ S
and
At∈ A
, respectively. During pretraining, unsupervised RL algorithms
compute an intrinsic reward
rint
; during the finetune phase, the agent receives the extrinsic reward
rext given by the environment at each interaction.
Skill.
Intuitively, a skill is an abstraction of a specific behavior (e.g., walking), and in practice, a
skill is a latent conditioned policy [
17
]. Given a latent vector
z
, we denote a skill as
πθ(at|st,z)
,
where
πθ
is the policy parameterized by
θ
. For instance, during pretraining, the latent vectors are
sampled every
n
steps such that the latent vector
z
is associated with the behavior executed during
the associated nsteps.
Mutual Information. Knowledge-based, data-based, and competence-based methods have differ-
ent measures of surprise. The study in this paper falls into the category of competence-based methods.
In particular, data-based and competence-based methods rely on an information-theoretic definition
of surprise, i.e., entropy. Previous competence-based methods acquire skills by maximizing mutual
information [16] between Tand skills Z
I(T;Z) = H[T]−H[T |Z](1)
=H[Z]−H[Z|T ],(2)
where
T
can be the states
S
, the joint distribution of state-transitions
(S0,S)
, or the state-transitions
(S0|S)
. In particular, these methods differ in how they decompose the mutual information. Theoreti-
cally, these different decompositions are equivalent, i.e., they all maximize the mutual information
between states and skills. However, the particular choice greatly influences the performance in
practice as optimizing this objective relies on approximations.
To motivate the potential of competence-based methods over data-based or knowledge-based methods,
we provide an intuitive understanding of Eq. (1). On the one hand, the entropy term says that we
want skills in aggregate that explore the state space; we use it as a proxy to learn skills that cover
the set of possible behaviors. On the other hand, It is not enough to learn skills that randomly go to
different places. We want to reuse those skills as accurately as possible, meaning we need to be able
to discriminate or predict the agent’s state transitions from skills. To do so, we minimize conditional
entropy. In other words, appropriate skills should cover the set of possible behaviors and should be
easily distinguishable.
3 Information-Theoretic Skill Discovery
Competence-based methods employ different intrinsic rewards to maximize mutual information: (1)
discriminability-based and (2) exploratory-based intrinsic rewards. For example, the former rewards
the agent for discriminable skills. In contrast, the latter rewards the agent for skills that effectively
cover the state space using a KNN density estimator [
48
,
36
] to approximate the entropy term. Below
we analyze both approaches.
3