intensive, and it is also difficult to scale up. An alternative is learning in simulated environments
which have got surged interest recently. There simulators, such as Habitat [
61
,
70
], RLBench [
27
],
House3D [
77
], and AI2THOR [
36
] enable the agent to interact with its environment. However, the
visual complexity of these simulated environments is far from matching the intricate real world. The
key limitation is that making hand-designed simulation that is close enough to what a camera in the
real-world would capture is both challenging and tedious.
To remedy the issue, we propose a conceptually simple yet effective method that leverages existing
diverse datasets, builds an environment with high semantic complexity from them, and then performs
interactive learning in this environment. We do so by leveraging deep generative models that are
trained in static datasets and introduce transition dynamics in the latent space of the generative model.
Specifically, at each time step, the transition dynamics simply mix action and a random sampled
latent. It then applies an exponential moving average for temporal persistency, imitating the prevalent
temporal persistency in the real world. Finally, the resulting latent is decoded to an image using a
trained generator. The generator is a conditional generative model which is conditioned on a prompt
e.g.
a class label sampled at the beginning of an episode for achieving further temporal persistency.
For the generative model, we use conditional StyleGAN [
32
] in this work, chosen for its simplicity,
although our method is not restricted to it, and can be also applied to other generative models such as
language conditioned model Dalle [56].
We employ unsupervised reinforcement learning (RL) [
41
] to explore this environment motivated
by imitating how intelligent creatures acquire perception and action skills by curiosity [
23
,
35
,
60
]. Specifically, we use the nonparametric entropy maximization method named APT [
44
] which
encourages the agent to actively explore the environment to seek novel and unseen observations.
Similar to other pixel-based unsupervised RL methods, APT learns an abstract representation by
using off-the-shelf data augmentation and contrastive learning techniques from vision [
37
,
40
,
39
].
While effective, designing these techniques requires domain knowledge. We show that by simply
leveraging the temporal nature, representation can be effectively learned. We do so by maximizing
the similarity between representations of current and next observations based on siamese network [
5
]
without needing to use domain knowledge or data augmentation. Our method is named as
p
l
a
ying in
the latent manifold for unsupervised pretraining (PALM).
We conduct experiments in CIFAR classification and out-of-distribution detection by transferring
our unsupervised exploratory pretrained representations in StyleGAN-based environments. Our
experiments show that the learned representations achieve competitive results with state-of-the-
art methods in image recognition and out-of-distribution detection despite being only trained in
synthesized data without data augmentation. We also train StyleGAN in observation data collected
from Atari and apply our method to it. We found that the learned representation helped in maximizing
many Atari game rewards. Our major contributions are summarized below:
•
We present a surprisingly simple yet effective approach to leverage generative models as an
interactive environment for unsupervised RL. By doing so, we connect vision datasets with
RL, and enable learning representation by actively interacting with the environment.
•
We demonstrate that exploration techniques used in unsupervised RL incentivize RL agent
to learn representations from a synthetic environment without data augmentations.
•
We show that PALM matches SOTA self-supervised representation learning methods on
CIFAR and out-of-distribution benchmarks.
•
We show that PALM outperforms strong model-free and model-based training from scratch
RL. It also achieves competitive scores as SOTA exploratory pre-training RL and offline-data
pretraining RL methods.
2 Related work
Exploratory pretraining in RL
Having an unsupervised pretraining stage before finetuning on
the target task has been explored in reinforcement learning to improve downstream task performance.
One common approach has been to allow the agent a period of fully-unsupervised interaction with the
environment, during which the agent is trained to maximize a surrogate exploration-based task such
as the diversity of the states it encounters [
44
,
43
,
78
]. Others have proposed to use self-supervised
objectives to generate intrinsic rewards encouraging agents to visit new states, such as the loss of an
2