Palm up Playing in the Latent Manifold for Unsupervised Pretraining Hao Liu

2025-05-06 0 0 4.24MB 17 页 10玖币
侵权投诉
Palm up: Playing in the Latent Manifold for
Unsupervised Pretraining
Hao Liu
UC Berkeley
Tom Zahavy
DeepMind
Volodymyr Mnih
DeepMind
Satinder Singh
DeepMind
Abstract
Large and diverse datasets have been the cornerstones of many impressive advance-
ments in artificial intelligence. Intelligent creatures, however, learn by interacting
with the environment, which changes the input sensory signals and the state of the
environment. In this work, we aim to bring the best of both worlds and propose
an algorithm that exhibits an exploratory behavior whilst it utilizes large diverse
datasets. Our key idea is to leverage deep generative models that are pretrained on
static datasets and introduce a dynamic model in the latent space. The transition
dynamics simply mixes an action and a random sampled latent. It then applies
an exponential moving average for temporal persistency, the resulting latent is
decoded to image using pretrained generator. We then employ an unsupervised
reinforcement learning algorithm to explore in this environment and perform un-
supervised representation learning on the collected data. We further leverage the
temporal information of this data to pair data points as a natural supervision for rep-
resentation learning. Our experiments suggest that the learned representations can
be successfully transferred to downstream tasks in both vision and reinforcement
learning domains.
1 Introduction
Large and diverse datasets have been the cornerstones of many impressive successes at the frontier of
artificial intelligence, such as protein folding [
64
,
28
], image recognition [
55
,
13
], and understanding
natural language [
6
,
12
]. Training machine learning models on diverse datasets that cover a breadth
of human-written text and natural images often dramatically improves performance and enables
impressive generalization capabilities [
16
,
6
,
55
,
12
]. These models are learned from a fixed set of
images, videos, or languages, guided by supervision that comes from ground-truth labels [
13
,
55
],
self-supervised contrastive learning [
10
], or masked token prediction [
6
,
12
] among other approaches.
In contrast, intelligent creatures learn to perform interactions that actively change the input sensory
signals and the state of the environment towards a desired configuration. For example, in psychology,
it has been shown that interaction with the environment is vital for developing flexible and generalize
intelligence [
23
,
74
,
17
]. During the first few months of interactions, infants develop meaningful
understandings about objects [
67
] and prefer to look at exemplars from a novel class (e.g., dogs) after
observing exemplars from a different class (e.g., cats) [
54
]. We hypothesis that by situating the agent
in a high semantic complexity environment, the agent can develop interesting cognition abilities.
Given the effectiveness of learning from large and diverse datasets and the significance of interactive
learning behaviors in intelligent creatures, it is therefore extremely important to connect them to build
better learning algorithms.
To bridge the gap, perhaps one straightforward solution is learning in the diverse real world [
1
,
53
],
however, teaching a robot to interact in the physical world is both time consuming and resource
This work was partially done at DeepMind. Correspondence to hao.liu@cs.berkeley.edu
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.10913v2 [cs.LG] 21 Oct 2022
intensive, and it is also difficult to scale up. An alternative is learning in simulated environments
which have got surged interest recently. There simulators, such as Habitat [
61
,
70
], RLBench [
27
],
House3D [
77
], and AI2THOR [
36
] enable the agent to interact with its environment. However, the
visual complexity of these simulated environments is far from matching the intricate real world. The
key limitation is that making hand-designed simulation that is close enough to what a camera in the
real-world would capture is both challenging and tedious.
To remedy the issue, we propose a conceptually simple yet effective method that leverages existing
diverse datasets, builds an environment with high semantic complexity from them, and then performs
interactive learning in this environment. We do so by leveraging deep generative models that are
trained in static datasets and introduce transition dynamics in the latent space of the generative model.
Specifically, at each time step, the transition dynamics simply mix action and a random sampled
latent. It then applies an exponential moving average for temporal persistency, imitating the prevalent
temporal persistency in the real world. Finally, the resulting latent is decoded to an image using a
trained generator. The generator is a conditional generative model which is conditioned on a prompt
e.g.
a class label sampled at the beginning of an episode for achieving further temporal persistency.
For the generative model, we use conditional StyleGAN [
32
] in this work, chosen for its simplicity,
although our method is not restricted to it, and can be also applied to other generative models such as
language conditioned model Dalle [56].
We employ unsupervised reinforcement learning (RL) [
41
] to explore this environment motivated
by imitating how intelligent creatures acquire perception and action skills by curiosity [
23
,
35
,
60
]. Specifically, we use the nonparametric entropy maximization method named APT [
44
] which
encourages the agent to actively explore the environment to seek novel and unseen observations.
Similar to other pixel-based unsupervised RL methods, APT learns an abstract representation by
using off-the-shelf data augmentation and contrastive learning techniques from vision [
37
,
40
,
39
].
While effective, designing these techniques requires domain knowledge. We show that by simply
leveraging the temporal nature, representation can be effectively learned. We do so by maximizing
the similarity between representations of current and next observations based on siamese network [
5
]
without needing to use domain knowledge or data augmentation. Our method is named as
p
l
a
ying in
the latent manifold for unsupervised pretraining (PALM).
We conduct experiments in CIFAR classification and out-of-distribution detection by transferring
our unsupervised exploratory pretrained representations in StyleGAN-based environments. Our
experiments show that the learned representations achieve competitive results with state-of-the-
art methods in image recognition and out-of-distribution detection despite being only trained in
synthesized data without data augmentation. We also train StyleGAN in observation data collected
from Atari and apply our method to it. We found that the learned representation helped in maximizing
many Atari game rewards. Our major contributions are summarized below:
We present a surprisingly simple yet effective approach to leverage generative models as an
interactive environment for unsupervised RL. By doing so, we connect vision datasets with
RL, and enable learning representation by actively interacting with the environment.
We demonstrate that exploration techniques used in unsupervised RL incentivize RL agent
to learn representations from a synthetic environment without data augmentations.
We show that PALM matches SOTA self-supervised representation learning methods on
CIFAR and out-of-distribution benchmarks.
We show that PALM outperforms strong model-free and model-based training from scratch
RL. It also achieves competitive scores as SOTA exploratory pre-training RL and offline-data
pretraining RL methods.
2 Related work
Exploratory pretraining in RL
Having an unsupervised pretraining stage before finetuning on
the target task has been explored in reinforcement learning to improve downstream task performance.
One common approach has been to allow the agent a period of fully-unsupervised interaction with the
environment, during which the agent is trained to maximize a surrogate exploration-based task such
as the diversity of the states it encounters [
44
,
43
,
78
]. Others have proposed to use self-supervised
objectives to generate intrinsic rewards encouraging agents to visit new states, such as the loss of an
2
inverse dynamics model [
52
,
7
]. SGI [
63
] combines forward predictive representation learning [
62
]
with inverse dynamics model [
52
] and demonstrate the power of representation pretraining for down-
stream RL tasks. Massive-scale unsupervised pretraining has shown strong results [
8
]. Laskin et al.
[41]
conducted a comparison of different unsupervised pretraining reinforcement learning algorithms.
Finally, Chaplot et al.
[9]
, Weihs et al.
[75]
studied training RL agent in game simulators and transfer-
ring its representation to various vision tasks. Their environments are equipped with carefully chosen
domain-specific reward function to guide the learning of RL agent, and the architectures of their RL
agents are fairly complicated.
Our work differs in that we do not rely on hand-crafted simulators and renderers which require a huge
amount of domain knowledge and effort to build, instead we leverage generative models as renders.
Unlike many prior work in unsupervised pretraining RL, our work does not focus on improving
transfer performance to downstream RL tasks although it can be used for this purpose.
Training with synthetic data
Using deep generative models as a source of synthetic data for
representation learning has been studied in prior work [
59
,
26
,
33
]. These generative models are fit
to real image datasets and produce realistic-looking images as samples. Baradad et al.
[2]
studied
using data sampled from random initialized generative models to train contrastive representations.
Gowal et al.
[20]
studied combining data sampled from pretrained generative models with real data
for adversarial training and demonstrated improved results in robustness. The use of synthesized data
has been explored in reinforcement learning under the heading of domain randomization [
72
], where
3D synthetic data is rendered under a variety of lighting conditions to transfer to real environments
where the lighting may be unknown. Our approach does away with the hand crafted simulation
engine entirely by making the training data diverse through unsupervised exploration.
Different from them, our work focus on leveraging a generative model as an interactive environment
and learn representation without using data augmentation.
Temporal persistent representation
Using temporal persistent information for representation
learning has been proposed in the past with similar motivations as ours. It has been used in learning
representation from videos [
3
,
76
,
48
,
19
,
14
,
51
] by minimizing different metrics of representation
difference over a temporal segment. Learning persistent representation has been explored in reinforce-
ment learning, and has been demonstrated to improve data efficiency [
50
,
65
,
62
,
79
] and improve
downstream task performance [68, 63].
In relation to these prior efforts, our work studies visual representation learning from interacted
experiences based on real-world data.
3 Preliminary
Unsupervised reinforcement learning
Reinforcement learning considers the problem of finding
an optimal policy for an agent that interacts with an uncertain environment and collects reward per
action [69]. The agent maximizes its cumulative reward by interacting with its environment.
Formally, this problem can be viewed as a Markov decision process (MDP) defined by
(S,A,T, ρ0, r, γ)
where
S Rns
is a set of
ns
-dimensional states,
A ⊆ Rna
is a set of
na
-
dimensional actions,
T:S × A × S [0,1]
is the state transition probability distribution.
ρ0:S [0,1]
is the distribution over initial states,
r:S × A R
is the reward function,
and
γ[0,1)
is the discount factor. At environment states
s∈ S
, the agent take actions
a∈ A
,
in the (unknown) environment dynamics defined by the transition probability
T(s0|s, a)
, and the
reward function yields a reward immediately following the action
at
performed in state
st
. In value-
based reinforcement learning, the agent learns an estimate of the expected discounted return, a.k.a,
state-action value function
Qπ(st, at) = Est+1,at+1 ,... P
l=0 γlr(st+l, at+l)
. A new policy can be
derived from value function by acting
-greedily with respect to the action values (discrete) or by
using policy gradient to maximize the value function (continuous).
In unsupervised reinforcement learning, the reward function is defined as some form of intrinsic
reward that is agnostic to standard task-specific reward function
r:= rintrinsic
. The intrinsic reward
function is usually constructed for a better exploration and is computed using states and actions
collected by the agent.
3
Generative adversarial model Generative Adversarial Networks (GANs) [18] consider the prob-
lem of generating photo realistic images. The StyleGAN [
30
,
32
,
31
] architecture is one of the
state-of-the-art in high-resolution image generation for a multitude of different natural image cate-
gories such as faces, buildings, and animals.
To generate high-quality and high-resolution images, StyleGAN makes use of a specialized generator
architecture which consists of a mapping network and synthesis network. The mapping network
converts a latent vector
z∈ Z
with
Z Rn
into an intermediate latent space
w∈ W
with
W Rn
.
The mapping network is implemented using a multilayer perceptron that typically consists of 8 layers.
The resulting vector
w
in that intermediate latent space is then transformed using learned affine
transformations and used as an input to a synthesis network.
The synthesis network consists of multiple blocks that each takes three inputs. First, they take a
feature map that contains the current content information of the image that is to be generated. Second,
each block takes a transformed representation of the vector was an input to its style parts, followed
by a normalization of the feature map.
4 Method
Figure 1: Overview of the proposed method.
(A).
A conditional generative model is pretrained in a static dataset
which is conditioned a prompt (
e.g.
language or class label) sampled at the beginning of an episode, and a
transition dynamics is defined in the latent space of the generator, the agent maximizes the nonparameteric
entropy of experience in learned representation space.
(B).
The transition dynamics consists of mixing a
randomly sampled latent with action (of the same dimension) followed by exponential moving average for
temporal persistency, the resulting latent is decoded to image using pretrained generator.
(C).
The representation
is learned without using data augmentation by maximizing the representation similarity between two consecutive
synthesized observations based on Siamese network. The agent is updated using unsupervised reinforcement
learning with representation detached.
Our objective is to leverage pretrained deep generative models
G:Z × C S
where
Z
denote
latents,
C
denote prompt or labels and
S
denote observations to build an interactive environment and
train an unsupervised reinforcement learning agent in such environment for representation pretraining.
4.1 Latent environment dynamics
The transition dynamics are designed in the latent space of StyleGAN. At the beginning of an episode,
a class label or prompt
c
is randomly sampled, at each time step,
c
and a latent
zt
that depends on the
action and previous latent
zt=T(a, zt1)
are transformed by the synthesis network of StyleGAN
into an image which serves as an observation
st=G(zt, c)
. We note that while the environment
conditions on the label, the ground-truth information is not directly used by PALM.
4
摘要:

Palmup:PlayingintheLatentManifoldforUnsupervisedPretrainingHaoLiuUCBerkeleyTomZahavyDeepMindVolodymyrMnihDeepMindSatinderSinghDeepMindAbstractLargeanddiversedatasetshavebeenthecornerstonesofmanyimpressiveadvance-mentsinarticialintelligence.Intelligentcreatures,however,learnbyinteractingwiththeenvi...

展开>> 收起<<
Palm up Playing in the Latent Manifold for Unsupervised Pretraining Hao Liu.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:17 页 大小:4.24MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注