Palm up Playing in the Latent Manifold for Unsupervised Pretraining Hao Liu

2025-05-06 0 0 4.24MB 17 页 10玖币

侵权投诉

Palm up: Playing in the Latent Manifold for

Unsupervised Pretraining

Hao Liu∗

UC Berkeley

Tom Zahavy

DeepMind

Volodymyr Mnih

DeepMind

Satinder Singh

DeepMind

Abstract

Large and diverse datasets have been the cornerstones of many impressive advance-

ments in artiﬁcial intelligence. Intelligent creatures, however, learn by interacting

with the environment, which changes the input sensory signals and the state of the

environment. In this work, we aim to bring the best of both worlds and propose

an algorithm that exhibits an exploratory behavior whilst it utilizes large diverse

datasets. Our key idea is to leverage deep generative models that are pretrained on

static datasets and introduce a dynamic model in the latent space. The transition

dynamics simply mixes an action and a random sampled latent. It then applies

an exponential moving average for temporal persistency, the resulting latent is

decoded to image using pretrained generator. We then employ an unsupervised

reinforcement learning algorithm to explore in this environment and perform un-

supervised representation learning on the collected data. We further leverage the

temporal information of this data to pair data points as a natural supervision for rep-

resentation learning. Our experiments suggest that the learned representations can

be successfully transferred to downstream tasks in both vision and reinforcement

learning domains.

1 Introduction

Large and diverse datasets have been the cornerstones of many impressive successes at the frontier of

artiﬁcial intelligence, such as protein folding [

], image recognition [

], and understanding

natural language [

]. Training machine learning models on diverse datasets that cover a breadth

of human-written text and natural images often dramatically improves performance and enables

impressive generalization capabilities [

]. These models are learned from a ﬁxed set of

images, videos, or languages, guided by supervision that comes from ground-truth labels [

self-supervised contrastive learning [

], or masked token prediction [

] among other approaches.

In contrast, intelligent creatures learn to perform interactions that actively change the input sensory

signals and the state of the environment towards a desired conﬁguration. For example, in psychology,

it has been shown that interaction with the environment is vital for developing ﬂexible and generalize

intelligence [

]. During the ﬁrst few months of interactions, infants develop meaningful

understandings about objects [

] and prefer to look at exemplars from a novel class (e.g., dogs) after

observing exemplars from a different class (e.g., cats) [

]. We hypothesis that by situating the agent

in a high semantic complexity environment, the agent can develop interesting cognition abilities.

Given the effectiveness of learning from large and diverse datasets and the signiﬁcance of interactive

learning behaviors in intelligent creatures, it is therefore extremely important to connect them to build

better learning algorithms.

To bridge the gap, perhaps one straightforward solution is learning in the diverse real world [

however, teaching a robot to interact in the physical world is both time consuming and resource

∗This work was partially done at DeepMind. Correspondence to hao.liu@cs.berkeley.edu

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.10913v2 [cs.LG] 21 Oct 2022

intensive, and it is also difﬁcult to scale up. An alternative is learning in simulated environments

which have got surged interest recently. There simulators, such as Habitat [

], RLBench [

House3D [

], and AI2THOR [

] enable the agent to interact with its environment. However, the

visual complexity of these simulated environments is far from matching the intricate real world. The

key limitation is that making hand-designed simulation that is close enough to what a camera in the

real-world would capture is both challenging and tedious.

To remedy the issue, we propose a conceptually simple yet effective method that leverages existing

diverse datasets, builds an environment with high semantic complexity from them, and then performs

interactive learning in this environment. We do so by leveraging deep generative models that are

trained in static datasets and introduce transition dynamics in the latent space of the generative model.

Speciﬁcally, at each time step, the transition dynamics simply mix action and a random sampled

latent. It then applies an exponential moving average for temporal persistency, imitating the prevalent

temporal persistency in the real world. Finally, the resulting latent is decoded to an image using a

trained generator. The generator is a conditional generative model which is conditioned on a prompt

e.g.

a class label sampled at the beginning of an episode for achieving further temporal persistency.

For the generative model, we use conditional StyleGAN [

] in this work, chosen for its simplicity,

although our method is not restricted to it, and can be also applied to other generative models such as

language conditioned model Dalle [56].

We employ unsupervised reinforcement learning (RL) [

] to explore this environment motivated

by imitating how intelligent creatures acquire perception and action skills by curiosity [

]. Speciﬁcally, we use the nonparametric entropy maximization method named APT [

] which

encourages the agent to actively explore the environment to seek novel and unseen observations.

Similar to other pixel-based unsupervised RL methods, APT learns an abstract representation by

using off-the-shelf data augmentation and contrastive learning techniques from vision [

While effective, designing these techniques requires domain knowledge. We show that by simply

leveraging the temporal nature, representation can be effectively learned. We do so by maximizing

the similarity between representations of current and next observations based on siamese network [

]

without needing to use domain knowledge or data augmentation. Our method is named as

ying in

the latent manifold for unsupervised pretraining (PALM).

We conduct experiments in CIFAR classiﬁcation and out-of-distribution detection by transferring

our unsupervised exploratory pretrained representations in StyleGAN-based environments. Our

experiments show that the learned representations achieve competitive results with state-of-the-

art methods in image recognition and out-of-distribution detection despite being only trained in

synthesized data without data augmentation. We also train StyleGAN in observation data collected

from Atari and apply our method to it. We found that the learned representation helped in maximizing

many Atari game rewards. Our major contributions are summarized below:

•

We present a surprisingly simple yet effective approach to leverage generative models as an

interactive environment for unsupervised RL. By doing so, we connect vision datasets with

RL, and enable learning representation by actively interacting with the environment.

•

We demonstrate that exploration techniques used in unsupervised RL incentivize RL agent

to learn representations from a synthetic environment without data augmentations.

•

We show that PALM matches SOTA self-supervised representation learning methods on

CIFAR and out-of-distribution benchmarks.

•

We show that PALM outperforms strong model-free and model-based training from scratch

RL. It also achieves competitive scores as SOTA exploratory pre-training RL and ofﬂine-data

pretraining RL methods.

2 Related work

Exploratory pretraining in RL

Having an unsupervised pretraining stage before ﬁnetuning on

the target task has been explored in reinforcement learning to improve downstream task performance.

One common approach has been to allow the agent a period of fully-unsupervised interaction with the

environment, during which the agent is trained to maximize a surrogate exploration-based task such

as the diversity of the states it encounters [

]. Others have proposed to use self-supervised

objectives to generate intrinsic rewards encouraging agents to visit new states, such as the loss of an

inverse dynamics model [

]. SGI [

] combines forward predictive representation learning [

]

with inverse dynamics model [

] and demonstrate the power of representation pretraining for down-

stream RL tasks. Massive-scale unsupervised pretraining has shown strong results [

]. Laskin et al.

[41]

conducted a comparison of different unsupervised pretraining reinforcement learning algorithms.

Finally, Chaplot et al.

[9]

, Weihs et al.

[75]

studied training RL agent in game simulators and transfer-

ring its representation to various vision tasks. Their environments are equipped with carefully chosen

domain-speciﬁc reward function to guide the learning of RL agent, and the architectures of their RL

agents are fairly complicated.

Our work differs in that we do not rely on hand-crafted simulators and renderers which require a huge

amount of domain knowledge and effort to build, instead we leverage generative models as renders.

Unlike many prior work in unsupervised pretraining RL, our work does not focus on improving

transfer performance to downstream RL tasks although it can be used for this purpose.

Training with synthetic data

Using deep generative models as a source of synthetic data for

representation learning has been studied in prior work [

]. These generative models are ﬁt

to real image datasets and produce realistic-looking images as samples. Baradad et al.

[2]

studied

using data sampled from random initialized generative models to train contrastive representations.

Gowal et al.

[20]

studied combining data sampled from pretrained generative models with real data

for adversarial training and demonstrated improved results in robustness. The use of synthesized data

has been explored in reinforcement learning under the heading of domain randomization [

], where

3D synthetic data is rendered under a variety of lighting conditions to transfer to real environments

where the lighting may be unknown. Our approach does away with the hand crafted simulation

engine entirely by making the training data diverse through unsupervised exploration.

Different from them, our work focus on leveraging a generative model as an interactive environment

and learn representation without using data augmentation.

Temporal persistent representation

Using temporal persistent information for representation

learning has been proposed in the past with similar motivations as ours. It has been used in learning

representation from videos [

] by minimizing different metrics of representation

difference over a temporal segment. Learning persistent representation has been explored in reinforce-

ment learning, and has been demonstrated to improve data efﬁciency [

] and improve

downstream task performance [68, 63].

In relation to these prior efforts, our work studies visual representation learning from interacted

experiences based on real-world data.

3 Preliminary

Unsupervised reinforcement learning

Reinforcement learning considers the problem of ﬁnding

an optimal policy for an agent that interacts with an uncertain environment and collects reward per

action [69]. The agent maximizes its cumulative reward by interacting with its environment.

Formally, this problem can be viewed as a Markov decision process (MDP) deﬁned by

(S,A,T, ρ0, r, γ)

where

S ⊆ Rns

is a set of

-dimensional states,

A ⊆ Rna

is a set of

dimensional actions,

T:S × A × S → [0,1]

is the state transition probability distribution.

ρ0:S → [0,1]

is the distribution over initial states,

r:S × A → R

is the reward function,

and

γ∈[0,1)

is the discount factor. At environment states

s∈ S

, the agent take actions

a∈ A

in the (unknown) environment dynamics deﬁned by the transition probability

T(s0|s, a)

, and the

reward function yields a reward immediately following the action

performed in state

. In value-

based reinforcement learning, the agent learns an estimate of the expected discounted return, a.k.a,

state-action value function

Qπ(st, at) = Est+1,at+1 ,... P∞

l=0 γlr(st+l, at+l)

. A new policy can be

derived from value function by acting



-greedily with respect to the action values (discrete) or by

using policy gradient to maximize the value function (continuous).

In unsupervised reinforcement learning, the reward function is deﬁned as some form of intrinsic

reward that is agnostic to standard task-speciﬁc reward function

r:= rintrinsic

. The intrinsic reward

function is usually constructed for a better exploration and is computed using states and actions

collected by the agent.

Generative adversarial model Generative Adversarial Networks (GANs) [18] consider the prob-

lem of generating photo realistic images. The StyleGAN [

] architecture is one of the

state-of-the-art in high-resolution image generation for a multitude of different natural image cate-

gories such as faces, buildings, and animals.

To generate high-quality and high-resolution images, StyleGAN makes use of a specialized generator

architecture which consists of a mapping network and synthesis network. The mapping network

converts a latent vector

z∈ Z

with

Z ∈ Rn

into an intermediate latent space

w∈ W

with

W ∈ Rn

The mapping network is implemented using a multilayer perceptron that typically consists of 8 layers.

The resulting vector

in that intermediate latent space is then transformed using learned afﬁne

transformations and used as an input to a synthesis network.

The synthesis network consists of multiple blocks that each takes three inputs. First, they take a

feature map that contains the current content information of the image that is to be generated. Second,

each block takes a transformed representation of the vector was an input to its style parts, followed

by a normalization of the feature map.

4 Method

Figure 1: Overview of the proposed method.

(A).

A conditional generative model is pretrained in a static dataset

which is conditioned a prompt (

e.g.

language or class label) sampled at the beginning of an episode, and a

transition dynamics is deﬁned in the latent space of the generator, the agent maximizes the nonparameteric

entropy of experience in learned representation space.

(B).

The transition dynamics consists of mixing a

randomly sampled latent with action (of the same dimension) followed by exponential moving average for

temporal persistency, the resulting latent is decoded to image using pretrained generator.

(C).

The representation

is learned without using data augmentation by maximizing the representation similarity between two consecutive

synthesized observations based on Siamese network. The agent is updated using unsupervised reinforcement

learning with representation detached.

Our objective is to leverage pretrained deep generative models

G:Z × C → S

where

denote

latents,

denote prompt or labels and

denote observations to build an interactive environment and

train an unsupervised reinforcement learning agent in such environment for representation pretraining.

4.1 Latent environment dynamics

The transition dynamics are designed in the latent space of StyleGAN. At the beginning of an episode,

a class label or prompt

is randomly sampled, at each time step,

and a latent

that depends on the

action and previous latent

zt=T(a, zt−1)

are transformed by the synthesis network of StyleGAN

into an image which serves as an observation

st=G(zt, c)

. We note that while the environment

conditions on the label, the ground-truth information is not directly used by PALM.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Palmup:PlayingintheLatentManifoldforUnsupervisedPretrainingHaoLiuUCBerkeleyTomZahavyDeepMindVolodymyrMnihDeepMindSatinderSinghDeepMindAbstractLargeanddiversedatasetshavebeenthecornerstonesofmanyimpressiveadvance-mentsinarticialintelligence.Intelligentcreatures,however,learnbyinteractingwiththeenvi...

展开>> 收起<<

Palm up Playing in the Latent Manifold for Unsupervised Pretraining Hao Liu.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Palm up Playing in the Latent Manifold for Unsupervised Pretraining Hao Liu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: