CLUTR Curriculum Learning via Unsupervised Task Representation Learning

2025-04-29 1 0 3.33MB 30 页 10玖币

侵权投诉

CLUTR: Curriculum Learning via Unsupervised Task Representation

Learning

Abdus Salam Azad 1Izzeddin Gur 2Jasper Emhoff 1Nathaniel Alexis 1Aleksandra Faust 2Pieter Abbeel 1

Ion Stoica 1

Abstract

Reinforcement Learning (RL) algorithms are of-

ten known for sample inefﬁciency and difﬁcult

generalization. Recently, Unsupervised Environ-

ment Design (UED) emerged as a new paradigm

for zero-shot generalization by simultaneously

learning a task distribution and agent policies

on the generated tasks. This is a non-stationary

process where the task distribution evolves along

with agent policies; creating an instability over

time. While past works demonstrated the poten-

tial of such approaches, sampling effectively from

the task space remains an open challenge, bottle-

necking these approaches. To this end, we intro-

duce CLUTR: a novel unsupervised curriculum

learning algorithm that decouples task representa-

tion and curriculum learning into a two-stage op-

timization. It ﬁrst trains a recurrent variational au-

toencoder on randomly generated tasks to learn a

latent task manifold. Next, a teacher agent creates

a curriculum by maximizing a minimax REGRET-

based objective on a set of latent tasks sampled

from this manifold. Using the ﬁxed-pretrained

task manifold, we show that CLUTR successfully

overcomes the non-stationarity problem and im-

proves stability. Our experimental results show

CLUTR outperforms PAIRED, a principled and

popular UED method, in the challenging Car-

Racing and navigation environments: achieving

10.6X and 45% improvement in zero-shot gen-

eralization, respectively. CLUTR also performs

comparably to the non-UED state-of-the-art for

CarRacing, while requiring 500X fewer environ-

ment interactions.

University of California, Berkeley

Google Research. Corre-

spondence to: Abdus Salam Azad <salam azad@berkeley.edu>.

1. Introduction

Deep Reinforcement Learning (RL) has shown exciting

progress in the past decade in many challenging domains in-

cluding Atari (Mnih et al.,2015), Dota (Berner et al.,2019),

Go (Silver et al.,2016). However, deep RL is also known

for its sample inefﬁciency and difﬁcult generalization—

performing poorly on unseen tasks or failing altogether

with the slightest change (Cobbe et al.,2019;Azad et al.,

2022;Zhang et al.,2018). While, Curriculum Learning

(CL) algorithms have shown to improve RL sample efﬁ-

ciency by adapting the training task distribution, i.e., the

curriculum (Portelas et al.,2020;Narvekar et al.,2020),

recently a class of Unsupervised CL algorithms, called Un-

supervised Environment Design (UED) (Dennis et al.,2020;

Jiang et al.,2021a) has shown promising zero-shot general-

ization by automatically generating the training tasks and

adapting the curriculum simultaneously.

UED algorithms employ a teacher that generates training

tasks by sampling the free parameters of the environment

(e.g., the start, goal, and obstacle locations for a navigation

task) and can either be adaptive or random. Contemporary

adaptive UED teachers, i.e., PAIRED (Dennis et al.,2020)

and REPAIRED (Jiang et al.,2021a), are implemented as

RL agents with the free task parameters as their action space.

The teacher agent aims at generating tasks that maximize

the student agent’s regret, deﬁned as the performance gap

between the student agent and an optimal policy. Inspite of

promising zero-shot generalization, adaptive teacher UEDs

are still sample inefﬁcient.

This sample inefﬁciency is attributed primarily to the difﬁ-

culty of training a regret based RL teacher (Parker-Holder

et al.,2022). First, the teacher receives a sparse reward only

after specifying the full parameterization of a task; leading

to a long-horizon credit assignment problem. Additionally,

the teacher agent faces a combinatorial explosion problem

if the parameter space is permutation invariant—e.g., for a

navigation task, a set of obstacles corresponds to factorially

different permutations of the parameters

. Most importantly,

Consider a 13x13 grid for a navigation task, where the loca-

tions are numbered from 1 to 169. Also consider a wall made of

four obstacles spanning the locations:

{

21, 22, 23, 24

}

. This wall

arXiv:2210.10243v2 [cs.LG] 7 Mar 2023

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

the teacher needs to simultaneously learn a task manifold–

from scratch–to generate training tasks and navigate this

manifold to induce an efﬁcient curriculum. However, the

teacher learns this task manifold implicitly based on the

student regret and as the student is continuously co-learning

with the teacher, the task manifold also keeps evolving over

time. Hence, the simultaneous learning of task manifold

and curriculum results in an instability over time and makes

it a difﬁcult learning problem.

To address the above-mentioned challenges, we present Cur-

riculum Learning via Unsupervised Task Representation

Learning (CLUTR). At the core of CLUTR, lies a hierar-

chical graphical model that decouples task representation

learning from curriculum learning. We develop a variational

approximation to the UED problem and employ a Recur-

rent Variational AutoEncoder (VAE) to learn a latent task

manifold, which is pretrained unsupervised. Unlike con-

temporary adaptive-teachers, which builds the tasks from

scratch one parameter at a time, the CLUTR teacher gen-

erates tasks in a single timestep by sampling points from

the latent task manifold and uses the generative model to

translate them into complete tasks. The CLUTR teacher

learns the curriculum by navigating the pretrained and ﬁxed

task manifold via maximizing regret. By utilizing a pre-

trained latent task-manifold, the CLUTR teacher can train

as a contextual bandit – overcoming the long-horizon credit

assignment problem – and create a curriculum much more

efﬁciently – improving stability at no cost to its effective-

ness. Finally, by carefully introducing bias to the training

corpus (such as sorting each parameter vector), CLUTR

solves the combinatorial explosion problem of parameter

space without using any costly environment interactions.

While CLUTR can be integrated with any adaptive teacher

UEDs, we implement CLUTR on top of PAIRED—one of

the most principled and popular UEDs. Our experimental

results show that CLUTR outperforms PAIRED, both in

terms of generalization and sample efﬁciency, in the chal-

lenging pixel-based continuous CarRacing and partially ob-

servable discrete navigation tasks. For CarRacing, CLUTR

achieves 10.6X higher zero-shot generalization on the F1

benchmark (Jiang et al.,2021a) modeled on 20 real-life F1

racing tracks. Furthermore, CLUTR performs comparably

to the non-UED attention-based CarRacing SOTA (Tang

et al.,2020), outperforming it in nine of the 20 test tracks

while requiring 500X fewer environment interactions. In

navigation tasks, CLUTR outperforms PAIRED in 14 out

of the 16 unseen tasks, achieving a 45% higher solve rate.

In summary, we make the following contributions: i) we

introduce CLUTR, a novel adaptive-teacher UED algorithm

derived from a hierarchical graphical model for UEDs, that

can be represented using any permutation of this set, e.g.,

{

22, 24,

23, 21},{23, 21, 24, 22}, resulting in a combinatorial explosion.

augments the teacher with unsupervised task-representation

learning ii) CLUTR, by decoupling task representation learn-

ing from curriculum learning, solves the long-horizon credit

assignment and the combinatorial explosion problems faced

by regret-based adaptive-teacher UEDs such as PAIRED.

iii) Our experimental results show CLUTR signiﬁcantly

outperforms PAIRED, both in terms of generalization and

sample efﬁciency, in two challenging domains: CarRacing

and navigation.

2. Related Work

Unsupervised Curriculum Design:

Dennis et al. (2020)

was the ﬁrst to formalize UED and introduced the mini-

max regret-based UED teacher algorithm, PAIRED, with a

strong theoretical robustness guarantee. However, gradient-

based multi-agent RL has no convergence guarantees

and often fails to converge in practice (Mazumdar et al.,

2019). Pre-existing techniques like Domain Randomization

(DR) (Jakobi,1997;Sadeghi & Levine,2016;Tobin et al.,

2017) and minimax adversarial curriculum learning (Mori-

moto & Doya,2005;Pinto et al.,2017) also fall under the

category of UEDs. DR teacher follows a uniform random

strategy, while the minimax adversarial teachers follow the

maximin criteria, i.e., generate tasks that minimize the re-

turns of the agent. POET (Wang et al.,2019) and Enhanced

POET (Wang et al.,2020) also approached UED, before

PAIRED, using an evolutionary approach of a co-evolving

population of tasks and agents.

Recently, Jiang et al. (2021a) proposed Dual Curriculum

Design (DCD): a novel class of UEDs that augments UED

generation methods (e.g., DR and PAIRED) with replay

capabilities. DCD involves two teachers: one that actively

generates tasks with PAIRED or DR, while the other curates

the curriculum to replay previously generated tasks with

Prioritized Level Replay (PLR) (Jiang et al.,2021b). Jiang

et al. (2021a) shows that, even with random generation (i.e.,

DR), updating the students only on the replayed level (but

not while they are ﬁrst generated, i.e., no exploratory student

gradient updates as PLR) and with a regret-based scoring

function, PLR can also learn minimax-regret agents at Nash

Equilibrium and call this variation Robust PLR. It also intro-

duces REPAIRED, combining PAIRED with Robust PLR.

Parker-Holder et al. (2022) introduces ACCEL, which im-

proves on Robust PLR by allowing edit/mutation of the tasks

with an evolutionary algorithm. Currently, random-teacher

UEDs outperform adaptive-teacher UED methods.

While CLUTR and other PAIRED-variants actively adapt

task generation to the performance of agents, other algo-

rithms such as PLR generates task from a ﬁxed-random task

distribution, resulting in two categories of UED methods,

i) adaptive teacher/generator based UEDs and ii) random-

generator based UEDs. The existing adaptive-teacher UEDs

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

are variants of PAIRED, which try to improve PAIRED from

different aspects, but are still susceptible to the instability

due to a evolving task-manifold. Unlike other PAIRED

variants, CLUTR introduces a novel variational formula-

tion with a VAE-style pretraining for task-manifold learn-

ing to solve this instability issue and can be applied, also

potentially improve, any adaptive-teacher UEDs. On the

other hand, random-generator UEDs focus on identifying or,

prioritizing which tasks to present to the student from the

randomly generated tasks, and is orthogonal to our proposed

approach.

Representation Learning:

Variational Auto Encoders

(Kingma & Welling,2013;Rezende et al.,2014;Hig-

gins et al.,2016) have widely been used for their abil-

ity to capture high-level semantic information from low-

level data and generative properties in a wide variety of

complex domains such as computer vision (Razavi et al.,

2019;Gulrajani et al.,2016;Zhang et al.,2021;2022),

natural language (Bowman et al.,2015;Jain et al.,2017),

speech (Chorowski et al.,2019), and music (Jiang et al.,

2020). VAE has been used in RL as well for representing

image observations (Kendall et al.,2019;Yarats et al.,2021)

and generating goals (Nair et al.,2018). While CLUTR

also utilizes similar VAEs, different from prior work, it com-

bines them in a new curriculum learning algorithm to learn

a latent task manifold. Florensa et al. (2018) also proposed

a curriculum learning algorithm, however, for latent-space

goal generation using a Generative Adversarial Network.

3. Background

3.1. Unsupervised Environment Design (UED)

As formalized by Dennis et al. (2020) UED is the prob-

lem of inducing a curriculum by designing a distribution

of concrete, fully-speciﬁed environments, from an under-

speciﬁed environment with free parameters. The fully spec-

iﬁed environments are represented using a Partially Observ-

able Markov Decision Process (POMDP) represented by

(A, O, S, T,I,R, γ)

, where

, and

denote the action,

observation, and state spaces, respectively.

I → O

is the

observation function,

R:S→R

is the reward function,

T:S×A→∆(S)

is the transition function and

is the dis-

count factor. The underspeciﬁed environments are deﬁned

in terms of an Underspeciﬁed Partially Observable Markov

Decision Process (UPOMDP) represented by the tuple

(A, O, Θ, SM,TM,IM,RM, γ)

is a set representing

the free parameters of the environment and is incorporated

in the transition function as

TM:S×A×Θ→∆(S)

Assigning a value to

results in a regular POMDP, i.e.,

UPOMDP +

= POMDP. Traditionally (e.g., in Dennis

et al. (2020) and Jiang et al. (2021a))

is considered as

a trajectory of environment parameters

or just

—which

we call task in this paper. For example,

can be a concrete

navigation task represented by a sequence of obstacle lo-

cations. We denote a concrete environment generated with

the parameter

θ∈Θ

or simply

Mθ

. The value

of a policy

Mθ

is deﬁned as

Vθ(π) = E[PT

t=0 rtγt]

where rtis the discounted reward obtained by πin Mθ.

3.2. PAIRED

PAIRED (Dennis et al.,2020) solves UED with an adver-

sarial game involving three players

: the agent

πP

and an

antagonist

πA

, are trained on tasks generated by the teacher

. PAIRED objective is:

max˜

θ,πPminπAU(πP, πA,˜

θ) =

Eθ∼˜

θ[REGRETθ(πP, πA)]

. Regret is deﬁned by the

difference of the discounted rewards obtained by the

antagonist and the agent in the generated tasks, i.e.,

REGRETθ(πP, πA) = Vθ(πA)−Vθ(πP)

. The PAIRED

teacher agent is deﬁned as

Λ : Π →∆(ΘT)

, where

is a

set of possible agent policies and

ΘT

is the set of possible

tasks. The teacher is trained with an RL algorithm with

as the reward while, the protagonist and antagonist agents

are trained using the usual discounted rewards from the envi-

ronments. Dennis et al. (2020) also introduced the ﬂexible

regret objective, an alternate regret approximation that is

less susceptible to local optima. It is deﬁned by the differ-

ence between the average score of the agent and antagonist

returns and the score of the policy that achieved the highest

average return.

4. Curriculum Learning via Unsupervised

Task Representation Learning

In this section, we formally present CLUTR as a latent UED

and discuss it in details.

4.1. Formulation of CLUTR

Figure 1: Hier-

archical Graphical

Model for CLUTR

At the core of CLUTR is the la-

tent generative model representing

the latent task manifold. Let’s as-

sume that

is a random variable

that denotes a measure of success

over the agent and antagonist agent

and

be a latent random variable

that generates environments/tasks,

denoted by the random variable

We use the graphical model shown

in Figure-1to formulate CLUTR.

Both

and

are observed vari-

ables while

is an unobserved la-

tent variable.

can cover a broad range of measures used in

different UED methods including PAIRED and DR (Domain

In the original PAIRED paper, the primary student agent was

named protagonist. Throughout this paper we refer it simply as

the agent.

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

Randomization). In PAIRED, Rrepresents the REGRET.

We use a variational formulation of UED by using the above

graphical model to derive the following ELBO for CLUTR,

where V AE(z, E)denotes the VAE objective:

ELBO ≈V AE(z, E)−REGRET(R, E)(1)

We share the details of this derivation in Section A.1 of the

Appendix. The above ELBO (Eq.1) deﬁnes the optimization

objective for CLUTR, which can be seen as optimizing the

VAE objective with a regret-based regularization term and

vice versa. As previously discussed, it is difﬁcult to train a

UED teacher while jointly optimizing for both the curricu-

lum and task representations. Hence we propose a two-level

optimization for CLUTR. First, we pretrain a VAE to learn

unsupervised task representations, and then in the curricu-

lum learning phase, we optimize for regret to generate the

curriculum while keeping the VAE ﬁxed. In Section 5.3, we

empirically show that this two-level optimization performs

better than the joint optimization of Eq.1, i.e., ﬁnetuning

the VAE decoder with the regret loss during the curriculum

learning phase.

4.2. Unsupervised Latent Task Representation

Learning

As discussed above, we use a Variational AutoEncoder

(VAE) to model our generative latent task-manifold. Align-

ing with Dennis et al. (2020) and Jiang et al. (2021a),

we represent task

, as a sequence of integers. For exam-

ple, in a navigation task, these integers denote obstacle,

agent, and goal locations. We use an LSTM-based Recur-

rent VAE (Bowman et al.,2015) to learn task representations

from integer sequences. We learn an embedding for each

integer and use cross-entropy over the sequences to mea-

sure the reconstruction error. This design choice makes

CLUTR applicable to task parameterization beyond integer

sequences, e.g., to sentences or images. To train our VAEs,

we generate random tasks by uniformly sampling from

ΘT

the set of possible tasks. Thus, we do not require any inter-

action with the environment to learn the task manifold. Such

unsupervised training of the task manifold is practically very

useful as interactions with the environment/simulator are

much more costly than sampling. Furthermore, we sort the

input sequences, fully or partially, when they are permuta-

tion invariant, i.e., essentially represent a set. By sorting the

training sequences, we avoid the combinatorial explosion

faced by other adaptive UED teachers.

4.3. CLUTR

We deﬁne CLUTR following the objective given in Eq. 1.

CLUTR uses the same curriculum objective as PAIRED,

REGRET(R, E) = REGRETθ(πP, πA)

where,

denotes a

Algorithm 1 CLUTR

1: Pretrain VAE with randomly sampled tasks from Θ

Randomly initialize Agent

πP

, Antagonist

πA

, and

Teacher ˜

Λ;

3: repeat

4: Generate latent task vector: z∼ Z from the teacher

Create POMDP

Mθ

where

θ=G(z)

and

is the

VAE decoder function

Collect Agent trajectory

τP

Mθ

. Compute:

Uθ(πP)=PT

i=0 rtγt

Collect Antagonist trajectory

τA

Mθ

. Compute:

Uθ(πA)=PT

i=0 rtγt

Compute:

REGRETθ(πP, πA) = Uθ(πA)−Uθ(πP)

Train Protagonist policy

πP

with RL update and re-

ward R(τP) = Uθ(πP)

10:

Train Antagonist policy

πA

with RL update and re-

ward R(τA) = Uθ(πA)

11:

Train Teacher policy

with RL update and reward

R(τ˜

Λ) = REGRET

12: until not converged

task, i.e., a concrete assignment to the free parameters of the

environment

. Unlike PAIRED teacher, which generates

directly, the CLUTR teacher policy is deﬁned as

Λ:Π→

∆(Z)

, where

is a set of possible agent policies and

as the latent space. Thus, the CLUTR teacher is a latent

environment designer, which samples random

and

generated by the VAE decoder function

G:Z → Θ

. We

present the outline of the CLUTR in Algorithm 1. CLUTR

outline is very similar to PAIRED, differing only in the ﬁrst

two lines of the main loop to incorporate the latent space.

Now we discuss a couple of additional properties of CLUTR

compared to other adaptive-teacher UEDs, i.e., PAIRED and

REPAIRED. First, CLUTR teacher samples from the latent

space

and thus generates a task in a single timestep. Note

that this is not possible for other adaptive UED teachers,

as they operate on parameter space and generate one task

parameter at a time, conditioned on the state of the partially-

generated task so far. Furthermore, Adaptive-teacher UEDs

typically observe the state of their partially generated task to

generate the next parameters. Hence they require designing

different teacher architectures for environments with differ-

ent state space. CLUTR teacher architecture, however, is

agnostic of the problem domain and does not depend on

their state space. Hence the same architecture can be used

across different environments.

4.4. CLUTR in the context of contemporary UED

method landscape

As discussed in Section 2, contemporary UED methods

can be characterized by their i) teacher type (random/ﬁxed

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

Algorithm Task

Representation Learning

Teacher

Model

UED

Method

Replay

Method

- Random DR

PLR PLR

Robust PLR Robust PLR

ACCEL DR + Evolution

PAIRED Implicit via RL Learned Regret

REPAIRED Robust PLR

CLUTR Explicit via

Unsupervised Generative Model -

Table 1: A comparative characterization of contemporary UED methods

or, learned/adaptive) and, ii) the use of replay. To clearly

place CLUTR in the context of contemporary UEDs, we

discuss another important aspect of curriculum learning

algorithms: how the task manifold is learned. The random-

generator UEDs (e.g., DR, PLR) do not learn a task man-

ifold. Regret-based adaptive-teachers, i.e., PAIRED and

REPAIRED, learn an implicit (e.g., the hidden state of the

teacher LSTM) task-manifold—from scratch—but it is not

utilized explicitly. It is trained via RL, based on the regret

estimates of the tasks they generate. Hence, these task-

manifolds depend on the quality of the estimates, which in

turn depends on the overall health of the multi-agent RL

training. Furthermore, they do not take into account the

actual task structures. In contrast, CLUTR introduces an

explicit task-manifold modeled with VAE, that can repre-

sent a local neighborhood structure capturing the similar-

ity of the tasks, subject to the parameter space being used.

Hence, similar tasks (in terms of parameterization) would be

placed nearby in the latent space. Intuitively this local neigh-

borhood structure should facilitate the teacher to navigate

the manifold effectively. The above discussion illustrates

that CLUTR along with PAIRED and REPAIRED form a

category of UEDs that generates tasks based on a learned

task-manifold, orthogonal to the random generation-based

methods, while CLUTR being the only one utilizing an un-

supervised generative task manifold. Table 1summarizes

the similarity and differences.

5. Experiments

In this section, we evaluate CLUTR in two challenging

domains: i) Pixel-Based Car Racing with continuous control

and dense rewards, and ii) partially observable navigation

tasks with discrete control and sparse rewards. We compare

CLUTR primarily with PAIRED to analyze its impact on

improving adaptive-teacher UED algorithms, experimenting

with two commonly used regret objectives: standard and

ﬂexible. As discussed in Section 2and 4.4, there are other

random-generator and adaptive-teacher UEDs employing

techniques complimentary or orthogonal to our approach.

For completeness, we compare CLUTR with such existing

UED methods in Section C.1 and Din the Appendix.

We then empirically investigate the following hypotheses:

H1: Simultaneous learning of latent task manifold and cur-

riculum degrades performance (Section 5.3)

: Training VAE on sorted data solves the combinatorial

explosion problem. (Section 5.4)

At last, we analyze CLUTR curriculum in multiple aspects

while comparing it with PAIRED to have a closer under-

standing. Full details of the environments, network architec-

tures, training hyperparameters, VAE training and further

details are discussed in the Appendix.

5.1. CLUTR Performance on Pixel-Based Continuous

Control CarRacing Environment

The CarRacing environment (Jiang et al.,2021a;Brock-

man et al.,2016) requires the agent to drive a full lap

around a closed-loop racing track modeled with B

ezier

Curves (Mortenson,1999) of up to 12 control points. Both

CLUTR and PAIRED were trained for 2M timesteps for

ﬂexible regret objective and for 5M timesteps for the stan-

dard regret objective experiments. We train the VAE on 1

million randomly generated tracks for 1 million gradient

updates. Note that only one VAE was trained and used for

all the experiments (10 independent runs, both objectives).

We evaluate the agents on the F1 benchmark (Jiang et al.,

2021a) containing 20 test tracks modeled on real-life F1 rac-

ing tracks. These tracks are signiﬁcantly out of distribution

than any tracks that the UED teachers can generate with just

12 control points. Further details on the environment, net-

work architectures, VAE training, and detailed experimental

results with analysis can be found in Section B.1,B.2,B.4,C

of the Appendix, respectively.

Figure 2shows the mean return obtained by CLUTR and

PAIRED on the full F1 benchmark, on. We independently

experimented with both the standard and ﬂexible regret

objectives. We notice that PAIRED performs miserably

with standard regret in these tasks. However, implementing

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CLUTR:CurriculumLearningviaUnsupervisedTaskRepresentationLearningAbdusSalamAzad1IzzeddinGur2JasperEmhoff1NathanielAlexis1AleksandraFaust2PieterAbbeel1IonStoica1AbstractReinforcementLearning(RL)algorithmsareof-tenknownforsampleinefciencyanddifcultgeneralization.Recently,UnsupervisedEnviron-mentDesi...

展开>> 收起<<

CLUTR Curriculum Learning via Unsupervised Task Representation Learning.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CLUTR Curriculum Learning via Unsupervised Task Representation Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: