CLUTR Curriculum Learning via Unsupervised Task Representation Learning

2025-04-29 1 0 3.33MB 30 页 10玖币
侵权投诉
CLUTR: Curriculum Learning via Unsupervised Task Representation
Learning
Abdus Salam Azad 1Izzeddin Gur 2Jasper Emhoff 1Nathaniel Alexis 1Aleksandra Faust 2Pieter Abbeel 1
Ion Stoica 1
Abstract
Reinforcement Learning (RL) algorithms are of-
ten known for sample inefficiency and difficult
generalization. Recently, Unsupervised Environ-
ment Design (UED) emerged as a new paradigm
for zero-shot generalization by simultaneously
learning a task distribution and agent policies
on the generated tasks. This is a non-stationary
process where the task distribution evolves along
with agent policies; creating an instability over
time. While past works demonstrated the poten-
tial of such approaches, sampling effectively from
the task space remains an open challenge, bottle-
necking these approaches. To this end, we intro-
duce CLUTR: a novel unsupervised curriculum
learning algorithm that decouples task representa-
tion and curriculum learning into a two-stage op-
timization. It first trains a recurrent variational au-
toencoder on randomly generated tasks to learn a
latent task manifold. Next, a teacher agent creates
a curriculum by maximizing a minimax REGRET-
based objective on a set of latent tasks sampled
from this manifold. Using the fixed-pretrained
task manifold, we show that CLUTR successfully
overcomes the non-stationarity problem and im-
proves stability. Our experimental results show
CLUTR outperforms PAIRED, a principled and
popular UED method, in the challenging Car-
Racing and navigation environments: achieving
10.6X and 45% improvement in zero-shot gen-
eralization, respectively. CLUTR also performs
comparably to the non-UED state-of-the-art for
CarRacing, while requiring 500X fewer environ-
ment interactions.
1
University of California, Berkeley
2
Google Research. Corre-
spondence to: Abdus Salam Azad <salam azad@berkeley.edu>.
1. Introduction
Deep Reinforcement Learning (RL) has shown exciting
progress in the past decade in many challenging domains in-
cluding Atari (Mnih et al.,2015), Dota (Berner et al.,2019),
Go (Silver et al.,2016). However, deep RL is also known
for its sample inefficiency and difficult generalization—
performing poorly on unseen tasks or failing altogether
with the slightest change (Cobbe et al.,2019;Azad et al.,
2022;Zhang et al.,2018). While, Curriculum Learning
(CL) algorithms have shown to improve RL sample effi-
ciency by adapting the training task distribution, i.e., the
curriculum (Portelas et al.,2020;Narvekar et al.,2020),
recently a class of Unsupervised CL algorithms, called Un-
supervised Environment Design (UED) (Dennis et al.,2020;
Jiang et al.,2021a) has shown promising zero-shot general-
ization by automatically generating the training tasks and
adapting the curriculum simultaneously.
UED algorithms employ a teacher that generates training
tasks by sampling the free parameters of the environment
(e.g., the start, goal, and obstacle locations for a navigation
task) and can either be adaptive or random. Contemporary
adaptive UED teachers, i.e., PAIRED (Dennis et al.,2020)
and REPAIRED (Jiang et al.,2021a), are implemented as
RL agents with the free task parameters as their action space.
The teacher agent aims at generating tasks that maximize
the student agent’s regret, defined as the performance gap
between the student agent and an optimal policy. Inspite of
promising zero-shot generalization, adaptive teacher UEDs
are still sample inefficient.
This sample inefficiency is attributed primarily to the diffi-
culty of training a regret based RL teacher (Parker-Holder
et al.,2022). First, the teacher receives a sparse reward only
after specifying the full parameterization of a task; leading
to a long-horizon credit assignment problem. Additionally,
the teacher agent faces a combinatorial explosion problem
if the parameter space is permutation invariant—e.g., for a
navigation task, a set of obstacles corresponds to factorially
different permutations of the parameters
1
. Most importantly,
1
Consider a 13x13 grid for a navigation task, where the loca-
tions are numbered from 1 to 169. Also consider a wall made of
four obstacles spanning the locations:
{
21, 22, 23, 24
}
. This wall
arXiv:2210.10243v2 [cs.LG] 7 Mar 2023
CLUTR: Curriculum Learning via Unsupervised Task Representation Learning
the teacher needs to simultaneously learn a task manifold–
from scratch–to generate training tasks and navigate this
manifold to induce an efficient curriculum. However, the
teacher learns this task manifold implicitly based on the
student regret and as the student is continuously co-learning
with the teacher, the task manifold also keeps evolving over
time. Hence, the simultaneous learning of task manifold
and curriculum results in an instability over time and makes
it a difficult learning problem.
To address the above-mentioned challenges, we present Cur-
riculum Learning via Unsupervised Task Representation
Learning (CLUTR). At the core of CLUTR, lies a hierar-
chical graphical model that decouples task representation
learning from curriculum learning. We develop a variational
approximation to the UED problem and employ a Recur-
rent Variational AutoEncoder (VAE) to learn a latent task
manifold, which is pretrained unsupervised. Unlike con-
temporary adaptive-teachers, which builds the tasks from
scratch one parameter at a time, the CLUTR teacher gen-
erates tasks in a single timestep by sampling points from
the latent task manifold and uses the generative model to
translate them into complete tasks. The CLUTR teacher
learns the curriculum by navigating the pretrained and fixed
task manifold via maximizing regret. By utilizing a pre-
trained latent task-manifold, the CLUTR teacher can train
as a contextual bandit – overcoming the long-horizon credit
assignment problem – and create a curriculum much more
efficiently – improving stability at no cost to its effective-
ness. Finally, by carefully introducing bias to the training
corpus (such as sorting each parameter vector), CLUTR
solves the combinatorial explosion problem of parameter
space without using any costly environment interactions.
While CLUTR can be integrated with any adaptive teacher
UEDs, we implement CLUTR on top of PAIRED—one of
the most principled and popular UEDs. Our experimental
results show that CLUTR outperforms PAIRED, both in
terms of generalization and sample efficiency, in the chal-
lenging pixel-based continuous CarRacing and partially ob-
servable discrete navigation tasks. For CarRacing, CLUTR
achieves 10.6X higher zero-shot generalization on the F1
benchmark (Jiang et al.,2021a) modeled on 20 real-life F1
racing tracks. Furthermore, CLUTR performs comparably
to the non-UED attention-based CarRacing SOTA (Tang
et al.,2020), outperforming it in nine of the 20 test tracks
while requiring 500X fewer environment interactions. In
navigation tasks, CLUTR outperforms PAIRED in 14 out
of the 16 unseen tasks, achieving a 45% higher solve rate.
In summary, we make the following contributions: i) we
introduce CLUTR, a novel adaptive-teacher UED algorithm
derived from a hierarchical graphical model for UEDs, that
can be represented using any permutation of this set, e.g.,
{
22, 24,
23, 21},{23, 21, 24, 22}, resulting in a combinatorial explosion.
augments the teacher with unsupervised task-representation
learning ii) CLUTR, by decoupling task representation learn-
ing from curriculum learning, solves the long-horizon credit
assignment and the combinatorial explosion problems faced
by regret-based adaptive-teacher UEDs such as PAIRED.
iii) Our experimental results show CLUTR significantly
outperforms PAIRED, both in terms of generalization and
sample efficiency, in two challenging domains: CarRacing
and navigation.
2. Related Work
Unsupervised Curriculum Design:
Dennis et al. (2020)
was the first to formalize UED and introduced the mini-
max regret-based UED teacher algorithm, PAIRED, with a
strong theoretical robustness guarantee. However, gradient-
based multi-agent RL has no convergence guarantees
and often fails to converge in practice (Mazumdar et al.,
2019). Pre-existing techniques like Domain Randomization
(DR) (Jakobi,1997;Sadeghi & Levine,2016;Tobin et al.,
2017) and minimax adversarial curriculum learning (Mori-
moto & Doya,2005;Pinto et al.,2017) also fall under the
category of UEDs. DR teacher follows a uniform random
strategy, while the minimax adversarial teachers follow the
maximin criteria, i.e., generate tasks that minimize the re-
turns of the agent. POET (Wang et al.,2019) and Enhanced
POET (Wang et al.,2020) also approached UED, before
PAIRED, using an evolutionary approach of a co-evolving
population of tasks and agents.
Recently, Jiang et al. (2021a) proposed Dual Curriculum
Design (DCD): a novel class of UEDs that augments UED
generation methods (e.g., DR and PAIRED) with replay
capabilities. DCD involves two teachers: one that actively
generates tasks with PAIRED or DR, while the other curates
the curriculum to replay previously generated tasks with
Prioritized Level Replay (PLR) (Jiang et al.,2021b). Jiang
et al. (2021a) shows that, even with random generation (i.e.,
DR), updating the students only on the replayed level (but
not while they are first generated, i.e., no exploratory student
gradient updates as PLR) and with a regret-based scoring
function, PLR can also learn minimax-regret agents at Nash
Equilibrium and call this variation Robust PLR. It also intro-
duces REPAIRED, combining PAIRED with Robust PLR.
Parker-Holder et al. (2022) introduces ACCEL, which im-
proves on Robust PLR by allowing edit/mutation of the tasks
with an evolutionary algorithm. Currently, random-teacher
UEDs outperform adaptive-teacher UED methods.
While CLUTR and other PAIRED-variants actively adapt
task generation to the performance of agents, other algo-
rithms such as PLR generates task from a fixed-random task
distribution, resulting in two categories of UED methods,
i) adaptive teacher/generator based UEDs and ii) random-
generator based UEDs. The existing adaptive-teacher UEDs
CLUTR: Curriculum Learning via Unsupervised Task Representation Learning
are variants of PAIRED, which try to improve PAIRED from
different aspects, but are still susceptible to the instability
due to a evolving task-manifold. Unlike other PAIRED
variants, CLUTR introduces a novel variational formula-
tion with a VAE-style pretraining for task-manifold learn-
ing to solve this instability issue and can be applied, also
potentially improve, any adaptive-teacher UEDs. On the
other hand, random-generator UEDs focus on identifying or,
prioritizing which tasks to present to the student from the
randomly generated tasks, and is orthogonal to our proposed
approach.
Representation Learning:
Variational Auto Encoders
(Kingma & Welling,2013;Rezende et al.,2014;Hig-
gins et al.,2016) have widely been used for their abil-
ity to capture high-level semantic information from low-
level data and generative properties in a wide variety of
complex domains such as computer vision (Razavi et al.,
2019;Gulrajani et al.,2016;Zhang et al.,2021;2022),
natural language (Bowman et al.,2015;Jain et al.,2017),
speech (Chorowski et al.,2019), and music (Jiang et al.,
2020). VAE has been used in RL as well for representing
image observations (Kendall et al.,2019;Yarats et al.,2021)
and generating goals (Nair et al.,2018). While CLUTR
also utilizes similar VAEs, different from prior work, it com-
bines them in a new curriculum learning algorithm to learn
a latent task manifold. Florensa et al. (2018) also proposed
a curriculum learning algorithm, however, for latent-space
goal generation using a Generative Adversarial Network.
3. Background
3.1. Unsupervised Environment Design (UED)
As formalized by Dennis et al. (2020) UED is the prob-
lem of inducing a curriculum by designing a distribution
of concrete, fully-specified environments, from an under-
specified environment with free parameters. The fully spec-
ified environments are represented using a Partially Observ-
able Markov Decision Process (POMDP) represented by
(A, O, S, T,I,R, γ)
, where
A
,
O
, and
S
denote the action,
observation, and state spaces, respectively.
I O
is the
observation function,
R:SR
is the reward function,
T:S×A∆(S)
is the transition function and
γ
is the dis-
count factor. The underspecified environments are defined
in terms of an Underspecified Partially Observable Markov
Decision Process (UPOMDP) represented by the tuple
M=
(A, O, Θ, SM,TM,IM,RM, γ)
.
Θ
is a set representing
the free parameters of the environment and is incorporated
in the transition function as
TM:S×A×Θ∆(S)
.
Assigning a value to
~
θ
results in a regular POMDP, i.e.,
UPOMDP +
~
θ
= POMDP. Traditionally (e.g., in Dennis
et al. (2020) and Jiang et al. (2021a))
Θ
is considered as
a trajectory of environment parameters
~
θ
or just
θ
—which
we call task in this paper. For example,
θ
can be a concrete
navigation task represented by a sequence of obstacle lo-
cations. We denote a concrete environment generated with
the parameter
~
θΘ
as
M~
θ
or simply
Mθ
. The value
of a policy
π
in
Mθ
is defined as
Vθ(π) = E[PT
t=0 rtγt]
,
where rtis the discounted reward obtained by πin Mθ.
3.2. PAIRED
PAIRED (Dennis et al.,2020) solves UED with an adver-
sarial game involving three players
2
: the agent
πP
and an
antagonist
πA
, are trained on tasks generated by the teacher
˜
θ
. PAIRED objective is:
max˜
θ,πPminπAU(πP, πA,˜
θ) =
Eθ˜
θ[REGRETθ(πP, πA)]
. Regret is defined by the
difference of the discounted rewards obtained by the
antagonist and the agent in the generated tasks, i.e.,
REGRETθ(πP, πA) = Vθ(πA)Vθ(πP)
. The PAIRED
teacher agent is defined as
Λ : Π ∆(ΘT)
, where
Π
is a
set of possible agent policies and
ΘT
is the set of possible
tasks. The teacher is trained with an RL algorithm with
U
as the reward while, the protagonist and antagonist agents
are trained using the usual discounted rewards from the envi-
ronments. Dennis et al. (2020) also introduced the flexible
regret objective, an alternate regret approximation that is
less susceptible to local optima. It is defined by the differ-
ence between the average score of the agent and antagonist
returns and the score of the policy that achieved the highest
average return.
4. Curriculum Learning via Unsupervised
Task Representation Learning
In this section, we formally present CLUTR as a latent UED
and discuss it in details.
4.1. Formulation of CLUTR
Figure 1: Hier-
archical Graphical
Model for CLUTR
At the core of CLUTR is the la-
tent generative model representing
the latent task manifold. Let’s as-
sume that
R
is a random variable
that denotes a measure of success
over the agent and antagonist agent
and
z
be a latent random variable
that generates environments/tasks,
denoted by the random variable
E
.
We use the graphical model shown
in Figure-1to formulate CLUTR.
Both
E
and
R
are observed vari-
ables while
z
is an unobserved la-
tent variable.
R
can cover a broad range of measures used in
different UED methods including PAIRED and DR (Domain
2
In the original PAIRED paper, the primary student agent was
named protagonist. Throughout this paper we refer it simply as
the agent.
CLUTR: Curriculum Learning via Unsupervised Task Representation Learning
Randomization). In PAIRED, Rrepresents the REGRET.
We use a variational formulation of UED by using the above
graphical model to derive the following ELBO for CLUTR,
where V AE(z, E)denotes the VAE objective:
ELBO V AE(z, E)REGRET(R, E)(1)
We share the details of this derivation in Section A.1 of the
Appendix. The above ELBO (Eq.1) defines the optimization
objective for CLUTR, which can be seen as optimizing the
VAE objective with a regret-based regularization term and
vice versa. As previously discussed, it is difficult to train a
UED teacher while jointly optimizing for both the curricu-
lum and task representations. Hence we propose a two-level
optimization for CLUTR. First, we pretrain a VAE to learn
unsupervised task representations, and then in the curricu-
lum learning phase, we optimize for regret to generate the
curriculum while keeping the VAE fixed. In Section 5.3, we
empirically show that this two-level optimization performs
better than the joint optimization of Eq.1, i.e., finetuning
the VAE decoder with the regret loss during the curriculum
learning phase.
4.2. Unsupervised Latent Task Representation
Learning
As discussed above, we use a Variational AutoEncoder
(VAE) to model our generative latent task-manifold. Align-
ing with Dennis et al. (2020) and Jiang et al. (2021a),
we represent task
θ
, as a sequence of integers. For exam-
ple, in a navigation task, these integers denote obstacle,
agent, and goal locations. We use an LSTM-based Recur-
rent VAE (Bowman et al.,2015) to learn task representations
from integer sequences. We learn an embedding for each
integer and use cross-entropy over the sequences to mea-
sure the reconstruction error. This design choice makes
CLUTR applicable to task parameterization beyond integer
sequences, e.g., to sentences or images. To train our VAEs,
we generate random tasks by uniformly sampling from
ΘT
,
the set of possible tasks. Thus, we do not require any inter-
action with the environment to learn the task manifold. Such
unsupervised training of the task manifold is practically very
useful as interactions with the environment/simulator are
much more costly than sampling. Furthermore, we sort the
input sequences, fully or partially, when they are permuta-
tion invariant, i.e., essentially represent a set. By sorting the
training sequences, we avoid the combinatorial explosion
faced by other adaptive UED teachers.
4.3. CLUTR
We define CLUTR following the objective given in Eq. 1.
CLUTR uses the same curriculum objective as PAIRED,
REGRET(R, E) = REGRETθ(πP, πA)
where,
θ
denotes a
Algorithm 1 CLUTR
1: Pretrain VAE with randomly sampled tasks from Θ
2:
Randomly initialize Agent
πP
, Antagonist
πA
, and
Teacher ˜
Λ;
3: repeat
4: Generate latent task vector: z∼ Z from the teacher
5:
Create POMDP
Mθ
where
θ=G(z)
and
G
is the
VAE decoder function
6:
Collect Agent trajectory
τP
in
Mθ
. Compute:
Uθ(πP)=PT
i=0 rtγt
7:
Collect Antagonist trajectory
τA
in
Mθ
. Compute:
Uθ(πA)=PT
i=0 rtγt
8:
Compute:
REGRETθ(πP, πA) = Uθ(πA)Uθ(πP)
9:
Train Protagonist policy
πP
with RL update and re-
ward R(τP) = Uθ(πP)
10:
Train Antagonist policy
πA
with RL update and re-
ward R(τA) = Uθ(πA)
11:
Train Teacher policy
˜
Λ
with RL update and reward
R(τ˜
Λ) = REGRET
12: until not converged
task, i.e., a concrete assignment to the free parameters of the
environment
E
. Unlike PAIRED teacher, which generates
θ
directly, the CLUTR teacher policy is defined as
Λ:Π
∆(Z)
, where
Π
is a set of possible agent policies and
Z
is
as the latent space. Thus, the CLUTR teacher is a latent
environment designer, which samples random
z
and
θ
is
generated by the VAE decoder function
G:Z Θ
. We
present the outline of the CLUTR in Algorithm 1. CLUTR
outline is very similar to PAIRED, differing only in the first
two lines of the main loop to incorporate the latent space.
Now we discuss a couple of additional properties of CLUTR
compared to other adaptive-teacher UEDs, i.e., PAIRED and
REPAIRED. First, CLUTR teacher samples from the latent
space
Z
and thus generates a task in a single timestep. Note
that this is not possible for other adaptive UED teachers,
as they operate on parameter space and generate one task
parameter at a time, conditioned on the state of the partially-
generated task so far. Furthermore, Adaptive-teacher UEDs
typically observe the state of their partially generated task to
generate the next parameters. Hence they require designing
different teacher architectures for environments with differ-
ent state space. CLUTR teacher architecture, however, is
agnostic of the problem domain and does not depend on
their state space. Hence the same architecture can be used
across different environments.
4.4. CLUTR in the context of contemporary UED
method landscape
As discussed in Section 2, contemporary UED methods
can be characterized by their i) teacher type (random/fixed
CLUTR: Curriculum Learning via Unsupervised Task Representation Learning
Algorithm Task
Representation Learning
Teacher
Model
UED
Method
Replay
Method
DR
- Random DR
-
PLR PLR
Robust PLR Robust PLR
ACCEL DR + Evolution
PAIRED Implicit via RL Learned Regret
-
REPAIRED Robust PLR
CLUTR Explicit via
Unsupervised Generative Model -
Table 1: A comparative characterization of contemporary UED methods
or, learned/adaptive) and, ii) the use of replay. To clearly
place CLUTR in the context of contemporary UEDs, we
discuss another important aspect of curriculum learning
algorithms: how the task manifold is learned. The random-
generator UEDs (e.g., DR, PLR) do not learn a task man-
ifold. Regret-based adaptive-teachers, i.e., PAIRED and
REPAIRED, learn an implicit (e.g., the hidden state of the
teacher LSTM) task-manifold—from scratch—but it is not
utilized explicitly. It is trained via RL, based on the regret
estimates of the tasks they generate. Hence, these task-
manifolds depend on the quality of the estimates, which in
turn depends on the overall health of the multi-agent RL
training. Furthermore, they do not take into account the
actual task structures. In contrast, CLUTR introduces an
explicit task-manifold modeled with VAE, that can repre-
sent a local neighborhood structure capturing the similar-
ity of the tasks, subject to the parameter space being used.
Hence, similar tasks (in terms of parameterization) would be
placed nearby in the latent space. Intuitively this local neigh-
borhood structure should facilitate the teacher to navigate
the manifold effectively. The above discussion illustrates
that CLUTR along with PAIRED and REPAIRED form a
category of UEDs that generates tasks based on a learned
task-manifold, orthogonal to the random generation-based
methods, while CLUTR being the only one utilizing an un-
supervised generative task manifold. Table 1summarizes
the similarity and differences.
5. Experiments
In this section, we evaluate CLUTR in two challenging
domains: i) Pixel-Based Car Racing with continuous control
and dense rewards, and ii) partially observable navigation
tasks with discrete control and sparse rewards. We compare
CLUTR primarily with PAIRED to analyze its impact on
improving adaptive-teacher UED algorithms, experimenting
with two commonly used regret objectives: standard and
flexible. As discussed in Section 2and 4.4, there are other
random-generator and adaptive-teacher UEDs employing
techniques complimentary or orthogonal to our approach.
For completeness, we compare CLUTR with such existing
UED methods in Section C.1 and Din the Appendix.
We then empirically investigate the following hypotheses:
H1: Simultaneous learning of latent task manifold and cur-
riculum degrades performance (Section 5.3)
H2
: Training VAE on sorted data solves the combinatorial
explosion problem. (Section 5.4)
At last, we analyze CLUTR curriculum in multiple aspects
while comparing it with PAIRED to have a closer under-
standing. Full details of the environments, network architec-
tures, training hyperparameters, VAE training and further
details are discussed in the Appendix.
5.1. CLUTR Performance on Pixel-Based Continuous
Control CarRacing Environment
The CarRacing environment (Jiang et al.,2021a;Brock-
man et al.,2016) requires the agent to drive a full lap
around a closed-loop racing track modeled with B
´
ezier
Curves (Mortenson,1999) of up to 12 control points. Both
CLUTR and PAIRED were trained for 2M timesteps for
flexible regret objective and for 5M timesteps for the stan-
dard regret objective experiments. We train the VAE on 1
million randomly generated tracks for 1 million gradient
updates. Note that only one VAE was trained and used for
all the experiments (10 independent runs, both objectives).
We evaluate the agents on the F1 benchmark (Jiang et al.,
2021a) containing 20 test tracks modeled on real-life F1 rac-
ing tracks. These tracks are significantly out of distribution
than any tracks that the UED teachers can generate with just
12 control points. Further details on the environment, net-
work architectures, VAE training, and detailed experimental
results with analysis can be found in Section B.1,B.2,B.4,C
of the Appendix, respectively.
Figure 2shows the mean return obtained by CLUTR and
PAIRED on the full F1 benchmark, on. We independently
experimented with both the standard and flexible regret
objectives. We notice that PAIRED performs miserably
with standard regret in these tasks. However, implementing
摘要:

CLUTR:CurriculumLearningviaUnsupervisedTaskRepresentationLearningAbdusSalamAzad1IzzeddinGur2JasperEmhoff1NathanielAlexis1AleksandraFaust2PieterAbbeel1IonStoica1AbstractReinforcementLearning(RL)algorithmsareof-tenknownforsampleinefciencyanddifcultgeneralization.Recently,UnsupervisedEnviron-mentDesi...

展开>> 收起<<
CLUTR Curriculum Learning via Unsupervised Task Representation Learning.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:3.33MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注