
CLUTR: Curriculum Learning via Unsupervised Task Representation Learning
the teacher needs to simultaneously learn a task manifold–
from scratch–to generate training tasks and navigate this
manifold to induce an efficient curriculum. However, the
teacher learns this task manifold implicitly based on the
student regret and as the student is continuously co-learning
with the teacher, the task manifold also keeps evolving over
time. Hence, the simultaneous learning of task manifold
and curriculum results in an instability over time and makes
it a difficult learning problem.
To address the above-mentioned challenges, we present Cur-
riculum Learning via Unsupervised Task Representation
Learning (CLUTR). At the core of CLUTR, lies a hierar-
chical graphical model that decouples task representation
learning from curriculum learning. We develop a variational
approximation to the UED problem and employ a Recur-
rent Variational AutoEncoder (VAE) to learn a latent task
manifold, which is pretrained unsupervised. Unlike con-
temporary adaptive-teachers, which builds the tasks from
scratch one parameter at a time, the CLUTR teacher gen-
erates tasks in a single timestep by sampling points from
the latent task manifold and uses the generative model to
translate them into complete tasks. The CLUTR teacher
learns the curriculum by navigating the pretrained and fixed
task manifold via maximizing regret. By utilizing a pre-
trained latent task-manifold, the CLUTR teacher can train
as a contextual bandit – overcoming the long-horizon credit
assignment problem – and create a curriculum much more
efficiently – improving stability at no cost to its effective-
ness. Finally, by carefully introducing bias to the training
corpus (such as sorting each parameter vector), CLUTR
solves the combinatorial explosion problem of parameter
space without using any costly environment interactions.
While CLUTR can be integrated with any adaptive teacher
UEDs, we implement CLUTR on top of PAIRED—one of
the most principled and popular UEDs. Our experimental
results show that CLUTR outperforms PAIRED, both in
terms of generalization and sample efficiency, in the chal-
lenging pixel-based continuous CarRacing and partially ob-
servable discrete navigation tasks. For CarRacing, CLUTR
achieves 10.6X higher zero-shot generalization on the F1
benchmark (Jiang et al.,2021a) modeled on 20 real-life F1
racing tracks. Furthermore, CLUTR performs comparably
to the non-UED attention-based CarRacing SOTA (Tang
et al.,2020), outperforming it in nine of the 20 test tracks
while requiring 500X fewer environment interactions. In
navigation tasks, CLUTR outperforms PAIRED in 14 out
of the 16 unseen tasks, achieving a 45% higher solve rate.
In summary, we make the following contributions: i) we
introduce CLUTR, a novel adaptive-teacher UED algorithm
derived from a hierarchical graphical model for UEDs, that
can be represented using any permutation of this set, e.g.,
{
22, 24,
23, 21},{23, 21, 24, 22}, resulting in a combinatorial explosion.
augments the teacher with unsupervised task-representation
learning ii) CLUTR, by decoupling task representation learn-
ing from curriculum learning, solves the long-horizon credit
assignment and the combinatorial explosion problems faced
by regret-based adaptive-teacher UEDs such as PAIRED.
iii) Our experimental results show CLUTR significantly
outperforms PAIRED, both in terms of generalization and
sample efficiency, in two challenging domains: CarRacing
and navigation.
2. Related Work
Unsupervised Curriculum Design:
Dennis et al. (2020)
was the first to formalize UED and introduced the mini-
max regret-based UED teacher algorithm, PAIRED, with a
strong theoretical robustness guarantee. However, gradient-
based multi-agent RL has no convergence guarantees
and often fails to converge in practice (Mazumdar et al.,
2019). Pre-existing techniques like Domain Randomization
(DR) (Jakobi,1997;Sadeghi & Levine,2016;Tobin et al.,
2017) and minimax adversarial curriculum learning (Mori-
moto & Doya,2005;Pinto et al.,2017) also fall under the
category of UEDs. DR teacher follows a uniform random
strategy, while the minimax adversarial teachers follow the
maximin criteria, i.e., generate tasks that minimize the re-
turns of the agent. POET (Wang et al.,2019) and Enhanced
POET (Wang et al.,2020) also approached UED, before
PAIRED, using an evolutionary approach of a co-evolving
population of tasks and agents.
Recently, Jiang et al. (2021a) proposed Dual Curriculum
Design (DCD): a novel class of UEDs that augments UED
generation methods (e.g., DR and PAIRED) with replay
capabilities. DCD involves two teachers: one that actively
generates tasks with PAIRED or DR, while the other curates
the curriculum to replay previously generated tasks with
Prioritized Level Replay (PLR) (Jiang et al.,2021b). Jiang
et al. (2021a) shows that, even with random generation (i.e.,
DR), updating the students only on the replayed level (but
not while they are first generated, i.e., no exploratory student
gradient updates as PLR) and with a regret-based scoring
function, PLR can also learn minimax-regret agents at Nash
Equilibrium and call this variation Robust PLR. It also intro-
duces REPAIRED, combining PAIRED with Robust PLR.
Parker-Holder et al. (2022) introduces ACCEL, which im-
proves on Robust PLR by allowing edit/mutation of the tasks
with an evolutionary algorithm. Currently, random-teacher
UEDs outperform adaptive-teacher UED methods.
While CLUTR and other PAIRED-variants actively adapt
task generation to the performance of agents, other algo-
rithms such as PLR generates task from a fixed-random task
distribution, resulting in two categories of UED methods,
i) adaptive teacher/generator based UEDs and ii) random-
generator based UEDs. The existing adaptive-teacher UEDs