2 Related Work
Curriculum reinforcement learning.
Curriculum reinforcement learning (CRL) [
6
,
16
] focuses
on the generation of training environments for RL agents. There are several objectives in CRL:
improving learning efficiency towards difficult tasks (time-to-threshold), maximum return (asymp-
totic performance), or transfer policies to solve unseen tasks (generalization). From a domain
randomization perspective, Active Domain Randomization [
5
,
17
] uses curricula to diversify the
physical parameters of the simulator to facilitate the generalization in sim-to-real transfer. From a
game-theoretical perspective, adversarial training is also developed to improve the robustness of RL
agents in unseen environments [
18
,
19
,
20
,
21
]. From an intrinsic motivation perspective, methods
have been proposed to create curricula even in the absence of a target task to be accomplished
[22, 13, 23].
CRL as an interpolation of distributions.
In this work, we focus on another stream of works that
interprets CRL as an explicit interpolation between an auxiliary task distribution and a difficult task
distribution [
8
,
9
,
10
,
11
]. Self-Paced Reinforcement Learning (SPRL) [
8
] is proposed to generate
intermediate distributions by measuring the task distribution similarity using Kullback–Leibler (KL)
divergence. However, as we will show in this paper, KL divergence brings several shortcomings
that may impede the usage of the algorithms. First, although the formulation of [
8
,
9
,
10
,
11
] does
not restrict the distribution class, the algorithm realization requires the explicit computation of KL
divergence, which is analytically tractable only for a restricted family of distributions. Second, using
KL divergence implicitly assumes a
l2
Euclidean space which ignores the manifold structure when
parameterizing RL environments. In this work, we use Wasserstein distance instead of KL divergence
to measure the distance between distributions. Unlike KL divergence, Wasserstein distance considers
the ground metric information and opens up a wide variety of task distance measures.
CRL using Optimal Transport.
Hindsight Goal Generation (HGG) [
24
] aims to solve the
poor exploration problem in Hindsight Experience Replay (HER). HGG computes 2-Wasserstein
barycenter approximately to guide the hindsight goals towards the target distribution in an implicit
curriculum. Concurrent to our work, CURROT [
25
] also uses optimal transport to generate
intermediate tasks explicitly. CURROT formulates CRL as a constrained optimization problem with
2-Wasserstein distance to measure the distance between distributions. The main difference is that
we propose task-dependent contextual distance metrics and directly treat the interpolation as the
geodesic from the source to the target distribution.
Gradual domain adaptation in semi-supervised learning
Gradual domain adaptation (GDA)
[
26
,
27
,
28
,
29
,
30
,
31
,
32
,
33
] considers the problem of transferring a classifier trained in a source
domain with labeled data, to a target domain with unlabelled data. GDA solves this problem by
designing a sequence of learning tasks. The classifier is retrained with the pseudolabels created
by the classifier from the last stage in the sequence. Most of the existing literature assumes that
there exist intermediate domains. However, there are a few works aiming to tackle the problem
when intermediate domains, or the index (i.e., stage in the curriculum), are not readily available. A
coarse-to-fine framework is proposed to sort and index intermediate domain data [
33
]. Another study
proposes to create virtual samples from intermediate distributions by interpolating representations
of examples from source and target domains and suggests using the optimal transport map to create
interpolate data in semi-supervised learning [
32
]. It is demonstrated theoretically in [
27
] that the
optimal path of samples is the geodesic interpolation defined by the optimal transport map. Our work
is inspired by the divide and conquer paradigm in GDA and also uses the geodesic as our curriculum
plan (although in a different learning paradigm).
3 Preliminary
3.1 Contextual Markov Decision Process
A contextual Markov decision process (CMDP) extends the standard single-task MDP to a multi-
task setting. In this work, we consider discounted infinite-horizon CMDPs, represented by a tuple
M= (S,C,A, R, P, p0, ρ, γ)
. Here,
S
is the state space,
C
is the context space,
A
is the action
space,
R:S × A × C 7→ R
is the context-dependent reward function,
P:S × A × C 7→ ∆(S)
is the context-dependent transition function,
p0:C 7→ ∆(S)
is the context-dependent initial state
distribution,
ρ∈∆(C)
is the context distribution and
γ∈(0,1)
is the discount factor. Note that
goal-conditioned reinforcement learning [12] can be considered as a special case of the CMDP.
3