Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation Peide Huang Mengdi Xu Jiacheng Zhu Laixi Shi Fei Fang Ding Zhao

2025-05-06 0 0 4.21MB 29 页 10玖币
侵权投诉
Curriculum Reinforcement Learning using Optimal
Transport via Gradual Domain Adaptation
Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, Ding Zhao
Carnegie Mellon University
Pittsburgh, PA 15213
{peideh, mengdixu, jzhu4, laixis, feifang, dingzhao}@andrew.cmu.edu
Abstract
Curriculum Reinforcement Learning (CRL) aims to create a sequence of tasks,
starting from easy ones and gradually learning towards difficult tasks. In this work,
we focus on the idea of framing CRL as interpolations between a source (auxiliary)
and a target task distribution. Although existing studies have shown the great
potential of this idea, it remains unclear how to formally quantify and generate the
movement between task distributions. Inspired by the insights from gradual domain
adaptation in semi-supervised learning, we create a natural curriculum by breaking
down the potentially large task distributional shift in CRL into smaller shifts. We
propose GRADIENT, which formulates CRL as an optimal transport problem with
a tailored distance metric between tasks. Specifically, we generate a sequence of
task distributions as a geodesic interpolation (i.e., Wasserstein barycenter) between
the source and target distributions. Different from many existing methods, our
algorithm considers a task-dependent contextual distance metric and is capable
of handling nonparametric distributions in both continuous and discrete context
settings. In addition, we theoretically show that GRADIENT enables smooth
transfer between subsequent stages in the curriculum under certain conditions. We
conduct extensive experiments in locomotion and manipulation tasks and show that
our proposed GRADIENT achieves higher performance than baselines in terms of
learning efficiency and asymptotic performance.
1 Introduction
Reinforcement Learning (RL) [
1
] has demonstrated great potential in solving complex decision-
making tasks [
2
], including but not limited to video games [
3
], chess [
4
], and robotic manipulation
[
5
]. Among them, various prior works highlight daunting challenges resulting from sparse rewards.
For example, in the maze navigation task, since the agent needs to navigate from the initial position
to the goal to receive a positive reward, the task requires a large amount of randomized exploration.
One solution to address this issue is Curriculum Reinforcement Learning (CRL) [
6
,
7
], of which the
objective is to create a sequence of environments to facilitate the learning of difficult tasks.
Although there are different interpretations of CRL, we focus on the one that views a curriculum as a
sequence of task distributions that interpolate between a source (auxiliary) task distribution and a
target task distribution [
8
,
9
,
10
,
11
,
12
,
13
]. This interpretation allows more general presentations
of tasks and a wider range of objectives such as generalization (uniform distribution over the task
collection) or learning to complete difficult tasks (a subset of the task collection). Again we use
the maze navigation as an example. Given a fixed maze layout and goal, a task is then defined by a
start position, and the task distribution is a categorical distribution over all possible start positions.
With a target distribution putting mass over start positions far away from the goal position, a natural
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.10195v1 [cs.LG] 18 Oct 2022
(a) Linear interpolation
(b) GRADIENT interpolation (ours)
Figure 1: Intermediate task distributions generated by (a) linear interpolation and (b) our method
for the maze navigation task. The green cell represents the goal. The red cells represent the initial
positions (the darker the color is, the higher the probability is). The first column shows the source task
distribution and the last column shows the target task distribution. In (a), linear interpolations do not
cover cells where the source and the target have zero probability, which hardly benefits the learning.
In contrast, (b) GRADIENT interpolations creates a curriculum that gradually morphs from the
source to the target, covering tasks of intermediate difficulty and improving the learning efficiency.
curriculum to accelerate the learning process is putting start positions close to the goal position first,
and gradually moving them towards the target distribution.
However, most of the existing methods, that interpret the curriculum as shifting distributions, use
Kullback–Leibler (KL) divergence to measure the distance between distributions. This setting
imposes several restrictions. First, due to either problem formulations or the computational feasibility,
existing methods often require the distribution to be parameterized, e.g., Gaussian [
8
,
9
,
10
,
13
],
which limits the usage in practice. Second, most of the existing algorithms using KL divergence
implicitly assume an
l2
Euclidean space which ignores the manifold structure when parameterizing
RL environments [14].
In light of the aforementioned issues with the existing CRL method, we propose GRADIENT, an
algorithm that creates a sequence of task distributions gradually morphing from the source to the
target distribution using Optimal Transport (OT). GRADIENT approaches CRL from a gradual
domain adaptation (GDA) perspective, breaking the potentially large domain shift between the source
and the target into smaller shifts to enable efficient and smooth policy transfer. In this work, we
first define a distance metric between individual tasks. Then we can find a series of task distributions
that interpolate between the easy and the difficult task distribution by computing the Wasserstein
barycenter. GRADIENT is able to deal with both discrete and continuous environment parameter
spaces, and nonparametric distributions (represented either by explicit categorical distributions or
implicit empirical distributions of particles). Under some conditions [
15
], GRADIENT provably
ensures a smooth adaptation from one stage to the next.
We summarize our main contributions as follows:
1.
We propose GRADIENT, a novel CRL framework based on optimal transport to generate gradually
morphing intermediate task distributions. As a result, GRADIENT requires little effort to transfer
between subsequent stages and therefore improves the learning efficiency towards difficult tasks.
2.
We develop
π
-contextual-distance to measure the task similarity and compute the Wasserstein
barycenters as intermediate task distributions. Our proposed method is able to deal with both
continuous and discrete context spaces as well nonparametric distributions. We also prove the
theoretical bound of policy transfer performance which leads to practical insights.
3.
We demonstrate empirically that GRADIENT has stronger learning efficiency and asymptotic
performance in a wide range of locomotion and manipulation tasks when compared with state-of-
the-art CRL baselines.
2
2 Related Work
Curriculum reinforcement learning.
Curriculum reinforcement learning (CRL) [
6
,
16
] focuses
on the generation of training environments for RL agents. There are several objectives in CRL:
improving learning efficiency towards difficult tasks (time-to-threshold), maximum return (asymp-
totic performance), or transfer policies to solve unseen tasks (generalization). From a domain
randomization perspective, Active Domain Randomization [
5
,
17
] uses curricula to diversify the
physical parameters of the simulator to facilitate the generalization in sim-to-real transfer. From a
game-theoretical perspective, adversarial training is also developed to improve the robustness of RL
agents in unseen environments [
18
,
19
,
20
,
21
]. From an intrinsic motivation perspective, methods
have been proposed to create curricula even in the absence of a target task to be accomplished
[22, 13, 23].
CRL as an interpolation of distributions.
In this work, we focus on another stream of works that
interprets CRL as an explicit interpolation between an auxiliary task distribution and a difficult task
distribution [
8
,
9
,
10
,
11
]. Self-Paced Reinforcement Learning (SPRL) [
8
] is proposed to generate
intermediate distributions by measuring the task distribution similarity using Kullback–Leibler (KL)
divergence. However, as we will show in this paper, KL divergence brings several shortcomings
that may impede the usage of the algorithms. First, although the formulation of [
8
,
9
,
10
,
11
] does
not restrict the distribution class, the algorithm realization requires the explicit computation of KL
divergence, which is analytically tractable only for a restricted family of distributions. Second, using
KL divergence implicitly assumes a
l2
Euclidean space which ignores the manifold structure when
parameterizing RL environments. In this work, we use Wasserstein distance instead of KL divergence
to measure the distance between distributions. Unlike KL divergence, Wasserstein distance considers
the ground metric information and opens up a wide variety of task distance measures.
CRL using Optimal Transport.
Hindsight Goal Generation (HGG) [
24
] aims to solve the
poor exploration problem in Hindsight Experience Replay (HER). HGG computes 2-Wasserstein
barycenter approximately to guide the hindsight goals towards the target distribution in an implicit
curriculum. Concurrent to our work, CURROT [
25
] also uses optimal transport to generate
intermediate tasks explicitly. CURROT formulates CRL as a constrained optimization problem with
2-Wasserstein distance to measure the distance between distributions. The main difference is that
we propose task-dependent contextual distance metrics and directly treat the interpolation as the
geodesic from the source to the target distribution.
Gradual domain adaptation in semi-supervised learning
Gradual domain adaptation (GDA)
[
26
,
27
,
28
,
29
,
30
,
31
,
32
,
33
] considers the problem of transferring a classifier trained in a source
domain with labeled data, to a target domain with unlabelled data. GDA solves this problem by
designing a sequence of learning tasks. The classifier is retrained with the pseudolabels created
by the classifier from the last stage in the sequence. Most of the existing literature assumes that
there exist intermediate domains. However, there are a few works aiming to tackle the problem
when intermediate domains, or the index (i.e., stage in the curriculum), are not readily available. A
coarse-to-fine framework is proposed to sort and index intermediate domain data [
33
]. Another study
proposes to create virtual samples from intermediate distributions by interpolating representations
of examples from source and target domains and suggests using the optimal transport map to create
interpolate data in semi-supervised learning [
32
]. It is demonstrated theoretically in [
27
] that the
optimal path of samples is the geodesic interpolation defined by the optimal transport map. Our work
is inspired by the divide and conquer paradigm in GDA and also uses the geodesic as our curriculum
plan (although in a different learning paradigm).
3 Preliminary
3.1 Contextual Markov Decision Process
A contextual Markov decision process (CMDP) extends the standard single-task MDP to a multi-
task setting. In this work, we consider discounted infinite-horizon CMDPs, represented by a tuple
M= (S,C,A, R, P, p0, ρ, γ)
. Here,
S
is the state space,
C
is the context space,
A
is the action
space,
R:S × A × C 7→ R
is the context-dependent reward function,
P:S × A × C 7→ ∆(S)
is the context-dependent transition function,
p0:C 7→ ∆(S)
is the context-dependent initial state
distribution,
ρ∆(C)
is the context distribution and
γ(0,1)
is the discount factor. Note that
goal-conditioned reinforcement learning [12] can be considered as a special case of the CMDP.
3
To sample a trajectory
τ:= {st, at, rt}
t=0
in CMDPs, the context
cρ
is randomly generated by
the environment at the beginning of each episode. With the initial state
s0p0(· | c)
, at each time
step
t
, the agent follows a policy
π
to select an action
atπ(st, c)
and receives a reward
R(st, at, c)
.
Then the environment transits to the next state
st+1 P(· | st, at, c)
. Contextual reinforcement
learning naturally extends the original RL objective to include the context distribution
ρ
. To find the
optimal policy, we need to solve the following optimization problem:
max
πVπ(ρ) = max
π
Eτ"
X
t=0
γtR(st, at, c)cρ;π#(1)
3.2 Optimal Transport
Wasserstein distance.
The Kantorovich problem [
34
], a classic problem in optimal transport [
35
],
aims to find the optimal coupling
θ
which minimizes the transportation cost between measures
µ, ν
M(C). Therefore, the Wasserstein distance defines the distance between probability distributions:
Wd(µ, ν) = inf
θΘ(µ,ν)ZC×C
d(cs, ct) dθ(cs, ct),subject to Θ = θ:γC
#θ=µ, γC
#θ=ν(2)
where
C
is the support space,
Θ(µ, ν)
is the set of all couplings between
µ
and
ν
,
d(·,·) : C×C 7→ R0
is a distance function,
γC
is the projection from
C × C
onto
C
, and
T#P
generally denotes the push-
forward measure of
P
by a map
T
[
36
,
37
,
38
]. This optimization is well-defined and the optimal
θ
always exists under mild conditions [35].
Wasserstein Geodesic and Wasserstein Barycenter.
To construct a curriculum that allows the
agent to efficiently solve the difficult task distribution, we follow the Wasserstein geodesic [
39
], the
shortest path [
27
] under the Wasserstein distance between the source and the target distributions.
While the original Wasserstein barycenter [
40
,
41
] problem focuses on the Frèchet mean of multiple
probability measures in a space endowed with the Wasserstein metric, we consider only two distribu-
tions
µ0
and
µ1
. The set of barycenters between
µ0
and
µ1
is the geodesic curve given by McCann’s
interpolation [
42
,
43
]. Thus, the interpolation between two given distributions
µ0
and
µ1
is defined as:
να= arg min
ν0
α
(1 α)Wd(µ0, ν0
α) + αWd(ν0
α, µ1),(3)
where each
α[0,1]
specifies one unique interpolation distribution on the geodesic. While the
computational cost of Wasserstein distance objective in Equation (2) could be a potential obstacle,
we can follow the entropic optimal transport and utilize the celebrated Sinkhorn’s algorithm [
44
].
Moreover, we can adopt a smoothing bias to solve for scalable debiased Sinkhorn barycenters [45].
4 Curriculum Reinforcement Learning using Optimal Transport
We first formulate the curriculum generation as an interpolation problem between probability
distributions in Section 4.1. Then we propose a distance measure between contexts in Section 4.2
in order to compute the Wasserstein barycenter. Next, we show our main algorithm, GRADIENT,
in Section 4.3 and an associated theorem that provides practical insights in Section 4.4.
4.1 Problem Formulation
Formally, given a target task distribution
ν(c)∆(C)
, we aim to automatically generate a curriculum
of task distributions,
ρ0(c), ρ1(c), . . . , ρK(c)
with
K
stages, that enables the agent to gradually adapt
from the auxiliary task distribution
µ(c)
to the target
ν(c)
, i.e.,
ρ0(c)µ(c)
and
ρK(c)ν(c)
. If
the context space
C
is discrete (sometimes called fixed-support in OT [
35
]) with cardinality
|C|
, the
task distribution is represented by a categorical distribution, e.g.,
µ(c)=[p(c1), p(c2), ..., p(c|C|)]
,
where
p(ci)0,Pip(ci)=1
. Whereas if the context space
C
is continuous (sometimes called
free-support in OT), the task distribution is then approximated by a set of particles sampled from
the distribution, e.g.,
µ(c)ˆµ(c) = 1
nsPns
i=1 δcsi (c)
, where
δcsi (c)
is a Dirac delta at
csi
. This
highlights the capability of dealing with nonparametric distributions in our formulation and algorithms,
in contrast to existing algorithms [8, 13, 8, 9, 10, 11].
4
4.2 Contextual Distance Metrics
To define the distance between contexts, we start by defining the distance between states.
Bisimulation metrics
[
46
,
47
,
48
] measure states’ behaviorally equivalence by defining a recursive
relationship between states. Recent literature in RL [
49
,
14
] uses the bisimulation concept to train
state encoders that facilitate multi-tasking and generalization. [
50
] uses this bisimulation concept
to enforce the policy to visit state-action pairs close to the support of logged transitions in offline RL.
However, the bisimulation metric is inherently "pessimistic" because it considers equivalence under
all actions, even including the actions that never lead to positive outcomes for the agent [
48
]. To
address this issue, [
48
] proposes on-policy
π
-bisimulation that considers the dynamics induced by
the policy
π
. Similarly, we extend this notion to the CMDP settings and define the
contextual
π-bisimulation metric:
dπ
ci,cj(si, sj) = |Rπ
si,ciRπ
sj,cj|+γWd(Pπ
si,ci,Pπ
sj,cj)(4)
where
Rπ
s,c := Paπ(a|s)R(s, a, c)
and
Pπ
s,c := PaP(· | s, a, c)π(a|s)
. With the definition of
contextual
π
-bisimulation metric, we are ready to propose the
π-contextual-distance
to measure the
distance between two contexts:
Definition 4.1 (π-contextual-distance)
Given a CMDP
M= (S,C,A, R, P, p0, ρ, γ)
, the distance
between two contexts dπ(ci, cj)under the policy πis defined as
dπ(ci, cj) = Esip0(·|ci),sjp0(·|cj)hdπ
ci,cj(si, sj)i(5)
Conceptually, the
π-contextual-distance
approximately measures the performance difference be-
tween two contexts
ci
and
cj
under
π
. The algorithm to compute a simplified version of this metric
under some conditions is detailed in the Appendix C.1. Note that it is difficult to compute this metric
precisely in general. In practice, we can design and compute some surrogate contextual distance
metrics, depending on the specific tasks. There are situations where it is reasonable to use
l2
as
a surrogate metric when the contextual distance resembles the Euclidean space, such as in some
goal-conditioned continuous environments [
24
]. In addition, although the contextual distance is not a
strict metric, we can still use it as a ground metric in the OT computation. With the concepts of a
contextual distance metric, we now introduce the algorithms to generate intermediate task distribution
to enable the agent to gradually transfer from the source to the target task distribution.
4.3 Algorithms
We present our main algorithm in Algorithm 1. We introduce an interpolation factor
α
to decide
the difference between two subsequent task distributions in the curriculum. Smaller
α
means a
smaller difference between subsequent task distributions. In the algorithm, we treat
α
as a constant
for simplicity but in effect, it can be scheduled or even adaptive, which we leave to future work. Note
that the
α
is analogical to the KL divergence constraint in [
9
,
10
,
8
]. The main difference is that we
use Wasserstein distance instead of KL divergence.
At the beginning of each stage in the curriculum, we add a
α
to the previous
α
(starting from
0). Then we pass the
α
into the function
ComputeBarycenter
(Algorithm 2) to generate the
intermediate task distribution. Note that when
α= 0
and
1
, the generated distribution are the source
and target distribution respectively. In the
ComputeBarycenter
function, the computation
method differs depending on whether the context is discrete or continuous. After generating the
intermediate task distribution, we optimize the agent in this task distribution until the accumulative
return Greaches the threshold ¯
G. Then the curriculum enters the next stage and repeats the process
until
α= 1
. In other words, the path of the intermediate task distributions is the OT geodesic on
the manifold defined by Wasserstein distance.
4.4 Theoretical Analysis
The proposed algorithm benefits from breaking the difficulty of learning in the target domain into
multiple small challenges by designing a sequence of
K
stages that gradually morph towards the
target. This motivates us to theoretically answer the following question:
Can we achieve smooth transfer to a new stage based on the optimal policy of the previous stage?
5
摘要:

CurriculumReinforcementLearningusingOptimalTransportviaGradualDomainAdaptationPeideHuang,MengdiXu,JiachengZhu,LaixiShi,FeiFang,DingZhaoCarnegieMellonUniversityPittsburgh,PA15213{peideh,mengdixu,jzhu4,laixis,feifang,dingzhao}@andrew.cmu.eduAbstractCurriculumReinforcementLearning(CRL)aimstocreateasequ...

展开>> 收起<<
Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation Peide Huang Mengdi Xu Jiacheng Zhu Laixi Shi Fei Fang Ding Zhao.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:4.21MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注