Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation Peide Huang Mengdi Xu Jiacheng Zhu Laixi Shi Fei Fang Ding Zhao

2025-05-06 1 0 4.21MB 29 页 10玖币

侵权投诉

Curriculum Reinforcement Learning using Optimal

Transport via Gradual Domain Adaptation

Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, Ding Zhao

Carnegie Mellon University

Pittsburgh, PA 15213

{peideh, mengdixu, jzhu4, laixis, feifang, dingzhao}@andrew.cmu.edu

Abstract

Curriculum Reinforcement Learning (CRL) aims to create a sequence of tasks,

starting from easy ones and gradually learning towards difﬁcult tasks. In this work,

we focus on the idea of framing CRL as interpolations between a source (auxiliary)

and a target task distribution. Although existing studies have shown the great

potential of this idea, it remains unclear how to formally quantify and generate the

movement between task distributions. Inspired by the insights from gradual domain

adaptation in semi-supervised learning, we create a natural curriculum by breaking

down the potentially large task distributional shift in CRL into smaller shifts. We

propose GRADIENT, which formulates CRL as an optimal transport problem with

a tailored distance metric between tasks. Speciﬁcally, we generate a sequence of

task distributions as a geodesic interpolation (i.e., Wasserstein barycenter) between

the source and target distributions. Different from many existing methods, our

algorithm considers a task-dependent contextual distance metric and is capable

of handling nonparametric distributions in both continuous and discrete context

settings. In addition, we theoretically show that GRADIENT enables smooth

transfer between subsequent stages in the curriculum under certain conditions. We

conduct extensive experiments in locomotion and manipulation tasks and show that

our proposed GRADIENT achieves higher performance than baselines in terms of

learning efﬁciency and asymptotic performance.

1 Introduction

Reinforcement Learning (RL) [

] has demonstrated great potential in solving complex decision-

making tasks [

], including but not limited to video games [

], chess [

], and robotic manipulation

[

]. Among them, various prior works highlight daunting challenges resulting from sparse rewards.

For example, in the maze navigation task, since the agent needs to navigate from the initial position

to the goal to receive a positive reward, the task requires a large amount of randomized exploration.

One solution to address this issue is Curriculum Reinforcement Learning (CRL) [

], of which the

objective is to create a sequence of environments to facilitate the learning of difﬁcult tasks.

Although there are different interpretations of CRL, we focus on the one that views a curriculum as a

sequence of task distributions that interpolate between a source (auxiliary) task distribution and a

target task distribution [

]. This interpretation allows more general presentations

of tasks and a wider range of objectives such as generalization (uniform distribution over the task

collection) or learning to complete difﬁcult tasks (a subset of the task collection). Again we use

the maze navigation as an example. Given a ﬁxed maze layout and goal, a task is then deﬁned by a

start position, and the task distribution is a categorical distribution over all possible start positions.

With a target distribution putting mass over start positions far away from the goal position, a natural

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.10195v1 [cs.LG] 18 Oct 2022

(a) Linear interpolation

(b) GRADIENT interpolation (ours)

Figure 1: Intermediate task distributions generated by (a) linear interpolation and (b) our method

for the maze navigation task. The green cell represents the goal. The red cells represent the initial

positions (the darker the color is, the higher the probability is). The ﬁrst column shows the source task

distribution and the last column shows the target task distribution. In (a), linear interpolations do not

cover cells where the source and the target have zero probability, which hardly beneﬁts the learning.

In contrast, (b) GRADIENT interpolations creates a curriculum that gradually morphs from the

source to the target, covering tasks of intermediate difﬁculty and improving the learning efﬁciency.

curriculum to accelerate the learning process is putting start positions close to the goal position ﬁrst,

and gradually moving them towards the target distribution.

However, most of the existing methods, that interpret the curriculum as shifting distributions, use

Kullback–Leibler (KL) divergence to measure the distance between distributions. This setting

imposes several restrictions. First, due to either problem formulations or the computational feasibility,

existing methods often require the distribution to be parameterized, e.g., Gaussian [

which limits the usage in practice. Second, most of the existing algorithms using KL divergence

implicitly assume an

Euclidean space which ignores the manifold structure when parameterizing

RL environments [14].

In light of the aforementioned issues with the existing CRL method, we propose GRADIENT, an

algorithm that creates a sequence of task distributions gradually morphing from the source to the

target distribution using Optimal Transport (OT). GRADIENT approaches CRL from a gradual

domain adaptation (GDA) perspective, breaking the potentially large domain shift between the source

and the target into smaller shifts to enable efﬁcient and smooth policy transfer. In this work, we

ﬁrst deﬁne a distance metric between individual tasks. Then we can ﬁnd a series of task distributions

that interpolate between the easy and the difﬁcult task distribution by computing the Wasserstein

barycenter. GRADIENT is able to deal with both discrete and continuous environment parameter

spaces, and nonparametric distributions (represented either by explicit categorical distributions or

implicit empirical distributions of particles). Under some conditions [

], GRADIENT provably

ensures a smooth adaptation from one stage to the next.

We summarize our main contributions as follows:

We propose GRADIENT, a novel CRL framework based on optimal transport to generate gradually

morphing intermediate task distributions. As a result, GRADIENT requires little effort to transfer

between subsequent stages and therefore improves the learning efﬁciency towards difﬁcult tasks.

We develop

-contextual-distance to measure the task similarity and compute the Wasserstein

barycenters as intermediate task distributions. Our proposed method is able to deal with both

continuous and discrete context spaces as well nonparametric distributions. We also prove the

theoretical bound of policy transfer performance which leads to practical insights.

We demonstrate empirically that GRADIENT has stronger learning efﬁciency and asymptotic

performance in a wide range of locomotion and manipulation tasks when compared with state-of-

the-art CRL baselines.

2 Related Work

Curriculum reinforcement learning.

Curriculum reinforcement learning (CRL) [

] focuses

on the generation of training environments for RL agents. There are several objectives in CRL:

improving learning efﬁciency towards difﬁcult tasks (time-to-threshold), maximum return (asymp-

totic performance), or transfer policies to solve unseen tasks (generalization). From a domain

randomization perspective, Active Domain Randomization [

] uses curricula to diversify the

physical parameters of the simulator to facilitate the generalization in sim-to-real transfer. From a

game-theoretical perspective, adversarial training is also developed to improve the robustness of RL

agents in unseen environments [

]. From an intrinsic motivation perspective, methods

have been proposed to create curricula even in the absence of a target task to be accomplished

[22, 13, 23].

CRL as an interpolation of distributions.

In this work, we focus on another stream of works that

interprets CRL as an explicit interpolation between an auxiliary task distribution and a difﬁcult task

distribution [

]. Self-Paced Reinforcement Learning (SPRL) [

] is proposed to generate

intermediate distributions by measuring the task distribution similarity using Kullback–Leibler (KL)

divergence. However, as we will show in this paper, KL divergence brings several shortcomings

that may impede the usage of the algorithms. First, although the formulation of [

] does

not restrict the distribution class, the algorithm realization requires the explicit computation of KL

divergence, which is analytically tractable only for a restricted family of distributions. Second, using

KL divergence implicitly assumes a

Euclidean space which ignores the manifold structure when

parameterizing RL environments. In this work, we use Wasserstein distance instead of KL divergence

to measure the distance between distributions. Unlike KL divergence, Wasserstein distance considers

the ground metric information and opens up a wide variety of task distance measures.

CRL using Optimal Transport.

Hindsight Goal Generation (HGG) [

] aims to solve the

poor exploration problem in Hindsight Experience Replay (HER). HGG computes 2-Wasserstein

barycenter approximately to guide the hindsight goals towards the target distribution in an implicit

curriculum. Concurrent to our work, CURROT [

] also uses optimal transport to generate

intermediate tasks explicitly. CURROT formulates CRL as a constrained optimization problem with

2-Wasserstein distance to measure the distance between distributions. The main difference is that

we propose task-dependent contextual distance metrics and directly treat the interpolation as the

geodesic from the source to the target distribution.

Gradual domain adaptation in semi-supervised learning

Gradual domain adaptation (GDA)

[

] considers the problem of transferring a classiﬁer trained in a source

domain with labeled data, to a target domain with unlabelled data. GDA solves this problem by

designing a sequence of learning tasks. The classiﬁer is retrained with the pseudolabels created

by the classiﬁer from the last stage in the sequence. Most of the existing literature assumes that

there exist intermediate domains. However, there are a few works aiming to tackle the problem

when intermediate domains, or the index (i.e., stage in the curriculum), are not readily available. A

coarse-to-ﬁne framework is proposed to sort and index intermediate domain data [

]. Another study

proposes to create virtual samples from intermediate distributions by interpolating representations

of examples from source and target domains and suggests using the optimal transport map to create

interpolate data in semi-supervised learning [

]. It is demonstrated theoretically in [

] that the

optimal path of samples is the geodesic interpolation deﬁned by the optimal transport map. Our work

is inspired by the divide and conquer paradigm in GDA and also uses the geodesic as our curriculum

plan (although in a different learning paradigm).

3 Preliminary

3.1 Contextual Markov Decision Process

A contextual Markov decision process (CMDP) extends the standard single-task MDP to a multi-

task setting. In this work, we consider discounted inﬁnite-horizon CMDPs, represented by a tuple

M= (S,C,A, R, P, p0, ρ, γ)

. Here,

is the state space,

is the context space,

is the action

space,

R:S × A × C 7→ R

is the context-dependent reward function,

P:S × A × C 7→ ∆(S)

is the context-dependent transition function,

p0:C 7→ ∆(S)

is the context-dependent initial state

distribution,

ρ∈∆(C)

is the context distribution and

γ∈(0,1)

is the discount factor. Note that

goal-conditioned reinforcement learning [12] can be considered as a special case of the CMDP.

To sample a trajectory

τ:= {st, at, rt}∞

t=0

in CMDPs, the context

c∼ρ

is randomly generated by

the environment at the beginning of each episode. With the initial state

s0∼p0(· | c)

, at each time

step

, the agent follows a policy

to select an action

at∼π(st, c)

and receives a reward

R(st, at, c)

Then the environment transits to the next state

st+1 ∼P(· | st, at, c)

. Contextual reinforcement

learning naturally extends the original RL objective to include the context distribution

. To ﬁnd the

optimal policy, we need to solve the following optimization problem:

max

πVπ(ρ) = max

Eτ"∞

t=0

γtR(st, at, c)c∼ρ;π#(1)

3.2 Optimal Transport

Wasserstein distance.

The Kantorovich problem [

], a classic problem in optimal transport [

aims to ﬁnd the optimal coupling

θ∗

which minimizes the transportation cost between measures

µ, ν ∈

M(C). Therefore, the Wasserstein distance deﬁnes the distance between probability distributions:

Wd(µ, ν) = inf

θ∈Θ(µ,ν)ZC×C

d(cs, ct) dθ(cs, ct),subject to Θ = θ:γC

#θ=µ, γC

#θ=ν(2)

where

is the support space,

Θ(µ, ν)

is the set of all couplings between

and

d(·,·) : C×C 7→ R≥0

is a distance function,

γC

is the projection from

C × C

onto

, and

T#P

generally denotes the push-

forward measure of

by a map

[

]. This optimization is well-deﬁned and the optimal

always exists under mild conditions [35].

Wasserstein Geodesic and Wasserstein Barycenter.

To construct a curriculum that allows the

agent to efﬁciently solve the difﬁcult task distribution, we follow the Wasserstein geodesic [

], the

shortest path [

] under the Wasserstein distance between the source and the target distributions.

While the original Wasserstein barycenter [

] problem focuses on the Frèchet mean of multiple

probability measures in a space endowed with the Wasserstein metric, we consider only two distribu-

tions

µ0

and

µ1

. The set of barycenters between

µ0

and

µ1

is the geodesic curve given by McCann’s

interpolation [

]. Thus, the interpolation between two given distributions

µ0

and

µ1

is deﬁned as:

να= arg min

ν0

(1 −α)Wd(µ0, ν0

α) + αWd(ν0

α, µ1),(3)

where each

α∈[0,1]

speciﬁes one unique interpolation distribution on the geodesic. While the

computational cost of Wasserstein distance objective in Equation (2) could be a potential obstacle,

we can follow the entropic optimal transport and utilize the celebrated Sinkhorn’s algorithm [

Moreover, we can adopt a smoothing bias to solve for scalable debiased Sinkhorn barycenters [45].

4 Curriculum Reinforcement Learning using Optimal Transport

We ﬁrst formulate the curriculum generation as an interpolation problem between probability

distributions in Section 4.1. Then we propose a distance measure between contexts in Section 4.2

in order to compute the Wasserstein barycenter. Next, we show our main algorithm, GRADIENT,

in Section 4.3 and an associated theorem that provides practical insights in Section 4.4.

4.1 Problem Formulation

Formally, given a target task distribution

ν(c)∈∆(C)

, we aim to automatically generate a curriculum

of task distributions,

ρ0(c), ρ1(c), . . . , ρK(c)

with

stages, that enables the agent to gradually adapt

from the auxiliary task distribution

µ(c)

to the target

ν(c)

, i.e.,

ρ0(c)→µ(c)

and

ρK(c)→ν(c)

. If

the context space

is discrete (sometimes called ﬁxed-support in OT [

]) with cardinality

|C|

, the

task distribution is represented by a categorical distribution, e.g.,

µ(c)=[p(c1), p(c2), ..., p(c|C|)]

where

p(ci)≥0,Pip(ci)=1

. Whereas if the context space

is continuous (sometimes called

free-support in OT), the task distribution is then approximated by a set of particles sampled from

the distribution, e.g.,

µ(c)≈ˆµ(c) = 1

nsPns

i=1 δcsi (c)

, where

δcsi (c)

is a Dirac delta at

csi

. This

highlights the capability of dealing with nonparametric distributions in our formulation and algorithms,

in contrast to existing algorithms [8, 13, 8, 9, 10, 11].

4.2 Contextual Distance Metrics

To deﬁne the distance between contexts, we start by deﬁning the distance between states.

Bisimulation metrics

[

] measure states’ behaviorally equivalence by deﬁning a recursive

relationship between states. Recent literature in RL [

] uses the bisimulation concept to train

state encoders that facilitate multi-tasking and generalization. [

] uses this bisimulation concept

to enforce the policy to visit state-action pairs close to the support of logged transitions in ofﬂine RL.

However, the bisimulation metric is inherently "pessimistic" because it considers equivalence under

all actions, even including the actions that never lead to positive outcomes for the agent [

]. To

address this issue, [

] proposes on-policy

-bisimulation that considers the dynamics induced by

the policy

. Similarly, we extend this notion to the CMDP settings and deﬁne the

contextual

π-bisimulation metric:

dπ

ci,cj(si, sj) = |Rπ

si,ci−Rπ

sj,cj|+γWd(Pπ

si,ci,Pπ

sj,cj)(4)

where

Rπ

s,c := Paπ(a|s)R(s, a, c)

and

Pπ

s,c := PaP(· | s, a, c)π(a|s)

. With the deﬁnition of

contextual

-bisimulation metric, we are ready to propose the

π-contextual-distance

to measure the

distance between two contexts:

Deﬁnition 4.1 (π-contextual-distance)

Given a CMDP

M= (S,C,A, R, P, p0, ρ, γ)

, the distance

between two contexts dπ(ci, cj)under the policy πis deﬁned as

dπ(ci, cj) = Esi∼p0(·|ci),sj∼p0(·|cj)hdπ

ci,cj(si, sj)i(5)

Conceptually, the

π-contextual-distance

approximately measures the performance difference be-

tween two contexts

and

under

. The algorithm to compute a simpliﬁed version of this metric

under some conditions is detailed in the Appendix C.1. Note that it is difﬁcult to compute this metric

precisely in general. In practice, we can design and compute some surrogate contextual distance

metrics, depending on the speciﬁc tasks. There are situations where it is reasonable to use

a surrogate metric when the contextual distance resembles the Euclidean space, such as in some

goal-conditioned continuous environments [

]. In addition, although the contextual distance is not a

strict metric, we can still use it as a ground metric in the OT computation. With the concepts of a

contextual distance metric, we now introduce the algorithms to generate intermediate task distribution

to enable the agent to gradually transfer from the source to the target task distribution.

4.3 Algorithms

We present our main algorithm in Algorithm 1. We introduce an interpolation factor

∆α

to decide

the difference between two subsequent task distributions in the curriculum. Smaller

∆α

means a

smaller difference between subsequent task distributions. In the algorithm, we treat

∆α

as a constant

for simplicity but in effect, it can be scheduled or even adaptive, which we leave to future work. Note

that the

∆α

is analogical to the KL divergence constraint in [

]. The main difference is that we

use Wasserstein distance instead of KL divergence.

At the beginning of each stage in the curriculum, we add a

∆α

to the previous

(starting from

0). Then we pass the

into the function

ComputeBarycenter

(Algorithm 2) to generate the

intermediate task distribution. Note that when

α= 0

and

, the generated distribution are the source

and target distribution respectively. In the

ComputeBarycenter

function, the computation

method differs depending on whether the context is discrete or continuous. After generating the

intermediate task distribution, we optimize the agent in this task distribution until the accumulative

return Greaches the threshold ¯

G. Then the curriculum enters the next stage and repeats the process

until

α= 1

. In other words, the path of the intermediate task distributions is the OT geodesic on

the manifold deﬁned by Wasserstein distance.

4.4 Theoretical Analysis

The proposed algorithm beneﬁts from breaking the difﬁculty of learning in the target domain into

multiple small challenges by designing a sequence of

stages that gradually morph towards the

target. This motivates us to theoretically answer the following question:

Can we achieve smooth transfer to a new stage based on the optimal policy of the previous stage?

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CurriculumReinforcementLearningusingOptimalTransportviaGradualDomainAdaptationPeideHuang,MengdiXu,JiachengZhu,LaixiShi,FeiFang,DingZhaoCarnegieMellonUniversityPittsburgh,PA15213{peideh,mengdixu,jzhu4,laixis,feifang,dingzhao}@andrew.cmu.eduAbstractCurriculumReinforcementLearning(CRL)aimstocreateasequ...

展开>> 收起<<

Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation Peide Huang Mengdi Xu Jiacheng Zhu Laixi Shi Fei Fang Ding Zhao.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation Peide Huang Mengdi Xu Jiacheng Zhu Laixi Shi Fei Fang Ding Zhao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: