CEIP Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations Kai Yan Alexander G. Schwing Yu-Xiong Wang

2025-04-30 0 0 4.62MB 27 页 10玖币

侵权投诉

CEIP: Combining Explicit and Implicit Priors for

Reinforcement Learning with Demonstrations

Kai Yan Alexander G. Schwing Yu-Xiong Wang

University of Illinois Urbana-Champaign

{kaiyan3, aschwing, yxw}@illinois.edu

https://github.com/289371298/CEIP

Abstract

Although reinforcement learning has found widespread use in dense reward set-

tings, training autonomous agents with sparse rewards remains challenging. To

address this difﬁculty, prior work has shown promising results when using not

only task-speciﬁc demonstrations but also task-agnostic albeit somewhat related

demonstrations. In most cases, the available demonstrations are distilled into an

implicit prior, commonly represented via a single deep net. Explicit priors in the

form of a database that can be queried have also been shown to lead to encouraging

results. To better beneﬁt from available demonstrations, we develop a method

to Combine Explicit and Implicit Priors (CEIP). CEIP exploits multiple implicit

priors in the form of normalizing ﬂows in parallel to form a single complex prior.

Moreover, CEIP uses an effective explicit retrieval and push-forward mechanism

to condition the implicit priors. In three challenging environments, we ﬁnd the

proposed CEIP method to improve upon sophisticated state-of-the-art techniques.

1 Introduction

Reinforcement learning (RL) has found widespread use across domains from robotics [

] and game

AI [

] to recommender systems [

]. Despite its success, reinforcement learning is also known to be

sample inefﬁcient. For instance, training a robot arm with sparse rewards to sort objects from scratch

still requires many training steps if it is at all feasible [46].

To increase the sample efﬁciency of reinforcement learning, prior work aims to leverage demonstra-

tions [

]. These demonstrations can be task-speciﬁc [

], i.e., they directly correspond to

and address the task of interest. More recently, the use of task-agnostic demonstrations has also been

studied [

], showing that demonstrations for loosely related tasks can enhance sample

efﬁciency of reinforcement learning agents.

To beneﬁt from either of these two types of demonstrations, most work distills the information within

the demonstrations into an implicit prior, by encoding available demonstrations in a deep net. For

example, SKiLD [

] and FIST [

] use a variational auto-encoder (VAE) to encode the “skills,” i.e.,

action sequences, in a latent space, and train a prior conditioned on states based on demonstrations

to use the skills. Differently, PARROT [

] adopts a state-conditional normalizing ﬂow to encode a

transformation from a latent space to the actual action space. However, the idea of using the available

demonstrations as an explicit prior has not received a lot of attention. Explicit priors enable the agent

to maintain a database of demonstrations, which can be used to retrieve state-action sequences given

an agent’s current state. This technique has been utilized in robotics [

] and early attempts of

reinforcement learning with demonstrations [

]. It was also implemented as a baseline in [

]. One

notable recent exception is FIST [

], which queries a database of demonstrations using the current

state to retrieve a likely next state. The use of an explicit prior was shown to greatly enhance the

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.09496v2 [cs.LG] 21 Oct 2022

performance. However, FIST uses pure imitation learning without any RL, hence losing the chance

for trial and remedy if the imitation is not good enough.

Our key insight is to leverage demonstrations both explicitly and implicitly, thus beneﬁting from

both worlds. To achieve this, we develop

CEIP

, a method which

ombines

xplicit and

mplicit

riors.

CEIP

leverages implicit demonstrations by learning a transformation from a latent space to

the real action space via normalizing ﬂows. More importantly, different from prior work, such as

PARROT and FIST which combine all the information within a single deep net,

CEIP

selects the

most useful prior by combining multiple ﬂows in parallel to form a single large ﬂow. To beneﬁt from

demonstrations explicitly,

CEIP

augments the input of the normalizing ﬂow with a likely future state,

which is retrieved via a lookup from a database of transitions. For an effective retrieval, we propose a

push-forward technique which ensures the database to return future states that have not been referred

to yet, encouraging the agent to complete the whole trajectory even if it fails on a single task.

We evaluate the proposed approach on three challenging environments: fetchreach [

], kitchen [

and ofﬁce [

]. In each environment, we study the use of both task-speciﬁc and task-agnostic demon-

strations. We observe that integrating an explicit prior, especially with our proposed push-forward

technique, greatly improves results. Notably, the proposed approach works well on sophisticated

long-horizon robotics tasks with a few, or sometimes even one task-speciﬁc demonstration.

2 Preliminaries

Reinforcement Learning.

Reinforcement learning (RL) aims to train an agent to make the ‘best’

decision towards completing a particular task in a given environment. The environment and the task

are often described as a Markov Decision Process (MDP), which is deﬁned by a tuple

(S,A, T, r, γ)

In timestep

of the Markov process, the agent observes the current state

st∈ S

, and executes an

action

at∈ A

following some probability distribution, i.e., policy

π(at|st)∈∆(A)

, where

∆(A)

denotes the probability simplex over elements in space

. Upon executing action

, the state of

the agent changes to

st+1

following the dynamics of the environment, which are governed by the

transition function

T(st, at) : S ×A → ∆(S)

. Meanwhile, the agent receives a reward

r(st, at)∈R

The agent aims to maximize the cumulative reward

Ptγtr(st, at)

, where

γ∈[0,1]

is the discount

factor. One complete run in an environment is called an episode, and the corresponding state-action

pairs τ={(s1, a1),(s2, a2), . . . }form a trajectory τ.

Normalizing Flows.

A normalizing ﬂow [

] is a generative model that transforms elements

drawn from a simple distribution

, e.g., a Gaussian, to elements

drawn from a more complex

distribution

. For this transformation, a bijective function

is used, i.e.,

a0=f(z0)

. The

use of a bijective function ensures that the log-likelihood of the more complex distribution at any

point is tractable and that samples of such a distribution can be easily generated by taking samples

from the simple distribution and pushing them through the ﬂow. Formally, the core idea of a

normalizing ﬂow can be summarized via

pa(a0) = pz(f−1(a0)) 

∂f−1(a)

∂a |a=a0

, where

|·|

is the

determinant (guaranteed positive by ﬂow designs),

is a random variable with the desired more

complex distribution, and

is a random variable governed by a simple distribution. To efﬁciently

compute the determinant of the Jacobian matrix of

f−1

, special constraints are imposed on the

form of

. For example, coupling ﬂows like RealNVP [

] and autoregressive ﬂows [

] impose the

Jacobian of f−1to be triangular.

3 CEIP: Combining Explicit and Implicit Priors

3.1 Overview

As illustrated in Fig. 1, our goal is to train an autonomous agent to solve challenging tasks despite

sparse rewards, such as controlling a robot arm to complete item manipulation tasks (like turning on

a switch or opening a cabinet). For this we aim to beneﬁt from available demonstrations. Formally,

we consider a task-speciﬁc dataset

DTS ={τTS

1, τTS

2, . . . , τTS

, where

τTS

is the

-th trajectory of

the task-speciﬁc dataset, and a task-agnostic dataset

DTA ={SDi|i∈ {1,2,3, . . . , n}}

, where

Di={τi

1, τi

2, . . . , τi

mi}

subsumes the demonstration trajectories for the

-th task in the task-agnostic

dataset. Each trajectory

τ={(s1, a1),(s2, a2), . . . }

in the dataset is a state-action pair sequence

Figure 1: Overview of our proposed approach, CEIP. Our approach can be divided into three steps: a) cluster

the task-agnostic dataset into different tasks, and then train one ﬂow on each of the

tasks of the task-agnostic

dataset; b) train a ﬂow on the task-speciﬁc dataset, and then train the coefﬁcients to combine the

n+ 1

ﬂows

into one large ﬂow

fTS

, which is the implicit prior; c) conduct reinforcement learning on the target task; for each

timestep, we perform a dataset lookup in the task-speciﬁc dataset to ﬁnd the state most similar to current state

and return the likely next state ˆsnext in the trajectory, which is the explicit prior.

of a complete episode, where

is the state, and

is the action. We assume that the number of

available task-speciﬁc trajectories is very small, i.e.,

i=1 mim

, which is common in practice.

For readability, we will also refer to DTS using Dn+1.

Our approach leverages demonstrations implicitly by training a normalizing ﬂow

fTS

, which trans-

forms the probability distribution represented by a policy

π(z|s)

over a simple latent probability

space

, i.e.,

z∈ Z

, into a reasonable expert policy over the space of real-world actions

. As

before,

is the current environment state. Thus, the downstream RL agent only needs to learn a policy

π(z|s)

that results in a probability distribution over latent space

, which is subsequently mapped

via the ﬂow

fTS

to a real-world action

a∈ A

. Intuitively, the MDP in the latent space is governed

by a less complex probability distribution, making it easier to train because the ﬂow increases the

exposure of more likely actions, while reducing the chance that a less-likely action is chosen. This is

because the ﬂow reduces the probability mass for less likely actions given the current state.

Task-agnostic demonstrations contain useful patterns that may be related to the task at hand. However,

not all the task-agnostic data are always equally useful, as different task-agnostic data may require to

expose different parts of the action space. Therefore, different from prior work where all data are

fed into the same deep net model, we ﬁrst partition the task-agnostic dataset into different groups

according to task similarity so as to increase ﬂexibility. For this we use a classical

-means algorithm.

We then train different ﬂows

on each of the groups, and ﬁnally combine the ﬂows via learned

coefﬁcients into a single ﬂow

fTS

. Beneﬁcially, this process permits to expose different parts of the

action space as needed and according to perceived task similarity.

Lastly, our approach further leverages demonstrations explicitly, by conditioning the ﬂow not only

on the current state but also on a likely next state, to better inform the agent of the state it should

try to achieve with its current action. In the following, we ﬁrst discuss the implicit prior of

CEIP

in Sec. 3.2; afterward we discuss our explicit prior in Sec. 3.3, and the downstream reinforcement

learning with both priors in Sec. 3.4.

3.2 Implicit Prior

To better beneﬁt from demonstrations implicitly, we use a 1-layer normalizing ﬂow as the backbone

of our implicit prior. It essentially corresponds to a conditioned afﬁne transformation of a Gaussian

distribution. We choose a ﬂow-based model instead of a VAE-based one for two reasons: 1) as the

dimensionality before and after the transformation via a normalizing ﬂow remains identical and since

the ﬂow is invertible, the agent is guaranteed to have control over the whole action space. This ensures

that all parts of the action space are accessible, which is not guaranteed by VAE-based methods

like SKiLD or FIST; 2) normalizing ﬂows, especially coupling ﬂows such as RealNVP [

], can be

easily stacked horizontally, so that the combination of parallel ﬂows is also a ﬂow. Among feasible

ﬂow models, we found that the simplest 1-layer ﬂow sufﬁces to achieve good results, and is even

more robust in training than a more complex RealNVP. Next, in Sec. 3.2.1 we ﬁrst introduce details

regarding the normalizing ﬂow

, before we discuss in Sec. 3.2.2 how to combine the ﬂows into one

ﬂow fTS applicable to the task for which the task-speciﬁc dataset contains demonstrations.

Figure 2: An illustration of how we combine different ﬂows into one large ﬂow for the task-speciﬁc dataset.

Each red block of “NN” stands for a neural network. Note that

ci(u)

and

di(u)

are vectors, while

µi

and

λi

are

the i-th dimension of µ(u)and λ(u).

3.2.1 Normalizing Flow Prior.

For each task

in the task-agnostic dataset, i.e., for each

, we

train a conditional 1-layer normalizing ﬂow

fi(z;u) = a

which maps a latent space variable

z∈Rq

to an action

a∈Rq

, where

is the number of dimensions of the real-valued action vector. We let

refer to a conditioning variable. In our case

is either the current environment state

(if no explicit

prior is used) or a concatenation of the current and a likely next state

[s, snext]

(if an explicit prior is

used). Concretely, the formulation of our 1-layer ﬂow is

fi(z;u) = a= exp{ci(u)}  z+di(u),(1)

where

ci(u)∈Rq

di(u)∈Rq

are trainable deep nets, and



refers to the Hadamard product.

The

exp

function is applied elementwise. When training the ﬂow, we sample state-action pairs

(without explicit prior) or transitions (with explicit prior)

(u, a)

from the dataset

, and maximize

the log-likelihood E(u,a)∼Dilog p(a|u); refer to [24] for how to maximize this objective.

In the discussion above, we assume the decomposition of the task-agnostic dataset into tasks to be

given. If such a decomposition is not provided (e.g., for the kitchen and ofﬁce environments in our

experiments), we perform a

-means clustering to divide the task-agnostic dataset into different parts.

The clustering algorithm operates on the last state of a trajectory, which is used to represent the whole

trajectory. The intuition is two-fold. First, for many real-world MDPs, achieving a particular terminal

state is more important than the actions taken [

]. For example, when we control a robot to pick

and place items, we want all target items to reach the right place eventually; however, we do not care

too much about the actions taken to achieve this state. Second, among all the states, the ﬁnal state is

often the most informative about the task that the agent has completed. The number of clusters

in the

-means algorithm is a hyperparameter, which empirically should be larger than the number

of dimensions of the action space. Though we assume the task-agnostic dataset is partitioned into

labeled clusters, our experiments show that our approach is robust and good results are achieved even

without a precise ground-truth decomposition.

In addition to the clusters in the task-agnostic dataset, we train a ﬂow

fn+1(z;u) = a

on the task-

speciﬁc dataset

Dn+1 =DTS

, using the same maximum log-likelihood loss, which is optional but

always available. This is not necessary when the task is relatively simple and the episodes are short

(e.g., the fetchreach environment in the experiment section), but becomes particularly helpful in

scenarios where some subtasks of a task sequence only appear in the task-speciﬁc dataset (e.g., the

kitchen environment).

3.2.2 Few-shot Adaptation.

The ﬂow models discussed in Sec. 3.2.1 learn which parts of the

action space to be more strongly exposed from the latent space. However, not all the ﬂows expose

useful parts of the action space for the current state. For example, the target task needs the agent to

move its gripper upwards at a particular location, but in the task-agnostic dataset, the robot more

often moves the gripper downwards to ﬁnish another task. In order to select the most useful prior, we

need to tune our set of ﬂows learned on the task-agnostic datasets to the small number of trajectories

available in the task-speciﬁc dataset. To ensure that this does not lead to overﬁtting as only a very

small number of task-speciﬁc trajectories are available, we train a set of coefﬁcients that selects the

ﬂow that works the best for the current task. Concretely, given all the trained ﬂows, we train a set of

coefﬁcients to combine the ﬂows

trained on the task-agnostic data, and also the ﬂow

fn+1

trained on the task-speciﬁc data. The coefﬁcients select from the set of available ﬂows the most useful

one. To achieve this, we use the combination ﬂow illustrated in Fig. 2which is formally speciﬁed as

follows:

fTS(z;u) = n+1

i=1

µi(u) exp{ci(u)}!z+ n+1

i=1

λi(u)di(u)!.(2)

Here,

µi(u)∈R

λi(u)∈R

are the

-th entry of the deep nets

µ(u)∈Rn+1

λ(u)∈Rn+1

respectively, which yield the coefﬁcients while the deep nets

and

are frozen. As before, the

exp

function is applied elementwise. We use a softplus activation and an offset at the output of

force

µi(u)≥10−4

for any

for numerical stability. Note that the combined ﬂow

fTS

consisting of

multiple 1-layer ﬂows is also a 1-layer normalizing ﬂow. Hence, all the compelling properties over

VAE-based architectures described at the beginning of Sec. 3.2 remain valid. To train the combined

ﬂow, we use the same log likelihood loss

E(u,a)∼DTS log p(a|u)

as that for training single ﬂows. Here,

we optimize the deep nets µ(u)and λ(u)which parameterize fTS.

Obviously, the employed combination of ﬂows can be straightforwardly extended to a more compli-

cated ﬂow, e.g., a RealNVP [

] or Glow [

]. However, we found the discussed simple formulation

to work remarkably well and to be robust.

3.3 Explicit Prior

Beyond distilling information from demonstrations into deep nets which are then used as implicit

priors, we ﬁnd explicit use of demonstrations to also be remarkably useful. To beneﬁt, we encode

future state information into the input of the ﬂow. More speciﬁcally, instead of sampling

(s, a)

-pairs

from a dataset

for training the ﬂows, we consider sampling a transition

(s, a, snext)

from

. During

training, we concatenate sand snext before feeding it into a ﬂow, i.e., u= [s, snext]instead of u=s.

However, we do not know the future state

snext

when deploying the policy. To obtain an estimate, we

use task-speciﬁc demonstrations as explicit priors. More formally, we use the trajectories within the

task-speciﬁc dataset

DTS

as a database. This is manageable as we assume the task-speciﬁc dataset to

be small. For each environment step of reinforcement learning with current state

, we perform a

lookup, where

is the query, states

skey

in the trajectories are the keys, and their corresponding next

state

snext

is the value. Concretely, we assume

snext

belongs to trajectory

in the task-speciﬁc dataset

DTS, and deﬁne ˆsnext as the result of the database retrieval with respect to the given query s, i.e.,

ˆsnext =argminsnext |(skey,a,snext)∈DTS [(skey −s)2+C·δ(snext)],where

δ(snext) = 1if ∃s0

next ∈τ, s.t. s0

next is no earlier than snext in τand has been retrieved,

0otherwise.

(3)

In Eq.

(3)

is a constant and

is the indicator function. We set

u= [s, ˆsnext]

as the condition, feed

it into the trained ﬂow

fTS

, and map the latent space element

obtained from the RL policy to the

real-world action

. The penalty term

is a push-forward technique, which aims to push the agent to

move forward instead of staying put, imposing monotonicity on the retrieved

ˆsnext

. Consider an agent

at a particular state

and a ﬂow

fTS

, conditioned on

u= [s, ˆsnext]

which maps the chosen action

a real-world action

that does not modify the environment. Without the penalty term, the agent will

remain at the same state, retrieve the same likely next state, which again maps onto the action that

does not change the environment. Intuitively, this term discourages 1) retrieving the same state twice,

and 2) returning to earlier states in a given trajectory. In our experiments, we set C= 1.

3.4 Reinforcement Learning with Priors

Given the implicit and explicit priors, we use RL to train a policy

π(z|s)

to accomplish the target

task demonstrated in the task-speciﬁc dataset. As shown in Fig. 1, the RL agent receives a state

and

provides a latent space element

. The conditioning variable of the ﬂow is retrieved via the dataset

lookup described in Sec. 3.3 and the real-world action

is then computed using the ﬂow. Note, our

approach is suitable for any RL method, i.e., the policy

π(z|s)

can be trained using any RL algorithm

such as proximal policy optimization (PPO) [43] or soft-actor-critic (SAC) [15].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CEIP:CombiningExplicitandImplicitPriorsforReinforcementLearningwithDemonstrationsKaiYanAlexanderG.SchwingYu-XiongWangUniversityofIllinoisUrbana-Champaign{kaiyan3,aschwing,yxw}@illinois.eduhttps://github.com/289371298/CEIPAbstractAlthoughreinforcementlearninghasfoundwidespreaduseindenserewardset-ting...

展开>> 收起<<

CEIP Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations Kai Yan Alexander G. Schwing Yu-Xiong Wang.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CEIP Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations Kai Yan Alexander G. Schwing Yu-Xiong Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: