CEIP Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations Kai Yan Alexander G. Schwing Yu-Xiong Wang

2025-04-30 0 0 4.62MB 27 页 10玖币
侵权投诉
CEIP: Combining Explicit and Implicit Priors for
Reinforcement Learning with Demonstrations
Kai Yan Alexander G. Schwing Yu-Xiong Wang
University of Illinois Urbana-Champaign
{kaiyan3, aschwing, yxw}@illinois.edu
https://github.com/289371298/CEIP
Abstract
Although reinforcement learning has found widespread use in dense reward set-
tings, training autonomous agents with sparse rewards remains challenging. To
address this difficulty, prior work has shown promising results when using not
only task-specific demonstrations but also task-agnostic albeit somewhat related
demonstrations. In most cases, the available demonstrations are distilled into an
implicit prior, commonly represented via a single deep net. Explicit priors in the
form of a database that can be queried have also been shown to lead to encouraging
results. To better benefit from available demonstrations, we develop a method
to Combine Explicit and Implicit Priors (CEIP). CEIP exploits multiple implicit
priors in the form of normalizing flows in parallel to form a single complex prior.
Moreover, CEIP uses an effective explicit retrieval and push-forward mechanism
to condition the implicit priors. In three challenging environments, we find the
proposed CEIP method to improve upon sophisticated state-of-the-art techniques.
1 Introduction
Reinforcement learning (RL) has found widespread use across domains from robotics [
57
] and game
AI [
44
] to recommender systems [
6
]. Despite its success, reinforcement learning is also known to be
sample inefficient. For instance, training a robot arm with sparse rewards to sort objects from scratch
still requires many training steps if it is at all feasible [46].
To increase the sample efficiency of reinforcement learning, prior work aims to leverage demonstra-
tions [
4
,
34
,
40
]. These demonstrations can be task-specific [
4
,
17
], i.e., they directly correspond to
and address the task of interest. More recently, the use of task-agnostic demonstrations has also been
studied [
14
,
16
,
34
,
46
], showing that demonstrations for loosely related tasks can enhance sample
efficiency of reinforcement learning agents.
To benefit from either of these two types of demonstrations, most work distills the information within
the demonstrations into an implicit prior, by encoding available demonstrations in a deep net. For
example, SKiLD [
34
] and FIST [
16
] use a variational auto-encoder (VAE) to encode the “skills,” i.e.,
action sequences, in a latent space, and train a prior conditioned on states based on demonstrations
to use the skills. Differently, PARROT [
46
] adopts a state-conditional normalizing flow to encode a
transformation from a latent space to the actual action space. However, the idea of using the available
demonstrations as an explicit prior has not received a lot of attention. Explicit priors enable the agent
to maintain a database of demonstrations, which can be used to retrieve state-action sequences given
an agent’s current state. This technique has been utilized in robotics [
32
,
47
] and early attempts of
reinforcement learning with demonstrations [
4
]. It was also implemented as a baseline in [
14
]. One
notable recent exception is FIST [
16
], which queries a database of demonstrations using the current
state to retrieve a likely next state. The use of an explicit prior was shown to greatly enhance the
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.09496v2 [cs.LG] 21 Oct 2022
performance. However, FIST uses pure imitation learning without any RL, hence losing the chance
for trial and remedy if the imitation is not good enough.
Our key insight is to leverage demonstrations both explicitly and implicitly, thus benefiting from
both worlds. To achieve this, we develop
CEIP
, a method which
c
ombines
e
xplicit and
i
mplicit
p
riors.
CEIP
leverages implicit demonstrations by learning a transformation from a latent space to
the real action space via normalizing flows. More importantly, different from prior work, such as
PARROT and FIST which combine all the information within a single deep net,
CEIP
selects the
most useful prior by combining multiple flows in parallel to form a single large flow. To benefit from
demonstrations explicitly,
CEIP
augments the input of the normalizing flow with a likely future state,
which is retrieved via a lookup from a database of transitions. For an effective retrieval, we propose a
push-forward technique which ensures the database to return future states that have not been referred
to yet, encouraging the agent to complete the whole trajectory even if it fails on a single task.
We evaluate the proposed approach on three challenging environments: fetchreach [
36
], kitchen [
11
],
and office [
45
]. In each environment, we study the use of both task-specific and task-agnostic demon-
strations. We observe that integrating an explicit prior, especially with our proposed push-forward
technique, greatly improves results. Notably, the proposed approach works well on sophisticated
long-horizon robotics tasks with a few, or sometimes even one task-specific demonstration.
2 Preliminaries
Reinforcement Learning.
Reinforcement learning (RL) aims to train an agent to make the ‘best’
decision towards completing a particular task in a given environment. The environment and the task
are often described as a Markov Decision Process (MDP), which is defined by a tuple
(S,A, T, r, γ)
.
In timestep
t
of the Markov process, the agent observes the current state
st∈ S
, and executes an
action
at∈ A
following some probability distribution, i.e., policy
π(at|st)∆(A)
, where
∆(A)
denotes the probability simplex over elements in space
A
. Upon executing action
at
, the state of
the agent changes to
st+1
following the dynamics of the environment, which are governed by the
transition function
T(st, at) : S ×A ∆(S)
. Meanwhile, the agent receives a reward
r(st, at)R
.
The agent aims to maximize the cumulative reward
Ptγtr(st, at)
, where
γ[0,1]
is the discount
factor. One complete run in an environment is called an episode, and the corresponding state-action
pairs τ={(s1, a1),(s2, a2), . . . }form a trajectory τ.
Normalizing Flows.
A normalizing flow [
24
] is a generative model that transforms elements
z0
drawn from a simple distribution
pz
, e.g., a Gaussian, to elements
a0
drawn from a more complex
distribution
pa
. For this transformation, a bijective function
f
is used, i.e.,
a0=f(z0)
. The
use of a bijective function ensures that the log-likelihood of the more complex distribution at any
point is tractable and that samples of such a distribution can be easily generated by taking samples
from the simple distribution and pushing them through the flow. Formally, the core idea of a
normalizing flow can be summarized via
pa(a0) = pz(f1(a0))
f1(a)
a |a=a0
, where
|·|
is the
determinant (guaranteed positive by flow designs),
a
is a random variable with the desired more
complex distribution, and
z
is a random variable governed by a simple distribution. To efficiently
compute the determinant of the Jacobian matrix of
f1
, special constraints are imposed on the
form of
f
. For example, coupling flows like RealNVP [
8
] and autoregressive flows [
31
] impose the
Jacobian of f1to be triangular.
3 CEIP: Combining Explicit and Implicit Priors
3.1 Overview
As illustrated in Fig. 1, our goal is to train an autonomous agent to solve challenging tasks despite
sparse rewards, such as controlling a robot arm to complete item manipulation tasks (like turning on
a switch or opening a cabinet). For this we aim to benefit from available demonstrations. Formally,
we consider a task-specific dataset
DTS ={τTS
1, τTS
2, . . . , τTS
m}
, where
τTS
i
is the
i
-th trajectory of
the task-specific dataset, and a task-agnostic dataset
DTA ={SDi|i∈ {1,2,3, . . . , n}}
, where
Di={τi
1, τi
2, . . . , τi
mi}
subsumes the demonstration trajectories for the
i
-th task in the task-agnostic
dataset. Each trajectory
τ={(s1, a1),(s2, a2), . . . }
in the dataset is a state-action pair sequence
2
Figure 1: Overview of our proposed approach, CEIP. Our approach can be divided into three steps: a) cluster
the task-agnostic dataset into different tasks, and then train one flow on each of the
n
tasks of the task-agnostic
dataset; b) train a flow on the task-specific dataset, and then train the coefficients to combine the
n+ 1
flows
into one large flow
fTS
, which is the implicit prior; c) conduct reinforcement learning on the target task; for each
timestep, we perform a dataset lookup in the task-specific dataset to find the state most similar to current state
s
,
and return the likely next state ˆsnext in the trajectory, which is the explicit prior.
of a complete episode, where
s
is the state, and
a
is the action. We assume that the number of
available task-specific trajectories is very small, i.e.,
Pn
i=1 mim
, which is common in practice.
For readability, we will also refer to DTS using Dn+1.
Our approach leverages demonstrations implicitly by training a normalizing flow
fTS
, which trans-
forms the probability distribution represented by a policy
π(z|s)
over a simple latent probability
space
Z
, i.e.,
z∈ Z
, into a reasonable expert policy over the space of real-world actions
A
. As
before,
s
is the current environment state. Thus, the downstream RL agent only needs to learn a policy
π(z|s)
that results in a probability distribution over latent space
Z
, which is subsequently mapped
via the flow
fTS
to a real-world action
a∈ A
. Intuitively, the MDP in the latent space is governed
by a less complex probability distribution, making it easier to train because the flow increases the
exposure of more likely actions, while reducing the chance that a less-likely action is chosen. This is
because the flow reduces the probability mass for less likely actions given the current state.
Task-agnostic demonstrations contain useful patterns that may be related to the task at hand. However,
not all the task-agnostic data are always equally useful, as different task-agnostic data may require to
expose different parts of the action space. Therefore, different from prior work where all data are
fed into the same deep net model, we first partition the task-agnostic dataset into different groups
according to task similarity so as to increase flexibility. For this we use a classical
k
-means algorithm.
We then train different flows
fi
on each of the groups, and finally combine the flows via learned
coefficients into a single flow
fTS
. Beneficially, this process permits to expose different parts of the
action space as needed and according to perceived task similarity.
Lastly, our approach further leverages demonstrations explicitly, by conditioning the flow not only
on the current state but also on a likely next state, to better inform the agent of the state it should
try to achieve with its current action. In the following, we first discuss the implicit prior of
CEIP
in Sec. 3.2; afterward we discuss our explicit prior in Sec. 3.3, and the downstream reinforcement
learning with both priors in Sec. 3.4.
3.2 Implicit Prior
To better benefit from demonstrations implicitly, we use a 1-layer normalizing flow as the backbone
of our implicit prior. It essentially corresponds to a conditioned affine transformation of a Gaussian
distribution. We choose a flow-based model instead of a VAE-based one for two reasons: 1) as the
dimensionality before and after the transformation via a normalizing flow remains identical and since
the flow is invertible, the agent is guaranteed to have control over the whole action space. This ensures
that all parts of the action space are accessible, which is not guaranteed by VAE-based methods
like SKiLD or FIST; 2) normalizing flows, especially coupling flows such as RealNVP [
8
], can be
easily stacked horizontally, so that the combination of parallel flows is also a flow. Among feasible
flow models, we found that the simplest 1-layer flow suffices to achieve good results, and is even
more robust in training than a more complex RealNVP. Next, in Sec. 3.2.1 we first introduce details
regarding the normalizing flow
fi
, before we discuss in Sec. 3.2.2 how to combine the flows into one
flow fTS applicable to the task for which the task-specific dataset contains demonstrations.
3
Figure 2: An illustration of how we combine different flows into one large flow for the task-specific dataset.
Each red block of “NN” stands for a neural network. Note that
ci(u)
and
di(u)
are vectors, while
µi
and
λi
are
the i-th dimension of µ(u)and λ(u).
3.2.1 Normalizing Flow Prior.
For each task
i
in the task-agnostic dataset, i.e., for each
Di
, we
train a conditional 1-layer normalizing flow
fi(z;u) = a
which maps a latent space variable
zRq
to an action
aRq
, where
q
is the number of dimensions of the real-valued action vector. We let
u
refer to a conditioning variable. In our case
u
is either the current environment state
s
(if no explicit
prior is used) or a concatenation of the current and a likely next state
[s, snext]
(if an explicit prior is
used). Concretely, the formulation of our 1-layer flow is
fi(z;u) = a= exp{ci(u)}  z+di(u),(1)
where
ci(u)Rq
,
di(u)Rq
are trainable deep nets, and
refers to the Hadamard product.
The
exp
function is applied elementwise. When training the flow, we sample state-action pairs
(without explicit prior) or transitions (with explicit prior)
(u, a)
from the dataset
Di
, and maximize
the log-likelihood E(u,a)Dilog p(a|u); refer to [24] for how to maximize this objective.
In the discussion above, we assume the decomposition of the task-agnostic dataset into tasks to be
given. If such a decomposition is not provided (e.g., for the kitchen and office environments in our
experiments), we perform a
k
-means clustering to divide the task-agnostic dataset into different parts.
The clustering algorithm operates on the last state of a trajectory, which is used to represent the whole
trajectory. The intuition is two-fold. First, for many real-world MDPs, achieving a particular terminal
state is more important than the actions taken [
12
]. For example, when we control a robot to pick
and place items, we want all target items to reach the right place eventually; however, we do not care
too much about the actions taken to achieve this state. Second, among all the states, the final state is
often the most informative about the task that the agent has completed. The number of clusters
k
in the
k
-means algorithm is a hyperparameter, which empirically should be larger than the number
of dimensions of the action space. Though we assume the task-agnostic dataset is partitioned into
labeled clusters, our experiments show that our approach is robust and good results are achieved even
without a precise ground-truth decomposition.
In addition to the clusters in the task-agnostic dataset, we train a flow
fn+1(z;u) = a
on the task-
specific dataset
Dn+1 =DTS
, using the same maximum log-likelihood loss, which is optional but
always available. This is not necessary when the task is relatively simple and the episodes are short
(e.g., the fetchreach environment in the experiment section), but becomes particularly helpful in
scenarios where some subtasks of a task sequence only appear in the task-specific dataset (e.g., the
kitchen environment).
3.2.2 Few-shot Adaptation.
The flow models discussed in Sec. 3.2.1 learn which parts of the
action space to be more strongly exposed from the latent space. However, not all the flows expose
useful parts of the action space for the current state. For example, the target task needs the agent to
move its gripper upwards at a particular location, but in the task-agnostic dataset, the robot more
often moves the gripper downwards to finish another task. In order to select the most useful prior, we
need to tune our set of flows learned on the task-agnostic datasets to the small number of trajectories
available in the task-specific dataset. To ensure that this does not lead to overfitting as only a very
small number of task-specific trajectories are available, we train a set of coefficients that selects the
flow that works the best for the current task. Concretely, given all the trained flows, we train a set of
coefficients to combine the flows
f1
to
fn
trained on the task-agnostic data, and also the flow
fn+1
trained on the task-specific data. The coefficients select from the set of available flows the most useful
4
one. To achieve this, we use the combination flow illustrated in Fig. 2which is formally specified as
follows:
fTS(z;u) = n+1
X
i=1
µi(u) exp{ci(u)}!z+ n+1
X
i=1
λi(u)di(u)!.(2)
Here,
µi(u)R
,
λi(u)R
are the
i
-th entry of the deep nets
µ(u)Rn+1
,
λ(u)Rn+1
,
respectively, which yield the coefficients while the deep nets
ci
and
di
are frozen. As before, the
exp
function is applied elementwise. We use a softplus activation and an offset at the output of
µ
to
force
µi(u)104
for any
i
for numerical stability. Note that the combined flow
fTS
consisting of
multiple 1-layer flows is also a 1-layer normalizing flow. Hence, all the compelling properties over
VAE-based architectures described at the beginning of Sec. 3.2 remain valid. To train the combined
flow, we use the same log likelihood loss
E(u,a)DTS log p(a|u)
as that for training single flows. Here,
we optimize the deep nets µ(u)and λ(u)which parameterize fTS.
Obviously, the employed combination of flows can be straightforwardly extended to a more compli-
cated flow, e.g., a RealNVP [
8
] or Glow [
22
]. However, we found the discussed simple formulation
to work remarkably well and to be robust.
3.3 Explicit Prior
Beyond distilling information from demonstrations into deep nets which are then used as implicit
priors, we find explicit use of demonstrations to also be remarkably useful. To benefit, we encode
future state information into the input of the flow. More specifically, instead of sampling
(s, a)
-pairs
from a dataset
D
for training the flows, we consider sampling a transition
(s, a, snext)
from
D
. During
training, we concatenate sand snext before feeding it into a flow, i.e., u= [s, snext]instead of u=s.
However, we do not know the future state
snext
when deploying the policy. To obtain an estimate, we
use task-specific demonstrations as explicit priors. More formally, we use the trajectories within the
task-specific dataset
DTS
as a database. This is manageable as we assume the task-specific dataset to
be small. For each environment step of reinforcement learning with current state
s
, we perform a
lookup, where
s
is the query, states
skey
in the trajectories are the keys, and their corresponding next
state
snext
is the value. Concretely, we assume
snext
belongs to trajectory
τ
in the task-specific dataset
DTS, and define ˆsnext as the result of the database retrieval with respect to the given query s, i.e.,
ˆsnext =argminsnext |(skey,a,snext)DTS [(skey s)2+C·δ(snext)],where
δ(snext) = 1if s0
next τ, s.t. s0
next is no earlier than snext in τand has been retrieved,
0otherwise.
(3)
In Eq.
(3)
,
C
is a constant and
δ
is the indicator function. We set
u= [s, ˆsnext]
as the condition, feed
it into the trained flow
fTS
, and map the latent space element
z
obtained from the RL policy to the
real-world action
a
. The penalty term
δ
is a push-forward technique, which aims to push the agent to
move forward instead of staying put, imposing monotonicity on the retrieved
ˆsnext
. Consider an agent
at a particular state
s
and a flow
fTS
, conditioned on
u= [s, ˆsnext]
which maps the chosen action
z
to
a real-world action
a
that does not modify the environment. Without the penalty term, the agent will
remain at the same state, retrieve the same likely next state, which again maps onto the action that
does not change the environment. Intuitively, this term discourages 1) retrieving the same state twice,
and 2) returning to earlier states in a given trajectory. In our experiments, we set C= 1.
3.4 Reinforcement Learning with Priors
Given the implicit and explicit priors, we use RL to train a policy
π(z|s)
to accomplish the target
task demonstrated in the task-specific dataset. As shown in Fig. 1, the RL agent receives a state
s
and
provides a latent space element
z
. The conditioning variable of the flow is retrieved via the dataset
lookup described in Sec. 3.3 and the real-world action
a
is then computed using the flow. Note, our
approach is suitable for any RL method, i.e., the policy
π(z|s)
can be trained using any RL algorithm
such as proximal policy optimization (PPO) [43] or soft-actor-critic (SAC) [15].
5
摘要:

CEIP:CombiningExplicitandImplicitPriorsforReinforcementLearningwithDemonstrationsKaiYanAlexanderG.SchwingYu-XiongWangUniversityofIllinoisUrbana-Champaign{kaiyan3,aschwing,yxw}@illinois.eduhttps://github.com/289371298/CEIPAbstractAlthoughreinforcementlearninghasfoundwidespreaduseindenserewardset-ting...

展开>> 收起<<
CEIP Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations Kai Yan Alexander G. Schwing Yu-Xiong Wang.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:4.62MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注