
performance. However, FIST uses pure imitation learning without any RL, hence losing the chance
for trial and remedy if the imitation is not good enough.
Our key insight is to leverage demonstrations both explicitly and implicitly, thus benefiting from
both worlds. To achieve this, we develop
CEIP
, a method which
c
ombines
e
xplicit and
i
mplicit
p
riors.
CEIP
leverages implicit demonstrations by learning a transformation from a latent space to
the real action space via normalizing flows. More importantly, different from prior work, such as
PARROT and FIST which combine all the information within a single deep net,
CEIP
selects the
most useful prior by combining multiple flows in parallel to form a single large flow. To benefit from
demonstrations explicitly,
CEIP
augments the input of the normalizing flow with a likely future state,
which is retrieved via a lookup from a database of transitions. For an effective retrieval, we propose a
push-forward technique which ensures the database to return future states that have not been referred
to yet, encouraging the agent to complete the whole trajectory even if it fails on a single task.
We evaluate the proposed approach on three challenging environments: fetchreach [
36
], kitchen [
11
],
and office [
45
]. In each environment, we study the use of both task-specific and task-agnostic demon-
strations. We observe that integrating an explicit prior, especially with our proposed push-forward
technique, greatly improves results. Notably, the proposed approach works well on sophisticated
long-horizon robotics tasks with a few, or sometimes even one task-specific demonstration.
2 Preliminaries
Reinforcement Learning.
Reinforcement learning (RL) aims to train an agent to make the ‘best’
decision towards completing a particular task in a given environment. The environment and the task
are often described as a Markov Decision Process (MDP), which is defined by a tuple
(S,A, T, r, γ)
.
In timestep
t
of the Markov process, the agent observes the current state
st∈ S
, and executes an
action
at∈ A
following some probability distribution, i.e., policy
π(at|st)∈∆(A)
, where
∆(A)
denotes the probability simplex over elements in space
A
. Upon executing action
at
, the state of
the agent changes to
st+1
following the dynamics of the environment, which are governed by the
transition function
T(st, at) : S ×A → ∆(S)
. Meanwhile, the agent receives a reward
r(st, at)∈R
.
The agent aims to maximize the cumulative reward
Ptγtr(st, at)
, where
γ∈[0,1]
is the discount
factor. One complete run in an environment is called an episode, and the corresponding state-action
pairs τ={(s1, a1),(s2, a2), . . . }form a trajectory τ.
Normalizing Flows.
A normalizing flow [
24
] is a generative model that transforms elements
z0
drawn from a simple distribution
pz
, e.g., a Gaussian, to elements
a0
drawn from a more complex
distribution
pa
. For this transformation, a bijective function
f
is used, i.e.,
a0=f(z0)
. The
use of a bijective function ensures that the log-likelihood of the more complex distribution at any
point is tractable and that samples of such a distribution can be easily generated by taking samples
from the simple distribution and pushing them through the flow. Formally, the core idea of a
normalizing flow can be summarized via
pa(a0) = pz(f−1(a0))
∂f−1(a)
∂a |a=a0
, where
|·|
is the
determinant (guaranteed positive by flow designs),
a
is a random variable with the desired more
complex distribution, and
z
is a random variable governed by a simple distribution. To efficiently
compute the determinant of the Jacobian matrix of
f−1
, special constraints are imposed on the
form of
f
. For example, coupling flows like RealNVP [
8
] and autoregressive flows [
31
] impose the
Jacobian of f−1to be triangular.
3 CEIP: Combining Explicit and Implicit Priors
3.1 Overview
As illustrated in Fig. 1, our goal is to train an autonomous agent to solve challenging tasks despite
sparse rewards, such as controlling a robot arm to complete item manipulation tasks (like turning on
a switch or opening a cabinet). For this we aim to benefit from available demonstrations. Formally,
we consider a task-specific dataset
DTS ={τTS
1, τTS
2, . . . , τTS
m}
, where
τTS
i
is the
i
-th trajectory of
the task-specific dataset, and a task-agnostic dataset
DTA ={SDi|i∈ {1,2,3, . . . , n}}
, where
Di={τi
1, τi
2, . . . , τi
mi}
subsumes the demonstration trajectories for the
i
-th task in the task-agnostic
dataset. Each trajectory
τ={(s1, a1),(s2, a2), . . . }
in the dataset is a state-action pair sequence
2