Transferring Knowledge for Reinforcement Learning in
Contact-Rich Manipulation
Quantao Yang, Johannes A. Stork, and Todor Stoyanov
I. INTRODUCTION
While humans are adept in transferring a learned skill—
that is the ability of solving a task—to a new similar
task efficiently, most state-of-the-art reinforcement learning
(RL) methods have to solve every new task from scratch.
Consequently, millions of new interactions with different
environments can be required to solve variant tasks, which
is infeasible for a real robot system. Training from scratch is
resource and time consuming, while sample collection in a
new physical environment is costly and repetitive. Therefore,
in order to apply RL directly on real physical robots, it
is imperative to address the problem of sample-inefficiency
when solving variant tasks.
State-of-the-art methods require policy training in simula-
tion to prevent undesired behavior and later domain transfer,
or guided policy search for single skills in a family of
similar problems [1], [2], [3]. The successful deployment of
simulation-to-reality methods requires that the simulation is
close enough to the physical system. However, for real world
robotic applications, transition dynamics in deployment are
often substantially different from those encountered during
training (in simulation).
In this work, we consider the problem of transferring
knowledge within a family of similar tasks. Our fundamen-
tal assumption is that we are presented with a family of
problems, formalized as Markov Decision Processes (MDPs)
that all share the same state and action spaces [4]. Crucially
however, we allow for members of the family to exhibit
different transition dynamics. Informally, our assumption is
that while transition probabilities are different, they may
be correlated or overlapping for parts of the state space.
We then propose a method—Multi-Prior Regularized RL
(MPR-RL)—that leverages prior experience collected on a
subset of the problems in the MDP family to efficiently
learn a policy on a new, previously unseen problem from the
same family. Our approach learns prior distributions over the
specific skill for each task and composes a family of skill
priors to guide learning the policy in a new environment. We
have evaluated our method on contact-rich peg-in-hole tasks
shown in Figure 2(a).
∗This work was supported by the Wallenberg AI, Autonomous Systems
and Software Program (WASP) funded by Knut and Alice Wallenberg
Foundation.
Autonomous Mobile Manipulation (AMM) Lab, Örebro University,
Sweden (e-mail: quantao.yang@oru.se; johannes.stork@oru.se;
todor.stoyanov@oru.se).
II. APPROACH
Our approach to transferring knowledge for RL is based on
exploiting prior knowledge from demonstrations for learning
a policy in a new task. The process is composed of two
distinct phases: a prior learning phase and a task learning
phase. In the task learning phase, we guide the policy
learning to initially follow skill priors learned in the prior
learning phase. For this, we regularize the RL objective with
a relative entropy term based on the learned skill priors.
We consider a family of tasks each formalized as a Markov
decision process (MDP) defined by a tuple (S,A,T, r, ρ, γ)
of states, actions, transition probability, reward, initial state
distribution, and discount factor. A family of MDPs, M,
share the same state space and action space, while the
dynamics and transition probabilities are different.
We assume access to a dataset Dof demonstrated trajec-
tories τi={(s0, a0), ..., (sTi, aTi)}for each robotic task.
We aim to leverage these trajectories to learn a skill prior
pi(at|st)for each specific MDP Mi. Our objective is then
to learn a policy πθ(a|s)with parameter θthat maximizes the
sum of rewards G(θ)for a new MDP Mnew by leveraging
the prior experience contained in the dataset D.
In skill prior RL (SPiRL) [5], the learned skill prior is
leveraged to guide learning a high-level policy πθ(z|s)by
introducing an entropy term. They propose to replace the
entropy term of Soft Actor-Critic (SAC) [6] with the negated
KL divergence between the policy and the prior. Similarly,
our method uses the embedding space Z.
Using only one skill prior limits the method to policy
learning in the same task as where the skill prior was learned.
For this reason, we extend this approach from one learned
skill prior to several skill priors learned in different tasks.
To this end, we regularize the RL objective with a weighted
sum of relative entropies:
J(θ) = E
τ∼πθ"T
X
t=0
γtr(st, at, st0) + αΓt#,(1)
where
Γt=−
m
X
i=1
ωiDKL(πθ(at|st), pi(at|st)).(2)
ωiis the adaptive weight and Pm
i=1 ωi= 1.pi(at|st)are
skill priors from different tasks of the family. This means
that the policy is initially incentivized to explore according
to a mixture of different skill priors depending on the weight
factors. We learn the adaptive weights ωiby training a
discriminator for the most recent observed transition.
arXiv:2210.02891v1 [cs.RO] 19 Sep 2022