Transferring Knowledge for Reinforcement Learning in Contact-Rich Manipulation Quantao Yang Johannes A. Stork and Todor Stoyanov

2025-05-06 0 0 1.35MB 2 页 10玖币

侵权投诉

Transferring Knowledge for Reinforcement Learning in

Contact-Rich Manipulation

Quantao Yang, Johannes A. Stork, and Todor Stoyanov

I. INTRODUCTION

While humans are adept in transferring a learned skill—

that is the ability of solving a task—to a new similar

task efﬁciently, most state-of-the-art reinforcement learning

(RL) methods have to solve every new task from scratch.

Consequently, millions of new interactions with different

environments can be required to solve variant tasks, which

is infeasible for a real robot system. Training from scratch is

resource and time consuming, while sample collection in a

new physical environment is costly and repetitive. Therefore,

in order to apply RL directly on real physical robots, it

is imperative to address the problem of sample-inefﬁciency

when solving variant tasks.

State-of-the-art methods require policy training in simula-

tion to prevent undesired behavior and later domain transfer,

or guided policy search for single skills in a family of

similar problems [1], [2], [3]. The successful deployment of

simulation-to-reality methods requires that the simulation is

close enough to the physical system. However, for real world

robotic applications, transition dynamics in deployment are

often substantially different from those encountered during

training (in simulation).

In this work, we consider the problem of transferring

knowledge within a family of similar tasks. Our fundamen-

tal assumption is that we are presented with a family of

problems, formalized as Markov Decision Processes (MDPs)

that all share the same state and action spaces [4]. Crucially

however, we allow for members of the family to exhibit

different transition dynamics. Informally, our assumption is

that while transition probabilities are different, they may

be correlated or overlapping for parts of the state space.

We then propose a method—Multi-Prior Regularized RL

(MPR-RL)—that leverages prior experience collected on a

subset of the problems in the MDP family to efﬁciently

learn a policy on a new, previously unseen problem from the

same family. Our approach learns prior distributions over the

speciﬁc skill for each task and composes a family of skill

priors to guide learning the policy in a new environment. We

have evaluated our method on contact-rich peg-in-hole tasks

shown in Figure 2(a).

∗This work was supported by the Wallenberg AI, Autonomous Systems

and Software Program (WASP) funded by Knut and Alice Wallenberg

Foundation.

Autonomous Mobile Manipulation (AMM) Lab, Örebro University,

Sweden (e-mail: quantao.yang@oru.se; johannes.stork@oru.se;

todor.stoyanov@oru.se).

II. APPROACH

Our approach to transferring knowledge for RL is based on

exploiting prior knowledge from demonstrations for learning

a policy in a new task. The process is composed of two

distinct phases: a prior learning phase and a task learning

phase. In the task learning phase, we guide the policy

learning to initially follow skill priors learned in the prior

learning phase. For this, we regularize the RL objective with

a relative entropy term based on the learned skill priors.

We consider a family of tasks each formalized as a Markov

decision process (MDP) deﬁned by a tuple (S,A,T, r, ρ, γ)

of states, actions, transition probability, reward, initial state

distribution, and discount factor. A family of MDPs, M,

share the same state space and action space, while the

dynamics and transition probabilities are different.

We assume access to a dataset Dof demonstrated trajec-

tories τi={(s0, a0), ..., (sTi, aTi)}for each robotic task.

We aim to leverage these trajectories to learn a skill prior

pi(at|st)for each speciﬁc MDP Mi. Our objective is then

to learn a policy πθ(a|s)with parameter θthat maximizes the

sum of rewards G(θ)for a new MDP Mnew by leveraging

the prior experience contained in the dataset D.

In skill prior RL (SPiRL) [5], the learned skill prior is

leveraged to guide learning a high-level policy πθ(z|s)by

introducing an entropy term. They propose to replace the

entropy term of Soft Actor-Critic (SAC) [6] with the negated

KL divergence between the policy and the prior. Similarly,

our method uses the embedding space Z.

Using only one skill prior limits the method to policy

learning in the same task as where the skill prior was learned.

For this reason, we extend this approach from one learned

skill prior to several skill priors learned in different tasks.

To this end, we regularize the RL objective with a weighted

sum of relative entropies:

J(θ) = E

τ∼πθ"T

t=0

γtr(st, at, st0) + αΓt#,(1)

where

Γt=−

i=1

ωiDKL(πθ(at|st), pi(at|st)).(2)

ωiis the adaptive weight and Pm

i=1 ωi= 1.pi(at|st)are

skill priors from different tasks of the family. This means

that the policy is initially incentivized to explore according

to a mixture of different skill priors depending on the weight

factors. We learn the adaptive weights ωiby training a

discriminator for the most recent observed transition.

arXiv:2210.02891v1 [cs.RO] 19 Sep 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TransferringKnowledgeforReinforcementLearninginContact-RichManipulationQuantaoYang,JohannesA.Stork,andTodorStoyanovI.INTRODUCTIONWhilehumansareadeptintransferringalearnedskillthatistheabilityofsolvingatasktoanewsimilartaskefciently,moststate-of-the-artreinforcementlearning(RL)methodshavetosolveev...

展开>> 收起<<

Transferring Knowledge for Reinforcement Learning in Contact-Rich Manipulation Quantao Yang Johannes A. Stork and Todor Stoyanov.pdf

共2页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transferring Knowledge for Reinforcement Learning in Contact-Rich Manipulation Quantao Yang Johannes A. Stork and Todor Stoyanov

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: