Efficient Learning of Locomotion Skills through the Discovery of Diverse Environmental Trajectory Generator Priors Shikha Surana1 Bryan Lim1 Antoine Cully1

2025-05-03 0 0 806.44KB 8 页 10玖币

侵权投诉

Efﬁcient Learning of Locomotion Skills through the Discovery of

Diverse Environmental Trajectory Generator Priors

Shikha Surana*1, Bryan Lim*1, Antoine Cully1

Abstract— Data-driven learning based methods have recently

been particularly successful at learning robust locomotion con-

trollers for a variety of unstructured terrains. Prior work has

shown that incorporating good locomotion priors in the form

of trajectory generators (TGs) is effective at efﬁciently learning

complex locomotion skills. However, deﬁning a good, single TG

as tasks/environments become increasingly more complex re-

mains a challenging problem as it requires extensive tuning and

risks reducing the effectiveness of the prior. In this paper, we

present Evolved Environmental Trajectory Generators (EETG),

a method that learns a diverse set of specialised locomotion

priors using Quality-Diversity algorithms while maintaining

a single policy within the Policies Modulating TG (PMTG)

architecture. The results demonstrate that EETG enables a

quadruped robot to successfully traverse a wide range of

environments, such as slopes, stairs, rough terrain, and balance

beams. Our experiments show that learning a diverse set of

specialized TG priors is signiﬁcantly (5 times) more efﬁcient

than using a single, ﬁxed prior when dealing with a wide range

of environments.

I. INTRODUCTION

Legged robots [1], [2], [3] have tremendous potential for

societal impact as they can be used for applications involving

a wide range of environments such as rough, cluttered and

unstructured terrain. From search and rescue and inspection

work [4] to carrying heavy payloads, legged robots have the

potential to undertake many of the physical activities humans

and animals are capable of that are dangerous and unhealthy.

However, legged robots are also underactuated high-

dimensional systems with many constraints making them

challenging to control. Recently, reinforcement learning (RL)

approaches [5], [6], [7], [8], [9], [10] have become compet-

itive to more conventional model-based optimization meth-

ods [11], [12], [13], [14], demonstrating state-of-art locomo-

tion abilities both in simulation and in the real-world [5], [6].

These learnt controllers are especially robust when evaluated

across many different environments and perturbations. De-

spite these signiﬁcant advances, learning based approaches

in robotics are notoriously known for being sample inefﬁcient

and usually requires a large amount of data [15], [16].

Researchers have tried to address this problem in a number of

different ways. For example, improving the sample efﬁciency

of the underlying RL algorithm used [17] or using fast,

highly parallel simulators [18], [8]. Another effective way

is to incorporate useful priors in the learning process.

Policies Modulating Trajectory Generator (PMTG) [19]

is one such method which incorporates a parameterized

*Equal Contribution

1Imperial College London, United Kingdom. {ss5721, bwl116,

a.cully}@ic.ac.uk

Trajectory Generator (TG) as a prior, separate to the learnt

policy. PMTG makes learning complex locomotion tasks

easier and demonstrates that a good locomotion prior can

signiﬁcantly help the efﬁciency of reinforcement learning

(RL) methods [19]. Lee et al. [6] also used this PMTG

framework when demonstrating state-of-the-art locomotion

across wide range of environments in the real-world, further

demonstrating the effectiveness of the TG prior for locomo-

tion.

However, there are still some questions surrounding the

PMTG method. How are the parameters of the TG deﬁned?

What parameters make a good prior? The parameters of

the TGs used in prior work are usually deﬁned manually

by engineers based on intuition of the locomotion task of

interest. For instance, a forward swinging TG motion is

useful when learning to walk forward [19]. On the other

hand, for more complex tasks such as learning to walk across

a diversity of difﬁcult environments, a more generic and

unbiased TG motion of stepping up and down in place had

to be used [6]. While this TG choice proved effective, this

could reduce the effectiveness of the prior in helping learning

and could indicate that the policy still has to do the bulk of

the work as it has to deal with the different environments.

For example, a good TG prior for ascending steps would

differ from that of descending steps. In this paper, we address

this by learning good priors for tasks instead of manually

crafting them. More importantly, we also learn a diverse set

of specialized priors using Quality-Diversity (QD) algorithms

rather than using just a single prior.

The main contribution of our work is a novel framework,

Evolved Environmental Trajectory Generators (EETG) for

discovering a diverse set of specialized Trajectory Generators

(TGs) which act as priors for more efﬁcient learning. We

demonstrate in our experiments that our method enables a

simulated A1 quadruped robot to learn dynamic locomotion

behaviors over diverse environment types such as slopes, un-

even terrain, and steps. Our experiments show that EETG is

as good as learning individual TGs and policies across all

environments while being signiﬁcantly more efﬁcient. Our

work demonstrates that learning a diverse set of TG prior

is more effective than a single ﬁxed TG especially when

dealing with many tasks and environments.

II. RELATED WORK

Legged Locomotion. Locomotion controllers have tra-

ditionally been designed using a modular control frame-

work. This framework breaks down the difﬁcult control

problem into smaller sub-problems. Each sub-problem makes

arXiv:2210.04819v2 [cs.NE] 22 Jun 2023

Fig. 1. A1 quadruped robot, trained with the presented approach, in a variety of challenging environments. (A) Ascending slope. (B) Descending

slope. (C) Ascending stairs. (D) Descending stairs. (E) Uneven/rough terrain. (F) Narrow balance beam.

approximations such as mass-less limbs and point mass

dynamics [11], [13] and apply heuristics [14] which are used

alongside trajectory optimization, footstep planning, model

predictive control (MPC) methods [20], [11], [12].

Alternatively, data-driven learning based approaches are

fast becoming a go-to option due to recent advances showing

impressive robustness and performance [5], [6], [9], [10],

while at the same time requiring less modeling and expert

optimization knowledge. One of the key ideas that make

learning based methods perform so well and give these

controllers incredible robustness especially in the real world,

is domain randomization (DR). DR of simulator physics

and visual features enabled learning dexterous manipula-

tion to solve a Rubik’s cube [15]. Hwangbo et al. [5]

then demonstrated DR was also effective for locomotion

in dynamic legged systems. This idea of DR can be also

extended towards curriculum learning based methods and

used beyond just physical parameters and dynamics of the

robot. Multiple separate works [6], [21], [22] show that

learning through a curriculum of diverse environment types

and terrains can result in learnt skills which can be robust

and generalize to new environments. Similarly, our work

utilizes environment diversity as strategy to learn robust and

generalizable controllers.

As mentioned in the introduction, we also build on the

PMTG control architecture [19] to utilize the effectiveness

of priors for learning. However, instead of manually crafting

a single ﬁxed parameter vector of the trajectory generator

(TG), we learn a diverse and high-performing set of spe-

cialised TGs.

Quality-Diversity. Quality-Diversity (QD) [23], [24], [25]

is a growing branch of optimization methods which aims

to ﬁnd a large set of diverse and locally optimal solution.

This is in contrast to conventional optimization algorithms

which ﬁnd a single objective maximising solution. QD

algorithms have demonstrated their effectiveness across a

range of applications including robotics [26], [27], [28],

reinforcement learning [29], [30], video-game design [31],

[32], engineering design optimization [33] and more.

In the context of robotics, QD algorithms have com-

monly been used to learn a diverse repertoire of primitive

controllers [34] which can then be used effectively for

downstream applications such a damage recovery [26], [35]

or with planning algorithms to perform longer horizon tasks

such as navigation to a goal [35], [36], [37]. In our work, we

learn the parameters of a controller which is represented as an

open loop TG. However, a critical difference in our method is

that the descriptors/cells of the QD algorithm are not deﬁned

by the behavior of the controller but by the environment (i.e.

variations of stairs, rough terrain etc.) in which the controller

is evaluated in and is selected before evaluation. This is akin

to multi-task MAP-Elites [38].

QD-like algorithms like POET [30], [39] have also been

used to evolve environments of increasing complexity while

learning specialised paired policies for each environment.

Similarly, our work also maintains a diversity of envi-

ronments while trying to discover specialized parameters

for each environment. However, we learn specialised TGs

instead of specialised policy networks for each environment.

We then later learn a single policy over all the specialized

TGs. Our method also differs in that we do not evolve

environments but deﬁne them beforehand as we are interested

in discovering specialized priors and not in auto-curricula

of the environment. However, we expect this to further

help despite requiring additional complexity of an additional

optimization process and more compute. We leave this for

future work.

III. METHODS

Our algorithm is composed of two parts: 1) Learning of a

diverse set of environment specialized TG priors (Section III-

A)and 2) policy optimization within the PMTG architecture

(Section III-B). Figure 2 shows an overview of the two

phases of the EETG algorithm. We describe these two parts

in the following subsections.

A. Discovering Diverse Specialised Trajectory Generators

Our goal is to ﬁnd a diverse set of priors in the form

of TGs, each of which are specialised and high-performing

for the corresponding task and environment. This problem

can be formalised as a Quality-Diversity (QD) optimization

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfficientLearningofLocomotionSkillsthroughtheDiscoveryofDiverseEnvironmentalTrajectoryGeneratorPriorsShikhaSurana*1,BryanLim*1,AntoineCully1Abstract—Data-drivenlearningbasedmethodshaverecentlybeenparticularlysuccessfulatlearningrobustlocomotioncon-trollersforavarietyofunstructuredterrains.Priorworkh...

展开>> 收起<<

Efficient Learning of Locomotion Skills through the Discovery of Diverse Environmental Trajectory Generator Priors Shikha Surana1 Bryan Lim1 Antoine Cully1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efficient Learning of Locomotion Skills through the Discovery of Diverse Environmental Trajectory Generator Priors Shikha Surana1 Bryan Lim1 Antoine Cully1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: