Efficient Learning of Locomotion Skills through the Discovery of Diverse Environmental Trajectory Generator Priors Shikha Surana1 Bryan Lim1 Antoine Cully1

2025-05-03 0 0 806.44KB 8 页 10玖币
侵权投诉
Efficient Learning of Locomotion Skills through the Discovery of
Diverse Environmental Trajectory Generator Priors
Shikha Surana*1, Bryan Lim*1, Antoine Cully1
Abstract Data-driven learning based methods have recently
been particularly successful at learning robust locomotion con-
trollers for a variety of unstructured terrains. Prior work has
shown that incorporating good locomotion priors in the form
of trajectory generators (TGs) is effective at efficiently learning
complex locomotion skills. However, defining a good, single TG
as tasks/environments become increasingly more complex re-
mains a challenging problem as it requires extensive tuning and
risks reducing the effectiveness of the prior. In this paper, we
present Evolved Environmental Trajectory Generators (EETG),
a method that learns a diverse set of specialised locomotion
priors using Quality-Diversity algorithms while maintaining
a single policy within the Policies Modulating TG (PMTG)
architecture. The results demonstrate that EETG enables a
quadruped robot to successfully traverse a wide range of
environments, such as slopes, stairs, rough terrain, and balance
beams. Our experiments show that learning a diverse set of
specialized TG priors is significantly (5 times) more efficient
than using a single, fixed prior when dealing with a wide range
of environments.
I. INTRODUCTION
Legged robots [1], [2], [3] have tremendous potential for
societal impact as they can be used for applications involving
a wide range of environments such as rough, cluttered and
unstructured terrain. From search and rescue and inspection
work [4] to carrying heavy payloads, legged robots have the
potential to undertake many of the physical activities humans
and animals are capable of that are dangerous and unhealthy.
However, legged robots are also underactuated high-
dimensional systems with many constraints making them
challenging to control. Recently, reinforcement learning (RL)
approaches [5], [6], [7], [8], [9], [10] have become compet-
itive to more conventional model-based optimization meth-
ods [11], [12], [13], [14], demonstrating state-of-art locomo-
tion abilities both in simulation and in the real-world [5], [6].
These learnt controllers are especially robust when evaluated
across many different environments and perturbations. De-
spite these significant advances, learning based approaches
in robotics are notoriously known for being sample inefficient
and usually requires a large amount of data [15], [16].
Researchers have tried to address this problem in a number of
different ways. For example, improving the sample efficiency
of the underlying RL algorithm used [17] or using fast,
highly parallel simulators [18], [8]. Another effective way
is to incorporate useful priors in the learning process.
Policies Modulating Trajectory Generator (PMTG) [19]
is one such method which incorporates a parameterized
*Equal Contribution
1Imperial College London, United Kingdom. {ss5721, bwl116,
a.cully}@ic.ac.uk
Trajectory Generator (TG) as a prior, separate to the learnt
policy. PMTG makes learning complex locomotion tasks
easier and demonstrates that a good locomotion prior can
significantly help the efficiency of reinforcement learning
(RL) methods [19]. Lee et al. [6] also used this PMTG
framework when demonstrating state-of-the-art locomotion
across wide range of environments in the real-world, further
demonstrating the effectiveness of the TG prior for locomo-
tion.
However, there are still some questions surrounding the
PMTG method. How are the parameters of the TG defined?
What parameters make a good prior? The parameters of
the TGs used in prior work are usually defined manually
by engineers based on intuition of the locomotion task of
interest. For instance, a forward swinging TG motion is
useful when learning to walk forward [19]. On the other
hand, for more complex tasks such as learning to walk across
a diversity of difficult environments, a more generic and
unbiased TG motion of stepping up and down in place had
to be used [6]. While this TG choice proved effective, this
could reduce the effectiveness of the prior in helping learning
and could indicate that the policy still has to do the bulk of
the work as it has to deal with the different environments.
For example, a good TG prior for ascending steps would
differ from that of descending steps. In this paper, we address
this by learning good priors for tasks instead of manually
crafting them. More importantly, we also learn a diverse set
of specialized priors using Quality-Diversity (QD) algorithms
rather than using just a single prior.
The main contribution of our work is a novel framework,
Evolved Environmental Trajectory Generators (EETG) for
discovering a diverse set of specialized Trajectory Generators
(TGs) which act as priors for more efficient learning. We
demonstrate in our experiments that our method enables a
simulated A1 quadruped robot to learn dynamic locomotion
behaviors over diverse environment types such as slopes, un-
even terrain, and steps. Our experiments show that EETG is
as good as learning individual TGs and policies across all
environments while being significantly more efficient. Our
work demonstrates that learning a diverse set of TG prior
is more effective than a single fixed TG especially when
dealing with many tasks and environments.
II. RELATED WORK
Legged Locomotion. Locomotion controllers have tra-
ditionally been designed using a modular control frame-
work. This framework breaks down the difficult control
problem into smaller sub-problems. Each sub-problem makes
arXiv:2210.04819v2 [cs.NE] 22 Jun 2023
Fig. 1. A1 quadruped robot, trained with the presented approach, in a variety of challenging environments. (A) Ascending slope. (B) Descending
slope. (C) Ascending stairs. (D) Descending stairs. (E) Uneven/rough terrain. (F) Narrow balance beam.
approximations such as mass-less limbs and point mass
dynamics [11], [13] and apply heuristics [14] which are used
alongside trajectory optimization, footstep planning, model
predictive control (MPC) methods [20], [11], [12].
Alternatively, data-driven learning based approaches are
fast becoming a go-to option due to recent advances showing
impressive robustness and performance [5], [6], [9], [10],
while at the same time requiring less modeling and expert
optimization knowledge. One of the key ideas that make
learning based methods perform so well and give these
controllers incredible robustness especially in the real world,
is domain randomization (DR). DR of simulator physics
and visual features enabled learning dexterous manipula-
tion to solve a Rubik’s cube [15]. Hwangbo et al. [5]
then demonstrated DR was also effective for locomotion
in dynamic legged systems. This idea of DR can be also
extended towards curriculum learning based methods and
used beyond just physical parameters and dynamics of the
robot. Multiple separate works [6], [21], [22] show that
learning through a curriculum of diverse environment types
and terrains can result in learnt skills which can be robust
and generalize to new environments. Similarly, our work
utilizes environment diversity as strategy to learn robust and
generalizable controllers.
As mentioned in the introduction, we also build on the
PMTG control architecture [19] to utilize the effectiveness
of priors for learning. However, instead of manually crafting
a single fixed parameter vector of the trajectory generator
(TG), we learn a diverse and high-performing set of spe-
cialised TGs.
Quality-Diversity. Quality-Diversity (QD) [23], [24], [25]
is a growing branch of optimization methods which aims
to find a large set of diverse and locally optimal solution.
This is in contrast to conventional optimization algorithms
which find a single objective maximising solution. QD
algorithms have demonstrated their effectiveness across a
range of applications including robotics [26], [27], [28],
reinforcement learning [29], [30], video-game design [31],
[32], engineering design optimization [33] and more.
In the context of robotics, QD algorithms have com-
monly been used to learn a diverse repertoire of primitive
controllers [34] which can then be used effectively for
downstream applications such a damage recovery [26], [35]
or with planning algorithms to perform longer horizon tasks
such as navigation to a goal [35], [36], [37]. In our work, we
learn the parameters of a controller which is represented as an
open loop TG. However, a critical difference in our method is
that the descriptors/cells of the QD algorithm are not defined
by the behavior of the controller but by the environment (i.e.
variations of stairs, rough terrain etc.) in which the controller
is evaluated in and is selected before evaluation. This is akin
to multi-task MAP-Elites [38].
QD-like algorithms like POET [30], [39] have also been
used to evolve environments of increasing complexity while
learning specialised paired policies for each environment.
Similarly, our work also maintains a diversity of envi-
ronments while trying to discover specialized parameters
for each environment. However, we learn specialised TGs
instead of specialised policy networks for each environment.
We then later learn a single policy over all the specialized
TGs. Our method also differs in that we do not evolve
environments but define them beforehand as we are interested
in discovering specialized priors and not in auto-curricula
of the environment. However, we expect this to further
help despite requiring additional complexity of an additional
optimization process and more compute. We leave this for
future work.
III. METHODS
Our algorithm is composed of two parts: 1) Learning of a
diverse set of environment specialized TG priors (Section III-
A)and 2) policy optimization within the PMTG architecture
(Section III-B). Figure 2 shows an overview of the two
phases of the EETG algorithm. We describe these two parts
in the following subsections.
A. Discovering Diverse Specialised Trajectory Generators
Our goal is to find a diverse set of priors in the form
of TGs, each of which are specialised and high-performing
for the corresponding task and environment. This problem
can be formalised as a Quality-Diversity (QD) optimization
摘要:

EfficientLearningofLocomotionSkillsthroughtheDiscoveryofDiverseEnvironmentalTrajectoryGeneratorPriorsShikhaSurana*1,BryanLim*1,AntoineCully1Abstract—Data-drivenlearningbasedmethodshaverecentlybeenparticularlysuccessfulatlearningrobustlocomotioncon-trollersforavarietyofunstructuredterrains.Priorworkh...

展开>> 收起<<
Efficient Learning of Locomotion Skills through the Discovery of Diverse Environmental Trajectory Generator Priors Shikha Surana1 Bryan Lim1 Antoine Cully1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:806.44KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注