OPT-Mimic Imitation of Optimized Trajectories for Dynamic Quadruped Behaviors Yuni Fuchioka1 Zhaoming Xie12 and Michiel van de Panne1

2025-04-29 0 0 2.63MB 7 页 10玖币
侵权投诉
OPT-Mimic: Imitation of Optimized Trajectories
for Dynamic Quadruped Behaviors
Yuni Fuchioka1, Zhaoming Xie1,2, and Michiel van de Panne1
Abstract Reinforcement Learning (RL) has seen many re-
cent successes for quadruped robot control. The imitation of
reference motions provides a simple and powerful prior for
guiding solutions towards desired solutions without the need
for meticulous reward design. While much work uses motion
capture data or hand-crafted trajectories as the reference
motion, relatively little work has explored the use of reference
motions coming from model-based trajectory optimization. In
this work, we investigate several design considerations that
arise with such a framework, as demonstrated through four
dynamic behaviours: trot, front hop, 180 backflip, and biped
stepping. These are trained in simulation and transferred to a
physical Solo 8 quadruped robot without further adaptation.
In particular, we explore the space of feed-forward designs
afforded by the trajectory optimizer to understand its impact
on RL learning efficiency and sim-to-real transfer. These
findings contribute to the long standing goal of producing robot
controllers that combine the interpretability and precision of
model-based optimization with the robustness that model-free
RL-based controllers offer.
I. INTRODUCTION
Quadruped control has seen significant recent advances
emerging from trajectory optimization and reinforcement
learning approaches. As a model-based method, trajectory
optimization offers fast iteration for designing motions. With
appropriate simplifications, it can also be used in real-time
for model-predictive control. On the other hand, reinforce-
ment learning (RL) is well suited to providing particularly
robust and fast-to-compute control policies. This comes at
the cost of offline computation and often requires careful
tuning of rewards and hyperparameters in order to arrive at
meaningful solutions. A combined solution has the potential
of providing the best of both worlds, wherein trajectory
optimization provides fast and predictable motion design of
a reference motion, after which RL can be used to imitate
or mimic that motion.
While RL-based motion-imitation policies have recently
seen much success, it remains unclear how it can best be
used in conjunction with reference trajectories provided by
trajectory optimization. Can the combined approach be used
to design dynamic motions with minimal tuning? Which
components of the optimized trajectory should be leveraged
by the RL policy and the PD-controllers used to control
the motions? For example, feedforward joint velocities and
joint torques are available, but should they be used? How do
these different choices impact on sim-to-real transfer? We
investigate these questions by designing four motions for
the Solo 8 robot, across three feedforward configurations,
1Faculty of Computer Science, The University of British Columbia
2Department of Computer Science, Stanford University
Fig. 1. Snapshots showing the 180-backflip motion produced from the mo-
tion generation system considered in this work, including the simple-model
trajectory optimization (top), full-model reinforcement learning (middle),
and transfer to the physical robot (bottom).
and with consistent hyperparameter settings across these
twelve scenarios, and test these using a Solo 8 robot with
predominantly proprioceptive sensing.
II. RELATED WORK
A. Trajectory Optimization for Legged Robots
Trajectory optimization is a process to generate physi-
cally feasible trajectories offline, which can then be tracked
though online feedback controllers. Trajectory optimization
is particularly challenging for legged robots due to the
hybrid dynamics arising from various contact modes, in
addition to the high dimensionality and nonconvexity of
the resulting problem. Various methods are proposed to
solve trajectory optimization efficiently, including collocation
methods, e.g., [1], [2], [3], and shooting based methods,
e.g., [4], [5]. Simplified models such as the single rigid body
dynamics model or inverted pendulums can also be used to
get approximate solutions [6], [7], [8].
B. Reinforcement Learning for Quadrupedal Robots
RL has been used with good success to generate robust
locomotion behaviors for quadrupedal robots, e.g., [9], [10],
[11], [12]. Without providing the algorithm prior knowledge
of how a quadruped should move, it often requires tedious
arXiv:2210.01247v3 [cs.RO] 23 Mar 2023
reward tuning to obtain reasonable behaviors. Combining
RL and model-based control can help mitigate the reward
tuning issue and generate natural behaviors like trotting and
jumping, e.g., [13], [14], [15]. This line of work often
designs the reward based on the locomotion task, i.e., follow
a desired velocity, and exhibits limited behaviors. In this
work, we aim to apply RL to achieve agile behaviors that do
not easily emerge from optimizing locomotion rewards.
C. Imitation-based Reinforcement Learning for Legged
Robots
It is often hard to generate policies that behave as intended
through task rewards alone. To provide the user more control
over the behaviors, reference trajectories can be provided to
encourage desired motions. One can design a reward func-
tion to explicitly track the reference trajectories, e.g., [16],
[17], [18]. Inverse reinforcement learning techniques such
as adversarial imitation learning can also be used to learn a
reward function to encourage the policy to produce motions
that look similar to a prescribed motion dataset, e.g., [19],
[20], [21]. There are various ways to obtain a reference
motion, e.g., trajectory optimization [22], [18], [7], [23],
[8], motion capture data from animals [16], [19] or even
crude hand designed motions [17], [24], [21]. In this paper,
we use a tracking-based reward to generate highly dynamic
behaviors. We demonstrate how a reference motion from
trajectory optimization is crucial for learning performance
as well as sim-to-real transfer. Furthermore, we explore how
different feedforward components from the optimized motion
can impact learning performance and sim-to-real transfer.
III. METHOD
A. Overview
The overview of our framework is given in Fig. 2.
Trajectory Optimization is used to produce a library of
open-loop reference motion trajectories that are feasible for
the simplified dynamics model used by the optimization.
These reference motions are then used by an imitation-
based reinforcement learning framework to produce a neural
network closed-loop feedback controller for the full-order
robot model to mimic the open-loop reference. Finally, the
reference motions and network controllers are loaded onto
robot control software to test on the physical robot. The
following sections describe each component in further detail.
B. Trajectory Optimization
Given a robot with configuration space Q=R3×
SO(3) ×Rnj, where njis the number of joints on the
robot. We wish to obtain a function RQ× T Q×Rnj:
φ[p(φ), R(φ), q(φ)] ×[ ˙p(φ), ω(φ),˙q(φ)] ×τ(φ), where
p, R, q, ω are the linear position, orientation, joint configura-
tion and angular velocity of the robot, τ(φ)denotes the joint
torque needed to accomplish the motion and φ[0, T ]is
the timing variable.
To avoid the expensive computation needed to optimize
the full order model, Single Rigid Body (SRB) is first used
to optimize for p(φ), R(φ),˙p(φ)and ω(φ), as well as a set of
foot positions pfoot(φ) = [p1, p2, p3, p4]and ground reaction
forces f(φ)=[f1, f2, f3, f4]. We use the IPOPT interior-
point solver [25] interfaced through the CasADi Python
library [26] to solve the direct collocation problem that we
wrote by combining elements of [27], [1], and [6], designed
specifically to quickly and flexibly produce a variety of
dynamic motions not limited to locomotion.
The SRB dynamics constraints are given by
p+=p+ ˙pt(1)
˙p+= ˙p+1
mX
i
fi+gt(2)
R+=Re([ω×]∆t)(3)
ω+=ω(4)
+BI1RT(X
i
(pip)×fi)[ω×]BIωt,
and for notational simplicity, we drop the time dependency
on the variable p, ˙p, R, ω, and use superscript +to denote
the variable at the next timestep, tis the fixed length of
the timestep set as 20ms, mis the total mass of the robot,
gis the gravitational acceleration vector, BIis the body
frame inertia vector of the robot body, e(·)denotes the matrix
exponential, and [ω×]R3×3is the skew-symmetric cross
product matrix produced by ωR3[6]. The objective is a
Linear Quadratic Regulator tracking cost summed over the
fixed trajectory length for tracking the kinematic initial guess
trajectory, as well as a regularization term smoothing the foot
trajectories given by 1/t2(pi)+piTR˙p(pi)+pi
for diagonal regularization weight matrix R˙p. Following [27],
we do not fix the foot contact locations and timings and allow
the optimizer to choose foot swing phase trajectories and
contact configurations. Therefore, we impose explicit contact
complementary constraints
(pi)z0(5)
(fi)z(pi)z= 0 (6)
(fi)z(pi)+
x(pi)x= 0 (7)
(fi)z(pi)+
y(pi)y= 0,(8)
where subscripts x, y, and zdenote the corresponding com-
ponent of the vector [27]. Similarly to [6], friction cone con-
straints are approximated with friction pyramid constraints
along with a maximum force limit given by (f)zmax,
0(f)z(f)zmax (9)
|(fi)x| ≤ µ(fi)z(10)
|(fi)y| ≤ µ(fi)z.(11)
Inspired by [1], we impose kinematic constraints as L1norm
constraints in the shoulder plane
(Bipi)x
(Bipi)z
1
lleg (12)
(Bipi)y= 0,(13)
where (Bipi)is the ith foot position in its corresponding
shoulder frame, and lleg denotes the maximum allowable
摘要:

OPT-Mimic:ImitationofOptimizedTrajectoriesforDynamicQuadrupedBehaviorsYuniFuchioka1,ZhaomingXie1;2,andMichielvandePanne1Abstract—ReinforcementLearning(RL)hasseenmanyre-centsuccessesforquadrupedrobotcontrol.Theimitationofreferencemotionsprovidesasimpleandpowerfulpriorforguidingsolutionstowardsdesired...

展开>> 收起<<
OPT-Mimic Imitation of Optimized Trajectories for Dynamic Quadruped Behaviors Yuni Fuchioka1 Zhaoming Xie12 and Michiel van de Panne1.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:2.63MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注