OPT-Mimic Imitation of Optimized Trajectories for Dynamic Quadruped Behaviors Yuni Fuchioka1 Zhaoming Xie12 and Michiel van de Panne1

2025-04-29 1 0 2.63MB 7 页 10玖币

侵权投诉

OPT-Mimic: Imitation of Optimized Trajectories

for Dynamic Quadruped Behaviors

Yuni Fuchioka1, Zhaoming Xie1,2, and Michiel van de Panne1

Abstract— Reinforcement Learning (RL) has seen many re-

cent successes for quadruped robot control. The imitation of

reference motions provides a simple and powerful prior for

guiding solutions towards desired solutions without the need

for meticulous reward design. While much work uses motion

capture data or hand-crafted trajectories as the reference

motion, relatively little work has explored the use of reference

motions coming from model-based trajectory optimization. In

this work, we investigate several design considerations that

arise with such a framework, as demonstrated through four

dynamic behaviours: trot, front hop, 180 backﬂip, and biped

stepping. These are trained in simulation and transferred to a

physical Solo 8 quadruped robot without further adaptation.

In particular, we explore the space of feed-forward designs

afforded by the trajectory optimizer to understand its impact

on RL learning efﬁciency and sim-to-real transfer. These

ﬁndings contribute to the long standing goal of producing robot

controllers that combine the interpretability and precision of

model-based optimization with the robustness that model-free

RL-based controllers offer.

I. INTRODUCTION

Quadruped control has seen signiﬁcant recent advances

emerging from trajectory optimization and reinforcement

learning approaches. As a model-based method, trajectory

optimization offers fast iteration for designing motions. With

appropriate simpliﬁcations, it can also be used in real-time

for model-predictive control. On the other hand, reinforce-

ment learning (RL) is well suited to providing particularly

robust and fast-to-compute control policies. This comes at

the cost of ofﬂine computation and often requires careful

tuning of rewards and hyperparameters in order to arrive at

meaningful solutions. A combined solution has the potential

of providing the best of both worlds, wherein trajectory

optimization provides fast and predictable motion design of

a reference motion, after which RL can be used to imitate

or mimic that motion.

While RL-based motion-imitation policies have recently

seen much success, it remains unclear how it can best be

used in conjunction with reference trajectories provided by

trajectory optimization. Can the combined approach be used

to design dynamic motions with minimal tuning? Which

components of the optimized trajectory should be leveraged

by the RL policy and the PD-controllers used to control

the motions? For example, feedforward joint velocities and

joint torques are available, but should they be used? How do

these different choices impact on sim-to-real transfer? We

investigate these questions by designing four motions for

the Solo 8 robot, across three feedforward conﬁgurations,

1Faculty of Computer Science, The University of British Columbia

2Department of Computer Science, Stanford University

Fig. 1. Snapshots showing the 180-backﬂip motion produced from the mo-

tion generation system considered in this work, including the simple-model

trajectory optimization (top), full-model reinforcement learning (middle),

and transfer to the physical robot (bottom).

and with consistent hyperparameter settings across these

twelve scenarios, and test these using a Solo 8 robot with

predominantly proprioceptive sensing.

II. RELATED WORK

A. Trajectory Optimization for Legged Robots

Trajectory optimization is a process to generate physi-

cally feasible trajectories ofﬂine, which can then be tracked

though online feedback controllers. Trajectory optimization

is particularly challenging for legged robots due to the

hybrid dynamics arising from various contact modes, in

addition to the high dimensionality and nonconvexity of

the resulting problem. Various methods are proposed to

solve trajectory optimization efﬁciently, including collocation

methods, e.g., [1], [2], [3], and shooting based methods,

e.g., [4], [5]. Simpliﬁed models such as the single rigid body

dynamics model or inverted pendulums can also be used to

get approximate solutions [6], [7], [8].

B. Reinforcement Learning for Quadrupedal Robots

RL has been used with good success to generate robust

locomotion behaviors for quadrupedal robots, e.g., [9], [10],

[11], [12]. Without providing the algorithm prior knowledge

of how a quadruped should move, it often requires tedious

arXiv:2210.01247v3 [cs.RO] 23 Mar 2023

reward tuning to obtain reasonable behaviors. Combining

RL and model-based control can help mitigate the reward

tuning issue and generate natural behaviors like trotting and

jumping, e.g., [13], [14], [15]. This line of work often

designs the reward based on the locomotion task, i.e., follow

a desired velocity, and exhibits limited behaviors. In this

work, we aim to apply RL to achieve agile behaviors that do

not easily emerge from optimizing locomotion rewards.

C. Imitation-based Reinforcement Learning for Legged

Robots

It is often hard to generate policies that behave as intended

through task rewards alone. To provide the user more control

over the behaviors, reference trajectories can be provided to

encourage desired motions. One can design a reward func-

tion to explicitly track the reference trajectories, e.g., [16],

[17], [18]. Inverse reinforcement learning techniques such

as adversarial imitation learning can also be used to learn a

reward function to encourage the policy to produce motions

that look similar to a prescribed motion dataset, e.g., [19],

[20], [21]. There are various ways to obtain a reference

motion, e.g., trajectory optimization [22], [18], [7], [23],

[8], motion capture data from animals [16], [19] or even

crude hand designed motions [17], [24], [21]. In this paper,

we use a tracking-based reward to generate highly dynamic

behaviors. We demonstrate how a reference motion from

trajectory optimization is crucial for learning performance

as well as sim-to-real transfer. Furthermore, we explore how

different feedforward components from the optimized motion

can impact learning performance and sim-to-real transfer.

III. METHOD

A. Overview

The overview of our framework is given in Fig. 2.

Trajectory Optimization is used to produce a library of

open-loop reference motion trajectories that are feasible for

the simpliﬁed dynamics model used by the optimization.

These reference motions are then used by an imitation-

based reinforcement learning framework to produce a neural

network closed-loop feedback controller for the full-order

robot model to mimic the open-loop reference. Finally, the

reference motions and network controllers are loaded onto

robot control software to test on the physical robot. The

following sections describe each component in further detail.

B. Trajectory Optimization

Given a robot with conﬁguration space Q=R3×

SO(3) ×Rnj, where njis the number of joints on the

robot. We wish to obtain a function R→Q× T Q×Rnj:

φ→[p(φ), R(φ), q(φ)] ×[ ˙p(φ), ω(φ),˙q(φ)] ×τ(φ), where

p, R, q, ω are the linear position, orientation, joint conﬁgura-

tion and angular velocity of the robot, τ(φ)denotes the joint

torque needed to accomplish the motion and φ∈[0, T ]is

the timing variable.

To avoid the expensive computation needed to optimize

the full order model, Single Rigid Body (SRB) is ﬁrst used

to optimize for p(φ), R(φ),˙p(φ)and ω(φ), as well as a set of

foot positions pfoot(φ) = [p1, p2, p3, p4]and ground reaction

forces f(φ)=[f1, f2, f3, f4]. We use the IPOPT interior-

point solver [25] interfaced through the CasADi Python

library [26] to solve the direct collocation problem that we

wrote by combining elements of [27], [1], and [6], designed

speciﬁcally to quickly and ﬂexibly produce a variety of

dynamic motions not limited to locomotion.

The SRB dynamics constraints are given by

p+=p+ ˙p∆t(1)

˙p+= ˙p+1

fi+g∆t(2)

R+=Re([ω×]∆t)(3)

ω+=ω(4)

+BI−1RT(X

(pi−p)×fi)−[ω×]BIω∆t,

and for notational simplicity, we drop the time dependency

on the variable p, ˙p, R, ω, and use superscript +to denote

the variable at the next timestep, ∆tis the ﬁxed length of

the timestep set as 20ms, mis the total mass of the robot,

gis the gravitational acceleration vector, BIis the body

frame inertia vector of the robot body, e(·)denotes the matrix

exponential, and [ω×]∈R3×3is the skew-symmetric cross

product matrix produced by ω∈R3[6]. The objective is a

Linear Quadratic Regulator tracking cost summed over the

ﬁxed trajectory length for tracking the kinematic initial guess

trajectory, as well as a regularization term smoothing the foot

trajectories given by 1/∆t2(pi)+−piTR˙p(pi)+−pi

for diagonal regularization weight matrix R˙p. Following [27],

we do not ﬁx the foot contact locations and timings and allow

the optimizer to choose foot swing phase trajectories and

contact conﬁgurations. Therefore, we impose explicit contact

complementary constraints

(pi)z≥0(5)

(fi)z(pi)z= 0 (6)

(fi)z(pi)+

x−(pi)x= 0 (7)

(fi)z(pi)+

y−(pi)y= 0,(8)

where subscripts x, y, and zdenote the corresponding com-

ponent of the vector [27]. Similarly to [6], friction cone con-

straints are approximated with friction pyramid constraints

along with a maximum force limit given by (f)zmax,

0≤(f)z≤(f)zmax (9)

|(fi)x| ≤ µ(fi)z(10)

|(fi)y| ≤ µ(fi)z.(11)

Inspired by [1], we impose kinematic constraints as L1norm

constraints in the shoulder plane



(Bipi)x

(Bipi)z



1

≤lleg (12)

(Bipi)y= 0,(13)

where (Bipi)is the ith foot position in its corresponding

shoulder frame, and lleg denotes the maximum allowable

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OPT-Mimic:ImitationofOptimizedTrajectoriesforDynamicQuadrupedBehaviorsYuniFuchioka1,ZhaomingXie1;2,andMichielvandePanne1AbstractReinforcementLearning(RL)hasseenmanyre-centsuccessesforquadrupedrobotcontrol.Theimitationofreferencemotionsprovidesasimpleandpowerfulpriorforguidingsolutionstowardsdesired...

展开>> 收起<<

OPT-Mimic Imitation of Optimized Trajectories for Dynamic Quadruped Behaviors Yuni Fuchioka1 Zhaoming Xie12 and Michiel van de Panne1.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OPT-Mimic Imitation of Optimized Trajectories for Dynamic Quadruped Behaviors Yuni Fuchioka1 Zhaoming Xie12 and Michiel van de Panne1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: