
reward tuning to obtain reasonable behaviors. Combining
RL and model-based control can help mitigate the reward
tuning issue and generate natural behaviors like trotting and
jumping, e.g., [13], [14], [15]. This line of work often
designs the reward based on the locomotion task, i.e., follow
a desired velocity, and exhibits limited behaviors. In this
work, we aim to apply RL to achieve agile behaviors that do
not easily emerge from optimizing locomotion rewards.
C. Imitation-based Reinforcement Learning for Legged
Robots
It is often hard to generate policies that behave as intended
through task rewards alone. To provide the user more control
over the behaviors, reference trajectories can be provided to
encourage desired motions. One can design a reward func-
tion to explicitly track the reference trajectories, e.g., [16],
[17], [18]. Inverse reinforcement learning techniques such
as adversarial imitation learning can also be used to learn a
reward function to encourage the policy to produce motions
that look similar to a prescribed motion dataset, e.g., [19],
[20], [21]. There are various ways to obtain a reference
motion, e.g., trajectory optimization [22], [18], [7], [23],
[8], motion capture data from animals [16], [19] or even
crude hand designed motions [17], [24], [21]. In this paper,
we use a tracking-based reward to generate highly dynamic
behaviors. We demonstrate how a reference motion from
trajectory optimization is crucial for learning performance
as well as sim-to-real transfer. Furthermore, we explore how
different feedforward components from the optimized motion
can impact learning performance and sim-to-real transfer.
III. METHOD
A. Overview
The overview of our framework is given in Fig. 2.
Trajectory Optimization is used to produce a library of
open-loop reference motion trajectories that are feasible for
the simplified dynamics model used by the optimization.
These reference motions are then used by an imitation-
based reinforcement learning framework to produce a neural
network closed-loop feedback controller for the full-order
robot model to mimic the open-loop reference. Finally, the
reference motions and network controllers are loaded onto
robot control software to test on the physical robot. The
following sections describe each component in further detail.
B. Trajectory Optimization
Given a robot with configuration space Q=R3×
SO(3) ×Rnj, where njis the number of joints on the
robot. We wish to obtain a function R→Q× T Q×Rnj:
φ→[p(φ), R(φ), q(φ)] ×[ ˙p(φ), ω(φ),˙q(φ)] ×τ(φ), where
p, R, q, ω are the linear position, orientation, joint configura-
tion and angular velocity of the robot, τ(φ)denotes the joint
torque needed to accomplish the motion and φ∈[0, T ]is
the timing variable.
To avoid the expensive computation needed to optimize
the full order model, Single Rigid Body (SRB) is first used
to optimize for p(φ), R(φ),˙p(φ)and ω(φ), as well as a set of
foot positions pfoot(φ) = [p1, p2, p3, p4]and ground reaction
forces f(φ)=[f1, f2, f3, f4]. We use the IPOPT interior-
point solver [25] interfaced through the CasADi Python
library [26] to solve the direct collocation problem that we
wrote by combining elements of [27], [1], and [6], designed
specifically to quickly and flexibly produce a variety of
dynamic motions not limited to locomotion.
The SRB dynamics constraints are given by
p+=p+ ˙p∆t(1)
˙p+= ˙p+1
mX
i
fi+g∆t(2)
R+=Re([ω×]∆t)(3)
ω+=ω(4)
+BI−1RT(X
i
(pi−p)×fi)−[ω×]BIω∆t,
and for notational simplicity, we drop the time dependency
on the variable p, ˙p, R, ω, and use superscript +to denote
the variable at the next timestep, ∆tis the fixed length of
the timestep set as 20ms, mis the total mass of the robot,
gis the gravitational acceleration vector, BIis the body
frame inertia vector of the robot body, e(·)denotes the matrix
exponential, and [ω×]∈R3×3is the skew-symmetric cross
product matrix produced by ω∈R3[6]. The objective is a
Linear Quadratic Regulator tracking cost summed over the
fixed trajectory length for tracking the kinematic initial guess
trajectory, as well as a regularization term smoothing the foot
trajectories given by 1/∆t2(pi)+−piTR˙p(pi)+−pi
for diagonal regularization weight matrix R˙p. Following [27],
we do not fix the foot contact locations and timings and allow
the optimizer to choose foot swing phase trajectories and
contact configurations. Therefore, we impose explicit contact
complementary constraints
(pi)z≥0(5)
(fi)z(pi)z= 0 (6)
(fi)z(pi)+
x−(pi)x= 0 (7)
(fi)z(pi)+
y−(pi)y= 0,(8)
where subscripts x, y, and zdenote the corresponding com-
ponent of the vector [27]. Similarly to [6], friction cone con-
straints are approximated with friction pyramid constraints
along with a maximum force limit given by (f)zmax,
0≤(f)z≤(f)zmax (9)
|(fi)x| ≤ µ(fi)z(10)
|(fi)y| ≤ µ(fi)z.(11)
Inspired by [1], we impose kinematic constraints as L1norm
constraints in the shoulder plane
(Bipi)x
(Bipi)z
1
≤lleg (12)
(Bipi)y= 0,(13)
where (Bipi)is the ith foot position in its corresponding
shoulder frame, and lleg denotes the maximum allowable