Deep Whole-Body Control Learning a Unified Policy for Manipulation and Locomotion Zipeng FuyXuxin ChengDeepak Pathak

2025-05-06 0 0 7.56MB 16 页 10玖币
侵权投诉
Deep Whole-Body Control: Learning a Unified Policy
for Manipulation and Locomotion
Zipeng Fu*Xuxin Cheng*Deepak Pathak
Carnegie Mellon University
Figure 1: We present a framework for whole-body control of a legged robot with a robot arm attached. Left
half shows how whole-body control achieves larger workspace by leg bending and stretching. Right half shows
different real-world tasks, including wiping whiteboard, picking up a cup, pressing door-open buttons, placing,
throwing a cup into a garbage bin and picking in clustered environments. Videos on the project website.
Abstract:
An attached arm can significantly increase the applicability of legged
robots to several mobile manipulation tasks that are not possible for the wheeled or
tracked counterparts. The standard hierarchical control pipeline for such legged
manipulators is to decouple the controller into that of manipulation and locomotion.
However, this is ineffective. It requires immense engineering to support coordina-
tion between the arm and legs, and error can propagate across modules causing
non-smooth unnatural motions. It is also biological implausible given evidence for
strong motor synergies across limbs. In this work, we propose to learn a unified
policy for whole-body control of a legged manipulator using reinforcement learn-
ing. We propose Regularized Online Adaptation to bridge the Sim2Real gap for
high-DoF control, and Advantage Mixing exploiting the causal dependency in the
action space to overcome local minima during training the whole-body system. We
also present a simple design for a low-cost legged manipulator, and find that our
unified policy can demonstrate dynamic and agile behaviors across several task
setups. Videos are at https://maniploco.github.io
Keywords: Mobile Manipulation, Whole-Body Control, Legged Locomotion
1 Introduction
Locomotion has seen impressive performance in the last decade with results in challenging outdoor
and indoor terrains, otherwise unreachable by their wheeled or tracked counterparts. However, there
are strong limitations to what a legged-only robot can achieve since even the most basic everyday tasks,
besides visual inspection, require some forms of manipulation. This has led to widespread interest
and progress towards building legged manipulators, i.e., robots with both legs and arms, primarily
*equal contribution; Zipeng Fu is now at Stanford University.
6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.
arXiv:2210.10044v1 [cs.RO] 18 Oct 2022
Body Mass, !
End-Eector Mass
CoM, Friction, !
Arm Strength, !
Leg Strength
( )
et
Adaptation Module !
( )
ϕ
Privileged Info Encoder !
( )
μ
Regularize Supervise
Unified!
Policy!
( )
π
xt,at1
aarm
t
aleg
t
Amanip
Aloco
(A) Training in Simulation
(B) Deployment in Real World
Adaptation Module ( )
ϕ
xt,at1
Unified Policy ( )
π
50Hz
50Hz
aarm
t
aleg
t
Env Extrinsics (zt)
Teleoperation Vision Demonstrations
sarm
t10,sleg
t10,sbase
t10,at11
sarm
t1,sleg
t1,sbase
t1,at2
sarm
t10,sleg
t10,sbase
t10,at11
sarm
t1,sleg
t1,sbase
t1,at2
commands
Figure 2: Whole-body control framework. During training, a unified policy is learned by conditioned on
environment extrinsics. During deployment, the adaptation module is reused without any real-world fine-tuning.
The robot can be commanded in various modes including teleoperation, vision and demonstration replay.
achieved so far through physical modeling of dynamics [
1
,
2
,
3
,
4
,
5
,
6
]. However, modeling
a legged robot with an attached arm is a dynamic, high-DoF, and non-smooth control problem,
requiring substantial domain expertise and engineering effort on the part of the designer. The control
frameworks are often hierarchical where kinematic constraints are dealt with separately for different
control spaces [
7
], thus limited to operating in constrained settings with limited generalization.
Learning-based methods, such as reinforcement learning (RL), could help lower the engineering
burden while aiding generalization to diverse scenarios.
However, recent learning-based approaches for legged mobile manipulators [
8
] have also followed
their model-based counterparts [
9
,
10
] by using hierarchical models in a semi-coupled fashion to
control the legs and arm. This is ineffective due to several practical reasons including lack of
coordination between the arm and legs, error propagation across modules, and slow, non-smooth
and unnatural motions. Furthermore, it is far from the whole-body motor control in humans where
studies suggest strong coordination among limbs. In fact, the control of hands and legs is so tied
together that they form low-dimension synergies, as outlined over 70 years ago in a seminal series
of writings by Russian physiologist Nikolai Bernstein [
11
,
12
,
13
]. Perhaps the simplest example
is how it is hard for humans to move one arm and the corresponding leg in different motions while
standing. The whole-body control should not only allow coordination but also extend the capabilities
of the individual parts. For instance, our robot bends or stretches its legs with the movement of the
arm to extend the reach of the end-effector as shown in Figure 1.
Unlike legged locomotion, it is not straightforward to scale the standard sim2real RL to whole-body
control due to several challenges: (a) High-DoF control: Our robot shown in Figure 3has total
19 degrees of freedom. This problem is exacerbated in legged manipulators because the control is
dynamic, continuous and high-frequency, which leads to an exponentially large search space even in
few seconds of trajectory. (b) Conflicting objectives and local minima: Consider when the arm tilts to
the right, the robot needs to change the walking gait to account for the weight balance. This curbs the
locomotion abilities and makes training prone to learn only one mode (manipulation or locomotion)
well. (c) Dependency: Consider picking an object on the ground, the end-effector of the arm needs
support from the torso by bending legs. This means the absolute performance of manipulation is
bounded until legs can adapt.
In this work, we present both a hardware setup for customized low-cost fully untethered legged
manipulators and a method for learning one unified policy to control and coordinate both legs and arm,
which is compatible with diverse operating modes as shown in Figure 1. We use our unified policy
for whole-body control, i.e. to control the joints of the quadruped legs as well as the manipulator
to simultaneously take the arm end-effector to desired poses and command the quadruped to move
2
Command Following (rfollowing) Energy (renergy) Alive (ralive)
rmanip 0.5·e−k[p,o][pcmd,ocmd]k10.004 ·Pjarm joints |τj˙qj|0
rloco 0.5·vxvcmd
x+ 0.15 ·e|ωyawωcmd
yaw |0.00005 ·Pileg joints |τi˙qi|20.2+0.5·vcmd
x
Table 1: Both manipulation and locomotion rewards follow:
rfollowing +renergy +ralive
, which encourages
command following while penalizes positive mechanical energy consumption to enable smooth motion [
17
].
Denote forward base linear velocity vx, yaw angular base velocity ωyaw, torque τ, joint angle velocity ˙q.
in desired velocities. The key insights of the method are that we can exploit the causal structure in
action space with respect to manipulation and locomotion to stabilize and speed up learning, and
adding regularization to domain adaptation bridges the gap between simulation with full states and
real world with only partial observations.
We perform evaluation on our proposed legged manipulator. Despite immense progress, there exists
no easy-to-use legged manipulator for academic labs. Most publicized robot is Spot Arm from Boston
Dynamics [
14
], but the robot comes with pre-designed controllers that cannot be changed. Another
example is the ANYmal robot with a custom arm [
8
] from ANYBotics. Notably, both these hardware
setups are expensive (more than 100K USD). We implement a simple design of low-cost legged Go1
robot [
15
] with low-cost arm on top (hardware costs 6K USD). Our legged manipulator can run fully
untethered with modest on-board compute. We show the effectiveness of our learned whole-body
controller for teleoperation, vision-guided control as well as open-loop control setup across tasks
such as picking objects, throwing garbage, pressing buttons on walls etc. Our robot exhibits
dynamic
and agile leg-arm coordinated motions as shown in videos at https://maniploco.github.io.
2 Method: A Unified Policy for Coordinated Manipulation and Locomotion
We formulate the unified policy
π
as one neural network where the inputs are current base state
sbase
tR5
(row, pitch, and base angular velocities), arm state
sarm
tR12
(joint position and
velocity of each arm joint), leg state
sleg
tR28
(joint position and velocity of each leg joint, and
foot contact indicators), last action
at1R18
, end-effector position and orientation command
[pcmd, ocmd]SE(3)
, base velocity command
[vcmd
x, ωcmd
yaw ]
, and environment extrinsics
ztR20
(details in Section 2.2). The policy outputs target arm joint position
aarm
tR6
and target leg joint
position
aleg
tR12
, which are subsequently converted to torques using PD controllers. We use
joint-space position control for both legs and the arm. As opposed to operational space control of the
arm, joint-space control enables learning to avoid self-collision and smaller Sim-to-Real gap, which is
also found to be useful in other setups involving multiple robot parts, like bimanual manipulation [
16
].
Command Vars Training Ranges Test Ranges
vcmd
x[0, 0.9] [0.8, 1.0]
ωcmd
yaw [-1,0, 1.0] [-1, -.7] & [.7, 1]
l[0.2, 0.7] [0.6, 0.8]
p[2π/5,2π/5] [2π/5,2π/5]
y[3π/5,3π/5] [3π/5,3π/5]
Ttraj [1, 3] [0.5, 1]
Table 2: Ranges for uniform sampling of command
variables
We use RL to train our policy
π
by
maximizing the discounted expected return
EπhPT1
t=0 γtrti
, where
rt
is the reward at time
step
t
,
γ
is the discount factor, and
T
is the max-
imum episode length. The reward
r
is the sum
of manipulation reward
rmanip
and locomotion
reward
rloco
as shown in Table 1. Notice that
we use the second power of energy consumption
at each leg joint to encourage both lower aver-
age and lower variance across all leg joints. We
follow the simple reward design that encourages minimizing energy consumption from [17].
Env Params Training Ranges Test Ranges
Base Extra Payload [-0.5, 3.0] [5.0, 6.0]
End-Effector Payload [0, 0.1] [0.2, 0.3]
Center of Base Mass [-0.15, 0.15] [0.20, 0.20]
Arm Motor Strength [0.7, 1.3] [0.6, 1.4]
Leg Motor Strength [0.9, 1.1] [0.7, 1.3]
Friction [0.25, 1.75] [0.05, 2.5]
Table 3: Ranges for uniform sampling of environment
parameters
We parameterize the end-effector position com-
mand
pcmd
in spherical coordinate
(l, p, y)
,
where
l
is the radius of the sphere and
p
and
y
are the pitch and yaw angle. The origin of the
spherical coordinate system is set at the base of
the arm, but independent of torso’s height, row
and pitch (details in Supplementary). We set the
end-effector pose command
pcmd
by interpolat-
ing between the current end-effector position
p
and a randomly sampled end-effector position
3
pend every Ttraj seconds:
pcmd
t=t
Ttraj
p+1t
Ttraj pend, t [0, Ttraj].
pend
is resampled if any
pcmd
t
leads to self-collision or collision with the ground.
ocmd
is uniformly
sampled from SO(3) space. Table 2lists the ranges for sampling of all command variables.
2.1 Advantage Mixing for Policy Learning
Training a robust policy for a high-DoF robot is hard. In both manipulation and locomotion learning
literature, researchers have used curriculum learning to ease the learning process by gradually
increasing the difficulty of tasks so that the policy can learn to solve simple tasks first and then tackle
difficult tasks [18,19,20]. However, most of these works require many manual tunings of a diverse
set of the curriculum parameters and careful design of the mechanism for automatic curriculum.
Instead of introducing a large number of curricula on the learning and environment setups, we rely
on only one curriculum with only one parameter to expedite the policy learning. Since we know
that manipulation tasks are mostly related to the arm actions and locomotion tasks largely depends
on leg actions, we can formulate this inductive bias in policy optimization by mixing advantage
functions for manipulation and locomotion to speed up policy learning. Formally, for a policy with
diagonal Gaussian noise and a sampled transition batch
D
, the training objective with respect to
policy’s parameters θπis
J(θπ) = 1
|D| X
(st,at)∈D
log π(aarm
t|st)Amanip +βAloco+ log π(aleg
t|st)βAmanip +Aloco
β
is the curriculum parameter that linearly increases from 0 to 1 over timesteps
Tmix
:
β=
min(t/Tmix,1)
.
Amanip
and
Aloco
are advantage functions based on
rmanip
and
rloco
respectively.
Intuitively, the Advantage Mixing reduces the credit assignment complexity by first attributing differ-
ence in manipulation returns to arm actions and difference in locomotion returns to leg actions, and
then gradually anneal the weighted advantage sum to encourage learning arm and leg actions that
help locomotion and manipulation respectively. We optimize this RL objective by PPO [21].
2.2 Regularized Online Adaptation for Sim-to-Real Transfer
Much prior work on Sim-to-Real transfer utilize the two-phase teacher-student scheme to first train a
teacher network by RL using privileged information that is only available in simulation, and then the
student network using onboard observation history imitates the teacher policy either in explicit action
space or latent space [
22
,
23
,
24
,
25
]. Due to the information gap between the full state available to
the teacher network and partial observability of onboard sensories, the teacher network may provide
supervision that is impossible for the student network to predict, resulting in a realizability gap. This
problem is also noted in Embodied Agent community [
26
]. In addition, the second phase can only
start after the convergence of the first phase, yielding extra burdens for both training and deployment.
To tackle the realizability gap and to remove the two-phase pipeline, we propose Regularized Online
Adaptation (shown in Figure 2). Concretely, the encoder
µ
takes the privileged information
e
as
input and predict an enviornment extrinsics latent
zµ
for the unified policy to adapt its behavior in
different environments. The adaptation module
φ
estimates the environment extrinsics latent
zφ
by
only condition on recent observation history from robot’s onboard sensories. We jointly train
µ
with
the unified policy
π
end-to-end by RL and regularize
zµ
to avoid large deviation from
zφ
estimated
by the adaptation module. The adaption module
φ
is trained by imitating
zµ
online. We formulate
the loss function of the whole learning pipeline with respect to policy’s parameters
θπ
, privileged
information encoder’s parameters θµ, and adaptation module’s parameters θφas
L(θπ, θµ, θφ) = J(θπ, θµ) + λ||zµsg[zφ]||2+||sg[zµ]zφ||2,
where
J(θπ, θµ)
is the RL objective discussed in Section 2.1,
sg[·]
is the stop gradient oper-
ator, and
λ
is the Laguagrian multiplier acting as regularization strength. The loss function
can be minimized by using dual gradient descent:
θπ, θµarg minθπµE(s,a)π(...,zµ)[L]
,
4
摘要:

DeepWhole-BodyControl:LearningaUniedPolicyforManipulationandLocomotionZipengFu*yXuxinCheng*DeepakPathakCarnegieMellonUniversityFigure1:Wepresentaframeworkforwhole-bodycontrolofaleggedrobotwitharobotarmattached.Lefthalfshowshowwhole-bodycontrolachieveslargerworkspacebylegbendingandstretching.Rightha...

展开>> 收起<<
Deep Whole-Body Control Learning a Unified Policy for Manipulation and Locomotion Zipeng FuyXuxin ChengDeepak Pathak.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:7.56MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注