Deep Whole-Body Control Learning a Uniﬁed Policy for Manipulation and Locomotion Zipeng FuyXuxin ChengDeepak Pathak

2025-05-06 0 0 7.56MB 16 页 10玖币

侵权投诉

Deep Whole-Body Control: Learning a Uniﬁed Policy

for Manipulation and Locomotion

Zipeng Fu*†Xuxin Cheng*Deepak Pathak

Carnegie Mellon University

Figure 1: We present a framework for whole-body control of a legged robot with a robot arm attached. Left

half shows how whole-body control achieves larger workspace by leg bending and stretching. Right half shows

different real-world tasks, including wiping whiteboard, picking up a cup, pressing door-open buttons, placing,

throwing a cup into a garbage bin and picking in clustered environments. Videos on the project website.

Abstract:

An attached arm can signiﬁcantly increase the applicability of legged

robots to several mobile manipulation tasks that are not possible for the wheeled or

tracked counterparts. The standard hierarchical control pipeline for such legged

manipulators is to decouple the controller into that of manipulation and locomotion.

However, this is ineffective. It requires immense engineering to support coordina-

tion between the arm and legs, and error can propagate across modules causing

non-smooth unnatural motions. It is also biological implausible given evidence for

strong motor synergies across limbs. In this work, we propose to learn a uniﬁed

policy for whole-body control of a legged manipulator using reinforcement learn-

ing. We propose Regularized Online Adaptation to bridge the Sim2Real gap for

high-DoF control, and Advantage Mixing exploiting the causal dependency in the

action space to overcome local minima during training the whole-body system. We

also present a simple design for a low-cost legged manipulator, and ﬁnd that our

uniﬁed policy can demonstrate dynamic and agile behaviors across several task

setups. Videos are at https://maniploco.github.io

Keywords: Mobile Manipulation, Whole-Body Control, Legged Locomotion

1 Introduction

Locomotion has seen impressive performance in the last decade with results in challenging outdoor

and indoor terrains, otherwise unreachable by their wheeled or tracked counterparts. However, there

are strong limitations to what a legged-only robot can achieve since even the most basic everyday tasks,

besides visual inspection, require some forms of manipulation. This has led to widespread interest

and progress towards building legged manipulators, i.e., robots with both legs and arms, primarily

*equal contribution; †Zipeng Fu is now at Stanford University.

6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.

arXiv:2210.10044v1 [cs.RO] 18 Oct 2022

Body Mass, !

End-Eﬀector Mass

CoM, Friction, !

Arm Strength, !

Leg Strength

( )

Adaptation Module !

( )

Privileged Info Encoder !

( )

Regularize Supervise

Uniﬁed!

Policy!

( )

xt,at−1

aarm

aleg

Manipulation!

Advantage!

Function!

Amanip

Locomotion!

Advantage!

Function!

Aloco

(A) Training in Simulation

(B) Deployment in Real World

Adaptation Module ( )

xt,at−1

Uniﬁed Policy ( )

50Hz

aarm

aleg

Env Extrinsics (zt)

Teleoperation Vision Demonstrations

sarm

t−10,sleg

t−10,sbase

t−10,at−11

sarm

t−1,sleg

t−1,sbase

t−1,at−2

sarm

t−10,sleg

t−10,sbase

t−10,at−11

sarm

t−1,sleg

t−1,sbase

t−1,at−2

commands

Figure 2: Whole-body control framework. During training, a uniﬁed policy is learned by conditioned on

environment extrinsics. During deployment, the adaptation module is reused without any real-world ﬁne-tuning.

The robot can be commanded in various modes including teleoperation, vision and demonstration replay.

achieved so far through physical modeling of dynamics [

]. However, modeling

a legged robot with an attached arm is a dynamic, high-DoF, and non-smooth control problem,

requiring substantial domain expertise and engineering effort on the part of the designer. The control

frameworks are often hierarchical where kinematic constraints are dealt with separately for different

control spaces [

], thus limited to operating in constrained settings with limited generalization.

Learning-based methods, such as reinforcement learning (RL), could help lower the engineering

burden while aiding generalization to diverse scenarios.

However, recent learning-based approaches for legged mobile manipulators [

] have also followed

their model-based counterparts [

] by using hierarchical models in a semi-coupled fashion to

control the legs and arm. This is ineffective due to several practical reasons including lack of

coordination between the arm and legs, error propagation across modules, and slow, non-smooth

and unnatural motions. Furthermore, it is far from the whole-body motor control in humans where

studies suggest strong coordination among limbs. In fact, the control of hands and legs is so tied

together that they form low-dimension synergies, as outlined over 70 years ago in a seminal series

of writings by Russian physiologist Nikolai Bernstein [

]. Perhaps the simplest example

is how it is hard for humans to move one arm and the corresponding leg in different motions while

standing. The whole-body control should not only allow coordination but also extend the capabilities

of the individual parts. For instance, our robot bends or stretches its legs with the movement of the

arm to extend the reach of the end-effector as shown in Figure 1.

Unlike legged locomotion, it is not straightforward to scale the standard sim2real RL to whole-body

control due to several challenges: (a) High-DoF control: Our robot shown in Figure 3has total

19 degrees of freedom. This problem is exacerbated in legged manipulators because the control is

dynamic, continuous and high-frequency, which leads to an exponentially large search space even in

few seconds of trajectory. (b) Conﬂicting objectives and local minima: Consider when the arm tilts to

the right, the robot needs to change the walking gait to account for the weight balance. This curbs the

locomotion abilities and makes training prone to learn only one mode (manipulation or locomotion)

well. (c) Dependency: Consider picking an object on the ground, the end-effector of the arm needs

support from the torso by bending legs. This means the absolute performance of manipulation is

bounded until legs can adapt.

In this work, we present both a hardware setup for customized low-cost fully untethered legged

manipulators and a method for learning one uniﬁed policy to control and coordinate both legs and arm,

which is compatible with diverse operating modes as shown in Figure 1. We use our uniﬁed policy

for whole-body control, i.e. to control the joints of the quadruped legs as well as the manipulator

to simultaneously take the arm end-effector to desired poses and command the quadruped to move

Command Following (rfollowing) Energy (renergy) Alive (ralive)

rmanip 0.5·e−k[p,o]−[pcmd,ocmd]k1−0.004 ·Pj∈arm joints |τj˙qj|0

rloco −0.5·vx−vcmd

x+ 0.15 ·e−|ωyaw−ωcmd

yaw |−0.00005 ·Pi∈leg joints |τi˙qi|20.2+0.5·vcmd

Table 1: Both manipulation and locomotion rewards follow:

rfollowing +renergy +ralive

, which encourages

command following while penalizes positive mechanical energy consumption to enable smooth motion [

Denote forward base linear velocity vx, yaw angular base velocity ωyaw, torque τ, joint angle velocity ˙q.

in desired velocities. The key insights of the method are that we can exploit the causal structure in

action space with respect to manipulation and locomotion to stabilize and speed up learning, and

adding regularization to domain adaptation bridges the gap between simulation with full states and

real world with only partial observations.

We perform evaluation on our proposed legged manipulator. Despite immense progress, there exists

no easy-to-use legged manipulator for academic labs. Most publicized robot is Spot Arm from Boston

Dynamics [

], but the robot comes with pre-designed controllers that cannot be changed. Another

example is the ANYmal robot with a custom arm [

] from ANYBotics. Notably, both these hardware

setups are expensive (more than 100K USD). We implement a simple design of low-cost legged Go1

robot [

] with low-cost arm on top (hardware costs 6K USD). Our legged manipulator can run fully

untethered with modest on-board compute. We show the effectiveness of our learned whole-body

controller for teleoperation, vision-guided control as well as open-loop control setup across tasks

such as picking objects, throwing garbage, pressing buttons on walls etc. Our robot exhibits

dynamic

and agile leg-arm coordinated motions as shown in videos at https://maniploco.github.io.

2 Method: A Uniﬁed Policy for Coordinated Manipulation and Locomotion

We formulate the uniﬁed policy

as one neural network where the inputs are current base state

sbase

t∈R5

(row, pitch, and base angular velocities), arm state

sarm

t∈R12

(joint position and

velocity of each arm joint), leg state

sleg

t∈R28

(joint position and velocity of each leg joint, and

foot contact indicators), last action

at−1∈R18

, end-effector position and orientation command

[pcmd, ocmd]∈SE(3)

, base velocity command

[vcmd

x, ωcmd

yaw ]

, and environment extrinsics

zt∈R20

(details in Section 2.2). The policy outputs target arm joint position

aarm

t∈R6

and target leg joint

position

aleg

t∈R12

, which are subsequently converted to torques using PD controllers. We use

joint-space position control for both legs and the arm. As opposed to operational space control of the

arm, joint-space control enables learning to avoid self-collision and smaller Sim-to-Real gap, which is

also found to be useful in other setups involving multiple robot parts, like bimanual manipulation [

Command Vars Training Ranges Test Ranges

vcmd

x[0, 0.9] [0.8, 1.0]

ωcmd

yaw [-1,0, 1.0] [-1, -.7] & [.7, 1]

l[0.2, 0.7] [0.6, 0.8]

p[−2π/5,2π/5] [−2π/5,2π/5]

y[−3π/5,3π/5] [−3π/5,3π/5]

Ttraj [1, 3] [0.5, 1]

Table 2: Ranges for uniform sampling of command

variables

We use RL to train our policy

maximizing the discounted expected return

EπhPT−1

t=0 γtrti

, where

is the reward at time

step

is the discount factor, and

is the max-

imum episode length. The reward

is the sum

of manipulation reward

rmanip

and locomotion

reward

rloco

as shown in Table 1. Notice that

we use the second power of energy consumption

at each leg joint to encourage both lower aver-

age and lower variance across all leg joints. We

follow the simple reward design that encourages minimizing energy consumption from [17].

Env Params Training Ranges Test Ranges

Base Extra Payload [-0.5, 3.0] [5.0, 6.0]

End-Effector Payload [0, 0.1] [0.2, 0.3]

Center of Base Mass [-0.15, 0.15] [0.20, 0.20]

Arm Motor Strength [0.7, 1.3] [0.6, 1.4]

Leg Motor Strength [0.9, 1.1] [0.7, 1.3]

Friction [0.25, 1.75] [0.05, 2.5]

Table 3: Ranges for uniform sampling of environment

parameters

We parameterize the end-effector position com-

mand

pcmd

in spherical coordinate

(l, p, y)

where

is the radius of the sphere and

and

are the pitch and yaw angle. The origin of the

spherical coordinate system is set at the base of

the arm, but independent of torso’s height, row

and pitch (details in Supplementary). We set the

end-effector pose command

pcmd

by interpolat-

ing between the current end-effector position

and a randomly sampled end-effector position

pend every Ttraj seconds:

pcmd

t=t

Ttraj

p+1−t

Ttraj pend, t ∈[0, Ttraj].

pend

is resampled if any

pcmd

leads to self-collision or collision with the ground.

ocmd

is uniformly

sampled from SO(3) space. Table 2lists the ranges for sampling of all command variables.

2.1 Advantage Mixing for Policy Learning

Training a robust policy for a high-DoF robot is hard. In both manipulation and locomotion learning

literature, researchers have used curriculum learning to ease the learning process by gradually

increasing the difﬁculty of tasks so that the policy can learn to solve simple tasks ﬁrst and then tackle

difﬁcult tasks [18,19,20]. However, most of these works require many manual tunings of a diverse

set of the curriculum parameters and careful design of the mechanism for automatic curriculum.

Instead of introducing a large number of curricula on the learning and environment setups, we rely

on only one curriculum with only one parameter to expedite the policy learning. Since we know

that manipulation tasks are mostly related to the arm actions and locomotion tasks largely depends

on leg actions, we can formulate this inductive bias in policy optimization by mixing advantage

functions for manipulation and locomotion to speed up policy learning. Formally, for a policy with

diagonal Gaussian noise and a sampled transition batch

, the training objective with respect to

policy’s parameters θπis

J(θπ) = 1

|D| X

(st,at)∈D

log π(aarm

t|st)Amanip +βAloco+ log π(aleg

t|st)βAmanip +Aloco

is the curriculum parameter that linearly increases from 0 to 1 over timesteps

Tmix

β=

min(t/Tmix,1)

Amanip

and

Aloco

are advantage functions based on

rmanip

and

rloco

respectively.

Intuitively, the Advantage Mixing reduces the credit assignment complexity by ﬁrst attributing differ-

ence in manipulation returns to arm actions and difference in locomotion returns to leg actions, and

then gradually anneal the weighted advantage sum to encourage learning arm and leg actions that

help locomotion and manipulation respectively. We optimize this RL objective by PPO [21].

2.2 Regularized Online Adaptation for Sim-to-Real Transfer

Much prior work on Sim-to-Real transfer utilize the two-phase teacher-student scheme to ﬁrst train a

teacher network by RL using privileged information that is only available in simulation, and then the

student network using onboard observation history imitates the teacher policy either in explicit action

space or latent space [

]. Due to the information gap between the full state available to

the teacher network and partial observability of onboard sensories, the teacher network may provide

supervision that is impossible for the student network to predict, resulting in a realizability gap. This

problem is also noted in Embodied Agent community [

]. In addition, the second phase can only

start after the convergence of the ﬁrst phase, yielding extra burdens for both training and deployment.

To tackle the realizability gap and to remove the two-phase pipeline, we propose Regularized Online

Adaptation (shown in Figure 2). Concretely, the encoder

takes the privileged information

input and predict an enviornment extrinsics latent

zµ

for the uniﬁed policy to adapt its behavior in

different environments. The adaptation module

estimates the environment extrinsics latent

zφ

only condition on recent observation history from robot’s onboard sensories. We jointly train

with

the uniﬁed policy

end-to-end by RL and regularize

zµ

to avoid large deviation from

zφ

estimated

by the adaptation module. The adaption module

is trained by imitating

zµ

online. We formulate

the loss function of the whole learning pipeline with respect to policy’s parameters

θπ

, privileged

information encoder’s parameters θµ, and adaptation module’s parameters θφas

L(θπ, θµ, θφ) = −J(θπ, θµ) + λ||zµ−sg[zφ]||2+||sg[zµ]−zφ||2,

where

J(θπ, θµ)

is the RL objective discussed in Section 2.1,

sg[·]

is the stop gradient oper-

ator, and

is the Laguagrian multiplier acting as regularization strength. The loss function

can be minimized by using dual gradient descent:

θπ, θµ←arg minθπ,θµE(s,a)∼π(...,zµ)[L]

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepWhole-BodyControl:LearningaUniedPolicyforManipulationandLocomotionZipengFu*yXuxinCheng*DeepakPathakCarnegieMellonUniversityFigure1:Wepresentaframeworkforwhole-bodycontrolofaleggedrobotwitharobotarmattached.Lefthalfshowshowwhole-bodycontrolachieveslargerworkspacebylegbendingandstretching.Rightha...

展开>> 收起<<

Deep Whole-Body Control Learning a Uniﬁed Policy for Manipulation and Locomotion Zipeng FuyXuxin ChengDeepak Pathak.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Whole-Body Control Learning a Uniﬁed Policy for Manipulation and Locomotion Zipeng FuyXuxin ChengDeepak Pathak

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: