STAP Sequencing Task-Agnostic Policies Project page sites.google.comstanford.edustap Christopher Agia Toki Migimatsu Jiajun Wu Jeannette Bohg

2025-05-03 0 0 4.2MB 12 页 10玖币

侵权投诉

STAP: Sequencing Task-Agnostic Policies

Project page: sites.google.com/stanford.edu/stap

Christopher Agia*, Toki Migimatsu*, Jiajun Wu, Jeannette Bohg

Department of Computer Science, Stanford University, California, U.S.A.

Email: {cagia,takatoki,jiajunw,bohg}@stanford.edu

Abstract— Advances in robotic skill acquisition have made it

possible to build general-purpose libraries of learned skills for

downstream manipulation tasks. However, naively executing

these skills one after the other is unlikely to succeed without

accounting for dependencies between actions prevalent in long-

horizon plans. We present Sequencing Task-Agnostic Policies

(STAP), a scalable framework for training manipulation skills

and coordinating their geometric dependencies at planning

time to solve long-horizon tasks never seen by any skill during

training. Given that Q-functions encode a measure of skill

feasibility, we formulate an optimization problem to maximize

the joint success of all skills sequenced in a plan, which we

estimate by the product of their Q-values. Our experiments

indicate that this objective function approximates ground

truth plan feasibility and, when used as a planning objective,

reduces myopic behavior and thereby promotes long-horizon

task success. We further demonstrate how STAP can be used

for task and motion planning by estimating the geometric

feasibility of skill sequences provided by a task planner. We

evaluate our approach in simulation and on a real robot.

I. INTRODUCTION

Performing sequential manipulation tasks requires a robot

to reason about dependencies between actions. Consider

the example in Fig. 1, where the robot needs to retrieve an

object outside of its workspace by ﬁrst using an L-shaped

hook to pull the target object closer. How the robot picks up

the hook affects whether the target object will be reachable.

Traditionally, planning actions to ensure the geometric fea-

sibility of a sequential manipulation task is handled by mo-

tion planning [1–3], which typically requires full observabil-

ity of the environment state and knowledge of its dynamics.

Learning-based approaches [4–6] can acquire skills without

this privileged information. However, using independently

learned skills to perform unseen long-horizon manipulation

tasks is an unsolved problem. The skills could be myopically

executed one after another to solve a simpler subset of tasks,

but solving more complex tasks requires planning with these

skills to ensure the feasibility of the entire skill sequence.

Prior work focuses on sequencing skills at train time to

solve a small set of sequential manipulation tasks [7, 8].

To contend with long-horizons, these methods often learn

skills [9] that consist of a policy and parameterized manipula-

tion primitive [10]. The policy predicts the parameters of the

primitive, thereby governing its motion. Such methods are

task-speciﬁc in that they need to be trained on skill sequences

that reﬂect the tasks they might encounter at test time. In our

framework, we assume that a task planner provides a novel

sequence of skills at test time that will then be grounded with

*Authors contributed equally to this work.

Toyota Research Institute provided funds to support this work.

Greedy execution

Pick(hook) Pull(yogurt,hook)

Planning with STAP (Ours)

Pick(hook) Pull(yogurt,hook)

Fig. 1: Sequential manipulation tasks often contain geometric dependencies

between actions. In this example, the robot needs to use the hook to pull

the block into its kinematic workspace so it is close enough to pick up.

The top row shows how greedy execution of skills results in the robot

picking up the hook in a way that prevents it from reaching the block.

We present a method for planning with skills to maximize long-horizon

success without the need to train the skills on long-horizon tasks.

parameters for manipulation primitives through optimization.

This makes our method task-agnostic, as skills can be se-

quenced to solve long-horizon tasks not seen during training.

At the core of our method, Sequencing Task-Agnostic Poli-

cies (STAP), we use Q-functions to optimize the parameters

of manipulation primitives in a given sequence. Policies and

Q-functions for each skill are acquired through off-the-shelf

Reinforcement Learning. We then deﬁne a planning objective

to maximize all Q-functions in a skill sequence, ensuring its

geometric feasibility. To evaluate downstream Q-functions

of future skills, we learn a dynamics model that can predict

future states. We also use Uncertainty Quantiﬁcation (UQ)

to avoid visiting states that are Out-Of-Distribution (OOD)

for the learned skills. We train all of these components

independently per skill, making it easy to gradually expand

a library of skills without the need to retrain existing ones.

Our contributions are three-fold: we propose 1) a frame-

work to train an extensible library of task-agnostic skills,

2) a planning method that optimizes arbitrary sequences

of skills to solve long-horizon tasks, and 3) a method to

solve Task and Motion Planning (TAMP) problems with

learned skills. In extensive experiments, we demonstrate that

planning with STAP promotes long-horizon success on tasks

with complex geometric dependencies between actions. We

also demonstrate that our framework works on a real robot.

II. RELATED WORK

A. Robot skill learning

How to represent and acquire composable manipulation

skills is a widely studied problem in robotics. A broad class

of methods uses Learning from Demonstration (LfD) [5].

arXiv:2210.12250v3 [cs.RO] 31 May 2023

Dynamic Movement Primitives (DMPs) [11–13] are a form

of LfD that learns the parameters of dynamical systems

encoding movements [14–17]. More recent extensions

integrate DMPs with deep neural networks to learn more

ﬂexible policies [18, 19]—for instance, to build a large

library of skills from human video demonstrations [20]. Skill

discovery methods instead identify action patterns in ofﬂine

datasets [21] and either distill them into policies [22, 23]

or extract skill priors for use in downstream tasks [24, 25].

Robot skills can also be acquired via active learning [26],

Reinforcement Learning (RL) [27–31], and ofﬂine RL [32].

An advantage of our planning framework is that it is

agnostic to the types of skills employed, requiring only that

it is possible to predict the probability of the skill’s success

given the current state and action. Here, we learn skills [9]

that consist of a policy and a parameterized manipulation

primitive [10]. The actions output by the policy are the pa-

rameters of the primitive determining its motion. In STAP, we

will use the Q-functions of the policy to optimize suitable pa-

rameters [20, 28] for a sequence of manipulation primitives.

B. Long-horizon robot planning

Once manipulation skills have been acquired, using

them to perform sequential manipulation tasks remains an

open challenge. [33–36] propose data-driven methods to

determine the symbolic feasibility of skills and only control

their timing, while we seek to ensure the geometric feasibility

of skills by controlling their trajectories. Other techniques

rely on task planning [37, 38], subgoal planning [39], or

meta-adaptation [40, 41] to sequence learned skills to novel

long-horizon goals. However, the tasks considered in these

works do not feature rich geometric dependencies between

actions that necessitate motion planning or skill coordination.

The options framework [42] and the parameterized

action Markov Decision Process (MDP) [43] train a

high-level policy to engage low-level policies [44, 45]

or primitives [8, 46–48] towards long-horizon goals. [49]

proposes a hierarchical RL method that uses the value

functions of lower-level policies as the state space for a

higher-level RL policy. Our work is also related to model-

based RL methods which jointly learn dynamics and reward

models to guide planning [50–52], policy search [53, 54],

or combine both [55, 56]. While these methods demonstrate

that policy hierarchies and model-based planning can enable

RL to solve long-horizon problems, they are typically trained

in the context of a single task. In contrast, we seek to plan

with lower-level skills to solve tasks never seen before.

Closest in spirit to our work is that of Xu et al. [7], Deep

Affordance Foresight (DAF), which proposes to learn a

dynamics model, skill-centric affordances (value functions),

and a skill proposal network that serves as a higher-level

RL policy. We identify several drawbacks with DAF: ﬁrst,

because DAF relies on multi-task experience for training,

generalizing beyond the distribution of training tasks may

be difﬁcult; second, the dynamics, affordance models, and

skill proposal network need to be trained synchronously,

which complicates expanding the current library of trained

skills; third, their planner samples actions from uniform

random distributions, which prevents DAF from scaling to

high-dimensional action spaces and long horizons. STAP

differs in that our dynamics, policies, and affordances (Q-

functions) are learned independently per skill. Without any

additional training, we combine the skills at planning time

to solve unseen long-horizon tasks. We compare our method

against DAF in the planning experiments (Sec. VII-B).

C. Task and motion planning

TAMP solves problems that require both symbolic and

geometric reasoning [2, 57]. DAF learns a skill proposal

network to replace the typical task planner in TAMP,

akin to [58]. Another prominent line of research learns

components of the TAMP system, often from a dataset of

precomputed solutions [59–64]. The problems we consider

involve complex geometric dependencies between actions

that are typical in TAMP. However, STAP only performs

geometric reasoning and by itself is not a TAMP method. We

demonstrate in experiments (Sec. VII-C) that STAP can be

combined with symbolic planners to solve TAMP problems.

III. PROBLEM SETUP

A. Long-horizon planning

Our objective is to solve long-horizon manipulation tasks

that require sequential execution of learned skills. These

skills come from a skill library L={ψ1, . . . , ψK}, where

each skill ψkconsists of a parameterized manipulation prim-

itive [10] ϕkand a learned policy πk. A primitive ϕkak

takes in parameters akand executes a series of motor com-

mands on the robot, while a policy πkak

skis trained

to predict a distribution of suitable parameters akfrom the

current state sk. For example, the Pick(a, b)skill may have

a primitive which takes as input an end-effector pose and

executes a trajectory to pick up object a, where the robot ﬁrst

moves to the commanded pose, closes the gripper to grasp

a, and then lifts aoff of b. The learned policy πkfor this

skill will then try to predict end-effector poses to pick up a.

We assume access to a high-level planner that computes

plan skeletons (i.e. skill sequences) to achieve a high-level

goal. STAP aims to solve the problem of turning plan skele-

tons into geometrically feasible action plans (i.e. parameters

for each manipulation primitive in the plan skeleton).

STAP is agnostic to the choice of high-level planner. For

instance, it can be used in conjunction with Planning Domain

Deﬁnition Language (PDDL) [65] task planners to perform

hierarchical TAMP [66]. In this setup, the task planner and

STAP will be queried numerous times to ﬁnd multiple plan

skeletons grounded with optimized action plans. STAP will

also evaluate each action plan’s probability of success (i.e.

its geometric feasibility). After some termination criterion is

met, such as a timeout, the candidate plan skeleton and action

plan with the highest probability of success is returned.

B. Task-agnostic policies

We aim to learn policies {π1, . . . , πK}for the skill library

Lthat can be sequenced by a high-level planner in arbitrary

ways to solve any long-horizon task. We call these policies

task-agnostic because they are not trained to solve a speciﬁc

long-horizon task. Instead, each policy πkis associated with

a skill-speciﬁc contextual bandit (i.e. a single timestep MDP)

Mk=Sk,Ak, T k, Rk, ρk,(1)

where Skis the state space, Akis the action space,

Tks′k

sk, akis the transition model, Rksk, ak, s′kis

the binary reward function, and ρkskis the initial state dis-

tribution. Given a state sk, the policy πkproduces an action

ak, and the state evolves according to the transition model

Tks′k

sk, ak. Thus, the transition model encapsulates

the execution of the manipulation primitive ϕk(Sec. III-A).

A long-horizon domain is one in which each timestep

involves the execution of a single policy, and it is speciﬁed by

M=M1:K,S, T 1:K, ρ1:K,Γ1:K,(2)

where M1:Kis the set of MDPs whose policies can be

executed in the long-horizon domain, Sis the state space of

the long-horizon domain, Tks′

s, akis an extension of

dynamics Tks′k

sk, akthat models how the entire long-

horizon state evolves with action ak,ρk(s)is an extension

of initial state distributions ρkskover the long-horizon

state space, and Γk:S → Skis a function that maps from

the long-horizon state space to the state space of policy k. We

assume that the dynamics Tks′k

sk, ak, T ks′

s, ak

and initial state distributions ρksk, ρkskare unknown.

Note that while the policies may have different state spaces

Sk, policy states skmust be obtainable from the long-horizon

state space Svia sk= Γk(s). This is to ensure that the poli-

cies can be used together in the same environment to perform

long-horizon tasks. In the base case, all the state spaces are

identical and Γkis simply the identity function. Another case

is that sis constructed as the concatenation of all s1:Kand

Γk(s)extracts the slice in scorresponding to sk.

IV. SEQUENCING TASK-AGNOSTIC POLICIES

Given a task in the form of a sequence of skills to execute,

our planning framework constructs an optimization problem

with the policies, Q-functions, and dynamics models of each

skill. Solving the optimization problem results in parameters

for all manipulation primitives in the skill sequence such that

the entire sequence’s probability of success is maximized.

We formalize our planning methodology in this section

and outline its implementation in Sec. V. Lastly, we describe

our procedure for training modular skill libraries in Sec. VI.

A. Grounding skill sequences with action plans

We assume that we are given a plan skeleton of skills

τ= [ψ1, . . . , ψH]∈ LH(hereafter denoted by τ=ψ1:H)

that should be successfully executed to solve a long-

horizon task. Let Mhwith subscript hdenote the MDP

corresponding to the h-th skill in the sequence—in contrast

to Mkwith superscript k, which denotes the k-th MDP in the

skill library. A long-horizon task is considered successful if

every skill reward r1, . . . , rHreceived during execution is 1.

(d) Action: arg max!&𝑄"⋅ 𝑄#

(a) Place: 𝑄"𝑠, 𝑎 (b) Push:&𝑄#𝑠, 𝑎 (c) Objective: 𝑄"⋅ 𝑄#

Fig. 2: Planning in a 2D toy domain. The agent needs to get the green block

under the brown receptacle with two skills: Place() and Push() that operate

on the horizontal position xof the green block. Plots (a) and (b) show the Q-

functions across (x, θ)for each skill. Place() is only trained to get the green

block on the ground, so the planner must determine a=xs.t. Push() is

unobstructed. The optimal action maximizes the probability of long-horizon

task success (Eq. 3), approximated by the product of Q-functions in plot (c).

Given an initial state s1∈ S, our problem is to

ground the plan skeleton τ=ψ1:Hwith an action plan

ξ= [a1, . . . , aH]∈ A1× · · · × AHthat maximizes the

probability of succeeding at the long-horizon task. This is

framed as an optimization problem arg maxa1:HJ, where

the maximization objective Jis the task success probability

J(a1:H;s1) = p(r1= 1, . . . , rH= 1 |s1, a1:H).

Here, r1:Hare the skill rewards received at each timestep.

With the long-horizon dynamics models Tks′

s, ak,

the objective can be cast as the expectation

J= Es2:H∼T1:H−1[p(r1= 1, . . . , rH= 1 |s1:H, a1:H)] .

By the Markov assumption, rewards are conditionally inde-

pendent given states and actions. We can express the prob-

ability of task success as the product of reward probabilities

J= Es2:H∼T1:H−1ΠH

h=1 p(rh= 1 |sh, ah).

Because the skill rewards are binary, the skill success

probabilities are equivalent to Q-values:

p(rh= 1 |sh, ah) = Esh+1∼Th[rh|sh, ah]

=Qh(Γh(sh), ah).

The ﬁnal objective is expressed in terms of Q-values:

J= Es2:H∼T1:H−1ΠH

h=1 Qh(Γh(sh), ah).(3)

This planning objective is simply the product of Q-values

evaluated along the trajectory (s1, a1, . . . , sH, aH), where

the states are predicted by the long-horizon dynamics model:

s2∼T1(· | s1, a1), . . . , sH∼TH−1(· | sH−1, aH−1).1

B. Ensuring action plan feasibility

A plan skeleton τ=ψ1:His feasible only if, for every

pair of consecutive skills ψiand ψj, there is a non-zero

overlap between the terminal state distribution of iand the

initial state distribution of j. More formally,

Esi∼ρi,ai∼Ai,sj∼ρjTi(sj|si, ai)>0,(4)

1One might consider maximizing the sum of Q-values instead of the

product, but this may not reﬂect the probability of task success. For

example, if we want to optimize a sequence of ten skills, consider a plan

that results in nine Q-values of 1and one Q-value of 0, for a total sum of

9. One Q-value of 0would indicate just one skill failure, but this is enough

to cause a failure for the entire task. Compare this to a plan with ten

Q-values of 0.9. This plan has an equivalent sum of 9, but it is preferable

because it has a non-zero probability of succeeding.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

STAP:SequencingTask-AgnosticPoliciesProjectpage:sites.google.com/stanford.edu/stapChristopherAgia*,TokiMigimatsu*,JiajunWu,JeannetteBohgDepartmentofComputerScience,StanfordUniversity,California,U.S.A.Email:{cagia,takatoki,jiajunw,bohg}@stanford.eduAbstract—Advancesinroboticskillacquisitionhavemadeit...

展开>> 收起<<

STAP Sequencing Task-Agnostic Policies Project page sites.google.comstanford.edustap Christopher Agia Toki Migimatsu Jiajun Wu Jeannette Bohg.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

STAP Sequencing Task-Agnostic Policies Project page sites.google.comstanford.edustap Christopher Agia Toki Migimatsu Jiajun Wu Jeannette Bohg

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: