STAP Sequencing Task-Agnostic Policies Project page sites.google.comstanford.edustap Christopher Agia Toki Migimatsu Jiajun Wu Jeannette Bohg

2025-05-03 0 0 4.2MB 12 页 10玖币
侵权投诉
STAP: Sequencing Task-Agnostic Policies
Project page: sites.google.com/stanford.edu/stap
Christopher Agia*, Toki Migimatsu*, Jiajun Wu, Jeannette Bohg
Department of Computer Science, Stanford University, California, U.S.A.
Email: {cagia,takatoki,jiajunw,bohg}@stanford.edu
Abstract Advances in robotic skill acquisition have made it
possible to build general-purpose libraries of learned skills for
downstream manipulation tasks. However, naively executing
these skills one after the other is unlikely to succeed without
accounting for dependencies between actions prevalent in long-
horizon plans. We present Sequencing Task-Agnostic Policies
(STAP), a scalable framework for training manipulation skills
and coordinating their geometric dependencies at planning
time to solve long-horizon tasks never seen by any skill during
training. Given that Q-functions encode a measure of skill
feasibility, we formulate an optimization problem to maximize
the joint success of all skills sequenced in a plan, which we
estimate by the product of their Q-values. Our experiments
indicate that this objective function approximates ground
truth plan feasibility and, when used as a planning objective,
reduces myopic behavior and thereby promotes long-horizon
task success. We further demonstrate how STAP can be used
for task and motion planning by estimating the geometric
feasibility of skill sequences provided by a task planner. We
evaluate our approach in simulation and on a real robot.
I. INTRODUCTION
Performing sequential manipulation tasks requires a robot
to reason about dependencies between actions. Consider
the example in Fig. 1, where the robot needs to retrieve an
object outside of its workspace by first using an L-shaped
hook to pull the target object closer. How the robot picks up
the hook affects whether the target object will be reachable.
Traditionally, planning actions to ensure the geometric fea-
sibility of a sequential manipulation task is handled by mo-
tion planning [1–3], which typically requires full observabil-
ity of the environment state and knowledge of its dynamics.
Learning-based approaches [4–6] can acquire skills without
this privileged information. However, using independently
learned skills to perform unseen long-horizon manipulation
tasks is an unsolved problem. The skills could be myopically
executed one after another to solve a simpler subset of tasks,
but solving more complex tasks requires planning with these
skills to ensure the feasibility of the entire skill sequence.
Prior work focuses on sequencing skills at train time to
solve a small set of sequential manipulation tasks [7, 8].
To contend with long-horizons, these methods often learn
skills [9] that consist of a policy and parameterized manipula-
tion primitive [10]. The policy predicts the parameters of the
primitive, thereby governing its motion. Such methods are
task-specific in that they need to be trained on skill sequences
that reflect the tasks they might encounter at test time. In our
framework, we assume that a task planner provides a novel
sequence of skills at test time that will then be grounded with
*Authors contributed equally to this work.
Toyota Research Institute provided funds to support this work.
Greedy execution
Pick(hook) Pull(yogurt,hook)
Planning with STAP (Ours)
Pick(hook) Pull(yogurt,hook)
Fig. 1: Sequential manipulation tasks often contain geometric dependencies
between actions. In this example, the robot needs to use the hook to pull
the block into its kinematic workspace so it is close enough to pick up.
The top row shows how greedy execution of skills results in the robot
picking up the hook in a way that prevents it from reaching the block.
We present a method for planning with skills to maximize long-horizon
success without the need to train the skills on long-horizon tasks.
parameters for manipulation primitives through optimization.
This makes our method task-agnostic, as skills can be se-
quenced to solve long-horizon tasks not seen during training.
At the core of our method, Sequencing Task-Agnostic Poli-
cies (STAP), we use Q-functions to optimize the parameters
of manipulation primitives in a given sequence. Policies and
Q-functions for each skill are acquired through off-the-shelf
Reinforcement Learning. We then define a planning objective
to maximize all Q-functions in a skill sequence, ensuring its
geometric feasibility. To evaluate downstream Q-functions
of future skills, we learn a dynamics model that can predict
future states. We also use Uncertainty Quantification (UQ)
to avoid visiting states that are Out-Of-Distribution (OOD)
for the learned skills. We train all of these components
independently per skill, making it easy to gradually expand
a library of skills without the need to retrain existing ones.
Our contributions are three-fold: we propose 1) a frame-
work to train an extensible library of task-agnostic skills,
2) a planning method that optimizes arbitrary sequences
of skills to solve long-horizon tasks, and 3) a method to
solve Task and Motion Planning (TAMP) problems with
learned skills. In extensive experiments, we demonstrate that
planning with STAP promotes long-horizon success on tasks
with complex geometric dependencies between actions. We
also demonstrate that our framework works on a real robot.
II. RELATED WORK
A. Robot skill learning
How to represent and acquire composable manipulation
skills is a widely studied problem in robotics. A broad class
of methods uses Learning from Demonstration (LfD) [5].
arXiv:2210.12250v3 [cs.RO] 31 May 2023
Dynamic Movement Primitives (DMPs) [11–13] are a form
of LfD that learns the parameters of dynamical systems
encoding movements [14–17]. More recent extensions
integrate DMPs with deep neural networks to learn more
flexible policies [18, 19]—for instance, to build a large
library of skills from human video demonstrations [20]. Skill
discovery methods instead identify action patterns in offline
datasets [21] and either distill them into policies [22, 23]
or extract skill priors for use in downstream tasks [24, 25].
Robot skills can also be acquired via active learning [26],
Reinforcement Learning (RL) [27–31], and offline RL [32].
An advantage of our planning framework is that it is
agnostic to the types of skills employed, requiring only that
it is possible to predict the probability of the skill’s success
given the current state and action. Here, we learn skills [9]
that consist of a policy and a parameterized manipulation
primitive [10]. The actions output by the policy are the pa-
rameters of the primitive determining its motion. In STAP, we
will use the Q-functions of the policy to optimize suitable pa-
rameters [20, 28] for a sequence of manipulation primitives.
B. Long-horizon robot planning
Once manipulation skills have been acquired, using
them to perform sequential manipulation tasks remains an
open challenge. [33–36] propose data-driven methods to
determine the symbolic feasibility of skills and only control
their timing, while we seek to ensure the geometric feasibility
of skills by controlling their trajectories. Other techniques
rely on task planning [37, 38], subgoal planning [39], or
meta-adaptation [40, 41] to sequence learned skills to novel
long-horizon goals. However, the tasks considered in these
works do not feature rich geometric dependencies between
actions that necessitate motion planning or skill coordination.
The options framework [42] and the parameterized
action Markov Decision Process (MDP) [43] train a
high-level policy to engage low-level policies [44, 45]
or primitives [8, 46–48] towards long-horizon goals. [49]
proposes a hierarchical RL method that uses the value
functions of lower-level policies as the state space for a
higher-level RL policy. Our work is also related to model-
based RL methods which jointly learn dynamics and reward
models to guide planning [50–52], policy search [53, 54],
or combine both [55, 56]. While these methods demonstrate
that policy hierarchies and model-based planning can enable
RL to solve long-horizon problems, they are typically trained
in the context of a single task. In contrast, we seek to plan
with lower-level skills to solve tasks never seen before.
Closest in spirit to our work is that of Xu et al. [7], Deep
Affordance Foresight (DAF), which proposes to learn a
dynamics model, skill-centric affordances (value functions),
and a skill proposal network that serves as a higher-level
RL policy. We identify several drawbacks with DAF: first,
because DAF relies on multi-task experience for training,
generalizing beyond the distribution of training tasks may
be difficult; second, the dynamics, affordance models, and
skill proposal network need to be trained synchronously,
which complicates expanding the current library of trained
skills; third, their planner samples actions from uniform
random distributions, which prevents DAF from scaling to
high-dimensional action spaces and long horizons. STAP
differs in that our dynamics, policies, and affordances (Q-
functions) are learned independently per skill. Without any
additional training, we combine the skills at planning time
to solve unseen long-horizon tasks. We compare our method
against DAF in the planning experiments (Sec. VII-B).
C. Task and motion planning
TAMP solves problems that require both symbolic and
geometric reasoning [2, 57]. DAF learns a skill proposal
network to replace the typical task planner in TAMP,
akin to [58]. Another prominent line of research learns
components of the TAMP system, often from a dataset of
precomputed solutions [59–64]. The problems we consider
involve complex geometric dependencies between actions
that are typical in TAMP. However, STAP only performs
geometric reasoning and by itself is not a TAMP method. We
demonstrate in experiments (Sec. VII-C) that STAP can be
combined with symbolic planners to solve TAMP problems.
III. PROBLEM SETUP
A. Long-horizon planning
Our objective is to solve long-horizon manipulation tasks
that require sequential execution of learned skills. These
skills come from a skill library L={ψ1, . . . , ψK}, where
each skill ψkconsists of a parameterized manipulation prim-
itive [10] ϕkand a learned policy πk. A primitive ϕkak
takes in parameters akand executes a series of motor com-
mands on the robot, while a policy πkak
skis trained
to predict a distribution of suitable parameters akfrom the
current state sk. For example, the Pick(a, b)skill may have
a primitive which takes as input an end-effector pose and
executes a trajectory to pick up object a, where the robot first
moves to the commanded pose, closes the gripper to grasp
a, and then lifts aoff of b. The learned policy πkfor this
skill will then try to predict end-effector poses to pick up a.
We assume access to a high-level planner that computes
plan skeletons (i.e. skill sequences) to achieve a high-level
goal. STAP aims to solve the problem of turning plan skele-
tons into geometrically feasible action plans (i.e. parameters
for each manipulation primitive in the plan skeleton).
STAP is agnostic to the choice of high-level planner. For
instance, it can be used in conjunction with Planning Domain
Definition Language (PDDL) [65] task planners to perform
hierarchical TAMP [66]. In this setup, the task planner and
STAP will be queried numerous times to find multiple plan
skeletons grounded with optimized action plans. STAP will
also evaluate each action plan’s probability of success (i.e.
its geometric feasibility). After some termination criterion is
met, such as a timeout, the candidate plan skeleton and action
plan with the highest probability of success is returned.
B. Task-agnostic policies
We aim to learn policies {π1, . . . , πK}for the skill library
Lthat can be sequenced by a high-level planner in arbitrary
ways to solve any long-horizon task. We call these policies
task-agnostic because they are not trained to solve a specific
long-horizon task. Instead, each policy πkis associated with
a skill-specific contextual bandit (i.e. a single timestep MDP)
Mk=Sk,Ak, T k, Rk, ρk,(1)
where Skis the state space, Akis the action space,
Tksk
sk, akis the transition model, Rksk, ak, skis
the binary reward function, and ρkskis the initial state dis-
tribution. Given a state sk, the policy πkproduces an action
ak, and the state evolves according to the transition model
Tksk
sk, ak. Thus, the transition model encapsulates
the execution of the manipulation primitive ϕk(Sec. III-A).
A long-horizon domain is one in which each timestep
involves the execution of a single policy, and it is specified by
M=M1:K,S, T 1:K, ρ1:K,Γ1:K,(2)
where M1:Kis the set of MDPs whose policies can be
executed in the long-horizon domain, Sis the state space of
the long-horizon domain, Tks
s, akis an extension of
dynamics Tksk
sk, akthat models how the entire long-
horizon state evolves with action ak,ρk(s)is an extension
of initial state distributions ρkskover the long-horizon
state space, and Γk:S → Skis a function that maps from
the long-horizon state space to the state space of policy k. We
assume that the dynamics Tksk
sk, ak, T ks
s, ak
and initial state distributions ρksk, ρkskare unknown.
Note that while the policies may have different state spaces
Sk, policy states skmust be obtainable from the long-horizon
state space Svia sk= Γk(s). This is to ensure that the poli-
cies can be used together in the same environment to perform
long-horizon tasks. In the base case, all the state spaces are
identical and Γkis simply the identity function. Another case
is that sis constructed as the concatenation of all s1:Kand
Γk(s)extracts the slice in scorresponding to sk.
IV. SEQUENCING TASK-AGNOSTIC POLICIES
Given a task in the form of a sequence of skills to execute,
our planning framework constructs an optimization problem
with the policies, Q-functions, and dynamics models of each
skill. Solving the optimization problem results in parameters
for all manipulation primitives in the skill sequence such that
the entire sequence’s probability of success is maximized.
We formalize our planning methodology in this section
and outline its implementation in Sec. V. Lastly, we describe
our procedure for training modular skill libraries in Sec. VI.
A. Grounding skill sequences with action plans
We assume that we are given a plan skeleton of skills
τ= [ψ1, . . . , ψH]∈ LH(hereafter denoted by τ=ψ1:H)
that should be successfully executed to solve a long-
horizon task. Let Mhwith subscript hdenote the MDP
corresponding to the h-th skill in the sequence—in contrast
to Mkwith superscript k, which denotes the k-th MDP in the
skill library. A long-horizon task is considered successful if
every skill reward r1, . . . , rHreceived during execution is 1.
(d) Action: arg max!&𝑄"⋅ 𝑄#
(a) Place: 𝑄"𝑠, 𝑎 (b) Push:&𝑄#𝑠, 𝑎 (c) Objective: 𝑄"⋅ 𝑄#
Fig. 2: Planning in a 2D toy domain. The agent needs to get the green block
under the brown receptacle with two skills: Place() and Push() that operate
on the horizontal position xof the green block. Plots (a) and (b) show the Q-
functions across (x, θ)for each skill. Place() is only trained to get the green
block on the ground, so the planner must determine a=xs.t. Push() is
unobstructed. The optimal action maximizes the probability of long-horizon
task success (Eq. 3), approximated by the product of Q-functions in plot (c).
Given an initial state s1∈ S, our problem is to
ground the plan skeleton τ=ψ1:Hwith an action plan
ξ= [a1, . . . , aH]∈ A1× · · · × AHthat maximizes the
probability of succeeding at the long-horizon task. This is
framed as an optimization problem arg maxa1:HJ, where
the maximization objective Jis the task success probability
J(a1:H;s1) = p(r1= 1, . . . , rH= 1 |s1, a1:H).
Here, r1:Hare the skill rewards received at each timestep.
With the long-horizon dynamics models Tks
s, ak,
the objective can be cast as the expectation
J= Es2:HT1:H1[p(r1= 1, . . . , rH= 1 |s1:H, a1:H)] .
By the Markov assumption, rewards are conditionally inde-
pendent given states and actions. We can express the prob-
ability of task success as the product of reward probabilities
J= Es2:HT1:H1ΠH
h=1 p(rh= 1 |sh, ah).
Because the skill rewards are binary, the skill success
probabilities are equivalent to Q-values:
p(rh= 1 |sh, ah) = Esh+1Th[rh|sh, ah]
=Qhh(sh), ah).
The final objective is expressed in terms of Q-values:
J= Es2:HT1:H1ΠH
h=1 Qhh(sh), ah).(3)
This planning objective is simply the product of Q-values
evaluated along the trajectory (s1, a1, . . . , sH, aH), where
the states are predicted by the long-horizon dynamics model:
s2T1(· | s1, a1), . . . , sHTH1(· | sH1, aH1).1
B. Ensuring action plan feasibility
A plan skeleton τ=ψ1:His feasible only if, for every
pair of consecutive skills ψiand ψj, there is a non-zero
overlap between the terminal state distribution of iand the
initial state distribution of j. More formally,
Esiρi,ai∼Ai,sjρjTi(sj|si, ai)>0,(4)
1One might consider maximizing the sum of Q-values instead of the
product, but this may not reflect the probability of task success. For
example, if we want to optimize a sequence of ten skills, consider a plan
that results in nine Q-values of 1and one Q-value of 0, for a total sum of
9. One Q-value of 0would indicate just one skill failure, but this is enough
to cause a failure for the entire task. Compare this to a plan with ten
Q-values of 0.9. This plan has an equivalent sum of 9, but it is preferable
because it has a non-zero probability of succeeding.
摘要:

STAP:SequencingTask-AgnosticPoliciesProjectpage:sites.google.com/stanford.edu/stapChristopherAgia*,TokiMigimatsu*,JiajunWu,JeannetteBohgDepartmentofComputerScience,StanfordUniversity,California,U.S.A.Email:{cagia,takatoki,jiajunw,bohg}@stanford.eduAbstract—Advancesinroboticskillacquisitionhavemadeit...

展开>> 收起<<
STAP Sequencing Task-Agnostic Policies Project page sites.google.comstanford.edustap Christopher Agia Toki Migimatsu Jiajun Wu Jeannette Bohg.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:4.2MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注