Dynamic Movement Primitives (DMPs) [11–13] are a form
of LfD that learns the parameters of dynamical systems
encoding movements [14–17]. More recent extensions
integrate DMPs with deep neural networks to learn more
flexible policies [18, 19]—for instance, to build a large
library of skills from human video demonstrations [20]. Skill
discovery methods instead identify action patterns in offline
datasets [21] and either distill them into policies [22, 23]
or extract skill priors for use in downstream tasks [24, 25].
Robot skills can also be acquired via active learning [26],
Reinforcement Learning (RL) [27–31], and offline RL [32].
An advantage of our planning framework is that it is
agnostic to the types of skills employed, requiring only that
it is possible to predict the probability of the skill’s success
given the current state and action. Here, we learn skills [9]
that consist of a policy and a parameterized manipulation
primitive [10]. The actions output by the policy are the pa-
rameters of the primitive determining its motion. In STAP, we
will use the Q-functions of the policy to optimize suitable pa-
rameters [20, 28] for a sequence of manipulation primitives.
B. Long-horizon robot planning
Once manipulation skills have been acquired, using
them to perform sequential manipulation tasks remains an
open challenge. [33–36] propose data-driven methods to
determine the symbolic feasibility of skills and only control
their timing, while we seek to ensure the geometric feasibility
of skills by controlling their trajectories. Other techniques
rely on task planning [37, 38], subgoal planning [39], or
meta-adaptation [40, 41] to sequence learned skills to novel
long-horizon goals. However, the tasks considered in these
works do not feature rich geometric dependencies between
actions that necessitate motion planning or skill coordination.
The options framework [42] and the parameterized
action Markov Decision Process (MDP) [43] train a
high-level policy to engage low-level policies [44, 45]
or primitives [8, 46–48] towards long-horizon goals. [49]
proposes a hierarchical RL method that uses the value
functions of lower-level policies as the state space for a
higher-level RL policy. Our work is also related to model-
based RL methods which jointly learn dynamics and reward
models to guide planning [50–52], policy search [53, 54],
or combine both [55, 56]. While these methods demonstrate
that policy hierarchies and model-based planning can enable
RL to solve long-horizon problems, they are typically trained
in the context of a single task. In contrast, we seek to plan
with lower-level skills to solve tasks never seen before.
Closest in spirit to our work is that of Xu et al. [7], Deep
Affordance Foresight (DAF), which proposes to learn a
dynamics model, skill-centric affordances (value functions),
and a skill proposal network that serves as a higher-level
RL policy. We identify several drawbacks with DAF: first,
because DAF relies on multi-task experience for training,
generalizing beyond the distribution of training tasks may
be difficult; second, the dynamics, affordance models, and
skill proposal network need to be trained synchronously,
which complicates expanding the current library of trained
skills; third, their planner samples actions from uniform
random distributions, which prevents DAF from scaling to
high-dimensional action spaces and long horizons. STAP
differs in that our dynamics, policies, and affordances (Q-
functions) are learned independently per skill. Without any
additional training, we combine the skills at planning time
to solve unseen long-horizon tasks. We compare our method
against DAF in the planning experiments (Sec. VII-B).
C. Task and motion planning
TAMP solves problems that require both symbolic and
geometric reasoning [2, 57]. DAF learns a skill proposal
network to replace the typical task planner in TAMP,
akin to [58]. Another prominent line of research learns
components of the TAMP system, often from a dataset of
precomputed solutions [59–64]. The problems we consider
involve complex geometric dependencies between actions
that are typical in TAMP. However, STAP only performs
geometric reasoning and by itself is not a TAMP method. We
demonstrate in experiments (Sec. VII-C) that STAP can be
combined with symbolic planners to solve TAMP problems.
III. PROBLEM SETUP
A. Long-horizon planning
Our objective is to solve long-horizon manipulation tasks
that require sequential execution of learned skills. These
skills come from a skill library L={ψ1, . . . , ψK}, where
each skill ψkconsists of a parameterized manipulation prim-
itive [10] ϕkand a learned policy πk. A primitive ϕkak
takes in parameters akand executes a series of motor com-
mands on the robot, while a policy πkak
skis trained
to predict a distribution of suitable parameters akfrom the
current state sk. For example, the Pick(a, b)skill may have
a primitive which takes as input an end-effector pose and
executes a trajectory to pick up object a, where the robot first
moves to the commanded pose, closes the gripper to grasp
a, and then lifts aoff of b. The learned policy πkfor this
skill will then try to predict end-effector poses to pick up a.
We assume access to a high-level planner that computes
plan skeletons (i.e. skill sequences) to achieve a high-level
goal. STAP aims to solve the problem of turning plan skele-
tons into geometrically feasible action plans (i.e. parameters
for each manipulation primitive in the plan skeleton).
STAP is agnostic to the choice of high-level planner. For
instance, it can be used in conjunction with Planning Domain
Definition Language (PDDL) [65] task planners to perform
hierarchical TAMP [66]. In this setup, the task planner and
STAP will be queried numerous times to find multiple plan
skeletons grounded with optimized action plans. STAP will
also evaluate each action plan’s probability of success (i.e.
its geometric feasibility). After some termination criterion is
met, such as a timeout, the candidate plan skeleton and action
plan with the highest probability of success is returned.
B. Task-agnostic policies
We aim to learn policies {π1, . . . , πK}for the skill library
Lthat can be sequenced by a high-level planner in arbitrary