task space [26,27]. In a closed loop setting with a moving target, the traditional process of using
IK to map task to configuration space can produce highly variable configurations, especially around
obstacles. Motion Optimization [2,3,28] on the other hand, generates paths with non-linear opti-
mization and can consider multiple objectives such as smoothness of the motion, obstacle avoidance
and convergence to an end effector pose. These algorithms require careful tuning of the respective
cost functions to ensure convergence to a desirable path and are prone to local minima. Furthermore,
non-linear optimization is computationally complex and can be slow for difficult planning problems.
Local Control: In contrast to global planners, local controllers have long been applied to create
collision-free motions [29,30,4,11]. While they prioritize speed and smoothness, they are highly
local and may fail to find a valid path in complex environments. We demonstrate in our experiments
that MπNets are more effective at producing convergent motions in these types of environments,
including in dynamic and in partially observed settings.
Imitation Learning: Imitation Learning [31] can train a policy from expert demonstrations with
limited knowledge of the expert’s internal model. For motion planning problems, we can apply im-
itation learning and leverage a traditional planner as the expert demonstrator—with perfect model
of the scene during training—and learn a policy that forgoes the need for an explicit scene model at
test time. Popular imitation learning methods include Inverse Reinforcement Learning [32,33,34]
and Behavior Cloning [35,36]. The former typically assumes expert optimality and learns a cost
function accordingly, whereas the latter directly learns the state-action mapping from demonstra-
tions, regardless of the expert’s optimality. We thus employ behavior cloning because producing
optimal plans for continuous manipulation problems is challenging. Recent work demonstrates
behavior cloning’s efficacy for fine-grained manipulation tasks, such as chopstick use [37] and pick-
and-place [38]. For long-horizon tasks like ours, however, distributional shift and data variance can
hinder behavior cloning performance. Distribution shift during execution can lead to states unseen
in training data [37]. Complex tasks often have a long tail of possible action states that are under-
represented in the data, leading to high data variance [39]. There are many techniques to address
these challenges through randomization, noise injection, regret optimization, and expert correction
[37,40,41,42,43]. These techniques, however, have not been demonstrated on a problem of our
scale and complexity (see Appendix Dfor details on the range of data). Our proposal seeks to over-
come these issues by specifically designing a learnable expert, increasing the scale and variation of
the data, and using a sufficiently expressive policy model.
Neural Motion Planning: Many deep planning methods [13,44,45,46] seek to learn efficient sam-
plers to speed up traditional planners. Motion Planning Networks (MPNets) [12] learn to directly
plan through imitation of a standard sampling based RRT* planner [6] and is used in conjunction
with a traditional planner for stronger guarantees. While these works greatly improve the speed of
the planning search, they have the same requirements as a standard planning system: targets in con-
figuration space and an explicit collision checker to connect the path. Our work operates based on
task-space targets and perceptual observations from a depth sensor without explicit state estimation.
Novel architectures have been proposed, such as differentiable planning modules in Value Iteration
Networks [20], transformers by Chaplot et al. [47] and goal-conditioned RL policies [48]. These
methods are challenging to generalize to unknown environments or have only been shown in simple
2D [19] or 3D settings [21]. In contrast, we illustrate our approach in the challenging domain of
controlling a 7degrees of freedom (DOF) manipulator in unknown, dynamic environments.
3 Learning from Motion Planning
3.1 Problem Formulation
MπNets expect two inputs, a robot configuration qtand a segmented, calibrated point cloud zt.
Before passing qtthrough the network, we normalize each element to be within [−1,1] according
to the limits for the corresponding joint. We call this qk·k
t. The point cloud is always assumed to be
calibrated in the robot’s base frame, and it encodes three segmentation classes: the robot’s current
geometry, the scene geometry, and the target pose. Targets are inserted into the point cloud via
points sampled from the mesh of a floating end effector placed at the target pose.
3