In this work, we seek to address the problem of controlling the extrinsic contact between a grasped
compliant tool (e.g. a spatula) and the environment. In general, the robot cannot expect to have the
full geometry and physical properties (e.g., mass, friction, stiffness) of all the tools it must use or
the geometries of the environments it must manipulate in. Instead, the robot must utilize multimodal
sensory observations, such as pointclouds and tactile feedback, to act on the environment.
In recent years, learning-based methods have become increasingly popular to address the complex-
ities of robotic manipulation, including for contact-rich tasks [5]. These methods can be loosely
grouped into model-free methods, that directly learn a policy [3,2,6], and model-based methods,
that learn system dynamics [7,8,9]. By focusing on modeling system dynamics, model-based meth-
ods can plan to reach new goals without retraining, and are often more data-efficient [9]. Therefore,
we propose learning the dynamics of our system to solve the extrinsic contact servoing task.
It is not obvious which representation to use for these dynamics. Fully recovering tool and environ-
ment geometries from visual data [10,11] and tactile feedback [12] has been widely explored, with
recent extensions to compliant geometries [13]; however, even if the system can be fully identified,
contact models to resolve interactions can have limited fidelity [14]. On the other hand, learned dy-
namics representations can be difficult to interpret and require demonstrations or observations from
the desired state to specify goals [7,15]. Instead, we propose a novel contact feature representa-
tion for our learning method that focuses on tool-environment interaction and bypasses explicitly
modeling the whole system. We represent the contact configuration as 1) a binary contact mode
(indicating if the system is in contact); 2) a contact geometry (as a line in 3D space); and 3) an
end-effector wrench.
We propose a learning architecture to model the dynamics of the proposed contact representation
from raw sensory observations over candidate action trajectories. We propose structuring the model
as a latent space dynamics model with a decoder that recovers the contact state. We also propose
an action offset term in the dynamics that allows us to accurately propagate robot poses, despite
controller errors (e.g. from robot impedance). To provide labels to our model, we collect self-
supervised data on a 7DoF Franka Emika Panda, using sensor data to automatically label contact
state.
We validate our proposed method by completing various desired contact trajectories on the real robot
system. We first show that our method can track diverse desired contact trajectories in the absence
of obstacles. Next, we demonstrate that we can utilize extrinsic contact servoing to scrape a target
object from the table, while handling occlusions and avoiding contact with obstacles (Fig. 1).
2 Related Work
Existing research has investigated the task of recovering contact locations. Manuelli et al. [16] local-
ize point contacts on a rigid robot with known geometry by employing a particle filtering approach
to update a set of candidate contact locations based on force torque sensing. Kim et al. [4] and Ma
et al. [17] model contact between a grasped rigid object and the environment by assuming stationary
line contacts and modeling the deformation of a GelSlim gripper. The estimated line contact is then
used in a Reinforcement Learning (RL) policy. Neither of these methods extends to compliant tools
and neither models the dynamics of the contact configuration.
Other works explore tactile servoing methods, where contact at the sensor is driven to a desired
configuration. Li et al. [18] use a large tactile pad and define contact configuration features of
objects pressed against the sensor. They manually construct a feedback controller based on these
features and use it to drive contacts to desired configurations. Sutanto et al. [19] use a smaller profile
tactile sensor and learn the dynamics of a learned latent space. They then employ a Model Predictive
Control (MPC) scheme to drive contacts to desired configurations on the sensor. Both of these works
assume contact is happening at the sensing location. We, on the other hand, seek to servo extrinsic
contacts, where we do not get direct sensing at the point of contact.
Other work focuses on maintaining contact between a tool and the environment. Sakaino [20] uses
imitation learning to learn a controller able to maintain contact between a mop and a tabletop. In
contrast, we wish to not only maintain contact but control the extrinsic contact geometry.
2