For robots to perform these insertion tasks in industrial
and warehouse settings with less human supervision, or in
unstructured environments such as homes, they must rely on
highly accurate state information of the external world (e.g.,
socket position and in-hand pose estimation). But such state
estimation, using either machine learning or computer vision
approaches, is brittle on unseen connectors. To solve the
general problem of inserting a novel connector, one promis-
ing solution is to generalize previously collected experience
of connector insertion to learn a policy to insert connectors
from vision. Among these tasks, there is enough variability
to require generalization and adaptation, but also enough
internal structural regularity that we expect transfer between
connectors. We first collected a large offline dataset with
insertion data of 50 connectors across 2 robots and diverse
backgrounds with actions, images, and sparse reward labels.
Offline RL on this data alone generalizes to connectors very
similar to those in the training dataset, but we will also expect
robots to be able to perform tasks in new domains, perhaps
after some practice. How can a robot insert test connectors
from vision in this setting, utilizing offline RL from offline
data to enable active online finetuning on a new connector?
The key insight is that we need to (1) adapt to new tasks
quickly with online finetuning if the zero-shot solution is not
sufficient and (2) generalize to new domains by finding com-
mon structure between domains while preserving important
domain-specific information. Ideally, a policy trained offline
can generalize from vision to new tasks. But if it does not, we
can still finetune in a new domain with minimal supervision
as long as we have a reward function that generalizes instead.
For training policies and reward functions that generalize to
test domains, we propose a split representation that combines
domain adversarial neural networks [1] for domain invariance
and a variational information bottleneck [2] for controlling
the flow of domain-specific information. This representation,
which we call domain adversarial information bottleneck
(DAIB) is used first for learning a robust reward function to
detect successful insertions for an unseen connector. Next,
we modify implicit Q-learning (IQL), an offline RL algo-
rithm amenable to online finetuning, to use DAIB. During
online finetuning, DAIB can be used in combination with
online RL to enable fast learning of novel connectors.
We present two main contributions. We demonstrate a
system for finetuning under realistic real-world constraints
with minimal human supervision, and applied it to insert
connectors robustly from vision without the need of accurate
socket localization, both for observations and rewards. To
accomplish this, we propose a novel representation learn-
ing method that allows better generalization of policies
and reward functions to unseen domains. We outperform
regression-based baselines on the same dataset that combine
localizing the socket with hand-designed control policies, as
well as prior RL methods. We show that new tasks can be
finetuned within 200 trials (about 50 minutes of real-world
interaction), given our dataset of off-policy data from 50
prior domains of 70,000 trajectories. This system allows us
to finetune IQL to a test connector, increasing performance
significantly over the offline performance. Project videos
and our dataset of robotic insertion will be made public at
sites.google.com/view/learningonthejob
II. RELATED WORK
Reinforcement learning has been applied to a variety of
robotics tasks [3]–[11]. To utilize offline datasets with diverse
data in robotics, algorithms developed for offline RL [12]–
[15] have been studied in the robotics setting [16]–[20]. A
subset of offline RL algorithms are amenable to finetuning
[14], [21]–[25]. Our work builds on the direction of offline
pretraining followed by online finetuning in robotics. But
beyond this line of work, we focus on finetuning from visual
input in realistic settings with multiple domains and without
ground truth reward functions for the new task.
In this respect, our work is closest to prior work on
self-supervised RL that does not assume an external reward
function and instead learns it from data. One class of self-
supervised RL methods uses goal-conditioned RL with self-
supervised rewards [26]–[36]. While general, this class of
methods is a poor fit for industrial insertion, as high precision
is required in both the policy and in evaluating rewards. In-
stead, we train a domain generalizing reward classifier from
prior data. Prior methods have used learned rewards [37] and
classifier rewards have been proposed as a scalable solution
for robotics tasks previously [38], [39]. However, learned
rewards have not been shown to be useful for finetuning
in novel real-world robotic domains previously. Because we
focus on applying offline RL and finetuning from vision in
the industrial insertion setting, domain generalization of the
reward function is vital for our method to work in practice.
Many aspects of robotic insertion, or peg-in-hole assembly,
has been studied in prior work [40]–[45], often utilizing
geometry and dynamic analysis, force control, tactile sens-
ing, and search, but these methods can be brittle to state
estimation errors. Learning-based methods, including RL,
have also been applied, usually for a single connector from
ground-truth state information [46]–[48]. In these cases, the
RL algorithm must learn to navigate the specific dynamics of
the single connector, but does not generalize across connec-
tors. More recent work has considered using meta-learning
to generalize and improve few-shot between domains [49].
Zhao et al. use offline RL and finetuning combined with
meta-learning to adapt to a new connector [50]. This work
assumes a known position of the socket and consistent
grasping of the connector, and is robust to a small amount
(±1mm) of noise. With known socket position and small
error, the learning algorithm can learn a structured noise
or exploration strategy that can overcome these errors. In
contrast, we initialize connectors within ±20mm of the
socket (20×the variance), which requires the robot to rely on
visual feedback since blind exploration will rarely succeed.
Closest to our work is prior work that also uses pixel input
for robotic insertion. Luo et al. incorporate vision alongside
proprioception, using a VAE to embed pixel input [51].
InsertionNet uses a vision system to localize the object and
socket, operating on a ”residual policy” which is learned