
79% of the time while BC trained on successful
demonstrations has a success rate of 66%.
II. RELATED WORK
Our work falls into the broader category of learning from
demonstrations [5]. But, rather than taking a more traditional
approach of BC from demonstrations of success [1], [2], [3],
[6], we propose a novel method for collecting demonstrations
of failure and recovery as well as success. The inclusion
of failures in our dataset leads us to use OffRL [7], [8] to
ensure that we do not imitate the failures. Specifically, we
use the IQL [9] and AWAC [10] algorithms for OffRL. Some
theoretical motivation for the use of OffRL rather than BC
when learning from suboptimal data can be found in [11].
The insight that observations of failures is useful for policy
learning has been made before in work on using success de-
tectors to compute rewards in RL [12]. In contrast, VBT does
not attempt to learn an explicit success detector, but provides
a mechanism for collecting data that is especially useful for
learning policies by capturing the salient differences between
failure and success in visually similar observations.
While we consider a setting where training happens en-
tirely offline (i.e., the data collection step is separated from
the policy learning), there is a related line of work that
collects examples of failures and recoveries by moving to
an interactive setting where actions from partially trained
policies are executed on the robot and judged by the human
teleoperator. In particular the DAgger line of work exem-
plifies this pattern [4], [13], [14], [15], [16]. In contrast,
VBT operates completely offline and does not require the
teleoperator to interact with the learning process or to deploy
learned policies on the robot during training.
VBT also is particularly suited to efficiently learn image-
based value functions for sparse reward tasks. This capability
could be especially useful in architectures like SayCan [2]
that require image-based affordances for a variety of manip-
ulation tasks, and it is an interesting direction for future work
to use VBT as a component in training such larger systems.
III. MOTIVATION
VBT is designed to solve two issues that arise from
simpler data collection methods: (1) lack of coverage of
failure and recovery behaviors, and (2) overfitting issues
that arise in the low-data and high-dimensional observation
setting. Here we describe both of these issues in more detail
before explaining how VBT resolves them.
A. Lack of coverage of failure and recovery
Any offline learning algorithm will be fundamentally
limited by the quality of the dataset. We can only expect
the algorithm to reproduce the best behaviors in the dataset,
not to reliably extrapolate beyond them. So, when collecting
datasets for offline learning of value functions and policies,
we want to ensure sufficient coverage of the relevant states
and actions. We argue that for sparse reward robotics tasks
this requires including failures and recoveries in the dataset.
Fig. 2. An illustration of each of the four types of datasets that we consider
in a gridworld environment with a sparse reward for reaching the green
goal. The Success Only dataset contains two successful trajectories. The
Coverage+Success and LfP+Success datasets each contain one success and
one failure. Our VBT dataset contains one trajectory that fails, then recovers,
and then succeeds. Full descriptions of the datasets are in Section V.
Value functions require coverage. Most OffRL algo-
rithms involve estimating the Qand Vvalue functions of a
learned policy πthat is different from the policy (in our case
the teleoperator) that collected the dataset. Explicitly, letting
γ∈[0,1) be a discount factor and rbe the reward function
they estimate Qπ(s, a) = Eπ[P∞
t=0 γtr(st, at)|s0=s, a0=
a]and Vπ(s) = Ea∼π|s[Qπ(s, a)], where expectations are
taken over actions sampled from π. For the rest of the paper
we will omit the superscript πwhen clear from context.
The key issue with learning value functions for a pol-
icy πthat did not collect the data is that we can only
reliably estimate value functions at states and actions that
are similar to those seen in the training set [7]. This is
born out by theoretical work which often requires strong
coverage assumptions to learn accurate value functions, such
as assuming that the data distribution covers all reachable
states and actions [17].
While this sort of assumption is too strong to satisfy
in practice, we argue that for the sparse reward robotic
tasks that we consider, the relevant notion of coverage is to
include failures and recoveries, as well as successes, in the
dataset. These failures and recoveries can provide coverage
of the task-relevant states and behaviors necessary to learn
useful value functions. Without failures the learned value
functions will not be able to identify the “decision boundary”
between failure and success that is necessary for reliably
accomplishing the specified task. Much as in supervised
learning it is difficult to train a binary classifier without
any examples of the negative class, we conjecture that it
is difficult to learn accurate Qfunctions for sparse reward
tasks without any failures.
Consider the example datasets shown in Fig. 2. If we use
the Success Only dataset to train a Qfunction and then
query the Qfunction at the location of the red agent during