Visual Backtracking Teleoperation A Data Collection Protocol for Offline Image-Based Reinforcement Learning David Brandfonbrener1 Stephen Tu2 Avi Singh2

2025-05-06 0 0 2.89MB 7 页 10玖币
侵权投诉
Visual Backtracking Teleoperation: A Data Collection Protocol
for Offline Image-Based Reinforcement Learning
David Brandfonbrener1, Stephen Tu2, Avi Singh2,
Stefan Welker2, Chad Boodoo2, Nikolai Matni2,3, and Jake Varley2
Abstract We consider how to most efficiently leverage
teleoperator time to collect data for learning robust image-based
value functions and policies for sparse reward robotic tasks. To
accomplish this goal, we modify the process of data collection to
include more than just successful demonstrations of the desired
task. Instead we develop a novel protocol that we call Visual
Backtracking Teleoperation (VBT), which deliberately collects
a dataset of visually similar failures, recoveries, and successes.
VBT data collection is particularly useful for efficiently learning
accurate value functions from small datasets of image-based
observations. We demonstrate VBT on a real robot to perform
continuous control from image observations for the deformable
manipulation task of T-shirt grasping. We find that by adjusting
the data collection process we improve the quality of both the
learned value functions and policies over a variety of baseline
methods for data collection. Specifically, we find that offline
reinforcement learning on VBT data outperforms standard
behavior cloning on successful demonstration data by 13%
when both methods are given equal-sized datasets of 60 minutes
of data from the real robot.
I. INTRODUCTION
A common approach to control from images is to collect
demonstrations of task success and train a behavioral cloning
(BC) agent [1], [2], [3]. This can lead to policies that are
able to succeed on many tasks, particularly when mistakes
do not cause the policy to move too far from the distribution
of state-action transitions seen in the dataset of successful
demonstrations. But, for many tasks (like T-shirt grasping) a
mistake will take the policy out of this distribution and then
the learned BC policy will fail to recover [4].
Such failures occur in part because a dataset of only
successes does not contain enough information to recognize
failures or learn recovery behaviors. To remedy this issue, we
propose a novel data collection method called Visual Back-
tracking Teleoperation (VBT). Specifically, VBT leverages
the teleoperator to collect visually similar failures, recoveries,
and successes as seen in Fig. 1.
VBT data collection is designed to combine well with
offline reinforcement learning (OffRL) rather than BC. VBT
data contains the necessary coverage of the state action
space to learn accurate value functions as well as recov-
ery behaviors. Then OffRL can leverage sparse rewards to
automatically emulate the advantageous actions (including
recovery behaviors) while avoiding the sub-optimal actions
that led to the initial task failure.
1New York University david.brandfonbrener@nyu.edu
2Robotics at Google
3University of Pennsylvania
Website: https://sites.google.com/view/vbt-paper
Every episode grounded
by terminal reward
Consistent background forces visual
attention to relevant features.
Demonstration of failure and
recovery behavior
https://cnsviewer.corp.google.com/cns/lu-d/home/brandfon/mars/ssot_to_video/TShirt
Lift3DDoublePick_v1/ReachSessionToSteps4ViewsRelativePoseTerminateConverter
/NoOpAdapter/ReachSSOTTShirtPolicyIQLRLOverlayEpisodeVideoWriter/2022_08_
23_16_08_46/shards_0-5_video_11.webm?user=
Approach Fail-Grasp Fail-Lift
Recover Successful-Grasp Successful-Lift
Fig. 1. An illustration of our VBT method on a T-shirt grasping task. The
method asks the teleoperator to demonstrate failure (red), recovery (blue),
and success (green) within each trajectory. This provides the necessary
coverage of failure and recovery to learn accurate value functions and
robust policies while also preventing overfitting by ensuring that failures
and successes are visually similar except for the task-relevant details.
It is crucial to the VBT method that the failures, recov-
eries, and successes are visually similar. Without this visual
similarity, the OffRL learner can overfit to non-task-relevant
elements of the image such as background clutter, leading to
useless value functions. To easily maintain visual similarity,
VBT collects each of failure, recovery, and success within the
same trajectory (Fig. 1). Thus, VBT avoids overfitting, even
on small, image-based datasets, by ensuring that differences
between observations of failure and success are task-relevant.
Concretely, our contributions are:
1) We propose the novel VBT protocol for data collection
to leverage human teleoperation to collect image-based
datasets for use with OffRL. VBT resolves the two
main issues with naive methods: (a) lack of coverage
of failures and recoveries and (b) overfitting caused by
visually dissimilar failures and successes.
2) We discuss how and why VBT can enable better policy
learning via learning more accurate Qfunctions and
present empirical evidence of the improvement in Q
functions learned from VBT data compared to several
baselines for data collection.
3) We present real robot results on a deformable grasping
task to demonstrate the effectiveness of VBT data.
When training from scratch on just one hour of robot
time for data collection and image-based observations,
a policy trained with OffRL on VBT data succeeds
arXiv:2210.02343v1 [cs.RO] 5 Oct 2022
79% of the time while BC trained on successful
demonstrations has a success rate of 66%.
II. RELATED WORK
Our work falls into the broader category of learning from
demonstrations [5]. But, rather than taking a more traditional
approach of BC from demonstrations of success [1], [2], [3],
[6], we propose a novel method for collecting demonstrations
of failure and recovery as well as success. The inclusion
of failures in our dataset leads us to use OffRL [7], [8] to
ensure that we do not imitate the failures. Specifically, we
use the IQL [9] and AWAC [10] algorithms for OffRL. Some
theoretical motivation for the use of OffRL rather than BC
when learning from suboptimal data can be found in [11].
The insight that observations of failures is useful for policy
learning has been made before in work on using success de-
tectors to compute rewards in RL [12]. In contrast, VBT does
not attempt to learn an explicit success detector, but provides
a mechanism for collecting data that is especially useful for
learning policies by capturing the salient differences between
failure and success in visually similar observations.
While we consider a setting where training happens en-
tirely offline (i.e., the data collection step is separated from
the policy learning), there is a related line of work that
collects examples of failures and recoveries by moving to
an interactive setting where actions from partially trained
policies are executed on the robot and judged by the human
teleoperator. In particular the DAgger line of work exem-
plifies this pattern [4], [13], [14], [15], [16]. In contrast,
VBT operates completely offline and does not require the
teleoperator to interact with the learning process or to deploy
learned policies on the robot during training.
VBT also is particularly suited to efficiently learn image-
based value functions for sparse reward tasks. This capability
could be especially useful in architectures like SayCan [2]
that require image-based affordances for a variety of manip-
ulation tasks, and it is an interesting direction for future work
to use VBT as a component in training such larger systems.
III. MOTIVATION
VBT is designed to solve two issues that arise from
simpler data collection methods: (1) lack of coverage of
failure and recovery behaviors, and (2) overfitting issues
that arise in the low-data and high-dimensional observation
setting. Here we describe both of these issues in more detail
before explaining how VBT resolves them.
A. Lack of coverage of failure and recovery
Any offline learning algorithm will be fundamentally
limited by the quality of the dataset. We can only expect
the algorithm to reproduce the best behaviors in the dataset,
not to reliably extrapolate beyond them. So, when collecting
datasets for offline learning of value functions and policies,
we want to ensure sufficient coverage of the relevant states
and actions. We argue that for sparse reward robotics tasks
this requires including failures and recoveries in the dataset.
Fig. 2. An illustration of each of the four types of datasets that we consider
in a gridworld environment with a sparse reward for reaching the green
goal. The Success Only dataset contains two successful trajectories. The
Coverage+Success and LfP+Success datasets each contain one success and
one failure. Our VBT dataset contains one trajectory that fails, then recovers,
and then succeeds. Full descriptions of the datasets are in Section V.
Value functions require coverage. Most OffRL algo-
rithms involve estimating the Qand Vvalue functions of a
learned policy πthat is different from the policy (in our case
the teleoperator) that collected the dataset. Explicitly, letting
γ[0,1) be a discount factor and rbe the reward function
they estimate Qπ(s, a) = Eπ[P
t=0 γtr(st, at)|s0=s, a0=
a]and Vπ(s) = Eaπ|s[Qπ(s, a)], where expectations are
taken over actions sampled from π. For the rest of the paper
we will omit the superscript πwhen clear from context.
The key issue with learning value functions for a pol-
icy πthat did not collect the data is that we can only
reliably estimate value functions at states and actions that
are similar to those seen in the training set [7]. This is
born out by theoretical work which often requires strong
coverage assumptions to learn accurate value functions, such
as assuming that the data distribution covers all reachable
states and actions [17].
While this sort of assumption is too strong to satisfy
in practice, we argue that for the sparse reward robotic
tasks that we consider, the relevant notion of coverage is to
include failures and recoveries, as well as successes, in the
dataset. These failures and recoveries can provide coverage
of the task-relevant states and behaviors necessary to learn
useful value functions. Without failures the learned value
functions will not be able to identify the “decision boundary”
between failure and success that is necessary for reliably
accomplishing the specified task. Much as in supervised
learning it is difficult to train a binary classifier without
any examples of the negative class, we conjecture that it
is difficult to learn accurate Qfunctions for sparse reward
tasks without any failures.
Consider the example datasets shown in Fig. 2. If we use
the Success Only dataset to train a Qfunction and then
query the Qfunction at the location of the red agent during
摘要:

VisualBacktrackingTeleoperation:ADataCollectionProtocolforOfineImage-BasedReinforcementLearningDavidBrandfonbrener1,StephenTu2,AviSingh2,StefanWelker2,ChadBoodoo2,NikolaiMatni2;3,andJakeVarley2Abstract—Weconsiderhowtomostefcientlyleverageteleoperatortimetocollectdataforlearningrobustimage-basedval...

展开>> 收起<<
Visual Backtracking Teleoperation A Data Collection Protocol for Offline Image-Based Reinforcement Learning David Brandfonbrener1 Stephen Tu2 Avi Singh2.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:2.89MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注