Visual Backtracking Teleoperation A Data Collection Protocol for Ofﬂine Image-Based Reinforcement Learning David Brandfonbrener1 Stephen Tu2 Avi Singh2

2025-05-06 1 0 2.89MB 7 页 10玖币

侵权投诉

Visual Backtracking Teleoperation: A Data Collection Protocol

for Ofﬂine Image-Based Reinforcement Learning

David Brandfonbrener1, Stephen Tu2, Avi Singh2,

Stefan Welker2, Chad Boodoo2, Nikolai Matni2,3, and Jake Varley2

Abstract— We consider how to most efﬁciently leverage

teleoperator time to collect data for learning robust image-based

value functions and policies for sparse reward robotic tasks. To

accomplish this goal, we modify the process of data collection to

include more than just successful demonstrations of the desired

task. Instead we develop a novel protocol that we call Visual

Backtracking Teleoperation (VBT), which deliberately collects

a dataset of visually similar failures, recoveries, and successes.

VBT data collection is particularly useful for efﬁciently learning

accurate value functions from small datasets of image-based

observations. We demonstrate VBT on a real robot to perform

continuous control from image observations for the deformable

manipulation task of T-shirt grasping. We ﬁnd that by adjusting

the data collection process we improve the quality of both the

learned value functions and policies over a variety of baseline

methods for data collection. Speciﬁcally, we ﬁnd that ofﬂine

reinforcement learning on VBT data outperforms standard

behavior cloning on successful demonstration data by 13%

when both methods are given equal-sized datasets of 60 minutes

of data from the real robot.

I. INTRODUCTION

A common approach to control from images is to collect

demonstrations of task success and train a behavioral cloning

(BC) agent [1], [2], [3]. This can lead to policies that are

able to succeed on many tasks, particularly when mistakes

do not cause the policy to move too far from the distribution

of state-action transitions seen in the dataset of successful

demonstrations. But, for many tasks (like T-shirt grasping) a

mistake will take the policy out of this distribution and then

the learned BC policy will fail to recover [4].

Such failures occur in part because a dataset of only

successes does not contain enough information to recognize

failures or learn recovery behaviors. To remedy this issue, we

propose a novel data collection method called Visual Back-

tracking Teleoperation (VBT). Speciﬁcally, VBT leverages

the teleoperator to collect visually similar failures, recoveries,

and successes as seen in Fig. 1.

VBT data collection is designed to combine well with

ofﬂine reinforcement learning (OffRL) rather than BC. VBT

data contains the necessary coverage of the state action

space to learn accurate value functions as well as recov-

ery behaviors. Then OffRL can leverage sparse rewards to

automatically emulate the advantageous actions (including

recovery behaviors) while avoiding the sub-optimal actions

that led to the initial task failure.

1New York University david.brandfonbrener@nyu.edu

2Robotics at Google

3University of Pennsylvania

Website: https://sites.google.com/view/vbt-paper

Every episode grounded

by terminal reward

Consistent background forces visual

attention to relevant features.

Demonstration of failure and

recovery behavior

https://cnsviewer.corp.google.com/cns/lu-d/home/brandfon/mars/ssot_to_video/TShirt

Lift3DDoublePick_v1/ReachSessionToSteps4ViewsRelativePoseTerminateConverter

/NoOpAdapter/ReachSSOTTShirtPolicyIQLRLOverlayEpisodeVideoWriter/2022_08_

23_16_08_46/shards_0-5_video_11.webm?user=

Approach Fail-Grasp Fail-Lift

Recover Successful-Grasp Successful-Lift

Fig. 1. An illustration of our VBT method on a T-shirt grasping task. The

method asks the teleoperator to demonstrate failure (red), recovery (blue),

and success (green) within each trajectory. This provides the necessary

coverage of failure and recovery to learn accurate value functions and

robust policies while also preventing overﬁtting by ensuring that failures

and successes are visually similar except for the task-relevant details.

It is crucial to the VBT method that the failures, recov-

eries, and successes are visually similar. Without this visual

similarity, the OffRL learner can overﬁt to non-task-relevant

elements of the image such as background clutter, leading to

useless value functions. To easily maintain visual similarity,

VBT collects each of failure, recovery, and success within the

same trajectory (Fig. 1). Thus, VBT avoids overﬁtting, even

on small, image-based datasets, by ensuring that differences

between observations of failure and success are task-relevant.

Concretely, our contributions are:

1) We propose the novel VBT protocol for data collection

to leverage human teleoperation to collect image-based

datasets for use with OffRL. VBT resolves the two

main issues with naive methods: (a) lack of coverage

of failures and recoveries and (b) overﬁtting caused by

visually dissimilar failures and successes.

2) We discuss how and why VBT can enable better policy

learning via learning more accurate Qfunctions and

present empirical evidence of the improvement in Q

functions learned from VBT data compared to several

baselines for data collection.

3) We present real robot results on a deformable grasping

task to demonstrate the effectiveness of VBT data.

When training from scratch on just one hour of robot

time for data collection and image-based observations,

a policy trained with OffRL on VBT data succeeds

arXiv:2210.02343v1 [cs.RO] 5 Oct 2022

79% of the time while BC trained on successful

demonstrations has a success rate of 66%.

II. RELATED WORK

Our work falls into the broader category of learning from

demonstrations [5]. But, rather than taking a more traditional

approach of BC from demonstrations of success [1], [2], [3],

[6], we propose a novel method for collecting demonstrations

of failure and recovery as well as success. The inclusion

of failures in our dataset leads us to use OffRL [7], [8] to

ensure that we do not imitate the failures. Speciﬁcally, we

use the IQL [9] and AWAC [10] algorithms for OffRL. Some

theoretical motivation for the use of OffRL rather than BC

when learning from suboptimal data can be found in [11].

The insight that observations of failures is useful for policy

learning has been made before in work on using success de-

tectors to compute rewards in RL [12]. In contrast, VBT does

not attempt to learn an explicit success detector, but provides

a mechanism for collecting data that is especially useful for

learning policies by capturing the salient differences between

failure and success in visually similar observations.

While we consider a setting where training happens en-

tirely ofﬂine (i.e., the data collection step is separated from

the policy learning), there is a related line of work that

collects examples of failures and recoveries by moving to

an interactive setting where actions from partially trained

policies are executed on the robot and judged by the human

teleoperator. In particular the DAgger line of work exem-

pliﬁes this pattern [4], [13], [14], [15], [16]. In contrast,

VBT operates completely ofﬂine and does not require the

teleoperator to interact with the learning process or to deploy

learned policies on the robot during training.

VBT also is particularly suited to efﬁciently learn image-

based value functions for sparse reward tasks. This capability

could be especially useful in architectures like SayCan [2]

that require image-based affordances for a variety of manip-

ulation tasks, and it is an interesting direction for future work

to use VBT as a component in training such larger systems.

III. MOTIVATION

VBT is designed to solve two issues that arise from

simpler data collection methods: (1) lack of coverage of

failure and recovery behaviors, and (2) overﬁtting issues

that arise in the low-data and high-dimensional observation

setting. Here we describe both of these issues in more detail

before explaining how VBT resolves them.

A. Lack of coverage of failure and recovery

Any ofﬂine learning algorithm will be fundamentally

limited by the quality of the dataset. We can only expect

the algorithm to reproduce the best behaviors in the dataset,

not to reliably extrapolate beyond them. So, when collecting

datasets for ofﬂine learning of value functions and policies,

we want to ensure sufﬁcient coverage of the relevant states

and actions. We argue that for sparse reward robotics tasks

this requires including failures and recoveries in the dataset.

Fig. 2. An illustration of each of the four types of datasets that we consider

in a gridworld environment with a sparse reward for reaching the green

goal. The Success Only dataset contains two successful trajectories. The

Coverage+Success and LfP+Success datasets each contain one success and

one failure. Our VBT dataset contains one trajectory that fails, then recovers,

and then succeeds. Full descriptions of the datasets are in Section V.

Value functions require coverage. Most OffRL algo-

rithms involve estimating the Qand Vvalue functions of a

learned policy πthat is different from the policy (in our case

the teleoperator) that collected the dataset. Explicitly, letting

γ∈[0,1) be a discount factor and rbe the reward function

they estimate Qπ(s, a) = Eπ[P∞

t=0 γtr(st, at)|s0=s, a0=

a]and Vπ(s) = Ea∼π|s[Qπ(s, a)], where expectations are

taken over actions sampled from π. For the rest of the paper

we will omit the superscript πwhen clear from context.

The key issue with learning value functions for a pol-

icy πthat did not collect the data is that we can only

reliably estimate value functions at states and actions that

are similar to those seen in the training set [7]. This is

born out by theoretical work which often requires strong

coverage assumptions to learn accurate value functions, such

as assuming that the data distribution covers all reachable

states and actions [17].

While this sort of assumption is too strong to satisfy

in practice, we argue that for the sparse reward robotic

tasks that we consider, the relevant notion of coverage is to

include failures and recoveries, as well as successes, in the

dataset. These failures and recoveries can provide coverage

of the task-relevant states and behaviors necessary to learn

useful value functions. Without failures the learned value

functions will not be able to identify the “decision boundary”

between failure and success that is necessary for reliably

accomplishing the speciﬁed task. Much as in supervised

learning it is difﬁcult to train a binary classiﬁer without

any examples of the negative class, we conjecture that it

is difﬁcult to learn accurate Qfunctions for sparse reward

tasks without any failures.

Consider the example datasets shown in Fig. 2. If we use

the Success Only dataset to train a Qfunction and then

query the Qfunction at the location of the red agent during

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VisualBacktrackingTeleoperation:ADataCollectionProtocolforOfineImage-BasedReinforcementLearningDavidBrandfonbrener1,StephenTu2,AviSingh2,StefanWelker2,ChadBoodoo2,NikolaiMatni2;3,andJakeVarley2AbstractWeconsiderhowtomostefcientlyleverageteleoperatortimetocollectdataforlearningrobustimage-basedval...

展开>> 收起<<

Visual Backtracking Teleoperation A Data Collection Protocol for Ofﬂine Image-Based Reinforcement Learning David Brandfonbrener1 Stephen Tu2 Avi Singh2.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Visual Backtracking Teleoperation A Data Collection Protocol for Ofﬂine Image-Based Reinforcement Learning David Brandfonbrener1 Stephen Tu2 Avi Singh2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: