Real-World Robot Learning with Masked Visual Pre-training Ilija RadosavovicTete XiaoStephen James Pieter Abbeel Jitendra MalikyTrevor Darrelly

2025-04-29 4 0 3.06MB 13 页 10玖币

侵权投诉

Real-World Robot Learning with

Masked Visual Pre-training

Ilija Radosavovic∗Tete Xiao∗Stephen James Pieter Abbeel Jitendra Malik†Trevor Darrell†

University of California, Berkeley

Abstract: In this work, we explore self-supervised visual pre-training on images

from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our

visual representations are pre-trained via a masked autoencoder (MAE), frozen,

and then passed into a learnable control module. Unlike prior work, we show that

the pre-trained representations are effective across a range of real-world robotic

tasks and embodiments. We ﬁnd that our encoder consistently outperforms CLIP

(up to 75%), supervised ImageNet pre-training (up to 81%), and training from

scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a

massive collection of 4.5M images from the Internet and egocentric videos, and

demonstrate clearly the beneﬁts of scaling visual pre-training for robot learning.

Keywords: Self-supervised Learning, Visual Representations, Robot Learning

1 Introduction

Learning representations with large neural networks is the powerhorse of modern deep learning.

This has enabled impressive results in computer vision [1,2], natural language processing [3,4,5],

and audio generation [6,7]. How can we transfer the success stories of representation learning to

robotics? We can approach this from two ends: shared representations on the perception side or

shared representations on the action side. Our focus in this paper is on shared visual representations.

Of course, the devil is in the details. Recent developments in the ﬁeld of visual learning have made

this more feasible: (1) the use of diverse, real-world data from the Internet and egocentric videos,

(2) self-supervised objectives that do not overly rely on data augmentations or other forms of strong

human-designed priors, (3) scalable and high-capacity transformer models [8,9], and (4) training of

control policies on top of frozen visual representations. In our recent work [10], we have shown that

this recipe for self-supervised visual pre-training is effective for motor control in simulation.

In this paper, we show that this framework is effective for real-world robotic tasks as well (Figure 1).

We build on our prior work, but make signiﬁcant advances in terms of data scale and diversity (7×

larger), model size (15×bigger), and real-world experiments (extensive real robot evaluations).

In particular, we train self-supervised visual representations on real-world images and videos from

the Internet [11,12,13] and egocentric video datasets [14,15]. We leverage the masked autoen-

coders [16] that learn visual representations by masked prediction. The hope is that, by learning to

predict the missing content in real-world images, the model will learn useful properties of the visual

world that will enable it to learn to perform real-world robotic tasks. Given the pre-trained vision

encoder, we freeze the encoder and learn control policies on top. The same visual representations are

used for all downstream robotic tasks and embodiments. We focus on efﬁcient real-world learning

through behavior cloning with a handful of human-provided demonstrations per task (20 - 80).

*,†Equal contribution. Code, pre-trained models, and videos available on our project page.

arXiv:2210.03109v1 [cs.RO] 6 Oct 2022

Real-World Robotic Tasks

Two robots (xArm, Allegro hand)

Eight tasks (scenes, objects)

In-the-Wild Data

Over 4.5 million images

Five diverse data sources

Masked Autoencoder

Encoder

Decoder

(b) Autoencoder(a) Masking

Figure 1: Real-world robot learning with masked visual pre-training. We learn visual represen-

tations from a massive collection of Internet and egocentric data. We pre-train representations with

masked image modeling, freeze the encoder, and learn control policies for robotic tasks on top.

We evaluate our approach in an extensive real-world study and report results from 981 real-world

experiments. We consider basic motor control tasks (reach, push, pick), as well as tasks with varia-

tions in scenes and objects (Figure 1, right). We ﬁnd that our approach achieves considerably higher

performance than CLIP (up to 75%), supervised pre-training (up to 81%), and training from scratch

(up to 81%). Furthermore, we observe that our representations lead to large improvements in sample

complexity, reaching the strongest baseline performance with half the number of demonstrations.

In addition, we demonstrate the beneﬁts of scaling visual pre-training for robotics by training a

307M parameter vision encoder [9] on a massive collection of 4.5M images from ImageNet [11],

Epic Kitchens [17], Something Something [12], 100 Days of Hands [13], and Ego4D [15] datasets.

Importantly, we observe that it is not sufﬁcient to scale the model alone and that larger models

require bigger datasets. To the best of our knowledge, ours is the largest vision model deployed for

robotics, and demonstrates clearly the beneﬁts of visual pre-training scale for robot learning.

2 Related Work

End-to-end control is concerned with learning to predict robot actions (e.g., joint velocities, end-

effector poses, etc) directly from observations [18,19,20], without the need to perform explicit 3D

pose estimation [21], grasp planning [22], and motion planning [23]. However, these end-to-end

approaches tend to be too sample inefﬁcient for real-world training. Some works have tried to ﬁnd

a balance between these explicitly pipelined approaches and end-to-end approaches [24,25,26].

Supervised pre-training for robotics learns one or more pretext tasks through strong supervision

and then transfers the representations to downstream robotic tasks. Lin et al. [27] shows that rep-

resentations learned from semantic tasks such as detection and segmentation correlate with affor-

dance maps for object manipulation. Shridhar et al. [28] use language-supervised CLIP model [29]

for learning language-conditioned imitation policy. In concurrent work, Nair et al. [30] explore

pre-training visual representations using time contrastive learning and language descriptions from

human annotators. These methods all require expert labels or cross-domain supervision.

Self-supervised learning in robotics has been explored as a means of improving sample efﬁciency.

Examples include: learning a dynamic model from interaction with environments [31]; learning

visual representation from interaction with environments [32]; learning vision-based policies on

self-collected trajectories [33,34]; learning visual autoencoders on trajectories [35]; learning spa-

tiotemporal representations through videos [36,37]; learning visual correspondence [38]; utilizing

non-parametric nearest-neighbor retrieval [39]; and conducting visual self-supervised learning on

pre-collected demonstrations [40]. These methods require in-domain data collection, and thus may

be difﬁcult to extend beyond the training environment and task. In contrast, our approach uses a

large-scale and diverse collection of real-world images and videos, making it more generalizable.

𝜋!"#$ 𝜋%&"!

𝜋%'"()* 𝜋%'+",

𝜋'*-#.

Pre-trained Vision Encoder

𝜋/"0$

Figure 2: One encoder for all robots and tasks. We train control policies per task, on top of the

frozen encoder. The same vision encoder is used for all downstream robotic tasks and embodiments.

3 Real-World Robot Learning with Masked Visual Pre-training

3.1 Masked Visual Pre-training

Data collection. We ﬁrst compile a large-scale dataset for learning visual representations. We

primarily use Ego4D [15], a massive scale, egocentric dataset from nine countries recorded via

portable devices, covering over 3,670 hours of daily-life activities. We combine the Ego4D data with

the ImageNet [11], as well as the Hand-object Interaction (HoI) data used in [10], which comprises

of the egocentric Epic Kitchens [17] dataset, the YouTube 100 Days of Hands dataset [13], and the

crowd-sourced Something-Something dataset [12]. Our training data totals 4.5 million images, 6.5x

of the HoI data. We ﬁnd that a sufﬁciently large and diverse pre-training dataset to perform the mask

image modeling self-supervisory task is critical to scale up the vision backbone for real robot tasks.

Self-supervised objective. At the core of our self-supervised representation learning approach is

masked image modeling via the masked autoencoders (MAE) [16]. MAE masks out random patches

in an image and reconstructs the missing content with a vision transformer (ViT) [9]. A high masking

ratio, e.g., 75%, and asymmetrical heavy-encoder light-decoder design, are important for learning

good visual representations efﬁciently. Simple and free from dataset or task-speciﬁc augmenta-

tions [41], MAE is the state-of-the-art self-supervised framework in computer vision [42,43,44,45],

and has been demonstrated to work well for motor control tasks in simulation as well [10].

Architecture. We use the ViT models as our vision encoders. While the MAE-trained ViT models

yield improving performance in vision tasks as model sizes grow [9,16,46], previous work [10]

does not show improvement from switching a ViT-Small model to the ViT-Base counterpart of 4x

as many parameters. In this work, we scale the model up to the ViT-Large and deploy it on the

real robot. The model contains 307M parameters and runs at ∼64 gigaﬂops at input size 224×224,

approximately 15x as many as the commonly adopted ResNet-50 [47], the largest vision model

deployed for robotics. As we will show in the experiments, scaling model sizes while training on

sufﬁciently large data leads to consistent performance improvement on downstream robotic tasks.

3.2 Real-World Robot Learning

We learn to perform real-robot tasks through behavior cloning (BC). We collect demonstrations

containing trajectories of RGB images from a wrist-mounted camera and the robot’s joint state at

each time step. For most of the tasks, we use the motion-tracked HTC Vive VR system to control the

end-effector. For some tasks that are difﬁcult to demonstrate via the motion controller, e.g., closing

fridge door, we use kinesthetic teaching. Given the recorded demonstrations, we train a control

policy that takes in the input image features and proprioceptive states (joint positions) at time step

tand predicts the action at time step t+ 1. We perform joint position control; we do not use any

end-effector information (e.g., the 6-DoF pose). We build on our MVP pipeline [10] and freeze

the image encoder throughout the policy learning, which prevents large pre-trained encoders from

overﬁtting to a speciﬁc setting or task, and greatly reduces GPU memory footprint and training time.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Real-WorldRobotLearningwithMaskedVisualPre-trainingIlijaRadosavovicTeteXiaoStephenJamesPieterAbbeelJitendraMalikyTrevorDarrellyUniversityofCalifornia,BerkeleyAbstract:Inthiswork,weexploreself-supervisedvisualpre-trainingonimagesfromdiverse,in-the-wildvideosforreal-worldrobotictasks.Likepriorwork,o...

展开>> 收起<<

Real-World Robot Learning with Masked Visual Pre-training Ilija RadosavovicTete XiaoStephen James Pieter Abbeel Jitendra MalikyTrevor Darrelly.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Real-World Robot Learning with Masked Visual Pre-training Ilija RadosavovicTete XiaoStephen James Pieter Abbeel Jitendra MalikyTrevor Darrelly

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: