
Real-World Robot Learning with
Masked Visual Pre-training
Ilija Radosavovic∗Tete Xiao∗Stephen James Pieter Abbeel Jitendra Malik†Trevor Darrell†
University of California, Berkeley
Abstract: In this work, we explore self-supervised visual pre-training on images
from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our
visual representations are pre-trained via a masked autoencoder (MAE), frozen,
and then passed into a learnable control module. Unlike prior work, we show that
the pre-trained representations are effective across a range of real-world robotic
tasks and embodiments. We find that our encoder consistently outperforms CLIP
(up to 75%), supervised ImageNet pre-training (up to 81%), and training from
scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a
massive collection of 4.5M images from the Internet and egocentric videos, and
demonstrate clearly the benefits of scaling visual pre-training for robot learning.
Keywords: Self-supervised Learning, Visual Representations, Robot Learning
1 Introduction
Learning representations with large neural networks is the powerhorse of modern deep learning.
This has enabled impressive results in computer vision [1,2], natural language processing [3,4,5],
and audio generation [6,7]. How can we transfer the success stories of representation learning to
robotics? We can approach this from two ends: shared representations on the perception side or
shared representations on the action side. Our focus in this paper is on shared visual representations.
Of course, the devil is in the details. Recent developments in the field of visual learning have made
this more feasible: (1) the use of diverse, real-world data from the Internet and egocentric videos,
(2) self-supervised objectives that do not overly rely on data augmentations or other forms of strong
human-designed priors, (3) scalable and high-capacity transformer models [8,9], and (4) training of
control policies on top of frozen visual representations. In our recent work [10], we have shown that
this recipe for self-supervised visual pre-training is effective for motor control in simulation.
In this paper, we show that this framework is effective for real-world robotic tasks as well (Figure 1).
We build on our prior work, but make significant advances in terms of data scale and diversity (7×
larger), model size (15×bigger), and real-world experiments (extensive real robot evaluations).
In particular, we train self-supervised visual representations on real-world images and videos from
the Internet [11,12,13] and egocentric video datasets [14,15]. We leverage the masked autoen-
coders [16] that learn visual representations by masked prediction. The hope is that, by learning to
predict the missing content in real-world images, the model will learn useful properties of the visual
world that will enable it to learn to perform real-world robotic tasks. Given the pre-trained vision
encoder, we freeze the encoder and learn control policies on top. The same visual representations are
used for all downstream robotic tasks and embodiments. We focus on efficient real-world learning
through behavior cloning with a handful of human-provided demonstrations per task (20 - 80).
*,†Equal contribution. Code, pre-trained models, and videos available on our project page.
arXiv:2210.03109v1 [cs.RO] 6 Oct 2022