Practically, we develop a self-supervised contrastive framework which learns to locate the most
stable and distinctive elements in temporally displaced video frames, and maximizes their invariance.
Secondly, we find the statistics of standard video datasets to have a detrimental effect on the quality
of the resulting representations, as measured by their performance on canonical scene understanding
tasks. We therefore introduce a simple, yet powerful video curation procedure—VideoNet—which
aligns their class distribution with that of ImageNet, and which redresses the imbalance between
image and video learning. In concert, this paradigm constitues a new methodology for distilling the
knowledge of videos into visual representations: VITO.
VITO yields task-general representations that perform well across both spatial and temporal under-
standing tasks. Particularly, VITO shows large gains over prior video pretraining efforts in scene
understanding tasks, while achieving similarly large performance gains over image pretraining on
video understanding tasks. Furthermore, VITO significantly outperforms the default ImageNet pre-
training as well as adversarial pretraining on image classification tasks subject to natural distribution
shifts. Finally, we find that even without a significant expansion in model size, VITO is not only
task-general and robust in performance, but also quantitatively captures multiple aspects of human
perceptual judgements, surpassing models specifically trained for that purpose.
2 Related work
Learning general visual representations from videos. Many prior works have considered self-
supervised representation learning for capturing spatio-temporal invariances, beginning with methods
that leveraged temporal coherence, optical flow, and object tracking [
23
–
31
]. More recently, many
successful approaches have leveraged contrastive learning, masked autoencoding, and other self-
supervised pretext tasks to learn strong video representations [
32
–
38
]. However, most of these
methods employ specialized video architectures and only transfer to video-based tasks such as action
recognition and motion segmentation.
Yet natural motion-induced deformations are powerful learning signals that should allow for learning
better image representations as well. Indeed, human infants can form complex understanding of
objects and shape within months, specifically driven by their observations of how they move [
39
,
40
].
Given this inspiration, some works have demonstrated that self-supervised contrastive learning in
videos can lead to aspects of efficient human learning and robust recognition [
41
–
43
]. In computer
vision, cycle-consistency [
44
,
45
] and optical flow [
46
,
47
] have been used to learn correspondences
between temporally ordered image patches. The most similar works to ours utilize video-based
contrastive learning [
7
–
9
] to improve performance on temporal understanding tasks, however they do
so at the cost of spatial scene understanding.
Robustness to distribution shifts. As standard benchmarks have been progressively saturated [
48
],
the community has turned to measuring robustness to adversarial attacks [
49
], corruptions [
50
], and
out-of-distribution datasets [
4
,
51
–
53
]. We focus on a subset of these benchmarks that are as “natural”
as possible, to evaluate generalization with respect to shifts that are most likely to appear in the real
world. While there have been many efforts to specifically encourage regularize models for these
kinds of robustness [
54
–
57
], we instead investigate the complementary question of whether image
and video pretraining differ in this respect.
Human-aligned representations. Most recent progress in achieving more behaviorally-matched
representations has been by scaling existing approaches. Indeed, recent examples [
10
,
11
,
58
] show
that as data and model sizes grow by orders of magnitude, generality and robustness of representations
tend to emerge. Moreover some aspects of human perception such as an increased shape-bias and
consistency with human perceptual behavior [
11
,
59
] can be captured reasonably well by certain large
models. However this scaling property tends to be brittle, with some large-scale models displaying
significantly worse consistency with human perception [
11
,
12
]. Additionally, more recent work
on alignment has found that scaling and architecture are not as important for alignment on specific
benchmarks, in comparison to the training dataset and objective function [
60
]. Therefore, while
scaling may continue to lead to task-performance gains, it is unclear whether only scaling image-based
pretraining will close the gap with general human behavior. We therefore explore the complementary
and potentially synergistic question of whether video pretraining can improve the task-generality,
robustness, and behavioral similarity of learned visual representations.
2