Self-supervised video pretraining yields robust and more human-aligned visual representations Nikhil ParthasarathyS. M. Ali Eslami João Carreira Olivier J. Hénaff

2025-04-26 0 0 4.68MB 23 页 10玖币

侵权投诉

Self-supervised video pretraining yields robust and

more human-aligned visual representations

Nikhil Parthasarathy†S. M. Ali Eslami João Carreira Olivier J. Hénaff

Google DeepMind

Abstract

Humans learn powerful representations of objects and scenes by observing how

they evolve over time. Yet, outside of speciﬁc tasks that require explicit temporal

understanding, static image pretraining remains the dominant paradigm for learn-

ing visual foundation models. We question this mismatch, and ask whether video

pretraining can yield visual representations that bear the hallmarks of human percep-

tion: generalisation across tasks, robustness to perturbations, and consistency with

human judgements. To that end we propose a novel procedure for curating videos,

and develop a contrastive framework which learns from the complex transforma-

tions therein. This simple paradigm for distilling knowledge from videos, called

VITO, yields general representations that far outperform prior video pretraining

methods on image understanding tasks, and image pretraining methods on video

understanding tasks. Moreover, VITO representations are signiﬁcantly more robust

to natural and synthetic deformations than image-, video-, and adversarially-trained

ones. Finally, VITO’s predictions are strongly aligned with human judgements,

surpassing models that were speciﬁcally trained for that purpose. Together, these

results suggest that video pretraining could be a simple way of learning uniﬁed,

robust, and human-aligned representations of the visual world.

1 Introduction

With the explosion of recent AI breakthroughs, humans now interact with and depend on the outputs

of these models at an unprecedented rate. It is therefore increasingly important that these models be

aligned with human abilities, judgements, and preferences. In the context of computer vision systems,

human alignment can be quantiﬁed with accurate generalization across a wide range of tasks [

–

robustness to various input deformations [

], and consistency with human perceptual judgements

[

]. While each of these challenges has been tackled separately, progress along one axis has often

come at the expense of the others. For example, gains in robustness [

] or temporal understanding

[

–

] have thus far come at the cost of spatial understanding, and scaling the model and dataset size,

while improving task-generality and robustness [

], can be detrimental for their consistency with

human perception [11,12].

In this work we question this trend, and ask whether improvements to all aspects of human alignment

can be made with the appropriate pretraining methodology. Speciﬁcally, humans and animals have

long been thought to learn from the dynamic evolution of natural scenes [

–

] and we hypothesize

that artiﬁcial visual systems will be more aligned by appropriately leveraging natural video pretraining.

In particular, while many current self-supervised methods [

–

] learn representations that are

invariant to synthetic augmentations that capture important image priors such as scale-, color-, and

translation-invariance, these represent a small part of the complex (and signal-rich) changes in pose,

viewpoint, and motion that are captured from natural videos. Predicting the evolution of videos is

also a natural means of learning intuitive physics and model-based reasoning [20–22].

†Current afﬁliation: NYU Center for Neural Science, work done while interning at DeepMind.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.06433v3 [cs.CV] 10 Jan 2025

Practically, we develop a self-supervised contrastive framework which learns to locate the most

stable and distinctive elements in temporally displaced video frames, and maximizes their invariance.

Secondly, we ﬁnd the statistics of standard video datasets to have a detrimental effect on the quality

of the resulting representations, as measured by their performance on canonical scene understanding

tasks. We therefore introduce a simple, yet powerful video curation procedure—VideoNet—which

aligns their class distribution with that of ImageNet, and which redresses the imbalance between

image and video learning. In concert, this paradigm constitues a new methodology for distilling the

knowledge of videos into visual representations: VITO.

VITO yields task-general representations that perform well across both spatial and temporal under-

standing tasks. Particularly, VITO shows large gains over prior video pretraining efforts in scene

understanding tasks, while achieving similarly large performance gains over image pretraining on

video understanding tasks. Furthermore, VITO signiﬁcantly outperforms the default ImageNet pre-

training as well as adversarial pretraining on image classiﬁcation tasks subject to natural distribution

shifts. Finally, we ﬁnd that even without a signiﬁcant expansion in model size, VITO is not only

task-general and robust in performance, but also quantitatively captures multiple aspects of human

perceptual judgements, surpassing models speciﬁcally trained for that purpose.

2 Related work

Learning general visual representations from videos. Many prior works have considered self-

supervised representation learning for capturing spatio-temporal invariances, beginning with methods

that leveraged temporal coherence, optical ﬂow, and object tracking [

–

]. More recently, many

successful approaches have leveraged contrastive learning, masked autoencoding, and other self-

supervised pretext tasks to learn strong video representations [

–

]. However, most of these

methods employ specialized video architectures and only transfer to video-based tasks such as action

recognition and motion segmentation.

Yet natural motion-induced deformations are powerful learning signals that should allow for learning

better image representations as well. Indeed, human infants can form complex understanding of

objects and shape within months, speciﬁcally driven by their observations of how they move [

Given this inspiration, some works have demonstrated that self-supervised contrastive learning in

videos can lead to aspects of efﬁcient human learning and robust recognition [

–

]. In computer

vision, cycle-consistency [

] and optical ﬂow [

] have been used to learn correspondences

between temporally ordered image patches. The most similar works to ours utilize video-based

contrastive learning [

–

] to improve performance on temporal understanding tasks, however they do

so at the cost of spatial scene understanding.

Robustness to distribution shifts. As standard benchmarks have been progressively saturated [

the community has turned to measuring robustness to adversarial attacks [

], corruptions [

], and

out-of-distribution datasets [

–

]. We focus on a subset of these benchmarks that are as “natural”

as possible, to evaluate generalization with respect to shifts that are most likely to appear in the real

world. While there have been many efforts to speciﬁcally encourage regularize models for these

kinds of robustness [

–

], we instead investigate the complementary question of whether image

and video pretraining differ in this respect.

Human-aligned representations. Most recent progress in achieving more behaviorally-matched

representations has been by scaling existing approaches. Indeed, recent examples [

] show

that as data and model sizes grow by orders of magnitude, generality and robustness of representations

tend to emerge. Moreover some aspects of human perception such as an increased shape-bias and

consistency with human perceptual behavior [

] can be captured reasonably well by certain large

models. However this scaling property tends to be brittle, with some large-scale models displaying

signiﬁcantly worse consistency with human perception [

]. Additionally, more recent work

on alignment has found that scaling and architecture are not as important for alignment on speciﬁc

benchmarks, in comparison to the training dataset and objective function [

]. Therefore, while

scaling may continue to lead to task-performance gains, it is unclear whether only scaling image-based

pretraining will close the gap with general human behavior. We therefore explore the complementary

and potentially synergistic question of whether video pretraining can improve the task-generality,

robustness, and behavioral similarity of learned visual representations.

attention mask

mask pooling

Contrastive loss

augment

frame 1

frame 2 EMA

hiddens

Figure 1: Learning to attend to related video content. Each augmented frame is encoded by the network

as a

spatial array of hidden vectors. The attention module

takes as input features from one view and produces a mask

that isolates features that are likely to be predictive of the other, temporally-displaced view. The attention-gated

features are pooled accordingly, and both the feature extractor and attention module are trained to satisfy the

contrastive objective. Subscripts θand ξrefer to online and target (EMA) networks respectively.

3 Method

We pretrain image representations using video datasets, then transfer them to a range of downstream

tasks that test image, video, and robust understanding. We adopt the ResNet-50 architecture for our

initial exploration, then validate our results with Swin transformers (see Sec. B.4).

3.1 Self-supervised pretraining

Our method for distilling videos into image representations, VITO, builds robust visual representa-

tions by learning to track stable and distinctive content in videos while they evolve over time.

Natural video pipeline. The key to our method is to distill the natural transformations present

in videos into image-based representations. Given a video-clip, we sample frames according to a

distribution Tand further transform each frame with image-based augmentations:

v1∼ A1(x1)v2∼ A2(x2)x1,x2∼ T ({xt}t=1,...,T )(1)

where the distribution

samples frames uniformly from a video clip of length

T= 2.56s

and

the image transformations

include random cropping, ﬂipping, blurring, and point-wise color

transformations [61], see appendices A.1 and B.2, and Figure B.3 for an ablation.

We note that video frames (or even uncurated image data) typically differ from the statistics of (object

centered) ImageNet images, with more variable viewpoints and a larger ﬁeld-of-view that can cover

multiple objects in complex scenes. As a result, the aggressive random cropping from [

] (whose

smallest crops cover only 8

of the original image) can result in “positive” pairs with very different

semantic content (e.g. entirely different objects). We therefore suggest and empirically validate that

larger crop sizes (e.g. increasing the minimum crop size to 40%) are beneﬁcial when learning from

real-world video frames (see Figure B.2).

Multi-scale contrastive attention pooling. Standard contrastive frameworks use global average

pooling of hidden vectors to obtain a single representation of each view. It has been shown that using

dense contrastive losses can lead to signiﬁcant improvements [

–

], but these methods require

establishing correspondences across views. Whereas correspondences can easily be obtained from

static images, when temporal deformations are introduced they require some form of object or point

tracking [

]. Furthermore, with the larger ﬁeld-of-view of video frames, correspondence learning

becomes an increasingly difﬁcult task. In this work, we propose a more general, adaptive method for

learning correspondences at multiple scales. Our method learns what features should be attended to

in order to solve the contrastive learning problem across temporally displaced views.

As shown in Figure 1, given a view

the feature extractor outputs a spatial map of feature vectors

hl,s

θ∈ Rh×w×c

at a given scale

, where different scales correspond to the outputs of different blocks

of a ResNet for example. At each scale, we introduce a 2-layer attention MLP

which outputs a

mask ml,s =softmax(aθ(hl,s

θ)) that we use to spatially weight and pool hidden vectors:

hl,s

θ=X

i,j

ml,s[i, j]hl,s

θ[i, j](2)

which we we concatenate and transform with the two-layer MLP projector:

θ=gθ(

θ)

where

θ= [

hl,s

θ, s ∈1...S]

. In our experiments, we ﬁnd that for the canonical ResNet-50 architecture,

attending over the outputs of the last two ResNet blocks (i.e.

S= 2

) is optimal given our evaluations.

These hidden vectors are then transformed with a standard two-layer MLP

gθ

, yielding projections

θ=gθ(

θ)

. We enforce invariance across views using the standard InfoNCE loss [

], encoding

targets with slowly-varying target networks

fξ

and

gξ

that are exponential moving averages of the

online network [61]

Lij (θ;ξ) = −log exp(zi

θ·zj

ξ)

exp(zi

θ·zj

ξ) + Pnexp(zi

θ·zn

ξ).(3)

{zn

ξ}n

are negative features computed from frames from other videos in the batch. The ﬁnal,

multi-view loss is evaluated for all pairs L(θ;ξ) = Pi̸=jLij (θ;ξ).

3.2 Addressing dataset domain mismatch

We began investigating the potential for learning general representations from videos, using standard

datasets including Kinetics, AudioSet, and YouTube-8M. However, Kinetics is quite small and is

limited in scope to human actions. On the other-hand, AudioSet and YouTube-8M are noisy and have

very imbalanced class distributions. Additionally, prior work has shown that even self-supervised

methods are quite sensitive to the pretraining distribution [

]. Yet over the last decade, it has been

shown that ImageNet can be used for learning image representations that transfer well to many

downstream tasks. As a result, we hypothesized that collecting a minimally-curated video dataset

matched to the rough properties of ImageNet would be beneﬁcial for learning a more general visual

model from videos.

To test of this hypothesis, we developed a data curation pipeline—VideoNet—to ﬁlter online videos

such that our training data more closely matches the distribution of ImageNet categories. For each of

the 1,000 ImageNet categories, we retrieved 5,000 video clips whose title included the category’s

name or a synonym. We then ﬁltered these videos by applying an image classiﬁer (pretrained ResNet-

50 on ImageNet) to verify that the videos contained the intended object category. We classiﬁed the

ﬁrst 100 frames of each video and discarded videos for which the query category was not equal to the

ResNet’s top-1 prediction for any of the frames. We also discarded videos of less than

10s

in length.

While the VideoNet procedure is close in conceptualization to the method used to create the R2V2

dataset proposed by Gordon et al.

[7]

, it differs in a few ways. First, we utilize full video clips that

allow us to uniformly sample frames at any time point rather than the ﬁxed sampling of frames

that are

apart in R2V2. Second, by using the ImageNet classiﬁer to ﬁlter videos, we can reduce

mismatch with the ImageNet distribution that can arise from incorrect tagging and noisy labeling of

online videos. This is veriﬁed by the fact that only 1.18M of the 5M retrieved videos met our ﬁltering

criteria. We also note that the use of classiﬁcation-based ﬁltering is just one method of curation.

While we demonstrate in Sec. 4.3, that this curation does provide large beneﬁts in the context of

video pre-training compared with existing datasets, there is still great potential to make improvements

by utilizing larger target datasets (such as ImageNet-22K) and utilizing alternative curation strategies

such as the nearest-neighbor retrieval proposed by [10] in creating the LVD-142M image dataset.

4 Results

Humans are able to solve a range of visual tasks that require complex spatial and temporal reasoning,

including generalizing to noisy or out-of-distribution (OOD) scenarios. Therefore, we ﬁrst benchmark

VITO against image and video pretrained models on a variety of tasks to demonstrate sufﬁcient

generality and robustness in task performance. We then assess whether VITO not only captures these

task-based properties, but also displays strong quantitative alignment with human behavior.

4.1 VITO generalizes across diverse visual tasks

We present in Table 1the transfer performance of VITO compared to strong supervised and self-

supervised baselines on dense scene understanding (semantic segmentation and object detection),

video understanding (video segmentation and action recognition), and out-of-distribution (OOD)

object recognition. On every benchmark, VITO either outperforms or is competitive with the best

baselines for that speciﬁc task.

Scene Understanding Video Understanding OOD Recognition

Pretraining Dataset ADE20K

(mIoU)

COCO

(mAP)

DAVIS

(J&F

mean)

UCF101

(top-1)

IN-A

(top-1)

IN-Vid

(pm0/

pm10)

Random - 27.9 39.0 - - - -

Standard image pretraining

Supervised ImageNet 33.5 44.2 66.1 83.4 2.2 67.7/52.4

BYOL [61] ImageNet 38.8 43.7 66.6 85.6 - -

MoCLR [68] ImageNet 39.2 43.9 65.5 85.5 3.7 64.7/50.0

DINO [19] ImageNet 39.0 44.3 65.3 85.4 5.0 65.2/52.0

Robust image pretraining

Stylized-IN [56] SIN+IN - - - 83.3 2.0 68.4/51.7

L2-Robust [54] ImageNet - - - 83.7 2.1 65.2/51.6

Video pretraining

VIVI [69] YT8M 34.2 41.3 - - 0.5 57.9/36.5

MMV-VA [70] AS + HT 32.5 41.3 - - - -

VINCE [7] R2V2 35.7 42.4 66.1 - - -

VFS [8] K400 31.4 41.6 67.8 - - -

CycleCon [9] R2V2 35.6 42.8 - 82.8 0.4 50.4/30.1

VITO VideoNet 39.4 44.0 68.2 87.4 5.4 70.6/57.2

Table 1: VITO representations generalize to a variety of tasks in both image and video modalities, surpassing

models specialized for each task. For external models, we ﬁnetune publicly available checkpoints.

Scene understanding. We ﬁrst note that VITO provides large gains over all prior video pretraining

methods on scene understanding and robust object recognition. We further validate these comparisons

on three additional benchmarks and ﬁnd that VITO strongly outperforms the prior work across all 5

datasets (PASCAL/ADE20K/COCO/LVIS/IN-1K, see Table B.3). For example, VITO improves over

VIVI [

] by 2-10%, highlighting the importance of data curation and our contrastive formulation.

VITO improves over VINCE [

] by 1-12%, highlighting the importance of ﬁne-grained temporal

deformations. Finally, VITO improves even over MMV [

] by 2-15%, despite their use of large-scale

text supervision, highlighting the relevance of video-only learning.

Compared with the best supervised and self-supervised image-pretraining methods VITO achieves

competitive performance on these same benchmarks (Table 1and Table B.3). To our knowledge,

VITO is the ﬁrst video pretrained method to close the gap with ImageNet pretraining on large-scale

scene understanding benchmarks such as these.

Video understanding. We next ask whether this increased spatial understanding come at the

cost of traditional beneﬁts of video pretraining on video tasks. We ﬁnd that this is not the case,

evaluating on DAVIS segmentation and UCF-101 action recognition. On DAVIS, which tests the

ability to segment an object over its dynamic temporal evolution, VITO features capture ﬁne-grained

temporal deformations of objects far better than ImageNet pretraining methods, as well as the best

video pretraining methods (See Table B.4 for additional comparisons). On UCF-101, which tests

the ability to classify global spatio-temporal features, we ﬁnd that a simple average pooling of

VITO frame representations again outperforms all image pretraining and prior frame-based video

pretraining signiﬁcantly. VITO even outperforms a number of recent methods that use specialized

video architectures (See Table B.5). While VITO under-performs relative to the best video models,

we note that these methods either cannot be tested or under-perform on spatial understanding.

Additionally, as shown in Table B.5 and Sec. A.5, simple learned temporal pooling strategies on top

of VITO representations further close the gap with the best video architectures.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-supervisedvideopretrainingyieldsrobustandmorehuman-alignedvisualrepresentationsNikhilParthasarathy†S.M.AliEslamiJoãoCarreiraOlivierJ.HénaffGoogleDeepMindAbstractHumanslearnpowerfulrepresentationsofobjectsandscenesbyobservinghowtheyevolveovertime.Yet,outsideofspecifictasksthatrequireexplicittemp...

展开>> 收起<<

Self-supervised video pretraining yields robust and more human-aligned visual representations Nikhil ParthasarathyS. M. Ali Eslami João Carreira Olivier J. Hénaff.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-supervised video pretraining yields robust and more human-aligned visual representations Nikhil ParthasarathyS. M. Ali Eslami João Carreira Olivier J. Hénaff

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: