Self-supervised video pretraining yields robust and more human-aligned visual representations Nikhil ParthasarathyS. M. Ali Eslami João Carreira Olivier J. Hénaff

2025-04-26 0 0 4.68MB 23 页 10玖币
侵权投诉
Self-supervised video pretraining yields robust and
more human-aligned visual representations
Nikhil ParthasarathyS. M. Ali Eslami João Carreira Olivier J. Hénaff
Google DeepMind
Abstract
Humans learn powerful representations of objects and scenes by observing how
they evolve over time. Yet, outside of specific tasks that require explicit temporal
understanding, static image pretraining remains the dominant paradigm for learn-
ing visual foundation models. We question this mismatch, and ask whether video
pretraining can yield visual representations that bear the hallmarks of human percep-
tion: generalisation across tasks, robustness to perturbations, and consistency with
human judgements. To that end we propose a novel procedure for curating videos,
and develop a contrastive framework which learns from the complex transforma-
tions therein. This simple paradigm for distilling knowledge from videos, called
VITO, yields general representations that far outperform prior video pretraining
methods on image understanding tasks, and image pretraining methods on video
understanding tasks. Moreover, VITO representations are significantly more robust
to natural and synthetic deformations than image-, video-, and adversarially-trained
ones. Finally, VITO’s predictions are strongly aligned with human judgements,
surpassing models that were specifically trained for that purpose. Together, these
results suggest that video pretraining could be a simple way of learning unified,
robust, and human-aligned representations of the visual world.
1 Introduction
With the explosion of recent AI breakthroughs, humans now interact with and depend on the outputs
of these models at an unprecedented rate. It is therefore increasingly important that these models be
aligned with human abilities, judgements, and preferences. In the context of computer vision systems,
human alignment can be quantified with accurate generalization across a wide range of tasks [
1
3
],
robustness to various input deformations [
4
], and consistency with human perceptual judgements
[
5
]. While each of these challenges has been tackled separately, progress along one axis has often
come at the expense of the others. For example, gains in robustness [
6
] or temporal understanding
[
7
9
] have thus far come at the cost of spatial understanding, and scaling the model and dataset size,
while improving task-generality and robustness [
10
,
11
], can be detrimental for their consistency with
human perception [11,12].
In this work we question this trend, and ask whether improvements to all aspects of human alignment
can be made with the appropriate pretraining methodology. Specifically, humans and animals have
long been thought to learn from the dynamic evolution of natural scenes [
13
15
] and we hypothesize
that artificial visual systems will be more aligned by appropriately leveraging natural video pretraining.
In particular, while many current self-supervised methods [
16
19
] learn representations that are
invariant to synthetic augmentations that capture important image priors such as scale-, color-, and
translation-invariance, these represent a small part of the complex (and signal-rich) changes in pose,
viewpoint, and motion that are captured from natural videos. Predicting the evolution of videos is
also a natural means of learning intuitive physics and model-based reasoning [2022].
Current affiliation: NYU Center for Neural Science, work done while interning at DeepMind.
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.06433v3 [cs.CV] 10 Jan 2025
Practically, we develop a self-supervised contrastive framework which learns to locate the most
stable and distinctive elements in temporally displaced video frames, and maximizes their invariance.
Secondly, we find the statistics of standard video datasets to have a detrimental effect on the quality
of the resulting representations, as measured by their performance on canonical scene understanding
tasks. We therefore introduce a simple, yet powerful video curation procedure—VideoNet—which
aligns their class distribution with that of ImageNet, and which redresses the imbalance between
image and video learning. In concert, this paradigm constitues a new methodology for distilling the
knowledge of videos into visual representations: VITO.
VITO yields task-general representations that perform well across both spatial and temporal under-
standing tasks. Particularly, VITO shows large gains over prior video pretraining efforts in scene
understanding tasks, while achieving similarly large performance gains over image pretraining on
video understanding tasks. Furthermore, VITO significantly outperforms the default ImageNet pre-
training as well as adversarial pretraining on image classification tasks subject to natural distribution
shifts. Finally, we find that even without a significant expansion in model size, VITO is not only
task-general and robust in performance, but also quantitatively captures multiple aspects of human
perceptual judgements, surpassing models specifically trained for that purpose.
2 Related work
Learning general visual representations from videos. Many prior works have considered self-
supervised representation learning for capturing spatio-temporal invariances, beginning with methods
that leveraged temporal coherence, optical flow, and object tracking [
23
31
]. More recently, many
successful approaches have leveraged contrastive learning, masked autoencoding, and other self-
supervised pretext tasks to learn strong video representations [
32
38
]. However, most of these
methods employ specialized video architectures and only transfer to video-based tasks such as action
recognition and motion segmentation.
Yet natural motion-induced deformations are powerful learning signals that should allow for learning
better image representations as well. Indeed, human infants can form complex understanding of
objects and shape within months, specifically driven by their observations of how they move [
39
,
40
].
Given this inspiration, some works have demonstrated that self-supervised contrastive learning in
videos can lead to aspects of efficient human learning and robust recognition [
41
43
]. In computer
vision, cycle-consistency [
44
,
45
] and optical flow [
46
,
47
] have been used to learn correspondences
between temporally ordered image patches. The most similar works to ours utilize video-based
contrastive learning [
7
9
] to improve performance on temporal understanding tasks, however they do
so at the cost of spatial scene understanding.
Robustness to distribution shifts. As standard benchmarks have been progressively saturated [
48
],
the community has turned to measuring robustness to adversarial attacks [
49
], corruptions [
50
], and
out-of-distribution datasets [
4
,
51
53
]. We focus on a subset of these benchmarks that are as “natural”
as possible, to evaluate generalization with respect to shifts that are most likely to appear in the real
world. While there have been many efforts to specifically encourage regularize models for these
kinds of robustness [
54
57
], we instead investigate the complementary question of whether image
and video pretraining differ in this respect.
Human-aligned representations. Most recent progress in achieving more behaviorally-matched
representations has been by scaling existing approaches. Indeed, recent examples [
10
,
11
,
58
] show
that as data and model sizes grow by orders of magnitude, generality and robustness of representations
tend to emerge. Moreover some aspects of human perception such as an increased shape-bias and
consistency with human perceptual behavior [
11
,
59
] can be captured reasonably well by certain large
models. However this scaling property tends to be brittle, with some large-scale models displaying
significantly worse consistency with human perception [
11
,
12
]. Additionally, more recent work
on alignment has found that scaling and architecture are not as important for alignment on specific
benchmarks, in comparison to the training dataset and objective function [
60
]. Therefore, while
scaling may continue to lead to task-performance gains, it is unclear whether only scaling image-based
pretraining will close the gap with general human behavior. We therefore explore the complementary
and potentially synergistic question of whether video pretraining can improve the task-generality,
robustness, and behavioral similarity of learned visual representations.
2
view 1
view 2
attention mask
mask pooling
Contrastive loss
augment
augment
frame 1
frame 2 EMA
hiddens
Figure 1: Learning to attend to related video content. Each augmented frame is encoded by the network
f
as a
spatial array of hidden vectors. The attention module
a
takes as input features from one view and produces a mask
that isolates features that are likely to be predictive of the other, temporally-displaced view. The attention-gated
features are pooled accordingly, and both the feature extractor and attention module are trained to satisfy the
contrastive objective. Subscripts θand ξrefer to online and target (EMA) networks respectively.
3 Method
We pretrain image representations using video datasets, then transfer them to a range of downstream
tasks that test image, video, and robust understanding. We adopt the ResNet-50 architecture for our
initial exploration, then validate our results with Swin transformers (see Sec. B.4).
3.1 Self-supervised pretraining
Our method for distilling videos into image representations, VITO, builds robust visual representa-
tions by learning to track stable and distinctive content in videos while they evolve over time.
Natural video pipeline. The key to our method is to distill the natural transformations present
in videos into image-based representations. Given a video-clip, we sample frames according to a
distribution Tand further transform each frame with image-based augmentations:
v1∼ A1(x1)v2∼ A2(x2)x1,x2∼ T ({xt}t=1,...,T )(1)
where the distribution
T
samples frames uniformly from a video clip of length
T= 2.56s
and
the image transformations
Al
include random cropping, flipping, blurring, and point-wise color
transformations [61], see appendices A.1 and B.2, and Figure B.3 for an ablation.
We note that video frames (or even uncurated image data) typically differ from the statistics of (object
centered) ImageNet images, with more variable viewpoints and a larger field-of-view that can cover
multiple objects in complex scenes. As a result, the aggressive random cropping from [
61
] (whose
smallest crops cover only 8
%
of the original image) can result in “positive” pairs with very different
semantic content (e.g. entirely different objects). We therefore suggest and empirically validate that
larger crop sizes (e.g. increasing the minimum crop size to 40%) are beneficial when learning from
real-world video frames (see Figure B.2).
Multi-scale contrastive attention pooling. Standard contrastive frameworks use global average
pooling of hidden vectors to obtain a single representation of each view. It has been shown that using
dense contrastive losses can lead to significant improvements [
62
65
], but these methods require
establishing correspondences across views. Whereas correspondences can easily be obtained from
static images, when temporal deformations are introduced they require some form of object or point
tracking [
46
]. Furthermore, with the larger field-of-view of video frames, correspondence learning
becomes an increasingly difficult task. In this work, we propose a more general, adaptive method for
learning correspondences at multiple scales. Our method learns what features should be attended to
in order to solve the contrastive learning problem across temporally displaced views.
As shown in Figure 1, given a view
vl
the feature extractor outputs a spatial map of feature vectors
hl,s
θ∈ Rh×w×c
at a given scale
s
, where different scales correspond to the outputs of different blocks
3
of a ResNet for example. At each scale, we introduce a 2-layer attention MLP
as
θ
which outputs a
mask ml,s =softmax(aθ(hl,s
θ)) that we use to spatially weight and pool hidden vectors:
ˆ
hl,s
θ=X
i,j
ml,s[i, j]hl,s
θ[i, j](2)
which we we concatenate and transform with the two-layer MLP projector:
zl
θ=gθ(
ˆ
hl
θ)
where
ˆ
hl
θ= [
ˆ
hl,s
θ, s 1...S]
. In our experiments, we find that for the canonical ResNet-50 architecture,
attending over the outputs of the last two ResNet blocks (i.e.
S= 2
) is optimal given our evaluations.
These hidden vectors are then transformed with a standard two-layer MLP
gθ
, yielding projections
zl
θ=gθ(
ˆ
hl
θ)
. We enforce invariance across views using the standard InfoNCE loss [
66
], encoding
targets with slowly-varying target networks
fξ
and
gξ
that are exponential moving averages of the
online network [61]
Lij (θ;ξ) = log exp(zi
θ·zj
ξ)
exp(zi
θ·zj
ξ) + Pnexp(zi
θ·zn
ξ).(3)
{zn
ξ}n
are negative features computed from frames from other videos in the batch. The final,
multi-view loss is evaluated for all pairs L(θ;ξ) = Pi̸=jLij (θ;ξ).
3.2 Addressing dataset domain mismatch
We began investigating the potential for learning general representations from videos, using standard
datasets including Kinetics, AudioSet, and YouTube-8M. However, Kinetics is quite small and is
limited in scope to human actions. On the other-hand, AudioSet and YouTube-8M are noisy and have
very imbalanced class distributions. Additionally, prior work has shown that even self-supervised
methods are quite sensitive to the pretraining distribution [
67
]. Yet over the last decade, it has been
shown that ImageNet can be used for learning image representations that transfer well to many
downstream tasks. As a result, we hypothesized that collecting a minimally-curated video dataset
matched to the rough properties of ImageNet would be beneficial for learning a more general visual
model from videos.
To test of this hypothesis, we developed a data curation pipeline—VideoNet—to filter online videos
such that our training data more closely matches the distribution of ImageNet categories. For each of
the 1,000 ImageNet categories, we retrieved 5,000 video clips whose title included the category’s
name or a synonym. We then filtered these videos by applying an image classifier (pretrained ResNet-
50 on ImageNet) to verify that the videos contained the intended object category. We classified the
first 100 frames of each video and discarded videos for which the query category was not equal to the
ResNet’s top-1 prediction for any of the frames. We also discarded videos of less than
10s
in length.
While the VideoNet procedure is close in conceptualization to the method used to create the R2V2
dataset proposed by Gordon et al.
[7]
, it differs in a few ways. First, we utilize full video clips that
allow us to uniformly sample frames at any time point rather than the fixed sampling of frames
that are
5s
apart in R2V2. Second, by using the ImageNet classifier to filter videos, we can reduce
mismatch with the ImageNet distribution that can arise from incorrect tagging and noisy labeling of
online videos. This is verified by the fact that only 1.18M of the 5M retrieved videos met our filtering
criteria. We also note that the use of classification-based filtering is just one method of curation.
While we demonstrate in Sec. 4.3, that this curation does provide large benefits in the context of
video pre-training compared with existing datasets, there is still great potential to make improvements
by utilizing larger target datasets (such as ImageNet-22K) and utilizing alternative curation strategies
such as the nearest-neighbor retrieval proposed by [10] in creating the LVD-142M image dataset.
4 Results
Humans are able to solve a range of visual tasks that require complex spatial and temporal reasoning,
including generalizing to noisy or out-of-distribution (OOD) scenarios. Therefore, we first benchmark
VITO against image and video pretrained models on a variety of tasks to demonstrate sufficient
generality and robustness in task performance. We then assess whether VITO not only captures these
task-based properties, but also displays strong quantitative alignment with human behavior.
4
4.1 VITO generalizes across diverse visual tasks
We present in Table 1the transfer performance of VITO compared to strong supervised and self-
supervised baselines on dense scene understanding (semantic segmentation and object detection),
video understanding (video segmentation and action recognition), and out-of-distribution (OOD)
object recognition. On every benchmark, VITO either outperforms or is competitive with the best
baselines for that specific task.
Scene Understanding Video Understanding OOD Recognition
Pretraining Dataset ADE20K
(mIoU)
COCO
(mAP)
DAVIS
(J&F
mean)
UCF101
(top-1)
IN-A
(top-1)
IN-Vid
(pm0/
pm10)
Random - 27.9 39.0 - - - -
Standard image pretraining
Supervised ImageNet 33.5 44.2 66.1 83.4 2.2 67.7/52.4
BYOL [61] ImageNet 38.8 43.7 66.6 85.6 - -
MoCLR [68] ImageNet 39.2 43.9 65.5 85.5 3.7 64.7/50.0
DINO [19] ImageNet 39.0 44.3 65.3 85.4 5.0 65.2/52.0
Robust image pretraining
Stylized-IN [56] SIN+IN - - - 83.3 2.0 68.4/51.7
L2-Robust [54] ImageNet - - - 83.7 2.1 65.2/51.6
Video pretraining
VIVI [69] YT8M 34.2 41.3 - - 0.5 57.9/36.5
MMV-VA [70] AS + HT 32.5 41.3 - - - -
VINCE [7] R2V2 35.7 42.4 66.1 - - -
VFS [8] K400 31.4 41.6 67.8 - - -
CycleCon [9] R2V2 35.6 42.8 - 82.8 0.4 50.4/30.1
VITO VideoNet 39.4 44.0 68.2 87.4 5.4 70.6/57.2
Table 1: VITO representations generalize to a variety of tasks in both image and video modalities, surpassing
models specialized for each task. For external models, we finetune publicly available checkpoints.
Scene understanding. We first note that VITO provides large gains over all prior video pretraining
methods on scene understanding and robust object recognition. We further validate these comparisons
on three additional benchmarks and find that VITO strongly outperforms the prior work across all 5
datasets (PASCAL/ADE20K/COCO/LVIS/IN-1K, see Table B.3). For example, VITO improves over
VIVI [
69
] by 2-10%, highlighting the importance of data curation and our contrastive formulation.
VITO improves over VINCE [
7
] by 1-12%, highlighting the importance of fine-grained temporal
deformations. Finally, VITO improves even over MMV [
70
] by 2-15%, despite their use of large-scale
text supervision, highlighting the relevance of video-only learning.
Compared with the best supervised and self-supervised image-pretraining methods VITO achieves
competitive performance on these same benchmarks (Table 1and Table B.3). To our knowledge,
VITO is the first video pretrained method to close the gap with ImageNet pretraining on large-scale
scene understanding benchmarks such as these.
Video understanding. We next ask whether this increased spatial understanding come at the
cost of traditional benefits of video pretraining on video tasks. We find that this is not the case,
evaluating on DAVIS segmentation and UCF-101 action recognition. On DAVIS, which tests the
ability to segment an object over its dynamic temporal evolution, VITO features capture fine-grained
temporal deformations of objects far better than ImageNet pretraining methods, as well as the best
video pretraining methods (See Table B.4 for additional comparisons). On UCF-101, which tests
the ability to classify global spatio-temporal features, we find that a simple average pooling of
VITO frame representations again outperforms all image pretraining and prior frame-based video
pretraining significantly. VITO even outperforms a number of recent methods that use specialized
video architectures (See Table B.5). While VITO under-performs relative to the best video models,
we note that these methods either cannot be tested or under-perform on spatial understanding.
Additionally, as shown in Table B.5 and Sec. A.5, simple learned temporal pooling strategies on top
of VITO representations further close the gap with the best video architectures.
5
摘要:

Self-supervisedvideopretrainingyieldsrobustandmorehuman-alignedvisualrepresentationsNikhilParthasarathy†S.M.AliEslamiJoãoCarreiraOlivierJ.HénaffGoogleDeepMindAbstractHumanslearnpowerfulrepresentationsofobjectsandscenesbyobservinghowtheyevolveovertime.Yet,outsideofspecifictasksthatrequireexplicittemp...

展开>> 收起<<
Self-supervised video pretraining yields robust and more human-aligned visual representations Nikhil ParthasarathyS. M. Ali Eslami João Carreira Olivier J. Hénaff.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:4.68MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注