Masked Motion Encoding for Self-Supervised Video Representation Learning Xinyu Sun1 2Peihao Chen1Liangwei Chen1Changhao Li1 Thomas H. Li6Mingkui Tan1 5Chuang Gan3 4

2025-04-27 1 0 2.73MB 16 页 10玖币
侵权投诉
Masked Motion Encoding for Self-Supervised Video Representation Learning
Xinyu Sun1 2*Peihao Chen1*Liangwei Chen1Changhao Li1
Thomas H. Li6Mingkui Tan1 5 Chuang Gan3 4
1South China University of Technology, 2Information Technology R&D Innovation Center of Peking University
3UMass Amherst, 4MIT-IBM Watson AI Lab, 5Key Laboratory of Big Data and Intelligent Robot, Ministry of Education
6Peking University Shenzhen Graduate School
Abstract
How to learn discriminative video representation from
unlabeled videos is challenging but crucial for video anal-
ysis. The latest attempts seek to learn a representation
model by predicting the appearance contents in the masked
regions. However, simply masking and recovering ap-
pearance contents may not be sufficient to model tempo-
ral clues as the appearance contents can be easily recon-
structed from a single frame. To overcome this limitation,
we present Masked Motion Encoding (MME), a new pre-
training paradigm that reconstructs both appearance and
motion information to explore temporal clues. In MME,
we focus on addressing two critical challenges to improve
the representation performance: 1) how to well represent
the possible long-term motion across multiple frames; and
2) how to obtain fine-grained temporal clues from sparsely
sampled videos. Motivated by the fact that human is able to
recognize an action by tracking objects’ position changes
and shape changes, we propose to reconstruct a motion
trajectory that represents these two kinds of change in the
masked regions. Besides, given the sparse video input, we
enforce the model to reconstruct dense motion trajectories
in both spatial and temporal dimensions. Pre-trained with
our MME paradigm, the model is able to anticipate long-
term and fine-grained motion details. Code is available at
https://github.com/XinyuSun/MME.
1. Introduction
Video representation learning plays a critical role in
video analysis like action recognition [15,32,79], action lo-
calization [12,81], video retrieval [4,82], videoQA [40], etc.
Learning video representation is very difficult for two rea-
sons. Firstly, it is extremely difficult and labor-intensive to
annotate videos, and thus relying on annotated data to learn
*Equal contribution. Email: {csxinyusun, phchencs}@gmail.com
Corresponding author. Email: mingkuitan@scut.edu.cn
video representations is not scalable. Also, the complex
spatial-temporal contents with a large data volume are diffi-
cult to be represented simultaneously. How to perform self-
supervised videos representation learning only using unla-
beled videos has been a prominent research topic [7,13,49].
Taking advantage of spatial-temporal modeling using a
flexible attention mechanism, vision transformers [3,8,25,
26,53] have shown their superiority in representing video.
Prior works [5,37,84] have successfully introduced the
mask-and-predict scheme in NLP [9,23] to pre-train an im-
age transformer. These methods vary in different recon-
struction objectives, including raw RGB pixels [37], hand-
crafted local patterns [75], and VQ-VAE embedding [5], all
above are static appearance information in images. Based
on previous successes, some researchers [64,72,75] attempt
to extend this scheme to the video domain, where they mask
3D video regions and reconstruct appearance information.
However, these methods suffers from two limitations. First,
as the appearance information can be well reconstructed in
a single image with an extremely high masking ratio (85%
in MAE [37]), it is also feasible to be reconstructed in the
tube-masked video frame-by-frame and neglect to learn im-
portant temporal clues. This can be proved by our ablation
study (cf. Section 4.2.1). Second, existing works [64,75]
often sample frames sparsely with a fixed stride, and then
mask some regions in these sampled frames. The recon-
struction objectives only contain information in the sparsely
sampled frames, and thus are hard to provide supervision
signals for learning fine-grained motion details, which is
critical to distinguish different actions [3,8].
In this paper, we aim to design a new mask-and-predict
paradigm to tackle these two issues. Fig. 1(a) shows two key
factors to model an action, i.e., position change and shape
change. By observing the position change of the person, we
realize he is jumping in the air, and by observing the shape
change that his head falls back and then tucks to his chest,
we are aware that he is adjusting his posture to cross the bar.
We believe that anticipating these changes helps the model
better understand an action.
arXiv:2210.06096v2 [cs.CV] 23 Mar 2023
𝑡
𝑤
(b) Appearance reconstruction vs. motion trajectory reconstruction.
Appearance
reconstruction
Motion trajectory
reconstruction
𝑡0𝑡0𝑡2
ViT-based
Autoencoder
Input video
(a) Two key factors to recognize a high jump action.
Position change Shape change
𝑓
𝑒𝑛𝑐 𝑓
𝑑𝑒𝑐
𝑡
𝑤
𝑡1
Figure 1. Illustration of motion trajectory reconstruction for Masked Motion Encoding. (a) Position change and shape change over
time are two key factors to recognize an action, we leverage them to represent the motion trajectory. (b) Compared with the current
appearance reconstruction task, our motion trajectory reconstruction takes into account both appearance and motion information.
Based on this observation, instead of predicting the ap-
pearance contents, we propose to predict motion trajectory,
which represents impending position and shape changes,
for the mask-and-predict task. Specifically, we use a dense
grid to sample points as different object parts, and then track
these points using optical flow in adjacent frames to gener-
ate trajectories, as shown in Fig. 1(b). The motion trajectory
contains information in two aspects: the position features
that describe relative movement; and the shape features that
describe shape changes of the tracked object along the tra-
jectory. To predict this motion trajectory, the model has to
learn to reason the semantics of masked objects based on
the visible patches, and then learn the correlation of objects
among different frames and try to estimate their accurate
motions. We name the proposed mask-and-predict task as
Masked Motion Encoding (MME).
Moreover, to help the model learn fine-grained motion
details, we further propose to interpolate the motion trajec-
tory. Taking sparsely sampled video as input, the model is
asked to reconstruct spatially and temporally dense motion
trajectories. This is inspired by the video frame interpo-
lation task [77] where a deep model can reconstruct dense
video at the pixel level from sparse video input. Differ-
ent from it, we aim to reconstruct the fine-grained motion
details of moving objects, which has higher-level motion
information and is helpful for understanding actions. Our
main contributions are as follows:
Existing mask-and-predict task based on appearance
reconstruction is hard to learn important temporal
clues, which is critical for representing video content.
Our Masked Motion Encoding (MME) paradigm over-
comes this limitation by asking the model to recon-
struct motion trajectory.
Our motion interpolation scheme takes a sparsely sam-
pled video as input and then predicts dense motion tra-
jectory in both spatial and temporal dimensions. This
scheme endows the model to capture long-term and
fine-grained motion clues from sparse video input.
Extensive experimental results on multiple standard video
recognition benchmarks prove that the representations
learned from the proposed mask-and-predict task achieve
state-of-the-art performance on downstream action recog-
nition tasks. Specifically, pre-trained on Kinetics-400 [10],
our MME brings the gain of 2.3% on Something-Something
V2 [34], 0.9% on Kinetics-400, 0.4% on UCF101 [59], and
4.7% on HMDB51 [44].
2. Related Work
Self-supervised Video Representation Learning. Self-
supervised video representation learning aims to learn dis-
criminative video features for various downstream tasks in
the absence of accurate video labels. To this end, most
of the existing methods try to design an advanced pretext
task like predicting the temporal order of shuffled video
crops [78], perceiving the video speediness [7,13] or solv-
ing puzzles [43,49]. In addition, contrastive learning is also
widely used in this domain [14,16,36,41,46,54,57,68,69],
which constrains the consistency between different aug-
mentation views and brings significant improvement. In
particular, ASCNet and CoCLR [36,41] focus on min-
ing hard positive samples in different perspectives. Op-
tical flow has also been proven to be effective in captur-
ing motion information [70,76]. Besides, tracking video
objects’ movement is also used in self-supervised learn-
ing [17,18,65,73,74]. Among them, Wang et al. [74]
only utilizes the spatial encoder to extract frame appearance
information. CtP framework [65] and Siamese-triplet net-
work [73] only require the model to figure out the position
and size changes of a specific video patch. Different from
these methods, our proposed MME trace the fine-grained
movement and shape changes of different parts of objects
in the video, hence resulting in a superior video representa-
tion. Tokmakov et al. [63] utilize Dense Trajectory to pro-
vide initial pseudo labels for video clustering. However, the
model does not predict trajectory motion features explicitly.
Instead, we consider long-term and fine-grained trajectory
motion features as explicit reconstruction targets.
ViT
encoder
ViT
decoder
𝑓
𝑑𝑒𝑐
masked video clipRaw video clip
𝑓
𝑒𝑛𝑐
Video Masking in Spatial & Temporal Domain Video Representation
Motion trajectory
Shape Position
𝑡
Motion Trajectory Reconstruction
M
M
M
Figure 2. Overview of Masked Motion Encoding (MME). Given a sparsely sampled video, we first divide it into several patches and
randomly mask out some of them. And then, we feed the remaining patches to a ViT encoder to extract video representation. Last, a
lightweight ViT decoder is involved to predict the content in the masked region, i.e., a motion trajectory containing position changes and
shape changes of moving objects.
Mask Modeling for Vision Transformer. Recently, BEiT
and MAE [5,37] show two excellent mask image modeling
paradigms. BEVT [72] and VideoMAE [64] extend these
two paradigms to the video domain. To learn visual rep-
resentations, BEVT [72] predicts the discrete tokens gener-
ated by a pre-trained VQ-VAE tokenizer [56]. Nevertheless,
pre-training such a tokenizer involves an unbearable amount
of data and computation. In contrast, VideoMAE [64] pre-
train the Vision Transformer by regressing the RGB pixels
located in the masked tubes of videos. Due to asymmetric
encoder-decoder architecture and extremely high masking
ratio, model pre-training is more efficient with VideoMAE.
Besides, MaskFeat [75] finds that predicting the Histogram
of Gradient (HOG [21]) of masked video contents is a
strong objective for the mask-and-predict paradigm. Exist-
ing methods only consider static information in each video
frame, thus the model can speculate the masked area by
watching the visible area in each frame independently and
failed to learn important temporal clues (cf. Section 3.1).
Different from the video prediction methods [35,51,58,60]
that predict the future frames in pixel or latent space, our
Masked Motion Encoding paradigm predicts incoming fine-
grained motion in masked video regions, including position
changes and shape changes.
3. Proposed Method
We first revisit the current masked video modeling task
for video representation learning (cf. Section 3.1). Then,
we introduce our masked motion encoding (MME), where
we change the task from recovering appearance to recover-
ing motion trajectory (cf. Section 3.2).
3.1. Rethinking Masked Video Modeling
Given a video clip sampled from a video, self-supervised
video representation learning aims to learn a feature en-
coder fenc(·)that maps the clip to its corresponding fea-
ture that best describes the video. Existing masked video
modeling methods [64,75] attempt to learn such a feature
encoder through a mask-and-predict task. Specifically, the
input clip is first divided into multiple non-overlapped 3D
patches. Some of these patches are randomly masked and
the remaining patches are fed into the feature encoder, fol-
lowed by a decoder fdec(·)to reconstruct the information in
the masked patches. Different works aim to reconstruct dif-
ferent information (e.g., raw pixel in VideoMAE [64] and
HOG in MaskFeat [75]).
However, the existing works share a common character-
istic where they all attempt to recover static appearance in-
formation of the masked patches. Since an image with a
high masking ratio (85% in MAE [37]) can be well recon-
structed [37,75], we conjecture that the masked appearance
information of a video can also be reconstructed frame by
frame independently. In this sense, the model may focus
more on the contents in the same frame. This may hinder
the models from learning important temporal clues, which
is critical for video representation. We empirically study
this conjecture in the ablation study (cf. Section 4.2.1).
3.2. General Scheme of MME
To better learn temporal clues of a video, our MME
changes the reconstruction content from static appearance
information to object motion information, including posi-
tion and shape change of objects. As shown in Fig. 2,
a video clip is sparsely sampled from a video and is di-
vided into a number of non-overlapped 3D patches of size
t×h×w, corresponding to time, height, and width. We
follow VideoMAE [64] to use the tube masking strategy,
where the masking map is the same for all frames, to mask
a subset of patches. For computation efficiency, we follow
MAE [37] to only feed the unmasked patches (and their po-
sitions) to the encoder. The output representation together
with learnable [MASK] tokens are fed to a decoder to re-
construct motion trajectory zin the masked patches. The
training loss for MME is
L=X
i∈I
|ziˆ
zi|2,(1)
where ˆ
zis the predicted motion trajectory, and Iis the index
set of motion trajectories in all masked patches.
Motivated by the fact that we humans recognize actions
by perceiving position changes and shape changes of mov-
ing objects, we leverage these two types of information to
represent the motion trajectory. Through pre-training on
MME task, the model is endowed with the ability to ex-
plore important temporal clues. Another important char-
acteristic of the motion trajectory is that it contains fine-
grained motion information extracted at raw video rate.
This fine-grained motion information provides the model
with a supervision signal to anticipate fine-grained action
from sparse video input. In the following, we will introduce
the proposed motion trajectory in detail.
3.3. Motion Trajectory for MME
The motion of moving objects can be represented in
various ways such as optical flow [27], histograms of op-
tical flow (HOF) [45], and motion boundary histograms
(MBH) [45]. However, these descriptors can only represent
short-term motion between two adjacent frames. We hope
our motion trajectory represents long-term motion, which
is critical for video representation. To this end, inspired by
DT [66], we first track the moving object in the following
Lframes to cover a longer range of motion, resulting in a
trajectory T,i.e.,
T= (pt,pt+1,· · · ,pt+L),(2)
where pt= (xt, yt)represents a point located at (xt, yt)
of frame t, and (·,·)indicates the concatenation operation.
Along this trajectory, we fetch the position features zpand
shape features zsof this object to compose a motion trajec-
tory z,i.e.,
z= (zp,zs).(3)
The position features are represented by the position tran-
sition relative to the last time step, while the shape features
are the HOG descriptors of the tracked object in different
time steps.
Tracking objects using spatially and temporally dense
trajectories. Some previous works [2,18,19,24] try to use
one trajectory to represent the motion of an individual ob-
ject. In contrast, DT [66] points out that tracking spatially
dense feature points sampled on a grid space performs bet-
ter since it ensures better coverage of different objects in
a video. Following DT [66], we use spatially dense grid
points as the initial position of each trajectory. Specifically,
we uniformly sample Kpoints in a masked patch of size
t×h×w, where each point indicates a part of an object. For
each point, we track it through temporally dense Lframes
according to the dense optical flow, resulting in Ktrajecto-
ries. In this way, the model is able to capture spatially and
temporally dense motion information of objects through the
mask-and-predict task.
As a comparison, the reconstruction contents in exist-
ing works [64,75] often extract temporally sparse videos
sampled with a large stride s > 1. The model takes as
input a sparse video and predicts these sparse contents for
learning video representation. Different from these works,
our model also takes as input sparse video but we push
the model to interpolate motion trajectory containing fine-
grained motion information. This simple trajectory interpo-
lation task does not increase the computational cost of the
video encoder but helps the model learn more fine-grained
action information even given sparse video as input. More
details about dense flow calculating and trajectory tracking
can be found in Appendix.
Representing position features. Given a trajectory Tcon-
sisting of the tracked object position at each frame, we are
more interested in the related movement of objects instead
of their absolute location. Consequently, we represent the
position features with related movement between two adja-
cent points pt=pt+1 pt,i.e.,
zp= (∆pt, ..., pt+L1),(4)
where zpis a L×2dimensional feature. As each patch
contains Kposition features, we concatenate and normalize
them as position features part of the motion trajectory.
Representing shape features. Besides embedding the
movement, the model also needs to be aware of the shape
changes of objects to recognize actions. Inspired by
HOG [21], we use histograms of oriented gradients (HOG)
with 9 bins to describe the shape of objects.
Compared with existing works [64,75] that reconstruct
HOG in every single frame, we are more interested in the
dynamic shape changes of an object, which can better rep-
resent action in a video. To this end, we follow DT [66] to
calculate trajectory-aligned HOG, consisting of HOG fea-
tures around all tracked points in a trajectory, i.e.,
zs= (HOG(pt), ..., HOG(pt+L1)),(5)
where HOG(·)is the HOG descriptor and zsis a L×9
dimensional feature. Also, as one patch contains Ktrajec-
tories, we concatenate Ktrajectory-aligned HOG features
and normalize them to the standard normal distribution as
the shape features part of the motion trajectory.
4. Experiments
Implementation details. We conduct experiments on
Kinetics-400 (K400), Something-Something V2 (SSV2),
UCF101 and HMDB51 datasets. Unless otherwise stated,
we follow previous trails [64] and feed the model a
摘要:

MaskedMotionEncodingforSelf-SupervisedVideoRepresentationLearningXinyuSun12*PeihaoChen1*LiangweiChen1ChanghaoLi1ThomasH.Li6MingkuiTan15†ChuangGan341SouthChinaUniversityofTechnology,2InformationTechnologyR&DInnovationCenterofPekingUniversity3UMassAmherst,4MIT-IBMWatsonAILab,5KeyLaboratoryofBigDataand...

展开>> 收起<<
Masked Motion Encoding for Self-Supervised Video Representation Learning Xinyu Sun1 2Peihao Chen1Liangwei Chen1Changhao Li1 Thomas H. Li6Mingkui Tan1 5Chuang Gan3 4.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.73MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注