Masked Motion Encoding for Self-Supervised Video Representation Learning Xinyu Sun1 2Peihao Chen1Liangwei Chen1Changhao Li1 Thomas H. Li6Mingkui Tan1 5Chuang Gan3 4

2025-04-27 2 0 2.73MB 16 页 10玖币

侵权投诉

Masked Motion Encoding for Self-Supervised Video Representation Learning

Xinyu Sun1 2*Peihao Chen1*Liangwei Chen1Changhao Li1

Thomas H. Li6Mingkui Tan1 5† Chuang Gan3 4

1South China University of Technology, 2Information Technology R&D Innovation Center of Peking University

3UMass Amherst, 4MIT-IBM Watson AI Lab, 5Key Laboratory of Big Data and Intelligent Robot, Ministry of Education

6Peking University Shenzhen Graduate School

Abstract

How to learn discriminative video representation from

unlabeled videos is challenging but crucial for video anal-

ysis. The latest attempts seek to learn a representation

model by predicting the appearance contents in the masked

regions. However, simply masking and recovering ap-

pearance contents may not be sufﬁcient to model tempo-

ral clues as the appearance contents can be easily recon-

structed from a single frame. To overcome this limitation,

we present Masked Motion Encoding (MME), a new pre-

training paradigm that reconstructs both appearance and

motion information to explore temporal clues. In MME,

we focus on addressing two critical challenges to improve

the representation performance: 1) how to well represent

the possible long-term motion across multiple frames; and

2) how to obtain ﬁne-grained temporal clues from sparsely

sampled videos. Motivated by the fact that human is able to

recognize an action by tracking objects’ position changes

and shape changes, we propose to reconstruct a motion

trajectory that represents these two kinds of change in the

masked regions. Besides, given the sparse video input, we

enforce the model to reconstruct dense motion trajectories

in both spatial and temporal dimensions. Pre-trained with

our MME paradigm, the model is able to anticipate long-

term and ﬁne-grained motion details. Code is available at

https://github.com/XinyuSun/MME.

1. Introduction

Video representation learning plays a critical role in

video analysis like action recognition [15,32,79], action lo-

calization [12,81], video retrieval [4,82], videoQA [40], etc.

Learning video representation is very difﬁcult for two rea-

sons. Firstly, it is extremely difﬁcult and labor-intensive to

annotate videos, and thus relying on annotated data to learn

*Equal contribution. Email: {csxinyusun, phchencs}@gmail.com

†Corresponding author. Email: mingkuitan@scut.edu.cn

video representations is not scalable. Also, the complex

spatial-temporal contents with a large data volume are difﬁ-

cult to be represented simultaneously. How to perform self-

supervised videos representation learning only using unla-

beled videos has been a prominent research topic [7,13,49].

Taking advantage of spatial-temporal modeling using a

ﬂexible attention mechanism, vision transformers [3,8,25,

26,53] have shown their superiority in representing video.

Prior works [5,37,84] have successfully introduced the

mask-and-predict scheme in NLP [9,23] to pre-train an im-

age transformer. These methods vary in different recon-

struction objectives, including raw RGB pixels [37], hand-

crafted local patterns [75], and VQ-VAE embedding [5], all

above are static appearance information in images. Based

on previous successes, some researchers [64,72,75] attempt

to extend this scheme to the video domain, where they mask

3D video regions and reconstruct appearance information.

However, these methods suffers from two limitations. First,

as the appearance information can be well reconstructed in

a single image with an extremely high masking ratio (85%

in MAE [37]), it is also feasible to be reconstructed in the

tube-masked video frame-by-frame and neglect to learn im-

portant temporal clues. This can be proved by our ablation

study (cf. Section 4.2.1). Second, existing works [64,75]

often sample frames sparsely with a ﬁxed stride, and then

mask some regions in these sampled frames. The recon-

struction objectives only contain information in the sparsely

sampled frames, and thus are hard to provide supervision

signals for learning ﬁne-grained motion details, which is

critical to distinguish different actions [3,8].

In this paper, we aim to design a new mask-and-predict

paradigm to tackle these two issues. Fig. 1(a) shows two key

factors to model an action, i.e., position change and shape

change. By observing the position change of the person, we

realize he is jumping in the air, and by observing the shape

change that his head falls back and then tucks to his chest,

we are aware that he is adjusting his posture to cross the bar.

We believe that anticipating these changes helps the model

better understand an action.

arXiv:2210.06096v2 [cs.CV] 23 Mar 2023

𝑡

ℎ

𝑤

(b) Appearance reconstruction vs. motion trajectory reconstruction.

Appearance

reconstruction

Motion trajectory

reconstruction

𝑡0𝑡0𝑡2

ViT-based

Autoencoder

Input video

(a) Two key factors to recognize a high jump action.

Position change Shape change

𝑓

𝑒𝑛𝑐 𝑓

𝑑𝑒𝑐

𝑡

ℎ

𝑤

𝑡1

Figure 1. Illustration of motion trajectory reconstruction for Masked Motion Encoding. (a) Position change and shape change over

time are two key factors to recognize an action, we leverage them to represent the motion trajectory. (b) Compared with the current

appearance reconstruction task, our motion trajectory reconstruction takes into account both appearance and motion information.

Based on this observation, instead of predicting the ap-

pearance contents, we propose to predict motion trajectory,

which represents impending position and shape changes,

for the mask-and-predict task. Speciﬁcally, we use a dense

grid to sample points as different object parts, and then track

these points using optical ﬂow in adjacent frames to gener-

ate trajectories, as shown in Fig. 1(b). The motion trajectory

contains information in two aspects: the position features

that describe relative movement; and the shape features that

describe shape changes of the tracked object along the tra-

jectory. To predict this motion trajectory, the model has to

learn to reason the semantics of masked objects based on

the visible patches, and then learn the correlation of objects

among different frames and try to estimate their accurate

motions. We name the proposed mask-and-predict task as

Masked Motion Encoding (MME).

Moreover, to help the model learn ﬁne-grained motion

details, we further propose to interpolate the motion trajec-

tory. Taking sparsely sampled video as input, the model is

asked to reconstruct spatially and temporally dense motion

trajectories. This is inspired by the video frame interpo-

lation task [77] where a deep model can reconstruct dense

video at the pixel level from sparse video input. Differ-

ent from it, we aim to reconstruct the ﬁne-grained motion

details of moving objects, which has higher-level motion

information and is helpful for understanding actions. Our

main contributions are as follows:

• Existing mask-and-predict task based on appearance

reconstruction is hard to learn important temporal

clues, which is critical for representing video content.

Our Masked Motion Encoding (MME) paradigm over-

comes this limitation by asking the model to recon-

struct motion trajectory.

• Our motion interpolation scheme takes a sparsely sam-

pled video as input and then predicts dense motion tra-

jectory in both spatial and temporal dimensions. This

scheme endows the model to capture long-term and

ﬁne-grained motion clues from sparse video input.

Extensive experimental results on multiple standard video

recognition benchmarks prove that the representations

learned from the proposed mask-and-predict task achieve

state-of-the-art performance on downstream action recog-

nition tasks. Speciﬁcally, pre-trained on Kinetics-400 [10],

our MME brings the gain of 2.3% on Something-Something

V2 [34], 0.9% on Kinetics-400, 0.4% on UCF101 [59], and

4.7% on HMDB51 [44].

2. Related Work

Self-supervised Video Representation Learning. Self-

supervised video representation learning aims to learn dis-

criminative video features for various downstream tasks in

the absence of accurate video labels. To this end, most

of the existing methods try to design an advanced pretext

task like predicting the temporal order of shufﬂed video

crops [78], perceiving the video speediness [7,13] or solv-

ing puzzles [43,49]. In addition, contrastive learning is also

widely used in this domain [14,16,36,41,46,54,57,68,69],

which constrains the consistency between different aug-

mentation views and brings signiﬁcant improvement. In

particular, ASCNet and CoCLR [36,41] focus on min-

ing hard positive samples in different perspectives. Op-

tical ﬂow has also been proven to be effective in captur-

ing motion information [70,76]. Besides, tracking video

objects’ movement is also used in self-supervised learn-

ing [17,18,65,73,74]. Among them, Wang et al. [74]

only utilizes the spatial encoder to extract frame appearance

information. CtP framework [65] and Siamese-triplet net-

work [73] only require the model to ﬁgure out the position

and size changes of a speciﬁc video patch. Different from

these methods, our proposed MME trace the ﬁne-grained

movement and shape changes of different parts of objects

in the video, hence resulting in a superior video representa-

tion. Tokmakov et al. [63] utilize Dense Trajectory to pro-

vide initial pseudo labels for video clustering. However, the

model does not predict trajectory motion features explicitly.

Instead, we consider long-term and ﬁne-grained trajectory

motion features as explicit reconstruction targets.

ViT

encoder

ViT

decoder

𝑓

𝑑𝑒𝑐

…

masked video clipRaw video clip

…

𝑓

𝑒𝑛𝑐

Video Masking in Spatial & Temporal Domain Video Representation

Motion trajectory

Shape Position

𝑡

Motion Trajectory Reconstruction

…

Figure 2. Overview of Masked Motion Encoding (MME). Given a sparsely sampled video, we ﬁrst divide it into several patches and

randomly mask out some of them. And then, we feed the remaining patches to a ViT encoder to extract video representation. Last, a

lightweight ViT decoder is involved to predict the content in the masked region, i.e., a motion trajectory containing position changes and

shape changes of moving objects.

Mask Modeling for Vision Transformer. Recently, BEiT

and MAE [5,37] show two excellent mask image modeling

paradigms. BEVT [72] and VideoMAE [64] extend these

two paradigms to the video domain. To learn visual rep-

resentations, BEVT [72] predicts the discrete tokens gener-

ated by a pre-trained VQ-VAE tokenizer [56]. Nevertheless,

pre-training such a tokenizer involves an unbearable amount

of data and computation. In contrast, VideoMAE [64] pre-

train the Vision Transformer by regressing the RGB pixels

located in the masked tubes of videos. Due to asymmetric

encoder-decoder architecture and extremely high masking

ratio, model pre-training is more efﬁcient with VideoMAE.

Besides, MaskFeat [75] ﬁnds that predicting the Histogram

of Gradient (HOG [21]) of masked video contents is a

strong objective for the mask-and-predict paradigm. Exist-

ing methods only consider static information in each video

frame, thus the model can speculate the masked area by

watching the visible area in each frame independently and

failed to learn important temporal clues (cf. Section 3.1).

Different from the video prediction methods [35,51,58,60]

that predict the future frames in pixel or latent space, our

Masked Motion Encoding paradigm predicts incoming ﬁne-

grained motion in masked video regions, including position

changes and shape changes.

3. Proposed Method

We ﬁrst revisit the current masked video modeling task

for video representation learning (cf. Section 3.1). Then,

we introduce our masked motion encoding (MME), where

we change the task from recovering appearance to recover-

ing motion trajectory (cf. Section 3.2).

3.1. Rethinking Masked Video Modeling

Given a video clip sampled from a video, self-supervised

video representation learning aims to learn a feature en-

coder fenc(·)that maps the clip to its corresponding fea-

ture that best describes the video. Existing masked video

modeling methods [64,75] attempt to learn such a feature

encoder through a mask-and-predict task. Speciﬁcally, the

input clip is ﬁrst divided into multiple non-overlapped 3D

patches. Some of these patches are randomly masked and

the remaining patches are fed into the feature encoder, fol-

lowed by a decoder fdec(·)to reconstruct the information in

the masked patches. Different works aim to reconstruct dif-

ferent information (e.g., raw pixel in VideoMAE [64] and

HOG in MaskFeat [75]).

However, the existing works share a common character-

istic where they all attempt to recover static appearance in-

formation of the masked patches. Since an image with a

high masking ratio (85% in MAE [37]) can be well recon-

structed [37,75], we conjecture that the masked appearance

information of a video can also be reconstructed frame by

frame independently. In this sense, the model may focus

more on the contents in the same frame. This may hinder

the models from learning important temporal clues, which

is critical for video representation. We empirically study

this conjecture in the ablation study (cf. Section 4.2.1).

3.2. General Scheme of MME

To better learn temporal clues of a video, our MME

changes the reconstruction content from static appearance

information to object motion information, including posi-

tion and shape change of objects. As shown in Fig. 2,

a video clip is sparsely sampled from a video and is di-

vided into a number of non-overlapped 3D patches of size

t×h×w, corresponding to time, height, and width. We

follow VideoMAE [64] to use the tube masking strategy,

where the masking map is the same for all frames, to mask

a subset of patches. For computation efﬁciency, we follow

MAE [37] to only feed the unmasked patches (and their po-

sitions) to the encoder. The output representation together

with learnable [MASK] tokens are fed to a decoder to re-

construct motion trajectory zin the masked patches. The

training loss for MME is

L=X

i∈I

|zi−ˆ

zi|2,(1)

where ˆ

zis the predicted motion trajectory, and Iis the index

set of motion trajectories in all masked patches.

Motivated by the fact that we humans recognize actions

by perceiving position changes and shape changes of mov-

ing objects, we leverage these two types of information to

represent the motion trajectory. Through pre-training on

MME task, the model is endowed with the ability to ex-

plore important temporal clues. Another important char-

acteristic of the motion trajectory is that it contains ﬁne-

grained motion information extracted at raw video rate.

This ﬁne-grained motion information provides the model

with a supervision signal to anticipate ﬁne-grained action

from sparse video input. In the following, we will introduce

the proposed motion trajectory in detail.

3.3. Motion Trajectory for MME

The motion of moving objects can be represented in

various ways such as optical ﬂow [27], histograms of op-

tical ﬂow (HOF) [45], and motion boundary histograms

(MBH) [45]. However, these descriptors can only represent

short-term motion between two adjacent frames. We hope

our motion trajectory represents long-term motion, which

is critical for video representation. To this end, inspired by

DT [66], we ﬁrst track the moving object in the following

Lframes to cover a longer range of motion, resulting in a

trajectory T,i.e.,

T= (pt,pt+1,· · · ,pt+L),(2)

where pt= (xt, yt)represents a point located at (xt, yt)

of frame t, and (·,·)indicates the concatenation operation.

Along this trajectory, we fetch the position features zpand

shape features zsof this object to compose a motion trajec-

tory z,i.e.,

z= (zp,zs).(3)

The position features are represented by the position tran-

sition relative to the last time step, while the shape features

are the HOG descriptors of the tracked object in different

time steps.

Tracking objects using spatially and temporally dense

trajectories. Some previous works [2,18,19,24] try to use

one trajectory to represent the motion of an individual ob-

ject. In contrast, DT [66] points out that tracking spatially

dense feature points sampled on a grid space performs bet-

ter since it ensures better coverage of different objects in

a video. Following DT [66], we use spatially dense grid

points as the initial position of each trajectory. Speciﬁcally,

we uniformly sample Kpoints in a masked patch of size

t×h×w, where each point indicates a part of an object. For

each point, we track it through temporally dense Lframes

according to the dense optical ﬂow, resulting in Ktrajecto-

ries. In this way, the model is able to capture spatially and

temporally dense motion information of objects through the

mask-and-predict task.

As a comparison, the reconstruction contents in exist-

ing works [64,75] often extract temporally sparse videos

sampled with a large stride s > 1. The model takes as

input a sparse video and predicts these sparse contents for

learning video representation. Different from these works,

our model also takes as input sparse video but we push

the model to interpolate motion trajectory containing ﬁne-

grained motion information. This simple trajectory interpo-

lation task does not increase the computational cost of the

video encoder but helps the model learn more ﬁne-grained

action information even given sparse video as input. More

details about dense ﬂow calculating and trajectory tracking

can be found in Appendix.

Representing position features. Given a trajectory Tcon-

sisting of the tracked object position at each frame, we are

more interested in the related movement of objects instead

of their absolute location. Consequently, we represent the

position features with related movement between two adja-

cent points ∆pt=pt+1 −pt,i.e.,

zp= (∆pt, ..., ∆pt+L−1),(4)

where zpis a L×2dimensional feature. As each patch

contains Kposition features, we concatenate and normalize

them as position features part of the motion trajectory.

Representing shape features. Besides embedding the

movement, the model also needs to be aware of the shape

changes of objects to recognize actions. Inspired by

HOG [21], we use histograms of oriented gradients (HOG)

with 9 bins to describe the shape of objects.

Compared with existing works [64,75] that reconstruct

HOG in every single frame, we are more interested in the

dynamic shape changes of an object, which can better rep-

resent action in a video. To this end, we follow DT [66] to

calculate trajectory-aligned HOG, consisting of HOG fea-

tures around all tracked points in a trajectory, i.e.,

zs= (HOG(pt), ..., HOG(pt+L−1)),(5)

where HOG(·)is the HOG descriptor and zsis a L×9

dimensional feature. Also, as one patch contains Ktrajec-

tories, we concatenate Ktrajectory-aligned HOG features

and normalize them to the standard normal distribution as

the shape features part of the motion trajectory.

4. Experiments

Implementation details. We conduct experiments on

Kinetics-400 (K400), Something-Something V2 (SSV2),

UCF101 and HMDB51 datasets. Unless otherwise stated,

we follow previous trails [64] and feed the model a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MaskedMotionEncodingforSelf-SupervisedVideoRepresentationLearningXinyuSun12*PeihaoChen1*LiangweiChen1ChanghaoLi1ThomasH.Li6MingkuiTan15†ChuangGan341SouthChinaUniversityofTechnology,2InformationTechnologyR&DInnovationCenterofPekingUniversity3UMassAmherst,4MIT-IBMWatsonAILab,5KeyLaboratoryofBigDataand...

展开>> 收起<<

Masked Motion Encoding for Self-Supervised Video Representation Learning Xinyu Sun1 2Peihao Chen1Liangwei Chen1Changhao Li1 Thomas H. Li6Mingkui Tan1 5Chuang Gan3 4.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Masked Motion Encoding for Self-Supervised Video Representation Learning Xinyu Sun1 2Peihao Chen1Liangwei Chen1Changhao Li1 Thomas H. Li6Mingkui Tan1 5Chuang Gan3 4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: