where ˆ
zis the predicted motion trajectory, and Iis the index
set of motion trajectories in all masked patches.
Motivated by the fact that we humans recognize actions
by perceiving position changes and shape changes of mov-
ing objects, we leverage these two types of information to
represent the motion trajectory. Through pre-training on
MME task, the model is endowed with the ability to ex-
plore important temporal clues. Another important char-
acteristic of the motion trajectory is that it contains fine-
grained motion information extracted at raw video rate.
This fine-grained motion information provides the model
with a supervision signal to anticipate fine-grained action
from sparse video input. In the following, we will introduce
the proposed motion trajectory in detail.
3.3. Motion Trajectory for MME
The motion of moving objects can be represented in
various ways such as optical flow [27], histograms of op-
tical flow (HOF) [45], and motion boundary histograms
(MBH) [45]. However, these descriptors can only represent
short-term motion between two adjacent frames. We hope
our motion trajectory represents long-term motion, which
is critical for video representation. To this end, inspired by
DT [66], we first track the moving object in the following
Lframes to cover a longer range of motion, resulting in a
trajectory T,i.e.,
T= (pt,pt+1,· · · ,pt+L),(2)
where pt= (xt, yt)represents a point located at (xt, yt)
of frame t, and (·,·)indicates the concatenation operation.
Along this trajectory, we fetch the position features zpand
shape features zsof this object to compose a motion trajec-
tory z,i.e.,
z= (zp,zs).(3)
The position features are represented by the position tran-
sition relative to the last time step, while the shape features
are the HOG descriptors of the tracked object in different
time steps.
Tracking objects using spatially and temporally dense
trajectories. Some previous works [2,18,19,24] try to use
one trajectory to represent the motion of an individual ob-
ject. In contrast, DT [66] points out that tracking spatially
dense feature points sampled on a grid space performs bet-
ter since it ensures better coverage of different objects in
a video. Following DT [66], we use spatially dense grid
points as the initial position of each trajectory. Specifically,
we uniformly sample Kpoints in a masked patch of size
t×h×w, where each point indicates a part of an object. For
each point, we track it through temporally dense Lframes
according to the dense optical flow, resulting in Ktrajecto-
ries. In this way, the model is able to capture spatially and
temporally dense motion information of objects through the
mask-and-predict task.
As a comparison, the reconstruction contents in exist-
ing works [64,75] often extract temporally sparse videos
sampled with a large stride s > 1. The model takes as
input a sparse video and predicts these sparse contents for
learning video representation. Different from these works,
our model also takes as input sparse video but we push
the model to interpolate motion trajectory containing fine-
grained motion information. This simple trajectory interpo-
lation task does not increase the computational cost of the
video encoder but helps the model learn more fine-grained
action information even given sparse video as input. More
details about dense flow calculating and trajectory tracking
can be found in Appendix.
Representing position features. Given a trajectory Tcon-
sisting of the tracked object position at each frame, we are
more interested in the related movement of objects instead
of their absolute location. Consequently, we represent the
position features with related movement between two adja-
cent points ∆pt=pt+1 −pt,i.e.,
zp= (∆pt, ..., ∆pt+L−1),(4)
where zpis a L×2dimensional feature. As each patch
contains Kposition features, we concatenate and normalize
them as position features part of the motion trajectory.
Representing shape features. Besides embedding the
movement, the model also needs to be aware of the shape
changes of objects to recognize actions. Inspired by
HOG [21], we use histograms of oriented gradients (HOG)
with 9 bins to describe the shape of objects.
Compared with existing works [64,75] that reconstruct
HOG in every single frame, we are more interested in the
dynamic shape changes of an object, which can better rep-
resent action in a video. To this end, we follow DT [66] to
calculate trajectory-aligned HOG, consisting of HOG fea-
tures around all tracked points in a trajectory, i.e.,
zs= (HOG(pt), ..., HOG(pt+L−1)),(5)
where HOG(·)is the HOG descriptor and zsis a L×9
dimensional feature. Also, as one patch contains Ktrajec-
tories, we concatenate Ktrajectory-aligned HOG features
and normalize them to the standard normal distribution as
the shape features part of the motion trajectory.
4. Experiments
Implementation details. We conduct experiments on
Kinetics-400 (K400), Something-Something V2 (SSV2),
UCF101 and HMDB51 datasets. Unless otherwise stated,
we follow previous trails [64] and feed the model a