most existing studies focus on a specific task using a single
type of human motion data, and they are not able to enjoy
the advantages of other data resources.
In this work, we provide a new perspective on learning
human motion representations. The key idea is that we can
learn a versatile human motion representation from hetero-
geneous data resources in a unified manner, and utilize the
representation to handle different downstream tasks in a
unified way. We present a two-stage framework, consist-
ing of pretraining and finetuning, as depicted in Figure 1.
In the pretraining stage, we extract 2D skeleton sequences
from diverse motion data sources and corrupt them with ran-
dom masks and noises. Subsequently, we train the motion
encoder to recover the 3D motion from the corrupted 2D
skeletons. This challenging pretext task intrinsically requires
the motion encoder to i) infer the underlying 3D human struc-
tures from its temporal movements; ii) recover the erroneous
and missing observations. In this way, the motion encoder
implicitly captures human motion commonsense such as
joint linkages, anatomical constraints, and temporal dynam-
ics. In practice, we propose Dual-stream Spatio-temporal
Transformer (DSTformer) as the motion encoder to capture
the long-range relationship among skeleton keypoints. We
suppose that the motion representations learned from large-
scale and diversified data resources could be shared across
different downstream tasks and benefit their performance.
Therefore, for each downstream task, we adapt the pretrained
motion representations using task-specific training data and
supervisory signals with a simple regression head.
In summary, the contributions of this work are three-fold:
1) We provide a new perspective on solving various human-
centric video tasks through a shared framework of learning
human motion representations. 2) We propose a pretraining
method to leverage the large-scale yet heterogeneous human
motion resources and learn generalizable human motion
representations. Our approach could take advantage of the
precision of 3D mocap data and the diversity of in-the-wild
RGB videos at the same time. 3) We design a dual-stream
Transformer network with cascaded spatio-temporal self-
attention blocks that could serve as a general backbone for
human motion modeling. The experiments demonstrate that
the above designs enable a versatile human motion represen-
tation that can be transferred to multiple downstream tasks,
outperforming the task-specific state-of-the-art methods.
2. Related Work
Learning Human Motion Representations. Early works
formulate human motion with Hidden Markov Models [53,
108] and graphical models [51,99]. Kanazawa et al. [42]
design a temporal encoder and a hallucinator to learn rep-
resentations of 3D human dynamics. Zhang et al. [132]
predict future 3D dynamics in a self-supervised manner.
Sun et al. [102] further incorporate action labels with an
action memory bank. From the action recognition perspec-
tive, a variety of pretext tasks are designed to learn mo-
tion representations in a self-supervised manner, includ-
ing future prediction [100], jigsaw puzzle [60], skeleton-
contrastive [107], speed change [101], cross-view con-
sistency [62], and contrast-reconstruction [117]. Similar
techniques are also explored in tasks like motion assess-
ment [33,85] and motion retargeting [126,139]. These meth-
ods leverage homogeneous motion data, design correspond-
ing pretext tasks, and apply them to a specific downstream
task. In this work, we propose a unified pretrain-finetune
framework to incorporate heterogeneous data resources and
demonstrate its versatility in various downstream tasks.
3D Human Pose Estimation. Recovering 3D human
poses from monocular RGB videos is a classical problem,
and the methods can be categorized into two categories.
The first is to estimate 3D poses with CNN directly from
images [82,104,136]. However, one limitation of these
approaches is that there is a trade-off between 3D pose
precision and appearance diversity due to current data col-
lection techniques. The second category is to extract the
2D pose first, then lift the estimated 2D pose to 3D with
a separate neural network. The lifting can be achieved via
Fully Connected Network [29,78], Temporal Convolutional
Network (TCN) [22,89], GCN [13,28,116], and Trans-
former [56,94,134,135]. Our framework is built upon the
second category as we use the proposed DSTformer to ac-
complish 2D-to-3D lifting.
Skeleton-based Action Recognition. The pioneering
works [74,115,127] point out the inherent connection be-
tween action recognition and human pose estimation. To-
wards modeling the spatio-temporal relationship among hu-
man joints, previous studies mainly employ LSTM [98,138]
and GCN [21,55,68,96,123]. Most recently, PoseC-
onv3D [32] proposes to apply 3D-CNN on the stacked 2D
joint heatmaps and achieves improved results. In addition to
the fully-supervised action recognition task, NTU-RGB+D-
120 [64] brings attention to the challenging one-shot action
recognition problem. To this end, SL-DML [81] applies deep
metric learning to multi-modal signals. Sabater et al. [92]
explores one-shot recognition in therapy scenarios with TCN.
We demonstrate that the pretrained motion representations
could generalize well to action recognition tasks, and the
pretrain-finetune framework is a suitable solution for the
one-shot challenges.
Human Mesh Recovery. Based on the parametric human
models such as SMPL [71], many research works [41,75,83,
122,133] focus on regressing the human mesh from a single
image. SPIN [48] additionally incorporates fitting the body
2