MotionBERT A Unified Perspective on Learning Human Motion Representations Wentao Zhu1Xiaoxuan Ma1Zhaoyang Liu2Libin Liu1Wayne Wu2Yizhou Wang1 1Peking University2Shanghai AI Laboratory

2025-04-24 0 0 6.75MB 18 页 10玖币
侵权投诉
MotionBERT: A Unified Perspective on Learning Human Motion Representations
Wentao Zhu1Xiaoxuan Ma1Zhaoyang Liu2Libin Liu1Wayne Wu2Yizhou Wang1
1Peking University 2Shanghai AI Laboratory
{wtzhu,maxiaoxuan,libin.liu,yizhou.wang}@pku.edu.cn
{zyliumy,wuwenyan0503}@gmail.com
Abstract
We present a unified perspective on tackling various
human-centric video tasks by learning human motion rep-
resentations from large-scale and heterogeneous data re-
sources. Specifically, we propose a pretraining stage in
which a motion encoder is trained to recover the underly-
ing 3D motion from noisy partial 2D observations. The
motion representations acquired in this way incorporate
geometric, kinematic, and physical knowledge about hu-
man motion, which can be easily transferred to multiple
downstream tasks. We implement the motion encoder with
a Dual-stream Spatio-temporal Transformer (DSTformer)
neural network. It could capture long-range spatio-temporal
relationships among the skeletal joints comprehensively and
adaptively, exemplified by the lowest 3D pose estimation
error so far when trained from scratch. Furthermore, our
proposed framework achieves state-of-the-art performance
on all three downstream tasks by simply finetuning the pre-
trained motion encoder with a simple regression head (1-2
layers), which demonstrates the versatility of the learned
motion representations. Code and models are available at
https://motionbert.github.io/
1. Introduction
Perceiving and understanding human activities have long
been a core pursuit of machine intelligence. To this end,
researchers define various tasks to estimate human-centric
semantic labels from videos, e.g. skeleton keypoints [14,35],
action classes [64,123], and surface meshes [46,71]. While
significant progress has been made in each of these tasks,
they tend to be modeled in isolation, rather than as intercon-
nected problems. For example, Spatial Temporal Graph Con-
Yizhou Wang is with Center on Frontiers of Computing Studies, School
of Computer Science, Peking University and Institute for Artificial Intelli-
gence, Peking University.
I. Pretrain
Downstream Tasks
“tennis bat swing”
3D Pose Estimation
2D Skeletons
Video
Action Recognition
Mesh Recovery
Motion
Encoder
II. Finetune
2D Skeletons Corrupted
2D Skeletons
Mask
Noise
3D Motion
Motion
Encoder
Annotated
Videos
Unannotated
Videos
Motion
Capture
FC
Figure 1. Framework overview. We utilize a motion encoder to
learn human motion representations via recovering 3D human mo-
tion from corrupted 2D skeleton sequences. To adapt to different
downstream tasks, we finetune the pretrained motion representa-
tions with a linear layer or a simple MLP.
volutional Networks (ST-GCN) have been applied to model-
ing spatio-temporal relationship of human joints in both 3D
pose estimation [13,116] and action recognition [96,123],
but their connections have not been fully explored. Intu-
itively, these models should all have learned to identify typi-
cal human motion patterns, despite being designed for dif-
ferent problems. Nonetheless, current methods fail to mine
and utilize such commonalities across the tasks. Ideally, we
could develop a unified human-centric video representation
that can be shared across all relevant tasks.
One significant challenge to developing such a represen-
tation is the heterogeneity of available data resources. Mo-
tion capture (Mocap) systems [38,76] provide high-fidelity
3D motion data obtained with markers and sensors, but the
appearances of captured videos are usually constrained to
simple indoor scenes. Action recognition datasets provide
annotations of the action semantics, but they either contain
no human pose labels [16,95] or feature limited motion
of daily activities [63,64,93]. In contrast, in-the-wild hu-
man videos offer a vast and diverse range of appearance and
motion. However, obtaining precise 2D pose annotations
requires considerable effort [3], and acquiring ground-truth
(GT) 3D joint locations is almost impossible. Consequently,
1
arXiv:2210.06551v5 [cs.CV] 14 Aug 2023
most existing studies focus on a specific task using a single
type of human motion data, and they are not able to enjoy
the advantages of other data resources.
In this work, we provide a new perspective on learning
human motion representations. The key idea is that we can
learn a versatile human motion representation from hetero-
geneous data resources in a unified manner, and utilize the
representation to handle different downstream tasks in a
unified way. We present a two-stage framework, consist-
ing of pretraining and finetuning, as depicted in Figure 1.
In the pretraining stage, we extract 2D skeleton sequences
from diverse motion data sources and corrupt them with ran-
dom masks and noises. Subsequently, we train the motion
encoder to recover the 3D motion from the corrupted 2D
skeletons. This challenging pretext task intrinsically requires
the motion encoder to i) infer the underlying 3D human struc-
tures from its temporal movements; ii) recover the erroneous
and missing observations. In this way, the motion encoder
implicitly captures human motion commonsense such as
joint linkages, anatomical constraints, and temporal dynam-
ics. In practice, we propose Dual-stream Spatio-temporal
Transformer (DSTformer) as the motion encoder to capture
the long-range relationship among skeleton keypoints. We
suppose that the motion representations learned from large-
scale and diversified data resources could be shared across
different downstream tasks and benefit their performance.
Therefore, for each downstream task, we adapt the pretrained
motion representations using task-specific training data and
supervisory signals with a simple regression head.
In summary, the contributions of this work are three-fold:
1) We provide a new perspective on solving various human-
centric video tasks through a shared framework of learning
human motion representations. 2) We propose a pretraining
method to leverage the large-scale yet heterogeneous human
motion resources and learn generalizable human motion
representations. Our approach could take advantage of the
precision of 3D mocap data and the diversity of in-the-wild
RGB videos at the same time. 3) We design a dual-stream
Transformer network with cascaded spatio-temporal self-
attention blocks that could serve as a general backbone for
human motion modeling. The experiments demonstrate that
the above designs enable a versatile human motion represen-
tation that can be transferred to multiple downstream tasks,
outperforming the task-specific state-of-the-art methods.
2. Related Work
Learning Human Motion Representations. Early works
formulate human motion with Hidden Markov Models [53,
108] and graphical models [51,99]. Kanazawa et al. [42]
design a temporal encoder and a hallucinator to learn rep-
resentations of 3D human dynamics. Zhang et al. [132]
predict future 3D dynamics in a self-supervised manner.
Sun et al. [102] further incorporate action labels with an
action memory bank. From the action recognition perspec-
tive, a variety of pretext tasks are designed to learn mo-
tion representations in a self-supervised manner, includ-
ing future prediction [100], jigsaw puzzle [60], skeleton-
contrastive [107], speed change [101], cross-view con-
sistency [62], and contrast-reconstruction [117]. Similar
techniques are also explored in tasks like motion assess-
ment [33,85] and motion retargeting [126,139]. These meth-
ods leverage homogeneous motion data, design correspond-
ing pretext tasks, and apply them to a specific downstream
task. In this work, we propose a unified pretrain-finetune
framework to incorporate heterogeneous data resources and
demonstrate its versatility in various downstream tasks.
3D Human Pose Estimation. Recovering 3D human
poses from monocular RGB videos is a classical problem,
and the methods can be categorized into two categories.
The first is to estimate 3D poses with CNN directly from
images [82,104,136]. However, one limitation of these
approaches is that there is a trade-off between 3D pose
precision and appearance diversity due to current data col-
lection techniques. The second category is to extract the
2D pose first, then lift the estimated 2D pose to 3D with
a separate neural network. The lifting can be achieved via
Fully Connected Network [29,78], Temporal Convolutional
Network (TCN) [22,89], GCN [13,28,116], and Trans-
former [56,94,134,135]. Our framework is built upon the
second category as we use the proposed DSTformer to ac-
complish 2D-to-3D lifting.
Skeleton-based Action Recognition. The pioneering
works [74,115,127] point out the inherent connection be-
tween action recognition and human pose estimation. To-
wards modeling the spatio-temporal relationship among hu-
man joints, previous studies mainly employ LSTM [98,138]
and GCN [21,55,68,96,123]. Most recently, PoseC-
onv3D [32] proposes to apply 3D-CNN on the stacked 2D
joint heatmaps and achieves improved results. In addition to
the fully-supervised action recognition task, NTU-RGB+D-
120 [64] brings attention to the challenging one-shot action
recognition problem. To this end, SL-DML [81] applies deep
metric learning to multi-modal signals. Sabater et al. [92]
explores one-shot recognition in therapy scenarios with TCN.
We demonstrate that the pretrained motion representations
could generalize well to action recognition tasks, and the
pretrain-finetune framework is a suitable solution for the
one-shot challenges.
Human Mesh Recovery. Based on the parametric human
models such as SMPL [71], many research works [41,75,83,
122,133] focus on regressing the human mesh from a single
image. SPIN [48] additionally incorporates fitting the body
2
Spatial MHSA
Add + Norm + MLP
Adaptive Fusion
Spatial Pos. Encoding
Temporal Pos. Encoding
N×
Spatial MHSA
Temporal MHSA
Temporal MHSA
Spatial MHSA
2D Skeletons 3D Motion
DSTformer
FC
S1
T1
T2
S2
ˆ
X
x
αTS
αST
E
FC
FC
Figure 2. Model architecture. We propose the Dual-stream Spatio-temporal Transformer (DSTformer) as a general backbone for human
motion modeling. DSTformer consists of
N
dual-stream-fusion modules. Each module contains two branches of spatial or temporal MHSA
and MLP. The Spatial MHSA models the connection among different joints within a timestep, while the Temporal MHSA models the
movement of one joint.
model to 2D joints in the training loop. Despite their promis-
ing per-frame results, these methods yield jittery and unsta-
ble results [46,130] when applied to videos. To improve their
temporal coherence, PoseBERT [8] and SmoothNet [130]
propose to employ a denoising and smoothing module to the
single-frame predictions. Several works [24,42,46,106] take
video clips as input to exploit the temporal cues. Another
common problem is that paired images and GT meshes are
mostly captured in constrained scenarios, which limits the
generalization ability of the above methods. To that end,
Pose2Mesh [25] proposes to first extract 2D skeletons using
an off-the-shelf pose estimator, then lift them to 3D mesh
vertices. Our approach is complementary to state-of-the-art
human mesh recovery methods and could further improve
their temporal coherence with the pretrained motion repre-
sentations.
3. Method
3.1. Overview
As discussed in Section 1, our approach consists of two
stages, namely unified pretraining and task-specific fine-
tuning. In the first stage, we train a motion encoder to
accomplish the 2D-to-3D lifting task, where we use the pro-
posed DSTformer as the backbone. In the second stage,
we finetune the pretrained motion encoder and a few new
layers on the downstream tasks. We use 2D skeleton se-
quences as input for both pretraining and finetuning because
they could be reliably extracted from all kinds of motion
sources [3,10,76,86,103], and is more robust to varia-
tions [19,32]. Existing studies have shown the effectiveness
of using 2D skeleton sequences for different downstream
tasks [25,32,89,109]. We will first introduce the architec-
ture of DSTformer, and then describe the training scheme in
detail.
3.2. Network Architecture
Figure 2shows the network architecture for 2D-to-3D
lifting. Given an input 2D skeleton sequence
xRT×J×Cin
,
we first project it to a high-dimensional feature
F0
RT×J×Cf
, then add learnable spatial positional encoding
PS
pos R1×J×Cf
and temporal positional encoding
PT
pos
RT×1×Cf
to it. We then use the sequence-to-sequence model
DSTformer to calculate
FiRT×J×Cf
(
i= 1, . . . , N
)
where
N
is the network depth. We apply a linear layer with
tanh activation [30] to
FN
to compute the motion represen-
tation
ERT×J×Ce
. Finally, we apply a linear transfor-
mation to
E
to estimate 3D motion
ˆ
XRT×J×Cout
. Here,
T
denotes the sequence length, and
J
denotes the number
of body joints.
Cin
,
Cf
,
Ce
, and
Cout
denote the channel
numbers of input, feature, embedding, and output respec-
tively. We first introduce the basic building blocks of DST-
former, i.e. Spatial and Temporal Blocks with Multi-Head
Self-Attention (MHSA), and then explain the DSTformer
architecture design.
Spatial Block. Spatial MHSA (S-MHSA) aims at model-
ing the relationship among the joints within the same time
step. It is defined as
S-MHSA(QS,KS,VS) = [head1;...;headh]WP
S,
headi=softmax(Qi
S(Ki
S)
dK
)Vi
S,(1)
where
WP
S
is a projection parameter matrix,
h
is the number
of the heads,
i1, . . . , h
, and
denotes matrix transpose.
We utilize self-attention to get the query
QS
, key
KS
, and
value
VS
from input per-frame spatial feature
FSRJ×Ce
for each headi,
Qi
S=FSW(Q,i)
S,Ki
S=FSW(K,i)
S,Vi
S=FSW(V,i)
S,(2)
where
W(Q,i)
S
,
W(K,i)
S
,
W(V,i)
S
are projection matrices, and
dK
is the feature dimension of
KS
. We apply S-MHSA to
3
features of different time steps in parallel. Residual connec-
tion and layer normalization (LayerNorm) are used to the
S-MHSA result, which is further fed into a multilayer per-
ceptron (MLP), and followed by a residual connection and
LayerNorm following [112]. We denote the entire spatial
block with MHSA, LayerNorm, MLP, and residual connec-
tions by S.
Temporal Block. Temporal MHSA (T-MHSA) aims at
modeling the relationship across the time steps for a body
joint. Its computation process is similar with S-MHSA ex-
cept that the MHSA is applied to the per-joint temporal
feature
FTRT×Ce
and parallelized over the spatial dimen-
sion.
T-MHSA(QT,KT,VT)=[head1;...;headh]WP
T,
headi=softmax(Qi
T(Ki
T)
dK
)Vi
T,(3)
where
i1, . . . , h
,
QT
,
KT
,
VT
are computed similar with
Formula 2. We denote the entire temporal block by T.
Dual-stream Spatio-temporal Transformer. Given spa-
tial and temporal MHSA that captures the intra-frame and
inter-frame body joint interactions respectively, we assemble
the basic building blocks to fuse the spatial and temporal
information in the flow. We design a dual-stream architec-
ture with the following assumptions: 1) Both streams should
be capable of modeling the comprehensive spatio-temporal
context. 2) Each stream should be specialized in different
spatio-temporal aspects. 3) The two streams should be fused
together, with the fusion weights dynamically balanced de-
pending on the input spatio-temporal characteristics.
Hence, we stack the spatial and temporal MHSA blocks in
different orders, forming two parallel computation branches.
The output features of the two branches are fused using
adaptive weights predicted by an attention regressor. The
dual-stream-fusion module is then repeated for Ntimes:
Fi=αi
ST◦T i
1(Si
1(Fi1))+αi
TS◦Si
2(Ti
2(Fi1)), i 1,...,N,
(4)
where
Fi
denotes the feature embedding at depth
i
,
denotes
element-wise production. Orders of
S
and
T
blocks are
shown in Figure 2, and different blocks do not share weights.
Adaptive fusion weights
αST,αTS RN×T×J
are given by
αi
ST,αi
TS =softmax(W([Ti
1(Si
1(Fi1)),Si
2(Ti
2(Fi1))])),
(5)
where
W
is a learnable linear transformation.
[,]
denotes
concatenation.
3.3. Unified Pretraining
We address two key challenges when designing the uni-
fied pretraining framework: 1) How to learn a powerful
motion representation with a universal pretext task. 2) How
to utilize large-scale but heterogeneous human motion data
in all kinds of formats.
For the first challenge, we follow the successful practices
in language [12,30,90] and vision [7,36] modeling to con-
struct the supervision signals, i.e. mask part of the input and
use the encoded representations to reconstruct the whole
input. Note that such “cloze” task naturally exists in human
motion analysis, that is to recover the lost depth information
from the 2D visual observations, i.e. 3D human pose estima-
tion. Inspired by this, we leverage the large-scale 3D mocap
data [76] and design a 2D-to-3D lifting pretext task. We first
extract the 2D skeleton sequences
x
by projecting the 3D
motion orthographically. Then, we corrupt
x
by randomly
masking and adding noise to produce the corrupted 2D skele-
ton sequences, which also resemble the 2D detection results
as it contains occlusions, detection failures, and errors. Both
joint-level and frame-level masks are applied with certain
probabilities. We use the aforementioned motion encoder
to get motion representation
E
and reconstruct 3D motion
ˆ
X
. We then compute the joint loss
L3D
between
ˆ
X
and GT
3D motion
X
. We also add the velocity loss
LO
following
previous works [89,134]. The 3D reconstruction losses are
thus given by
L3D =
T
X
t=1
J
X
j=1
ˆ
Xt,j Xt,j 2,LO=
T
X
t=2
J
X
j=1
ˆ
Ot,j Ot,j 2,
(6)
where ˆ
Ot=ˆ
Xtˆ
Xt1,Ot=XtXt1.
For the second challenge, we notice that 2D skeletons
could serve as a universal medium as they can be extracted
from all sorts of motion data sources. We further incorporate
in-the-wild RGB videos into the 2D-to-3D lifting framework
for unified pretraining. For RGB videos, the 2D skeletons
x
could be given by manual annotation [3] or 2D pose es-
timators [14,103], and the depth channel of the extracted
2D skeletons is intrinsically “masked”. Similarly, we add
extra masks and noises to degrade
x
(if
x
already contains
detection noise, only masking is applied). As 3D motion GT
X
is not available for these data, we apply a weighted 2D
re-projection loss which is calculated by
L2D =
T
X
t=1
J
X
j=1
δt,j ˆxt,j xt,j 2,(7)
where
ˆx
is the 2D orthographical projection of the estimated
3D motion
ˆ
X
, and
δRT×J
is given by visibility annota-
tion or 2D detection confidence.
The total pretraining loss is computed by
L=L3D +λOLO
| {z }
for 3D data
+L2D
|{z}
for 2D data
,(8)
where λOis a constant coefficient to balance the losses.
4
摘要:

MotionBERT:AUnifiedPerspectiveonLearningHumanMotionRepresentationsWentaoZhu1XiaoxuanMa1ZhaoyangLiu2LibinLiu1WayneWu2YizhouWang1†1PekingUniversity2ShanghaiAILaboratory{wtzhu,maxiaoxuan,libin.liu,yizhou.wang}@pku.edu.cn{zyliumy,wuwenyan0503}@gmail.comAbstractWepresentaunifiedperspectiveontacklingvario...

展开>> 收起<<
MotionBERT A Unified Perspective on Learning Human Motion Representations Wentao Zhu1Xiaoxuan Ma1Zhaoyang Liu2Libin Liu1Wayne Wu2Yizhou Wang1 1Peking University2Shanghai AI Laboratory.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:6.75MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注