MotionBERT A Unified Perspective on Learning Human Motion Representations Wentao Zhu1Xiaoxuan Ma1Zhaoyang Liu2Libin Liu1Wayne Wu2Yizhou Wang1 1Peking University2Shanghai AI Laboratory

2025-04-24 0 0 6.75MB 18 页 10玖币

侵权投诉

MotionBERT: A Uniﬁed Perspective on Learning Human Motion Representations

Wentao Zhu1Xiaoxuan Ma1Zhaoyang Liu2Libin Liu1Wayne Wu2Yizhou Wang1†

1Peking University 2Shanghai AI Laboratory

{wtzhu,maxiaoxuan,libin.liu,yizhou.wang}@pku.edu.cn

{zyliumy,wuwenyan0503}@gmail.com

Abstract

We present a uniﬁed perspective on tackling various

human-centric video tasks by learning human motion rep-

resentations from large-scale and heterogeneous data re-

sources. Speciﬁcally, we propose a pretraining stage in

which a motion encoder is trained to recover the underly-

ing 3D motion from noisy partial 2D observations. The

motion representations acquired in this way incorporate

geometric, kinematic, and physical knowledge about hu-

man motion, which can be easily transferred to multiple

downstream tasks. We implement the motion encoder with

a Dual-stream Spatio-temporal Transformer (DSTformer)

neural network. It could capture long-range spatio-temporal

relationships among the skeletal joints comprehensively and

adaptively, exempliﬁed by the lowest 3D pose estimation

error so far when trained from scratch. Furthermore, our

proposed framework achieves state-of-the-art performance

on all three downstream tasks by simply ﬁnetuning the pre-

trained motion encoder with a simple regression head (1-2

layers), which demonstrates the versatility of the learned

motion representations. Code and models are available at

https://motionbert.github.io/

1. Introduction

Perceiving and understanding human activities have long

been a core pursuit of machine intelligence. To this end,

researchers deﬁne various tasks to estimate human-centric

semantic labels from videos, e.g. skeleton keypoints [14,35],

action classes [64,123], and surface meshes [46,71]. While

signiﬁcant progress has been made in each of these tasks,

they tend to be modeled in isolation, rather than as intercon-

nected problems. For example, Spatial Temporal Graph Con-

†

Yizhou Wang is with Center on Frontiers of Computing Studies, School

of Computer Science, Peking University and Institute for Artiﬁcial Intelli-

gence, Peking University.

I. Pretrain

Downstream Tasks

“tennis bat swing”

3D Pose Estimation

2D Skeletons

Video

Action Recognition

Mesh Recovery

Motion

Encoder

II. Finetune

2D Skeletons Corrupted

2D Skeletons

Mask

Noise

3D Motion

Motion

Encoder

Annotated

Videos

Unannotated

Videos

Motion

Capture

Figure 1. Framework overview. We utilize a motion encoder to

learn human motion representations via recovering 3D human mo-

tion from corrupted 2D skeleton sequences. To adapt to different

downstream tasks, we ﬁnetune the pretrained motion representa-

tions with a linear layer or a simple MLP.

volutional Networks (ST-GCN) have been applied to model-

ing spatio-temporal relationship of human joints in both 3D

pose estimation [13,116] and action recognition [96,123],

but their connections have not been fully explored. Intu-

itively, these models should all have learned to identify typi-

cal human motion patterns, despite being designed for dif-

ferent problems. Nonetheless, current methods fail to mine

and utilize such commonalities across the tasks. Ideally, we

could develop a uniﬁed human-centric video representation

that can be shared across all relevant tasks.

One signiﬁcant challenge to developing such a represen-

tation is the heterogeneity of available data resources. Mo-

tion capture (Mocap) systems [38,76] provide high-ﬁdelity

3D motion data obtained with markers and sensors, but the

appearances of captured videos are usually constrained to

simple indoor scenes. Action recognition datasets provide

annotations of the action semantics, but they either contain

no human pose labels [16,95] or feature limited motion

of daily activities [63,64,93]. In contrast, in-the-wild hu-

man videos offer a vast and diverse range of appearance and

motion. However, obtaining precise 2D pose annotations

requires considerable effort [3], and acquiring ground-truth

(GT) 3D joint locations is almost impossible. Consequently,

arXiv:2210.06551v5 [cs.CV] 14 Aug 2023

most existing studies focus on a speciﬁc task using a single

type of human motion data, and they are not able to enjoy

the advantages of other data resources.

In this work, we provide a new perspective on learning

human motion representations. The key idea is that we can

learn a versatile human motion representation from hetero-

geneous data resources in a uniﬁed manner, and utilize the

representation to handle different downstream tasks in a

uniﬁed way. We present a two-stage framework, consist-

ing of pretraining and ﬁnetuning, as depicted in Figure 1.

In the pretraining stage, we extract 2D skeleton sequences

from diverse motion data sources and corrupt them with ran-

dom masks and noises. Subsequently, we train the motion

encoder to recover the 3D motion from the corrupted 2D

skeletons. This challenging pretext task intrinsically requires

the motion encoder to i) infer the underlying 3D human struc-

tures from its temporal movements; ii) recover the erroneous

and missing observations. In this way, the motion encoder

implicitly captures human motion commonsense such as

joint linkages, anatomical constraints, and temporal dynam-

ics. In practice, we propose Dual-stream Spatio-temporal

Transformer (DSTformer) as the motion encoder to capture

the long-range relationship among skeleton keypoints. We

suppose that the motion representations learned from large-

scale and diversiﬁed data resources could be shared across

different downstream tasks and beneﬁt their performance.

Therefore, for each downstream task, we adapt the pretrained

motion representations using task-speciﬁc training data and

supervisory signals with a simple regression head.

In summary, the contributions of this work are three-fold:

1) We provide a new perspective on solving various human-

centric video tasks through a shared framework of learning

human motion representations. 2) We propose a pretraining

method to leverage the large-scale yet heterogeneous human

motion resources and learn generalizable human motion

representations. Our approach could take advantage of the

precision of 3D mocap data and the diversity of in-the-wild

RGB videos at the same time. 3) We design a dual-stream

Transformer network with cascaded spatio-temporal self-

attention blocks that could serve as a general backbone for

human motion modeling. The experiments demonstrate that

the above designs enable a versatile human motion represen-

tation that can be transferred to multiple downstream tasks,

outperforming the task-speciﬁc state-of-the-art methods.

2. Related Work

Learning Human Motion Representations. Early works

formulate human motion with Hidden Markov Models [53,

108] and graphical models [51,99]. Kanazawa et al. [42]

design a temporal encoder and a hallucinator to learn rep-

resentations of 3D human dynamics. Zhang et al. [132]

predict future 3D dynamics in a self-supervised manner.

Sun et al. [102] further incorporate action labels with an

action memory bank. From the action recognition perspec-

tive, a variety of pretext tasks are designed to learn mo-

tion representations in a self-supervised manner, includ-

ing future prediction [100], jigsaw puzzle [60], skeleton-

contrastive [107], speed change [101], cross-view con-

sistency [62], and contrast-reconstruction [117]. Similar

techniques are also explored in tasks like motion assess-

ment [33,85] and motion retargeting [126,139]. These meth-

ods leverage homogeneous motion data, design correspond-

ing pretext tasks, and apply them to a speciﬁc downstream

task. In this work, we propose a uniﬁed pretrain-ﬁnetune

framework to incorporate heterogeneous data resources and

demonstrate its versatility in various downstream tasks.

3D Human Pose Estimation. Recovering 3D human

poses from monocular RGB videos is a classical problem,

and the methods can be categorized into two categories.

The ﬁrst is to estimate 3D poses with CNN directly from

images [82,104,136]. However, one limitation of these

approaches is that there is a trade-off between 3D pose

precision and appearance diversity due to current data col-

lection techniques. The second category is to extract the

2D pose ﬁrst, then lift the estimated 2D pose to 3D with

a separate neural network. The lifting can be achieved via

Fully Connected Network [29,78], Temporal Convolutional

Network (TCN) [22,89], GCN [13,28,116], and Trans-

former [56,94,134,135]. Our framework is built upon the

second category as we use the proposed DSTformer to ac-

complish 2D-to-3D lifting.

Skeleton-based Action Recognition. The pioneering

works [74,115,127] point out the inherent connection be-

tween action recognition and human pose estimation. To-

wards modeling the spatio-temporal relationship among hu-

man joints, previous studies mainly employ LSTM [98,138]

and GCN [21,55,68,96,123]. Most recently, PoseC-

onv3D [32] proposes to apply 3D-CNN on the stacked 2D

joint heatmaps and achieves improved results. In addition to

the fully-supervised action recognition task, NTU-RGB+D-

120 [64] brings attention to the challenging one-shot action

recognition problem. To this end, SL-DML [81] applies deep

metric learning to multi-modal signals. Sabater et al. [92]

explores one-shot recognition in therapy scenarios with TCN.

We demonstrate that the pretrained motion representations

could generalize well to action recognition tasks, and the

pretrain-ﬁnetune framework is a suitable solution for the

one-shot challenges.

Human Mesh Recovery. Based on the parametric human

models such as SMPL [71], many research works [41,75,83,

122,133] focus on regressing the human mesh from a single

image. SPIN [48] additionally incorporates ﬁtting the body

Spatial MHSA

Temporal MHSA

Add + Norm + MLP

Adaptive Fusion

Spatial Pos. Encoding

Temporal Pos. Encoding

N×

Spatial MHSA

Temporal MHSA

Spatial MHSA

2D Skeletons 3D Motion

DSTformer

αTS

αST

Figure 2. Model architecture. We propose the Dual-stream Spatio-temporal Transformer (DSTformer) as a general backbone for human

motion modeling. DSTformer consists of

dual-stream-fusion modules. Each module contains two branches of spatial or temporal MHSA

and MLP. The Spatial MHSA models the connection among different joints within a timestep, while the Temporal MHSA models the

movement of one joint.

model to 2D joints in the training loop. Despite their promis-

ing per-frame results, these methods yield jittery and unsta-

ble results [46,130] when applied to videos. To improve their

temporal coherence, PoseBERT [8] and SmoothNet [130]

propose to employ a denoising and smoothing module to the

single-frame predictions. Several works [24,42,46,106] take

video clips as input to exploit the temporal cues. Another

common problem is that paired images and GT meshes are

mostly captured in constrained scenarios, which limits the

generalization ability of the above methods. To that end,

Pose2Mesh [25] proposes to ﬁrst extract 2D skeletons using

an off-the-shelf pose estimator, then lift them to 3D mesh

vertices. Our approach is complementary to state-of-the-art

human mesh recovery methods and could further improve

their temporal coherence with the pretrained motion repre-

sentations.

3. Method

3.1. Overview

As discussed in Section 1, our approach consists of two

stages, namely uniﬁed pretraining and task-speciﬁc ﬁne-

tuning. In the ﬁrst stage, we train a motion encoder to

accomplish the 2D-to-3D lifting task, where we use the pro-

posed DSTformer as the backbone. In the second stage,

we ﬁnetune the pretrained motion encoder and a few new

layers on the downstream tasks. We use 2D skeleton se-

quences as input for both pretraining and ﬁnetuning because

they could be reliably extracted from all kinds of motion

sources [3,10,76,86,103], and is more robust to varia-

tions [19,32]. Existing studies have shown the effectiveness

of using 2D skeleton sequences for different downstream

tasks [25,32,89,109]. We will ﬁrst introduce the architec-

ture of DSTformer, and then describe the training scheme in

detail.

3.2. Network Architecture

Figure 2shows the network architecture for 2D-to-3D

lifting. Given an input 2D skeleton sequence

x∈RT×J×Cin

we ﬁrst project it to a high-dimensional feature

F0∈

RT×J×Cf

, then add learnable spatial positional encoding

pos ∈R1×J×Cf

and temporal positional encoding

pos ∈

RT×1×Cf

to it. We then use the sequence-to-sequence model

DSTformer to calculate

Fi∈RT×J×Cf

(

i= 1, . . . , N

)

where

is the network depth. We apply a linear layer with

tanh activation [30] to

to compute the motion represen-

tation

E∈RT×J×Ce

. Finally, we apply a linear transfor-

mation to

to estimate 3D motion

X∈RT×J×Cout

. Here,

denotes the sequence length, and

denotes the number

of body joints.

Cin

, and

Cout

denote the channel

numbers of input, feature, embedding, and output respec-

tively. We ﬁrst introduce the basic building blocks of DST-

former, i.e. Spatial and Temporal Blocks with Multi-Head

Self-Attention (MHSA), and then explain the DSTformer

architecture design.

Spatial Block. Spatial MHSA (S-MHSA) aims at model-

ing the relationship among the joints within the same time

step. It is deﬁned as

S-MHSA(QS,KS,VS) = [head1;...;headh]WP

headi=softmax(Qi

S(Ki

S)′

√dK

)Vi

S,(1)

where

is a projection parameter matrix,

is the number

of the heads,

i∈1, . . . , h

, and

′

denotes matrix transpose.

We utilize self-attention to get the query

, key

, and

value

from input per-frame spatial feature

FS∈RJ×Ce

for each headi,

S=FSW(Q,i)

S,Ki

S=FSW(K,i)

S,Vi

S=FSW(V,i)

S,(2)

where

W(Q,i)

W(K,i)

W(V,i)

are projection matrices, and

is the feature dimension of

. We apply S-MHSA to

features of different time steps in parallel. Residual connec-

tion and layer normalization (LayerNorm) are used to the

S-MHSA result, which is further fed into a multilayer per-

ceptron (MLP), and followed by a residual connection and

LayerNorm following [112]. We denote the entire spatial

block with MHSA, LayerNorm, MLP, and residual connec-

tions by S.

Temporal Block. Temporal MHSA (T-MHSA) aims at

modeling the relationship across the time steps for a body

joint. Its computation process is similar with S-MHSA ex-

cept that the MHSA is applied to the per-joint temporal

feature

FT∈RT×Ce

and parallelized over the spatial dimen-

sion.

T-MHSA(QT,KT,VT)=[head1;...;headh]WP

headi=softmax(Qi

T(Ki

T)′

√dK

)Vi

T,(3)

where

i∈1, . . . , h

are computed similar with

Formula 2. We denote the entire temporal block by T.

Dual-stream Spatio-temporal Transformer. Given spa-

tial and temporal MHSA that captures the intra-frame and

inter-frame body joint interactions respectively, we assemble

the basic building blocks to fuse the spatial and temporal

information in the ﬂow. We design a dual-stream architec-

ture with the following assumptions: 1) Both streams should

be capable of modeling the comprehensive spatio-temporal

context. 2) Each stream should be specialized in different

spatio-temporal aspects. 3) The two streams should be fused

together, with the fusion weights dynamically balanced de-

pending on the input spatio-temporal characteristics.

Hence, we stack the spatial and temporal MHSA blocks in

different orders, forming two parallel computation branches.

The output features of the two branches are fused using

adaptive weights predicted by an attention regressor. The

dual-stream-fusion module is then repeated for Ntimes:

Fi=αi

ST◦T i

1(Si

1(Fi−1))+αi

TS◦Si

2(Ti

2(Fi−1)), i ∈1,...,N,

(4)

where

denotes the feature embedding at depth

◦

denotes

element-wise production. Orders of

and

blocks are

shown in Figure 2, and different blocks do not share weights.

Adaptive fusion weights

αST,αTS ∈RN×T×J

are given by

αi

ST,αi

TS =softmax(W([Ti

1(Si

1(Fi−1)),Si

2(Ti

2(Fi−1))])),

(5)

where

is a learnable linear transformation.

[,]

denotes

concatenation.

3.3. Uniﬁed Pretraining

We address two key challenges when designing the uni-

ﬁed pretraining framework: 1) How to learn a powerful

motion representation with a universal pretext task. 2) How

to utilize large-scale but heterogeneous human motion data

in all kinds of formats.

For the ﬁrst challenge, we follow the successful practices

in language [12,30,90] and vision [7,36] modeling to con-

struct the supervision signals, i.e. mask part of the input and

use the encoded representations to reconstruct the whole

input. Note that such “cloze” task naturally exists in human

motion analysis, that is to recover the lost depth information

from the 2D visual observations, i.e. 3D human pose estima-

tion. Inspired by this, we leverage the large-scale 3D mocap

data [76] and design a 2D-to-3D lifting pretext task. We ﬁrst

extract the 2D skeleton sequences

by projecting the 3D

motion orthographically. Then, we corrupt

by randomly

masking and adding noise to produce the corrupted 2D skele-

ton sequences, which also resemble the 2D detection results

as it contains occlusions, detection failures, and errors. Both

joint-level and frame-level masks are applied with certain

probabilities. We use the aforementioned motion encoder

to get motion representation

and reconstruct 3D motion

. We then compute the joint loss

L3D

between

and GT

3D motion

. We also add the velocity loss

following

previous works [89,134]. The 3D reconstruction losses are

thus given by

L3D =

t=1

j=1

∥ˆ

Xt,j −Xt,j ∥2,LO=

t=2

j=1

∥ˆ

Ot,j −Ot,j ∥2,

(6)

where ˆ

Ot=ˆ

Xt−ˆ

Xt−1,Ot=Xt−Xt−1.

For the second challenge, we notice that 2D skeletons

could serve as a universal medium as they can be extracted

from all sorts of motion data sources. We further incorporate

in-the-wild RGB videos into the 2D-to-3D lifting framework

for uniﬁed pretraining. For RGB videos, the 2D skeletons

could be given by manual annotation [3] or 2D pose es-

timators [14,103], and the depth channel of the extracted

2D skeletons is intrinsically “masked”. Similarly, we add

extra masks and noises to degrade

(if

already contains

detection noise, only masking is applied). As 3D motion GT

is not available for these data, we apply a weighted 2D

re-projection loss which is calculated by

L2D =

t=1

j=1

δt,j ∥ˆxt,j −xt,j ∥2,(7)

where

ˆx

is the 2D orthographical projection of the estimated

3D motion

, and

δ∈RT×J

is given by visibility annota-

tion or 2D detection conﬁdence.

The total pretraining loss is computed by

L=L3D +λOLO

| {z }

for 3D data

+L2D

|{z}

for 2D data

,(8)

where λOis a constant coefﬁcient to balance the losses.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MotionBERT:AUnifiedPerspectiveonLearningHumanMotionRepresentationsWentaoZhu1XiaoxuanMa1ZhaoyangLiu2LibinLiu1WayneWu2YizhouWang1†1PekingUniversity2ShanghaiAILaboratory{wtzhu,maxiaoxuan,libin.liu,yizhou.wang}@pku.edu.cn{zyliumy,wuwenyan0503}@gmail.comAbstractWepresentaunifiedperspectiveontacklingvario...

展开>> 收起<<

MotionBERT A Unified Perspective on Learning Human Motion Representations Wentao Zhu1Xiaoxuan Ma1Zhaoyang Liu2Libin Liu1Wayne Wu2Yizhou Wang1 1Peking University2Shanghai AI Laboratory.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MotionBERT A Unified Perspective on Learning Human Motion Representations Wentao Zhu1Xiaoxuan Ma1Zhaoyang Liu2Libin Liu1Wayne Wu2Yizhou Wang1 1Peking University2Shanghai AI Laboratory

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: