Human Joint Kinematics Diffusion-Reﬁnement for Stochastic Motion Prediction Dong Wei1Huaijiang Sun1Bin Li2Jianfeng Lu1Weiqing Li1Xiaoning Sun1Shengxiang Hu1 1Nanjing University of Science and Technology

2025-04-27 0 0 1.43MB 9 页 10玖币

侵权投诉

Human Joint Kinematics Diffusion-Reﬁnement for Stochastic Motion Prediction

Dong Wei,1Huaijiang Sun,1Bin Li,2Jianfeng Lu,1Weiqing Li,1Xiaoning Sun1Shengxiang Hu,1

1Nanjing University of Science and Technology

2Tianjin AiForward Science and Technology Co., Ltd.

Abstract

Stochastic human motion prediction aims to forecast mul-

tiple plausible future motions given a single pose sequence

from the past. Most previous works focus on designing elab-

orate losses to improve the accuracy, while the diversity is

typically characterized by randomly sampling a set of latent

variables from the latent prior, which is then decoded into

possible motions. This joint training of sampling and decod-

ing, however, suffers from posterior collapse as the learned

latent variables tend to be ignored by a strong decoder, lead-

ing to limited diversity. Alternatively, inspired by the diffu-

sion process in nonequilibrium thermodynamics, we propose

MotionDiff, a diffusion probabilistic model to treat the kine-

matics of human joints as heated particles, which will diffuse

from original states to a noise distribution. This process of-

fers a natural way to obtain the “whitened” latents without

any trainable parameters, and human motion prediction can

be regarded as the reverse diffusion process that converts the

noise distribution into realistic future motions conditioned on

the observed sequence. Speciﬁcally, MotionDiff consists of

two parts: a spatial-temporal transformer-based diffusion net-

work to generate diverse yet plausible motions, and a graph

convolutional network to further reﬁne the outputs. Experi-

mental results on two datasets demonstrate that our model

yields the competitive performance in terms of both accuracy

and diversity.

Introduction

Human Motion Prediction (HMP) has received increasing

attention due to its broad applications such as human-robot

interaction (Bajcsy et al. 2021), autonomous driving (Paden

et al. 2016) and animation production (Park et al. 2019). The

ability to perform such predictions allows robots to under-

stand the future plans of human beings, which is critical to

cooperate safely and reasonably with people. While encour-

aging results have been achieved in previous works (Aksan,

Kaufmann, and Hilliges 2019; Mao et al. 2019; Zhong et al.

2022), they neglect the fact that uncertainty and stochastic-

ity are intrinsic properties of human motions. Given a sin-

gle past observation, predicting multiple possible future se-

quences rather than only one output is gaining in popularity.

The latter, i.e., deterministic HMP, which is mostly based

on recurrent neural network or graph convolutional network,

cannot capture such stochastic behaviors. How to generate

accurate human motion predictions and at the same time

Past Motion Future Motion

Reverse

Diffusion

Process

...

Condition

Diffusion

Process

...

Figure 1: Visualization of the diffusion and reverse process

of MotionDiff. For diffusion process, new noise is gradually

incorporated until the kinematic information is completely

destroyed. In contrast, the reverse diffusion process recovers

the desired realistic future motion from noisy distribution

conditioned on the observation via a Markov chain.

fully consider the diversity remains a challenging problem.

Recently, deep generative networks have made signiﬁcant

progress in modeling the multi-modal data distribution (Bar-

soum, Kender, and Liu 2018; Yan et al. 2018; Aliakbarian

et al. 2020), such as Generative Adversarial Network (GAN)

and Variational AutoEncoder (VAE). Most of them obtain

diversity by randomly sampling a set of latent variables from

the latent prior, which requires additional neural networks

for training (i.e., the discriminator in GAN or the sampling

encoder in VAE). This process, however, will bring about

training instability or posterior collapse when jointly trained

with a powerful decoder (McCarthy et al. 2020). Unfortu-

nately, in the particular case of human motion prediction, a

sufﬁciently high-capacity decoder is indispensable to keep

the predictions physically plausible. As a consequence, such

decoder tends to model the conditional density directly, giv-

ing the network possibility to learn to ignore the stochastic

latent variables, and thus limiting the diversity of future mo-

tions. To increase the diversity, recent progress on stochastic

human motion prediction (Aliakbarian et al. 2020; Yuan and

Kitani 2020; Zhang, Black, and Tang 2021; Ma et al. 2022)

add constraints such as stochastic conditioning schemes or

new losses, to force the model to take the noise into account.

While these methods indeed yield high diversity, they still

arXiv:2210.05976v2 [cs.CV] 28 Nov 2022

suffer from the above inherent limitation and might restrict

the ﬂexibility of the model. An ideal solution would be di-

rectly starting from the latents without training to circum-

vent this problem, and using appropriate constraints to ex-

plore more plausible motions.

In this paper, we propose such a solution named Mo-

tionDiff based on Denoising Diffusion Probabilistic Models

(DDPM) inspired by the diffusion process in nonequilibrium

thermodynamics. As a future human pose sequence is com-

posed of a set of 3D joint locations that satisfy the kinematic

constraints, we regard these locations as particles in a ther-

modynamics system in contact with a heat bath. In this light,

the particles evolve stochastically in the way that they pro-

gressively diffuse from the original states (i.e., kinematics of

human joints) to a noise distribution (i.e., chaotic positions).

This offers an alternative way to obtain the “whitened” la-

tents without any training process, which naturally avoids

posterior collapse. Meanwhile, contrary to previous meth-

ods that require extra sampling encoders to obtain diversity,

a unique strength of MotionDiff is that it is inherently di-

verse because the diffusion process is implemented by in-

corporating a new noise to human motion at each time step.

Our high-level idea is to learn the reverse diffusion process,

which recovers the target realistic pose sequences from the

noisy distribution conditioned on the observed past motion

(see Figure 1). This process can be formulated as a Markov

chain, and allows us to use a simple mean squared error loss

function to optimize the variational lower bound. Nonethe-

less, directly extending the diffusion model to stochastic hu-

man motion prediction results in two key challenges that

arise from the following observations: First, since the kine-

matic information between local joint coordinates has been

completely destroyed in the diffusion process, and a certain

number of steps in the reverse diffusion process is required,

it is necessary to devise an expressive yet efﬁcient decoder

to construct such relations; Second, as we do not explicitly

guide the future motion generation with any loss, Motion-

Diff produces realistic predictions that are totally different

from the ground truth, which makes the quantitative evalua-

tion challenging (Lugmayr et al. 2022).

To this end, we elaborately design an efﬁcient spatial-

temporal transformer-based architecture as the core decoder

of MotionDiff to tackle the ﬁrst problem. Instead of per-

forming simple pose embedding (Bouazizi et al. 2022), we

devise a spatial transformer module to encode joint embed-

ding, in which local relationships between the 3D joints in

each frame can be better investigated. Following, we capture

the global dependencies across frames by a temporal trans-

former module. This architecture differs from the autore-

gressive model (Aksan et al. 2021) that interleaves spatial

and temporal modeling with tremendous computations. For

the second issue, we further employ a Graph Convolutional

Network (GCN) to reﬁne diverse pose sequences generated

from the decoder with the help of the observed past motion.

By introducing the losses of GCN, our reﬁnement enjoys a

signiﬁcant approximation to the ground truth and still keeps

diverse and realistic. The contributions of our work are sum-

marized as follows:

• We propose a novel stochastic human motion predic-

tion framework with human joint kinematics diffusion-

reﬁnement, which incorporates a new noise at each dif-

fusion step to get inherent diversity.

• We design a spatial-temporal transformer-based architec-

ture for the proposed framework to encode local kine-

matic information in each frame as well as global tempo-

ral dependencies across frames.

• Extensive experiments show that our model achieves

state-of-the-art performance on both Human3.6M and

HumanEva-I datasets.

Related Work

Deterministic Human Motion Prediction

Given the observed past motion, deterministic HMP aims at

producing only one output, and thus can be regarded as a

regression task. Most existing methods (Fragkiadaki et al.

2015a; Martinez, Black, and Romero 2017; Liu et al. 2022)

exploit Recurrent Neural Networks (RNN) to address this

problem due to its superiority in modeling sequential data.

However, these methods usually suffer from limitations of

ﬁrst-frame discontinuity and error accumulation, especially

for long-term prediction. Recent works (Mao et al. 2019; Li

et al. 2020; Dang et al. 2021) propose Graph Convolutional

Networks (GCN) to model the joint dependencies of hu-

man motion. Motivated by the signiﬁcant success of Trans-

former (Vaswani et al. 2017), (Cai et al. 2020) adapt it on the

discrete cosine transform coefﬁcients extracted from the ob-

served motion. To learn more desired representations, (Ak-

san et al. 2021) propose to aggregate spatial and temporal

information directly from the data by leveraging the recur-

sive nature of human motion. However, this autoregressive

and computationally heavy design is not appropriate for the

decoder of MotionDiff because the diffusion model is non-

autoregressive and requires a certain number of reverse dif-

fusion steps (i.e., decoding), which are valuable to generate

high quality predictions. Therefore, we present an efﬁcient

architecture that separates spatial and temporal information

like (Soﬁanos et al. 2021; Zheng et al. 2021).

Stochastic Human Motion Prediction

Due to the diversity of human behaviors, many stochas-

tic HMP methods are proposed to model the multi-modal

data distribution. These methods are mainly based on deep

generative models (Walker et al. 2017; Kundu, Gor, and

Babu 2019; Mao, Liu, and Salzmann 2021; Ma et al. 2022),

such as GAN (Goodfellow et al. 2014) and VAE (Kingma

and Welling 2013). For GAN-based methods, (Barsoum,

Kender, and Liu 2018) develop a HP-GAN framework that

models the diversity by combining a random vector with the

embedding state at the test time; (Kundu, Gor, and Babu

2019) exploit the discriminator to regress the random vector

and then feed into the generator to obtain diversity. How-

ever, these methods involve complex adversarial learning

between the generator and discriminator, resulting in insta-

ble training. For VAE-based methods, although such likeli-

hood methods can have a good estimation of the data, they

require additional networks to sample a set of latent vari-

ables and fail to sample some minor modes. To alleviate

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HumanJointKinematicsDiffusion-RenementforStochasticMotionPredictionDongWei,1HuaijiangSun,1BinLi,2JianfengLu,1WeiqingLi,1XiaoningSun1ShengxiangHu,11NanjingUniversityofScienceandTechnology2TianjinAiForwardScienceandTechnologyCo.,Ltd.AbstractStochastichumanmotionpredictionaimstoforecastmul-tipleplausi...

展开>> 收起<<

Human Joint Kinematics Diffusion-Reﬁnement for Stochastic Motion Prediction Dong Wei1Huaijiang Sun1Bin Li2Jianfeng Lu1Weiqing Li1Xiaoning Sun1Shengxiang Hu1 1Nanjing University of Science and Technology.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Human Joint Kinematics Diffusion-Reﬁnement for Stochastic Motion Prediction Dong Wei1Huaijiang Sun1Bin Li2Jianfeng Lu1Weiqing Li1Xiaoning Sun1Shengxiang Hu1 1Nanjing University of Science and Technology

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: