Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction Dong Wei1Huaijiang Sun1Bin Li2Jianfeng Lu1Weiqing Li1Xiaoning Sun1Shengxiang Hu1 1Nanjing University of Science and Technology

2025-04-27 0 0 1.43MB 9 页 10玖币
侵权投诉
Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction
Dong Wei,1Huaijiang Sun,1Bin Li,2Jianfeng Lu,1Weiqing Li,1Xiaoning Sun1Shengxiang Hu,1
1Nanjing University of Science and Technology
2Tianjin AiForward Science and Technology Co., Ltd.
Abstract
Stochastic human motion prediction aims to forecast mul-
tiple plausible future motions given a single pose sequence
from the past. Most previous works focus on designing elab-
orate losses to improve the accuracy, while the diversity is
typically characterized by randomly sampling a set of latent
variables from the latent prior, which is then decoded into
possible motions. This joint training of sampling and decod-
ing, however, suffers from posterior collapse as the learned
latent variables tend to be ignored by a strong decoder, lead-
ing to limited diversity. Alternatively, inspired by the diffu-
sion process in nonequilibrium thermodynamics, we propose
MotionDiff, a diffusion probabilistic model to treat the kine-
matics of human joints as heated particles, which will diffuse
from original states to a noise distribution. This process of-
fers a natural way to obtain the “whitened” latents without
any trainable parameters, and human motion prediction can
be regarded as the reverse diffusion process that converts the
noise distribution into realistic future motions conditioned on
the observed sequence. Specifically, MotionDiff consists of
two parts: a spatial-temporal transformer-based diffusion net-
work to generate diverse yet plausible motions, and a graph
convolutional network to further refine the outputs. Experi-
mental results on two datasets demonstrate that our model
yields the competitive performance in terms of both accuracy
and diversity.
Introduction
Human Motion Prediction (HMP) has received increasing
attention due to its broad applications such as human-robot
interaction (Bajcsy et al. 2021), autonomous driving (Paden
et al. 2016) and animation production (Park et al. 2019). The
ability to perform such predictions allows robots to under-
stand the future plans of human beings, which is critical to
cooperate safely and reasonably with people. While encour-
aging results have been achieved in previous works (Aksan,
Kaufmann, and Hilliges 2019; Mao et al. 2019; Zhong et al.
2022), they neglect the fact that uncertainty and stochastic-
ity are intrinsic properties of human motions. Given a sin-
gle past observation, predicting multiple possible future se-
quences rather than only one output is gaining in popularity.
The latter, i.e., deterministic HMP, which is mostly based
on recurrent neural network or graph convolutional network,
cannot capture such stochastic behaviors. How to generate
accurate human motion predictions and at the same time
Past Motion Future Motion
Reverse
Diffusion
Process
...
...
Condition
Diffusion
Process
...
...
Figure 1: Visualization of the diffusion and reverse process
of MotionDiff. For diffusion process, new noise is gradually
incorporated until the kinematic information is completely
destroyed. In contrast, the reverse diffusion process recovers
the desired realistic future motion from noisy distribution
conditioned on the observation via a Markov chain.
fully consider the diversity remains a challenging problem.
Recently, deep generative networks have made significant
progress in modeling the multi-modal data distribution (Bar-
soum, Kender, and Liu 2018; Yan et al. 2018; Aliakbarian
et al. 2020), such as Generative Adversarial Network (GAN)
and Variational AutoEncoder (VAE). Most of them obtain
diversity by randomly sampling a set of latent variables from
the latent prior, which requires additional neural networks
for training (i.e., the discriminator in GAN or the sampling
encoder in VAE). This process, however, will bring about
training instability or posterior collapse when jointly trained
with a powerful decoder (McCarthy et al. 2020). Unfortu-
nately, in the particular case of human motion prediction, a
sufficiently high-capacity decoder is indispensable to keep
the predictions physically plausible. As a consequence, such
decoder tends to model the conditional density directly, giv-
ing the network possibility to learn to ignore the stochastic
latent variables, and thus limiting the diversity of future mo-
tions. To increase the diversity, recent progress on stochastic
human motion prediction (Aliakbarian et al. 2020; Yuan and
Kitani 2020; Zhang, Black, and Tang 2021; Ma et al. 2022)
add constraints such as stochastic conditioning schemes or
new losses, to force the model to take the noise into account.
While these methods indeed yield high diversity, they still
arXiv:2210.05976v2 [cs.CV] 28 Nov 2022
suffer from the above inherent limitation and might restrict
the flexibility of the model. An ideal solution would be di-
rectly starting from the latents without training to circum-
vent this problem, and using appropriate constraints to ex-
plore more plausible motions.
In this paper, we propose such a solution named Mo-
tionDiff based on Denoising Diffusion Probabilistic Models
(DDPM) inspired by the diffusion process in nonequilibrium
thermodynamics. As a future human pose sequence is com-
posed of a set of 3D joint locations that satisfy the kinematic
constraints, we regard these locations as particles in a ther-
modynamics system in contact with a heat bath. In this light,
the particles evolve stochastically in the way that they pro-
gressively diffuse from the original states (i.e., kinematics of
human joints) to a noise distribution (i.e., chaotic positions).
This offers an alternative way to obtain the “whitened” la-
tents without any training process, which naturally avoids
posterior collapse. Meanwhile, contrary to previous meth-
ods that require extra sampling encoders to obtain diversity,
a unique strength of MotionDiff is that it is inherently di-
verse because the diffusion process is implemented by in-
corporating a new noise to human motion at each time step.
Our high-level idea is to learn the reverse diffusion process,
which recovers the target realistic pose sequences from the
noisy distribution conditioned on the observed past motion
(see Figure 1). This process can be formulated as a Markov
chain, and allows us to use a simple mean squared error loss
function to optimize the variational lower bound. Nonethe-
less, directly extending the diffusion model to stochastic hu-
man motion prediction results in two key challenges that
arise from the following observations: First, since the kine-
matic information between local joint coordinates has been
completely destroyed in the diffusion process, and a certain
number of steps in the reverse diffusion process is required,
it is necessary to devise an expressive yet efficient decoder
to construct such relations; Second, as we do not explicitly
guide the future motion generation with any loss, Motion-
Diff produces realistic predictions that are totally different
from the ground truth, which makes the quantitative evalua-
tion challenging (Lugmayr et al. 2022).
To this end, we elaborately design an efficient spatial-
temporal transformer-based architecture as the core decoder
of MotionDiff to tackle the first problem. Instead of per-
forming simple pose embedding (Bouazizi et al. 2022), we
devise a spatial transformer module to encode joint embed-
ding, in which local relationships between the 3D joints in
each frame can be better investigated. Following, we capture
the global dependencies across frames by a temporal trans-
former module. This architecture differs from the autore-
gressive model (Aksan et al. 2021) that interleaves spatial
and temporal modeling with tremendous computations. For
the second issue, we further employ a Graph Convolutional
Network (GCN) to refine diverse pose sequences generated
from the decoder with the help of the observed past motion.
By introducing the losses of GCN, our refinement enjoys a
significant approximation to the ground truth and still keeps
diverse and realistic. The contributions of our work are sum-
marized as follows:
We propose a novel stochastic human motion predic-
tion framework with human joint kinematics diffusion-
refinement, which incorporates a new noise at each dif-
fusion step to get inherent diversity.
We design a spatial-temporal transformer-based architec-
ture for the proposed framework to encode local kine-
matic information in each frame as well as global tempo-
ral dependencies across frames.
• Extensive experiments show that our model achieves
state-of-the-art performance on both Human3.6M and
HumanEva-I datasets.
Related Work
Deterministic Human Motion Prediction
Given the observed past motion, deterministic HMP aims at
producing only one output, and thus can be regarded as a
regression task. Most existing methods (Fragkiadaki et al.
2015a; Martinez, Black, and Romero 2017; Liu et al. 2022)
exploit Recurrent Neural Networks (RNN) to address this
problem due to its superiority in modeling sequential data.
However, these methods usually suffer from limitations of
first-frame discontinuity and error accumulation, especially
for long-term prediction. Recent works (Mao et al. 2019; Li
et al. 2020; Dang et al. 2021) propose Graph Convolutional
Networks (GCN) to model the joint dependencies of hu-
man motion. Motivated by the significant success of Trans-
former (Vaswani et al. 2017), (Cai et al. 2020) adapt it on the
discrete cosine transform coefficients extracted from the ob-
served motion. To learn more desired representations, (Ak-
san et al. 2021) propose to aggregate spatial and temporal
information directly from the data by leveraging the recur-
sive nature of human motion. However, this autoregressive
and computationally heavy design is not appropriate for the
decoder of MotionDiff because the diffusion model is non-
autoregressive and requires a certain number of reverse dif-
fusion steps (i.e., decoding), which are valuable to generate
high quality predictions. Therefore, we present an efficient
architecture that separates spatial and temporal information
like (Sofianos et al. 2021; Zheng et al. 2021).
Stochastic Human Motion Prediction
Due to the diversity of human behaviors, many stochas-
tic HMP methods are proposed to model the multi-modal
data distribution. These methods are mainly based on deep
generative models (Walker et al. 2017; Kundu, Gor, and
Babu 2019; Mao, Liu, and Salzmann 2021; Ma et al. 2022),
such as GAN (Goodfellow et al. 2014) and VAE (Kingma
and Welling 2013). For GAN-based methods, (Barsoum,
Kender, and Liu 2018) develop a HP-GAN framework that
models the diversity by combining a random vector with the
embedding state at the test time; (Kundu, Gor, and Babu
2019) exploit the discriminator to regress the random vector
and then feed into the generator to obtain diversity. How-
ever, these methods involve complex adversarial learning
between the generator and discriminator, resulting in insta-
ble training. For VAE-based methods, although such likeli-
hood methods can have a good estimation of the data, they
require additional networks to sample a set of latent vari-
ables and fail to sample some minor modes. To alleviate
摘要:

HumanJointKinematicsDiffusion-RenementforStochasticMotionPredictionDongWei,1HuaijiangSun,1BinLi,2JianfengLu,1WeiqingLi,1XiaoningSun1ShengxiangHu,11NanjingUniversityofScienceandTechnology2TianjinAiForwardScienceandTechnologyCo.,Ltd.AbstractStochastichumanmotionpredictionaimstoforecastmul-tipleplausi...

展开>> 收起<<
Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction Dong Wei1Huaijiang Sun1Bin Li2Jianfeng Lu1Weiqing Li1Xiaoning Sun1Shengxiang Hu1 1Nanjing University of Science and Technology.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.43MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注