suffer from the above inherent limitation and might restrict
the flexibility of the model. An ideal solution would be di-
rectly starting from the latents without training to circum-
vent this problem, and using appropriate constraints to ex-
plore more plausible motions.
In this paper, we propose such a solution named Mo-
tionDiff based on Denoising Diffusion Probabilistic Models
(DDPM) inspired by the diffusion process in nonequilibrium
thermodynamics. As a future human pose sequence is com-
posed of a set of 3D joint locations that satisfy the kinematic
constraints, we regard these locations as particles in a ther-
modynamics system in contact with a heat bath. In this light,
the particles evolve stochastically in the way that they pro-
gressively diffuse from the original states (i.e., kinematics of
human joints) to a noise distribution (i.e., chaotic positions).
This offers an alternative way to obtain the “whitened” la-
tents without any training process, which naturally avoids
posterior collapse. Meanwhile, contrary to previous meth-
ods that require extra sampling encoders to obtain diversity,
a unique strength of MotionDiff is that it is inherently di-
verse because the diffusion process is implemented by in-
corporating a new noise to human motion at each time step.
Our high-level idea is to learn the reverse diffusion process,
which recovers the target realistic pose sequences from the
noisy distribution conditioned on the observed past motion
(see Figure 1). This process can be formulated as a Markov
chain, and allows us to use a simple mean squared error loss
function to optimize the variational lower bound. Nonethe-
less, directly extending the diffusion model to stochastic hu-
man motion prediction results in two key challenges that
arise from the following observations: First, since the kine-
matic information between local joint coordinates has been
completely destroyed in the diffusion process, and a certain
number of steps in the reverse diffusion process is required,
it is necessary to devise an expressive yet efficient decoder
to construct such relations; Second, as we do not explicitly
guide the future motion generation with any loss, Motion-
Diff produces realistic predictions that are totally different
from the ground truth, which makes the quantitative evalua-
tion challenging (Lugmayr et al. 2022).
To this end, we elaborately design an efficient spatial-
temporal transformer-based architecture as the core decoder
of MotionDiff to tackle the first problem. Instead of per-
forming simple pose embedding (Bouazizi et al. 2022), we
devise a spatial transformer module to encode joint embed-
ding, in which local relationships between the 3D joints in
each frame can be better investigated. Following, we capture
the global dependencies across frames by a temporal trans-
former module. This architecture differs from the autore-
gressive model (Aksan et al. 2021) that interleaves spatial
and temporal modeling with tremendous computations. For
the second issue, we further employ a Graph Convolutional
Network (GCN) to refine diverse pose sequences generated
from the decoder with the help of the observed past motion.
By introducing the losses of GCN, our refinement enjoys a
significant approximation to the ground truth and still keeps
diverse and realistic. The contributions of our work are sum-
marized as follows:
• We propose a novel stochastic human motion predic-
tion framework with human joint kinematics diffusion-
refinement, which incorporates a new noise at each dif-
fusion step to get inherent diversity.
• We design a spatial-temporal transformer-based architec-
ture for the proposed framework to encode local kine-
matic information in each frame as well as global tempo-
ral dependencies across frames.
• Extensive experiments show that our model achieves
state-of-the-art performance on both Human3.6M and
HumanEva-I datasets.
Related Work
Deterministic Human Motion Prediction
Given the observed past motion, deterministic HMP aims at
producing only one output, and thus can be regarded as a
regression task. Most existing methods (Fragkiadaki et al.
2015a; Martinez, Black, and Romero 2017; Liu et al. 2022)
exploit Recurrent Neural Networks (RNN) to address this
problem due to its superiority in modeling sequential data.
However, these methods usually suffer from limitations of
first-frame discontinuity and error accumulation, especially
for long-term prediction. Recent works (Mao et al. 2019; Li
et al. 2020; Dang et al. 2021) propose Graph Convolutional
Networks (GCN) to model the joint dependencies of hu-
man motion. Motivated by the significant success of Trans-
former (Vaswani et al. 2017), (Cai et al. 2020) adapt it on the
discrete cosine transform coefficients extracted from the ob-
served motion. To learn more desired representations, (Ak-
san et al. 2021) propose to aggregate spatial and temporal
information directly from the data by leveraging the recur-
sive nature of human motion. However, this autoregressive
and computationally heavy design is not appropriate for the
decoder of MotionDiff because the diffusion model is non-
autoregressive and requires a certain number of reverse dif-
fusion steps (i.e., decoding), which are valuable to generate
high quality predictions. Therefore, we present an efficient
architecture that separates spatial and temporal information
like (Sofianos et al. 2021; Zheng et al. 2021).
Stochastic Human Motion Prediction
Due to the diversity of human behaviors, many stochas-
tic HMP methods are proposed to model the multi-modal
data distribution. These methods are mainly based on deep
generative models (Walker et al. 2017; Kundu, Gor, and
Babu 2019; Mao, Liu, and Salzmann 2021; Ma et al. 2022),
such as GAN (Goodfellow et al. 2014) and VAE (Kingma
and Welling 2013). For GAN-based methods, (Barsoum,
Kender, and Liu 2018) develop a HP-GAN framework that
models the diversity by combining a random vector with the
embedding state at the test time; (Kundu, Gor, and Babu
2019) exploit the discriminator to regress the random vector
and then feed into the generator to obtain diversity. How-
ever, these methods involve complex adversarial learning
between the generator and discriminator, resulting in insta-
ble training. For VAE-based methods, although such likeli-
hood methods can have a good estimation of the data, they
require additional networks to sample a set of latent vari-
ables and fail to sample some minor modes. To alleviate