DIFFUSION MOTION: GENERATE TEXT-GUIDED 3D HUMAN MOTION BY DIFFUSION
MODEL
Zhiyuan Ren
Michigan State University
East Lansing, MI 48824, USA
Zhihong Pan, Xin Zhou, Le Kang
Baidu Research (USA)
Sunnyvale, CA, 94089, USA
ABSTRACT
We propose a simple and novel method for generating 3D hu-
man motion from complex natural language sentences, which
describe different velocity, direction and composition of all
kinds of actions. Different from existing methods that use
classical generative architecture, we apply the Denoising Dif-
fusion Probabilistic Model to this task, synthesizing diverse
motion results under the guidance of texts. The diffusion
model converts white noise into structured 3D motion by a
Markov process with a series of denoising steps and is effi-
ciently trained by optimizing a variational lower bound. To
achieve the goal of text-conditioned image synthesis, we use
the classifier-free guidance strategy to add text embedding
into the model during training. Our experiments demonstrate
that our model achieves competitive results on HumanML3D
test set quantitatively and can generate more visually natural
and diverse examples. We also show with experiments that
our model is capable of zero-shot generation of motions for
unseen text guidance.
Index Terms—Diffusion Model, 3D motion generation,
Multi-modalities
1. INTRODUCTION
Generating 3D human motion from natural language sen-
tences is an interesting and useful task. It has extensive
applications across virtual avatar controlling, robot motion
planning, virtual assistants and movie script visualization.
The task has two major challenges. First, since natural
language can have very fine-grained representation, generat-
ing visually natural and semantically relevant motions from
texts is difficult. Specifically, the text inputs can contain a lot
of subtleties. For instance, given different verbs and adverbs
in the text, the model needs to generate different motions.
The input may indicate different velocities or directions, e.g.,
“a person is running fast forward then walking slowly back-
ward”. The input may also describe a diverse set of motions,
e.g., “a man is playing golf”, “a person is playing the violin
”, “a person walks steadily along a path while holding onto
rails to keep balance”. The second challenge is that one tex-
tual description could map to multiple motions. This requires
the generative model to be probabilistic. For instance, the
generated motions from the description “a person is walking”
should have multiple output samples with different velocities
and directions.
Early motion generating methods [1, 2, 3, 4] in generat-
ing 3D human motions are based on very simple textual de-
scriptions, such as an action category, e.g. jump, throw or
run. This type of setup has two limitations. First, the feature
space of input texts is too sparse. Therefore, the solutions do
not generalize to texts outside the distribution of the dataset.
Second, category-based texts have very limited applications
in real-world scenarios. With the emergence of the KIT-ML
dataset [5], which contains 6,278 long sentence descriptions
and 3,911 complex motions, a series of work [6, 7] started
to convert complex sentence modalities into motion modal-
ities. They usually design a sequence-to-sequence architec-
ture to generate one result. However, this is inconsistent with
the nature of the motion generation task because every lan-
guage modality corresponds to a very diverse set of 3D mo-
tions. Most recently, a new dataset HumanML3D and a new
model has been proposed in [8] to solve the above problems.
The dataset consists of 14,616 motion clips and 44,970 text
descriptions and provides the basis for training models that
can generate multiple results. The new model proposed in
[8] is able to generate high-fidelity and multiple samples. It
achieves state-of-the-art quantitatively. However, the gener-
ated samples have very limited diversity and are not capable
of achieving zero-shot test. In addition, this model consists of
several sub-models, which cannot be trained end-to-end, and
the inference process is very complex.
A new paradigm for image and video generation named
denoising diffusion probabilistic models, has recently emerged
and achieved remarkable results [9, 10]. The diffusion model
learns an iterative denoising process that gradually recovers
the target output from a Gaussian noise at inference time.
Many recent papers aim to generate images based on tex-
tual descriptions. They blend text into the input and guide
the generation of images using techniques such as classifier
guidance [11], classifier-free guidance [12], which synthesize
impressive samples, such as [13]. Generation using diffusion
models are also applied to other modalities such as speech
arXiv:2210.12315v2 [cs.CV] 14 Apr 2023