DIFFUSION MOTION GENERATE TEXT-GUIDED 3D HUMAN MOTION BY DIFFUSION MODEL Zhiyuan Ren

2025-05-03 1 0 1.74MB 5 页 10玖币

侵权投诉

DIFFUSION MOTION: GENERATE TEXT-GUIDED 3D HUMAN MOTION BY DIFFUSION

MODEL

Zhiyuan Ren

Michigan State University

East Lansing, MI 48824, USA

Zhihong Pan, Xin Zhou, Le Kang

Baidu Research (USA)

Sunnyvale, CA, 94089, USA

ABSTRACT

We propose a simple and novel method for generating 3D hu-

man motion from complex natural language sentences, which

describe different velocity, direction and composition of all

kinds of actions. Different from existing methods that use

classical generative architecture, we apply the Denoising Dif-

fusion Probabilistic Model to this task, synthesizing diverse

motion results under the guidance of texts. The diffusion

model converts white noise into structured 3D motion by a

Markov process with a series of denoising steps and is efﬁ-

ciently trained by optimizing a variational lower bound. To

achieve the goal of text-conditioned image synthesis, we use

the classiﬁer-free guidance strategy to add text embedding

into the model during training. Our experiments demonstrate

that our model achieves competitive results on HumanML3D

test set quantitatively and can generate more visually natural

and diverse examples. We also show with experiments that

our model is capable of zero-shot generation of motions for

unseen text guidance.

Index Terms—Diffusion Model, 3D motion generation,

Multi-modalities

1. INTRODUCTION

Generating 3D human motion from natural language sen-

tences is an interesting and useful task. It has extensive

applications across virtual avatar controlling, robot motion

planning, virtual assistants and movie script visualization.

The task has two major challenges. First, since natural

language can have very ﬁne-grained representation, generat-

ing visually natural and semantically relevant motions from

texts is difﬁcult. Speciﬁcally, the text inputs can contain a lot

of subtleties. For instance, given different verbs and adverbs

in the text, the model needs to generate different motions.

The input may indicate different velocities or directions, e.g.,

“a person is running fast forward then walking slowly back-

ward”. The input may also describe a diverse set of motions,

e.g., “a man is playing golf”, “a person is playing the violin

”, “a person walks steadily along a path while holding onto

rails to keep balance”. The second challenge is that one tex-

tual description could map to multiple motions. This requires

the generative model to be probabilistic. For instance, the

generated motions from the description “a person is walking”

should have multiple output samples with different velocities

and directions.

Early motion generating methods [1, 2, 3, 4] in generat-

ing 3D human motions are based on very simple textual de-

scriptions, such as an action category, e.g. jump, throw or

run. This type of setup has two limitations. First, the feature

space of input texts is too sparse. Therefore, the solutions do

not generalize to texts outside the distribution of the dataset.

Second, category-based texts have very limited applications

in real-world scenarios. With the emergence of the KIT-ML

dataset [5], which contains 6,278 long sentence descriptions

and 3,911 complex motions, a series of work [6, 7] started

to convert complex sentence modalities into motion modal-

ities. They usually design a sequence-to-sequence architec-

ture to generate one result. However, this is inconsistent with

the nature of the motion generation task because every lan-

guage modality corresponds to a very diverse set of 3D mo-

tions. Most recently, a new dataset HumanML3D and a new

model has been proposed in [8] to solve the above problems.

The dataset consists of 14,616 motion clips and 44,970 text

descriptions and provides the basis for training models that

can generate multiple results. The new model proposed in

[8] is able to generate high-ﬁdelity and multiple samples. It

achieves state-of-the-art quantitatively. However, the gener-

ated samples have very limited diversity and are not capable

of achieving zero-shot test. In addition, this model consists of

several sub-models, which cannot be trained end-to-end, and

the inference process is very complex.

A new paradigm for image and video generation named

denoising diffusion probabilistic models, has recently emerged

and achieved remarkable results [9, 10]. The diffusion model

learns an iterative denoising process that gradually recovers

the target output from a Gaussian noise at inference time.

Many recent papers aim to generate images based on tex-

tual descriptions. They blend text into the input and guide

the generation of images using techniques such as classiﬁer

guidance [11], classiﬁer-free guidance [12], which synthesize

impressive samples, such as [13]. Generation using diffusion

models are also applied to other modalities such as speech

arXiv:2210.12315v2 [cs.CV] 14 Apr 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DIFFUSIONMOTION:GENERATETEXT-GUIDED3DHUMANMOTIONBYDIFFUSIONMODELZhiyuanRenMichiganStateUniversityEastLansing,MI48824,USAZhihongPan,XinZhou,LeKangBaiduResearch(USA)Sunnyvale,CA,94089,USAABSTRACTWeproposeasimpleandnovelmethodforgenerating3Dhu-manmotionfromcomplexnaturallanguagesentences,whichdescribed...

展开>> 收起<<

DIFFUSION MOTION GENERATE TEXT-GUIDED 3D HUMAN MOTION BY DIFFUSION MODEL Zhiyuan Ren.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DIFFUSION MOTION GENERATE TEXT-GUIDED 3D HUMAN MOTION BY DIFFUSION MODEL Zhiyuan Ren

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: