Rhythmic Gesticulator Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings

2025-05-03 0 0 9.07MB 19 页 10玖币

侵权投诉

Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis

with Hierarchical Neural Embeddings

TENGLONG AO, Peking University, China

QINGZHE GAO, Shandong University and Peking University, China

YUKE LOU, Peking University, China

BAOQUAN CHEN, SIST & KLMP (MOE), Peking University, China

LIBIN LIU∗,SIST & KLMP (MOE), Peking University, China

…

Well

,for example, (pause) if you played an

and please do

not

play another F

higher

…

Fig. 1. Gesture results automatically synthesized by our system for a beat-rich TED talk clip. The red words represent beats, and the red arrows indicate the

movements of corresponding beat gestures.

Automatic synthesis of realistic co-speech gestures is an increasingly im-

portant yet challenging task in articial embodied agent creation. Previous

systems mainly focus on generating gestures in an end-to-end manner, which

leads to diculties in mining the clear rhythm and semantics due to the

complex yet subtle harmony between speech and gestures. We present a

novel co-speech gesture synthesis method that achieves convincing results

both on the rhythm and semantics. For the rhythm, our system contains a ro-

bust rhythm-based segmentation pipeline to ensure the temporal coherence

between the vocalization and gestures explicitly. For the gesture semantics,

we devise a mechanism to eectively disentangle both low- and high-level

neural embeddings of speech and motion based on linguistic theory. The

high-level embedding corresponds to semantics, while the low-level embed-

ding relates to subtle variations. Lastly, we build correspondence between the

∗corresponding author

Authors’ addresses: Tenglong Ao, aubrey.tenglong.ao@gmail.com, Peking University,

No.5 Yiheyuan Road, Haidian District, Beijing, Beijing, China, 100871; Qingzhe Gao,

gaoqingzhe97@gmail.com, Shandong University and Peking University, China; Yuke

Lou, louyuke@pku.edu.cn, Peking University, No.5 Yiheyuan Road, Haidian District,

Beijing, Beijing, China, 100871; Baoquan Chen, baoquan@pku.edu.cn, SIST & KLMP

(MOE), Peking University, No.5 Yiheyuan Road, Haidian District, Beijing, Beijing, China,

100871; Libin Liu, libin.liu@pku.edu.cn, SIST & KLMP (MOE), Peking University, No.5

Yiheyuan Road, Haidian District, Beijing, Beijing, China, 100871.

This is the author’s version of the work. It is posted here for your personal use. Not for

redistribution. The denitive Version of Record was published in ACM Transactions on

Graphics, https://doi.org/10.1145/3550454.3555435.

hierarchical embeddings of the speech and the motion, resulting in rhythm-

and semantics-aware gesture synthesis. Evaluations with existing objective

metrics, a newly proposed rhythmic metric, and human feedback show that

our method outperforms state-of-the-art systems by a clear margin.

CCS Concepts:

•Computing methodologies →Animation

;Natural lan-

guage processing;Neural networks.

Additional Key Words and Phrases: non-verbal behavior, co-speech gesture

synthesis, character animation, neural generative model, multi-modality,

virtual agents

ACM Reference Format:

Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. 2022.

Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with

Hierarchical Neural Embeddings. ACM Trans. Graph. 41, 6, Article 209 (De-

cember 2022), 19 pages. https://doi.org/10.1145/3550454.3555435

1 INTRODUCTION

Gesturing is an important part of speaking. It adds emphasis and

clarity to a speech and conveys essential non-verbal information that

makes the speech lively and persuasive [Burgoon et al

1990]. There

are rich demands for high-quality 3D gesture animation in many

industries, such as games, lms, and digital humans. However, the

diculties in reproducing the complex yet subtle harmony between

vocalization and body movement make synthesizing natural-looking

co-speech gestures remain a long-standing and challenging task.

ACM Trans. Graph., Vol. 41, No. 6, Article 209. Publication date: December 2022.

arXiv:2210.01448v3 [cs.SD] 4 May 2023

209:2 •Ao, Gao, Lou, Chen, and Liu

Gestures are grouped into six categories by linguists [Ekman and

Friesen 1969; McNeill 1992]—adaptors, emblems, deictics, iconics,

metaphorics, and beats. Among them, the beat gestures are rhythmic

movements that bear no apparent relation to speech semantics [Kipp

2004] but serve meta-narrative functions [McNeill 1992] that are

crucial to rhythmic harmony between speech and gestures. Gener-

ating realistic beat gestures requires modelling the relation between

the gestural beats and the verbal stresses. However, it has been

observed that these two modalities are not synchronized in a strict

rhythmic sense [McClave 1994], making it dicult to learn their

temporal connection directly from data using an end-to-end method

[Bhattacharya et al

2021a; Kucherenko et al

2020; Yoon et al

2020].

Gestures are associated with dierent levels of speech informa-

tion [McNeill 1992]. For example, an emblem gesture such as thumbs-

up usually accompanies high-level semantics like good or great,

while a beat gesture commonly comes with low-level acoustic em-

phasis. Many previous studies use only the features extracted at the

last layer of an audio encoder to synthesize gestures [Alexanderson

et al

2020; Bhattacharya et al

2021a; Kucherenko et al

2020; Qian

et al

2021; Yoon et al

2020]. This setup, however, may in eect

encourage the encoder to mix the speech information from multiple

levels into the same feature, causing ambiguity and increasing the

diculty in mining clear rhythmic and semantic cues.

In this paper, we focus on generating co-speech upper-body ges-

tures that can accompany a broad range of speech content—from a

single sentence to a public speech, aiming at achieving convincing

results both on the rhythm and semantics. Our rst observation

is that gesturing can be considered as a special form of dancing

under changing beats. We develop a rhythm-based canonicalization

and generation framework to deal with the challenge of generating

synchronized gestures to the speech, which segments the speech

into short clips at audio beats, normalizes these clips into canoni-

cal blocks of the same length, generates gestures for every block,

and aligns the generated motion to the rhythm of the speech. This

framework, which is partially inspired by recent research in dance

generation [Aristidou et al

2022], provides the gesture model with

an explicit hint of the rhythm, allowing the model to learn the pat-

tern of gestural beats within a rhythmic block eciently. Both the

quantitative evaluation with a novel rhythmic metric and the quali-

tative evaluation with user studies show that the gestures generated

by this pipeline exhibit natural synchronization to the speech.

As indicated in linguistics literature [Kipp 2004; Ne et al

2008;

Webb 1996], gestures used in everyday conversation can be bro-

ken down into a limited number of semantic units with dierent

motion variations. We assume that these semantic units, usually

referred to as lexemes, relate to the high-level features of speech

audio, while the motion variations are determined by the low-level

audio features. We thus disentangle high- and low-level features

from dierent layers of an audio encoder and learn the mappings

between them and the gesture lexemes and the motion variations,

respectively. Experiments demonstrate that this mechanism suc-

cessfully disentangles multi-level features of both the speech and

motion and synthesizes semantics-matching and stylized gestures.

In summary, our main contributions in this paper are:

•

We present a novel rhythm- and semantics-aware co-speech

gesture synthesis system that generates natural-looking ges-

tures. To the best of our knowledge, this is the rst neural

system that explicitly models both the rhythmic and semantic

relations between speech and gestures.

•

We develop a robust rhythm-based segmentation pipeline to

ensure the temporal coherence between speech and gestures,

which we nd is crucial to achieving rhythmic gestures.

•

We devise an eective mechanism to relate the disentangled

multi-level features of both speech and motion, which enables

generating gestures with convincing semantics.

2 RELATED WORK

2.1 Data-driven Human Motion Synthesis

Traditional human motion synthesis frameworks often rely on con-

catenative approaches such as motion graph [Kovar et al

2002].

Recently, learning-based methods with neural networks have been

widely applied to this area to generate high-quality and interactive

motions, using models ranging from feed-forward network [Holden

et al

2017; Starke et al

2022] to dedicated generative models [Henter

et al

2020; Ling et al

2020]. Dealing with the one-to-many issue

where a variety of motions can correspond to the same input or con-

trol signal is often a challenge for these learning-based approaches.

Previous systems often employ additional conditions, such as con-

tacts [Starke et al

2020] or phase indices [Holden et al

2017; Starke

et al

2022], to deal with this problem. Closer to the gesture domain is

the speech-driven head motion synthesis, where conditional GANs

[Sadoughi and Busso 2018], and conditional VAEs [Greenwood et al

2017] have been used.

2.1.1 Music-driven Dance Synthesis. Among the general motion

synthesis tasks, music-driven dance generation addresses a similar

problem to the co-speech gesture synthesis, where the complex

temporal relation between two dierent modalities needs to be

modeled accurately. Both motion graph-based methods [Chen et al

2021; Kim et al

2006] and learning-based approaches [Li et al

2021b;

Siyao et al

2022; Valle-Pérez et al

2021] have been adopted and

successfully achieved impressive generation results. To deal with the

synchronization between the dance and music, Chen et al

[2021]

develop a manually labeled rhythm signature to represent beat

patterns and ensures the rhythm signatures of the generated dance

match the music. Aristidou et al

[2022] segment the dance into

blocks at music onsets, convert each block into a motion motif

[Aristidou et al

2018] that denes a specic cluster of motions, and

use the motion motif to guide the synthesis of dance at the block

level. Siyao et al

[2022] employ a reinforcement learning scheme

to improve the rhythmic performance of the generator using a

reward function encouraging beat alignment. Our rhythm-based

segmentation and canonicalization framework is partially inspired

by [Aristidou et al

2022]. Similar to [Aristidou et al

2022], we also

segment the gestures into clips at audio beats but learn a high-level

representation for each clip via the vector quantization scheme

[Oord et al

2017] instead of the K-means clustering. Moreover, our

framework generates gestures in blocks of motion and denormalizes

the generated motion blocks to match the rhythm of the speech.

ACM Trans. Graph., Vol. 41, No. 6, Article 209. Publication date: December 2022.

Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings •209:3

In contrast, Aristidou et al

[2022] synthesize dance sequences in

frames conditioned on the corresponding motion motifs.

2.2 Co-speech Gesture Synthesis

The most primitive approach used to generate human non-verbal be-

haviors is to animate an articial agent using the retargeted motion

capture data. This kind of approach is widely used in commercial

systems (e.g., lms and games) because of its high-quality motion

performance. However, it is not suitable for creating interactive

content that cannot be prepared beforehand. Generating co-speech

gestures according to an arbitrary input has been a long-standing

research topic. Previous studies can be roughly categorized into two

groups, i.e., rule-based and data-driven methods.

2.2.1 Rule-based Method. The idea of the rule-based approach is

to collect a set of gesture units and design specic rules that map a

speech to a sequence of gesture units [Cassell et al

2004; Huang and

Mutlu 2012; Kipp 2004; Softbank 2018]. Wagner et al

[2014] have

an excellent review of these methods. The results of the rule-based

methods are generally highly explainable and controllable. However,

the gesture units and rules typically have to be created manually,

which can be costly and inecient for complex systems.

2.2.2 Data-driven Method. Early research in data-driven method

learns the rules embedded in data and combines them with prede-

ned animation units to generate new gestures. For example, Kopp

et al

[2006]; Levine et al

[2010] use probabilistic models to build

correspondence between speech and gestures. Ne et al

[2008]

build a statistical model to learn the personal style of each speaker.

The model is combined with the input text tagged with the theme,

utterance focus, and rheme to generate gesture scripts, which are

then mapped to a sequence of gestures selected from an animation

lexicon. Chiu et al

[2015] train a neural classication model to se-

lect a proper gesture unit based on the speech input. More recent

research has started to take advantage of deep learning and trains

end-to-end models using raw gesture data directly, which frees the

manual eorts of designing the gesture lexicon and mapping rules.

Gestures can be synthesized using deterministic models such as mul-

tilayer perceptron (MLP) [Kucherenko et al

2020], recurrent neural

networks [Bhattacharya et al

2021a; Hasegawa et al

2018; Liu et al

2022; Yoon et al

2020, 2019], convolutional networks [Habibie et al

2021], and transformers [Bhattacharya et al

2021b], or by learning

generative models such as normalizing ow [Alexanderson et al

2020], VAEs [Li et al

2021a; Xu et al

2022], and learnable noise codes

[Qian et al

2021]. Our method is also a data-driven framework. We

learn the motion generator and the mapping between the speech and

gestures from data using a combined network structure of the vec-

tor quantized variational autoencoder (VQ-VAE) [Oord et al. 2017]

and LSTM. To capture the rhythmic and semantic correspondences

between the speech and gestures, we propose a multi-stage archi-

tecture that explicitly models the rhythm and semantics in dierent

stages. An earlier system proposed by Kucherenko et al

[2021b]

shares a similar high-level architectural design to our framework.

However, there are two key dierences: (a) our method is essen-

tially an unsupervised learning approach, which learns the gesture

lexeme, style code, and the generator directly from the data with-

out detailed annotations; and (b) our system employs an explicit

beat-based segmentation scheme which is shown to be eective in

ensuring temporal coherence between the speech and the gesture.

2.3 Multi-Modal Data Processing

Co-speech gesture generation is a cross-modal process involving

audio, text, motion, and other information related to the speaker

and the content of the speech. The representation and alignment of

each modality are essential for high-quality results [Baltrušaitis et al

2019]. Mel-spectrogram and MFCC acoustic features are commonly

used as audio features [Alexanderson et al

2020; Kucherenko et al

2020; Qian et al

2021], typically resampled into the same framerate

of the motion. For the text features, pre-trained language models like

BERT [Devlin et al

2019; Kucherenko et al

2020] and FastText [Bo-

janowski et al

2017; Yoon et al

2020] have been used to encode

text transcripts into frame-wise latent codes, where paddings, llers,

or empty words are inserted into a sentence to make the world se-

quence the same length as the motion [Kucherenko et al

2020; Yoon

et al

2020]. Speaker’s style and emotions can also be encoded by

learnable latent codes [Bhattacharya et al

2021a; Yoon et al

2020]

and are resampled or padded to match the length of the speech. In

this work, we employ a pre-trained speech model to extract audio

features and ne-tune it using a contrastive learning strategy. We

also utilize a BERT-based model to vectorize the text. These multi-

modal data are then aligned explicitly using the standard approaches

discussed above. Notably, a concurrent study [Liu et al

2022] also

extracts audio features using contrastive learning. Their framework

considers the learning of the audio features as a part of the training

of the gesture generator. Instead, our framework trains the audio

encoder in a separate pre-training stage using only the audio data.

2.4 Evaluation of Motion Synthesis Models

Evaluating the generated co-speech gestures is often dicult be-

cause the motion quality is a very subjective concept. Previous works

have proposed several evaluation criteria. Wolfert et al. [2022] have

made a comprehensive review of them. User studies are widely

adopted to evaluate dierent aspects of motion quality, such as

human-likeliness and speech-gesture matching [Alexanderson et al

2020; Kucherenko et al

2020; Yoon et al

2020], but can be expensive

and hard to exclude uncontrolled factors. The absolute dierence

of joint positions or other motion features, such as velocity and

acceleration between a reconstructed motion and the ground truth,

is used by several works as an objective metric [Ginosar et al

2019;

Joo et al

2019; Kucherenko et al

2019]. However, this metric is

not suitable for evaluating motions that are natural but not the

same as the reference. Fréchet Inception Distance (FID) [Heusel

et al

2017] is a widely used criterion in image generation tasks that

measures the dierence between the distributions of the dataset and

generated samples in the latent space. It successfully reects the

perceptual quality of generated samples. Similarly, Yoon et al

[2020]

and Qian et al

[2021] propose Fréchet Gesture Distance (FGD) and

Fréchet Template Distance (FTD) metrics, respectively. These met-

rics measure the perceptual quality of generated gestures. In this

paper, we compare our framework with several baseline methods

ACM Trans. Graph., Vol. 41, No. 6, Article 209. Publication date: December 2022.

209:4 •Ao, Gao, Lou, Chen, and Liu

𝑚!

𝑴"

…

Encoder Decoder 𝑚!

𝑚!

𝑴"

∗

…

𝑴$%"

𝑨$%"

&'(

motion blocks

training set

Motion

Encoder

Audio

Encoder

𝑨$

&'(

𝑨$)"

&'(

𝑴$%" $

𝑨$

motion block

𝑴 ∈ ℝ!×#!

low-level

audio blocks

𝑨$%& ∈ ℝ!

!×#!

𝑴$

∗

gesture style code

𝒛 ∈ ℝ#"ℒ*

ℒ+,-

ℒ&,.,/,

gesture lexeme 𝒔 ∈ ℝ#"

Training

𝑨#

$%&$ 𝑨'

$%&$ 𝑨(

$%&$ 𝑨)

$%&$

𝑨*

$%&$ 𝑨+,'

$%&$ 𝑨+,#

$%&$ 𝑨+,*

$%&$

…𝑨+

$%&$

…

𝑻(𝑻)𝑻*𝑻+𝑻#$% 𝑻,-* 𝑻#$& 𝑻,-(

…𝑻#

Ilove you …This is amazing …but what the key

𝑻,

𝑻)𝑻*𝑻+𝑻,-+

𝑻(𝑻,-* 𝑻,-) 𝑻,-(

…

beat

identification

normalization

high-level feature

text feature

normalization

Data

𝑺.

𝒁.

𝒔'𝒔&𝒔(𝒔)"

…

vector

quantization

gesture lexicon

lookup

𝑨#

-./ 𝑨'

-./ 𝑨(

-./ 𝑨)

0./

𝑨*

-./ 𝑨+,'

-./ 𝑨+,#

-./ 𝑨+,*

-./ 𝑨+

-./

low-level feature …

ℒ2,+-

ℒ+,-

Generator

MLP

groups

1, ……, H

time

1, …, K

Decoder

𝑨!"#

$%&$

𝑨!

$%&$

𝑨!'#

$%&$

𝑴$%"

speaker identity

𝑰 ∈ ℝ/1Lexeme

Interpreter

Style

Interpreter

𝑴$

∗

𝑨2,*

-./

𝑨2

-./

𝑨23*

-./

𝑨2,*

-./

𝑨2

-./

𝑨23*

-./

Inference

𝑴(

∗𝑴)

∗𝑴*

∗𝑴+,#

∗

…𝑴+,*

∗𝑴,

∗

𝑴(

∗𝑴)

∗𝑴(

∗𝑴,-)

∗

…𝑴!"#

∗𝑴#

∗

denormalization

𝑴$%" $

𝑨$

Motion &Audio

Encoders

𝑺.

∗

𝒁.

∗

Generator

𝑺$

∗

predicted

gesture lexeme block

𝑺∗∈ ℝ,×."

𝑻$

𝒔'𝒔&𝒔(𝒔)"

…

high-level

audio blocks

𝑻$

text block

𝑻 ∈ ℝ!×#5

𝑨$%&$ ∈ ℝ6)×8)

gesture lexeme 𝒔 ∈ ℝ#"

Fig. 2. Our system is composed of three core components: (a) the data module preprocesses a speech, segments it into normalized blocks based on the beats,

and extracts speech features from these blocks; (b) the training module learns a gesture lexicon from the normalized motion blocks and trains the generator to

synthesize gesture sequences, conditioned on the gesture lexemes, the style codes, as well as the features of previous motion blocks and adjacent speech

blocks; and (c) the inference module employs interpreters to transfer the speech features to gesture lexemes and style codes, which are then used by the

learned generator to predict future gestures.

using both user studies and objective metrics like FGD. We further

propose a simple but eective rhythmic metric to measure the per-

centage of matched beats by dynamically adjusting the matching

threshold, which provides a more informative picture of the rhythm

performance.

3 SYSTEM OVERVIEW

Our goal is to synthesize realistic co-speech upper-body gestures

that match a given speech context both temporally and semanti-

cally. To achieve this goal, we build a system using neural networks

that takes speech audio as input and generates gesture sequences

accordingly. Additional speech modalities, such as text and speaker

identity, will also be considered by the system when available to

enhance semantic coherence and generate stylized gestures.

A gesture motion consists of a sequence of gesture units, which

can be further broken down into a number of gesture phases that

align with intonational units, such as pitch accents or stressed sylla-

bles [Kendon 2004; Loehr 2012]. The action in each of these gesture

phases is typically a specic movement such as lifting a hand, hold-

ing an arm at a position, or moving both arms down together, which

is often referred to as a gesture lexeme by linguists [Kipp 2004; Ne

et al

2008; Webb 1996]. It is also revealed in the literature that there

are only a limited number of lexemes used in everyday conversation.

These lexemes form a gesture lexicon. A typical speaker may only

use a subset of this lexicon and apply slight variations to the motion.

We assume such variations cannot be inferred directly from the

speech but can be characterized by some latent variables, which

we refer to as the gesture style codes. Our system then generates

gestures in a hierarchical order. It rst determines the sequence of

gesture lexemes and style codes and then generates gestural moves

based on these motion-related features and other speech modalities.

Our system processes the input speech in a block-wise manner.

Considering the temporal and structural synchrony between the

gesture and the speech, we leverage a segmentation that aligns with

the rhythm of the speech to ensure temporal coherence between

the two modalities. Specically, our system extracts beats from the

input speech based on audio onsets and segments the speech into

short clips at every beat. These clips are then time-scaled and con-

verted into normalized blocks with the same length. We extract

features at multiple levels for each block, where the high-level fea-

tures are translated into a gesture lexeme, and the low-level features

ACM Trans. Graph., Vol. 41, No. 6, Article 209. Publication date: December 2022.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RhythmicGesticulator:Rhythm-AwareCo-SpeechGestureSynthesiswithHierarchicalNeuralEmbeddingsTENGLONGAO,PekingUniversity,ChinaQINGZHEGAO,ShandongUniversityandPekingUniversity,ChinaYUKELOU,PekingUniversity,ChinaBAOQUANCHEN,SIST&KLMP(MOE),PekingUniversity,ChinaLIBINLIU∗,SIST&KLMP(MOE),PekingUniversity,Ch...

展开>> 收起<<

Rhythmic Gesticulator Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Rhythmic Gesticulator Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: