Rhythmic Gesticulator Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings

2025-05-03 0 0 9.07MB 19 页 10玖币
侵权投诉
Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis
with Hierarchical Neural Embeddings
TENGLONG AO, Peking University, China
QINGZHE GAO, Shandong University and Peking University, China
YUKE LOU, Peking University, China
BAOQUAN CHEN, SIST & KLMP (MOE), Peking University, China
LIBIN LIU,SIST & KLMP (MOE), Peking University, China
Well
,for example, (pause) if you played an
F
and please do
not
play another F
higher
Fig. 1. Gesture results automatically synthesized by our system for a beat-rich TED talk clip. The red words represent beats, and the red arrows indicate the
movements of corresponding beat gestures.
Automatic synthesis of realistic co-speech gestures is an increasingly im-
portant yet challenging task in articial embodied agent creation. Previous
systems mainly focus on generating gestures in an end-to-end manner, which
leads to diculties in mining the clear rhythm and semantics due to the
complex yet subtle harmony between speech and gestures. We present a
novel co-speech gesture synthesis method that achieves convincing results
both on the rhythm and semantics. For the rhythm, our system contains a ro-
bust rhythm-based segmentation pipeline to ensure the temporal coherence
between the vocalization and gestures explicitly. For the gesture semantics,
we devise a mechanism to eectively disentangle both low- and high-level
neural embeddings of speech and motion based on linguistic theory. The
high-level embedding corresponds to semantics, while the low-level embed-
ding relates to subtle variations. Lastly, we build correspondence between the
corresponding author
Authors’ addresses: Tenglong Ao, aubrey.tenglong.ao@gmail.com, Peking University,
No.5 Yiheyuan Road, Haidian District, Beijing, Beijing, China, 100871; Qingzhe Gao,
gaoqingzhe97@gmail.com, Shandong University and Peking University, China; Yuke
Lou, louyuke@pku.edu.cn, Peking University, No.5 Yiheyuan Road, Haidian District,
Beijing, Beijing, China, 100871; Baoquan Chen, baoquan@pku.edu.cn, SIST & KLMP
(MOE), Peking University, No.5 Yiheyuan Road, Haidian District, Beijing, Beijing, China,
100871; Libin Liu, libin.liu@pku.edu.cn, SIST & KLMP (MOE), Peking University, No.5
Yiheyuan Road, Haidian District, Beijing, Beijing, China, 100871.
©2022 Association for Computing Machinery.
This is the author’s version of the work. It is posted here for your personal use. Not for
redistribution. The denitive Version of Record was published in ACM Transactions on
Graphics, https://doi.org/10.1145/3550454.3555435.
hierarchical embeddings of the speech and the motion, resulting in rhythm-
and semantics-aware gesture synthesis. Evaluations with existing objective
metrics, a newly proposed rhythmic metric, and human feedback show that
our method outperforms state-of-the-art systems by a clear margin.
CCS Concepts:
Computing methodologies Animation
;Natural lan-
guage processing;Neural networks.
Additional Key Words and Phrases: non-verbal behavior, co-speech gesture
synthesis, character animation, neural generative model, multi-modality,
virtual agents
ACM Reference Format:
Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. 2022.
Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with
Hierarchical Neural Embeddings. ACM Trans. Graph. 41, 6, Article 209 (De-
cember 2022), 19 pages. https://doi.org/10.1145/3550454.3555435
1 INTRODUCTION
Gesturing is an important part of speaking. It adds emphasis and
clarity to a speech and conveys essential non-verbal information that
makes the speech lively and persuasive [Burgoon et al
.
1990]. There
are rich demands for high-quality 3D gesture animation in many
industries, such as games, lms, and digital humans. However, the
diculties in reproducing the complex yet subtle harmony between
vocalization and body movement make synthesizing natural-looking
co-speech gestures remain a long-standing and challenging task.
ACM Trans. Graph., Vol. 41, No. 6, Article 209. Publication date: December 2022.
arXiv:2210.01448v3 [cs.SD] 4 May 2023
209:2 Ao, Gao, Lou, Chen, and Liu
Gestures are grouped into six categories by linguists [Ekman and
Friesen 1969; McNeill 1992]—adaptors, emblems, deictics, iconics,
metaphorics, and beats. Among them, the beat gestures are rhythmic
movements that bear no apparent relation to speech semantics [Kipp
2004] but serve meta-narrative functions [McNeill 1992] that are
crucial to rhythmic harmony between speech and gestures. Gener-
ating realistic beat gestures requires modelling the relation between
the gestural beats and the verbal stresses. However, it has been
observed that these two modalities are not synchronized in a strict
rhythmic sense [McClave 1994], making it dicult to learn their
temporal connection directly from data using an end-to-end method
[Bhattacharya et al
.
2021a; Kucherenko et al
.
2020; Yoon et al
.
2020].
Gestures are associated with dierent levels of speech informa-
tion [McNeill 1992]. For example, an emblem gesture such as thumbs-
up usually accompanies high-level semantics like good or great,
while a beat gesture commonly comes with low-level acoustic em-
phasis. Many previous studies use only the features extracted at the
last layer of an audio encoder to synthesize gestures [Alexanderson
et al
.
2020; Bhattacharya et al
.
2021a; Kucherenko et al
.
2020; Qian
et al
.
2021; Yoon et al
.
2020]. This setup, however, may in eect
encourage the encoder to mix the speech information from multiple
levels into the same feature, causing ambiguity and increasing the
diculty in mining clear rhythmic and semantic cues.
In this paper, we focus on generating co-speech upper-body ges-
tures that can accompany a broad range of speech content—from a
single sentence to a public speech, aiming at achieving convincing
results both on the rhythm and semantics. Our rst observation
is that gesturing can be considered as a special form of dancing
under changing beats. We develop a rhythm-based canonicalization
and generation framework to deal with the challenge of generating
synchronized gestures to the speech, which segments the speech
into short clips at audio beats, normalizes these clips into canoni-
cal blocks of the same length, generates gestures for every block,
and aligns the generated motion to the rhythm of the speech. This
framework, which is partially inspired by recent research in dance
generation [Aristidou et al
.
2022], provides the gesture model with
an explicit hint of the rhythm, allowing the model to learn the pat-
tern of gestural beats within a rhythmic block eciently. Both the
quantitative evaluation with a novel rhythmic metric and the quali-
tative evaluation with user studies show that the gestures generated
by this pipeline exhibit natural synchronization to the speech.
As indicated in linguistics literature [Kipp 2004; Ne et al
.
2008;
Webb 1996], gestures used in everyday conversation can be bro-
ken down into a limited number of semantic units with dierent
motion variations. We assume that these semantic units, usually
referred to as lexemes, relate to the high-level features of speech
audio, while the motion variations are determined by the low-level
audio features. We thus disentangle high- and low-level features
from dierent layers of an audio encoder and learn the mappings
between them and the gesture lexemes and the motion variations,
respectively. Experiments demonstrate that this mechanism suc-
cessfully disentangles multi-level features of both the speech and
motion and synthesizes semantics-matching and stylized gestures.
In summary, our main contributions in this paper are:
We present a novel rhythm- and semantics-aware co-speech
gesture synthesis system that generates natural-looking ges-
tures. To the best of our knowledge, this is the rst neural
system that explicitly models both the rhythmic and semantic
relations between speech and gestures.
We develop a robust rhythm-based segmentation pipeline to
ensure the temporal coherence between speech and gestures,
which we nd is crucial to achieving rhythmic gestures.
We devise an eective mechanism to relate the disentangled
multi-level features of both speech and motion, which enables
generating gestures with convincing semantics.
2 RELATED WORK
2.1 Data-driven Human Motion Synthesis
Traditional human motion synthesis frameworks often rely on con-
catenative approaches such as motion graph [Kovar et al
.
2002].
Recently, learning-based methods with neural networks have been
widely applied to this area to generate high-quality and interactive
motions, using models ranging from feed-forward network [Holden
et al
.
2017; Starke et al
.
2022] to dedicated generative models [Henter
et al
.
2020; Ling et al
.
2020]. Dealing with the one-to-many issue
where a variety of motions can correspond to the same input or con-
trol signal is often a challenge for these learning-based approaches.
Previous systems often employ additional conditions, such as con-
tacts [Starke et al
.
2020] or phase indices [Holden et al
.
2017; Starke
et al
.
2022], to deal with this problem. Closer to the gesture domain is
the speech-driven head motion synthesis, where conditional GANs
[Sadoughi and Busso 2018], and conditional VAEs [Greenwood et al
.
2017] have been used.
2.1.1 Music-driven Dance Synthesis. Among the general motion
synthesis tasks, music-driven dance generation addresses a similar
problem to the co-speech gesture synthesis, where the complex
temporal relation between two dierent modalities needs to be
modeled accurately. Both motion graph-based methods [Chen et al
.
2021; Kim et al
.
2006] and learning-based approaches [Li et al
.
2021b;
Siyao et al
.
2022; Valle-Pérez et al
.
2021] have been adopted and
successfully achieved impressive generation results. To deal with the
synchronization between the dance and music, Chen et al
.
[2021]
develop a manually labeled rhythm signature to represent beat
patterns and ensures the rhythm signatures of the generated dance
match the music. Aristidou et al
.
[2022] segment the dance into
blocks at music onsets, convert each block into a motion motif
[Aristidou et al
.
2018] that denes a specic cluster of motions, and
use the motion motif to guide the synthesis of dance at the block
level. Siyao et al
.
[2022] employ a reinforcement learning scheme
to improve the rhythmic performance of the generator using a
reward function encouraging beat alignment. Our rhythm-based
segmentation and canonicalization framework is partially inspired
by [Aristidou et al
.
2022]. Similar to [Aristidou et al
.
2022], we also
segment the gestures into clips at audio beats but learn a high-level
representation for each clip via the vector quantization scheme
[Oord et al
.
2017] instead of the K-means clustering. Moreover, our
framework generates gestures in blocks of motion and denormalizes
the generated motion blocks to match the rhythm of the speech.
ACM Trans. Graph., Vol. 41, No. 6, Article 209. Publication date: December 2022.
Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings 209:3
In contrast, Aristidou et al
.
[2022] synthesize dance sequences in
frames conditioned on the corresponding motion motifs.
2.2 Co-speech Gesture Synthesis
The most primitive approach used to generate human non-verbal be-
haviors is to animate an articial agent using the retargeted motion
capture data. This kind of approach is widely used in commercial
systems (e.g., lms and games) because of its high-quality motion
performance. However, it is not suitable for creating interactive
content that cannot be prepared beforehand. Generating co-speech
gestures according to an arbitrary input has been a long-standing
research topic. Previous studies can be roughly categorized into two
groups, i.e., rule-based and data-driven methods.
2.2.1 Rule-based Method. The idea of the rule-based approach is
to collect a set of gesture units and design specic rules that map a
speech to a sequence of gesture units [Cassell et al
.
2004; Huang and
Mutlu 2012; Kipp 2004; Softbank 2018]. Wagner et al
.
[2014] have
an excellent review of these methods. The results of the rule-based
methods are generally highly explainable and controllable. However,
the gesture units and rules typically have to be created manually,
which can be costly and inecient for complex systems.
2.2.2 Data-driven Method. Early research in data-driven method
learns the rules embedded in data and combines them with prede-
ned animation units to generate new gestures. For example, Kopp
et al
.
[2006]; Levine et al
.
[2010] use probabilistic models to build
correspondence between speech and gestures. Ne et al
.
[2008]
build a statistical model to learn the personal style of each speaker.
The model is combined with the input text tagged with the theme,
utterance focus, and rheme to generate gesture scripts, which are
then mapped to a sequence of gestures selected from an animation
lexicon. Chiu et al
.
[2015] train a neural classication model to se-
lect a proper gesture unit based on the speech input. More recent
research has started to take advantage of deep learning and trains
end-to-end models using raw gesture data directly, which frees the
manual eorts of designing the gesture lexicon and mapping rules.
Gestures can be synthesized using deterministic models such as mul-
tilayer perceptron (MLP) [Kucherenko et al
.
2020], recurrent neural
networks [Bhattacharya et al
.
2021a; Hasegawa et al
.
2018; Liu et al
.
2022; Yoon et al
.
2020, 2019], convolutional networks [Habibie et al
.
2021], and transformers [Bhattacharya et al
.
2021b], or by learning
generative models such as normalizing ow [Alexanderson et al
.
2020], VAEs [Li et al
.
2021a; Xu et al
.
2022], and learnable noise codes
[Qian et al
.
2021]. Our method is also a data-driven framework. We
learn the motion generator and the mapping between the speech and
gestures from data using a combined network structure of the vec-
tor quantized variational autoencoder (VQ-VAE) [Oord et al. 2017]
and LSTM. To capture the rhythmic and semantic correspondences
between the speech and gestures, we propose a multi-stage archi-
tecture that explicitly models the rhythm and semantics in dierent
stages. An earlier system proposed by Kucherenko et al
.
[2021b]
shares a similar high-level architectural design to our framework.
However, there are two key dierences: (a) our method is essen-
tially an unsupervised learning approach, which learns the gesture
lexeme, style code, and the generator directly from the data with-
out detailed annotations; and (b) our system employs an explicit
beat-based segmentation scheme which is shown to be eective in
ensuring temporal coherence between the speech and the gesture.
2.3 Multi-Modal Data Processing
Co-speech gesture generation is a cross-modal process involving
audio, text, motion, and other information related to the speaker
and the content of the speech. The representation and alignment of
each modality are essential for high-quality results [Baltrušaitis et al
.
2019]. Mel-spectrogram and MFCC acoustic features are commonly
used as audio features [Alexanderson et al
.
2020; Kucherenko et al
.
2020; Qian et al
.
2021], typically resampled into the same framerate
of the motion. For the text features, pre-trained language models like
BERT [Devlin et al
.
2019; Kucherenko et al
.
2020] and FastText [Bo-
janowski et al
.
2017; Yoon et al
.
2020] have been used to encode
text transcripts into frame-wise latent codes, where paddings, llers,
or empty words are inserted into a sentence to make the world se-
quence the same length as the motion [Kucherenko et al
.
2020; Yoon
et al
.
2020]. Speaker’s style and emotions can also be encoded by
learnable latent codes [Bhattacharya et al
.
2021a; Yoon et al
.
2020]
and are resampled or padded to match the length of the speech. In
this work, we employ a pre-trained speech model to extract audio
features and ne-tune it using a contrastive learning strategy. We
also utilize a BERT-based model to vectorize the text. These multi-
modal data are then aligned explicitly using the standard approaches
discussed above. Notably, a concurrent study [Liu et al
.
2022] also
extracts audio features using contrastive learning. Their framework
considers the learning of the audio features as a part of the training
of the gesture generator. Instead, our framework trains the audio
encoder in a separate pre-training stage using only the audio data.
2.4 Evaluation of Motion Synthesis Models
Evaluating the generated co-speech gestures is often dicult be-
cause the motion quality is a very subjective concept. Previous works
have proposed several evaluation criteria. Wolfert et al. [2022] have
made a comprehensive review of them. User studies are widely
adopted to evaluate dierent aspects of motion quality, such as
human-likeliness and speech-gesture matching [Alexanderson et al
.
2020; Kucherenko et al
.
2020; Yoon et al
.
2020], but can be expensive
and hard to exclude uncontrolled factors. The absolute dierence
of joint positions or other motion features, such as velocity and
acceleration between a reconstructed motion and the ground truth,
is used by several works as an objective metric [Ginosar et al
.
2019;
Joo et al
.
2019; Kucherenko et al
.
2019]. However, this metric is
not suitable for evaluating motions that are natural but not the
same as the reference. Fréchet Inception Distance (FID) [Heusel
et al
.
2017] is a widely used criterion in image generation tasks that
measures the dierence between the distributions of the dataset and
generated samples in the latent space. It successfully reects the
perceptual quality of generated samples. Similarly, Yoon et al
.
[2020]
and Qian et al
.
[2021] propose Fréchet Gesture Distance (FGD) and
Fréchet Template Distance (FTD) metrics, respectively. These met-
rics measure the perceptual quality of generated gestures. In this
paper, we compare our framework with several baseline methods
ACM Trans. Graph., Vol. 41, No. 6, Article 209. Publication date: December 2022.
209:4 Ao, Gao, Lou, Chen, and Liu
𝑚!
𝑚!
𝑚!
𝑴"
Encoder Decoder 𝑚!
𝑚!
𝑚!
𝑴"
𝑴$%"
𝑨$%"
&'(
motion blocks
training set
Motion
Encoder
Audio
Encoder
𝑨$
&'(
𝑨$)"
&'(
$
𝑴$%" $
𝑨$
motion block
𝑴 ∈ ℝ!×#!
low-level
audio blocks
𝑨$%& ∈ ℝ!
!×#!
𝑴$
gesture style code
𝒛 ∈ ℝ#"*
+,-
&,.,/,
gesture lexeme 𝒔 ∈ ℝ#"
Training
𝑨#
$%&$ 𝑨'
$%&$ 𝑨(
$%&$ 𝑨)
$%&$
𝑨*
$%&$ 𝑨+,'
$%&$ 𝑨+,#
$%&$ 𝑨+,*
$%&$
𝑨+
$%&$
𝑻(𝑻)𝑻*𝑻+𝑻#$% 𝑻,-* 𝑻#$& 𝑻,-(
𝑻#
Ilove you This is amazing but what the key
𝑻,
𝑻)𝑻*𝑻+𝑻,-+
𝑻(𝑻,-* 𝑻,-) 𝑻,-(
beat
identification
normalization
high-level feature
text feature
normalization
Data
𝑺.
𝒁.
𝒔'𝒔&𝒔(𝒔)"
3
6
9
1
vector
quantization
gesture lexicon
lookup
𝑨#
-./ 𝑨'
-./ 𝑨(
-./ 𝑨)
0./
𝑨*
-./ 𝑨+,'
-./ 𝑨+,#
-./ 𝑨+,*
-./ 𝑨+
-./
low-level feature
2,+-
+,-
Generator
MLP
0
7
9
3
5
1
groups
1, ……, H
time
1, …, K
Decoder
𝑨!"#
$%&$
𝑨!
$%&$
𝑨!'#
$%&$
𝑴$%"
speaker identity
𝑰 ∈ ℝ/1Lexeme
Interpreter
Style
Interpreter
𝑴$
𝑨2,*
-./
𝑨2
-./
𝑨23*
-./
𝑨2,*
-./
𝑨2
-./
𝑨23*
-./
+
+
+
Inference
𝑴(
𝑴)
𝑴*
𝑴+,#
𝑴+,*
𝑴,
𝑴(
𝑴)
𝑴(
𝑴,-)
𝑴!"#
𝑴#
denormalization
$
𝑴$%" $
𝑨$
Motion &Audio
Encoders
𝑺.
𝒁.
Generator
𝑺$
predicted
gesture lexeme block
𝑺∈ ℝ,×."
𝑻$
𝒔'𝒔&𝒔(𝒔)"
high-level
audio blocks
𝑻$
text block
𝑻 ∈ ℝ!×#5
𝑨$%&$ ∈ ℝ6)×8)
gesture lexeme 𝒔 ∈ ℝ#"
Fig. 2. Our system is composed of three core components: (a) the data module preprocesses a speech, segments it into normalized blocks based on the beats,
and extracts speech features from these blocks; (b) the training module learns a gesture lexicon from the normalized motion blocks and trains the generator to
synthesize gesture sequences, conditioned on the gesture lexemes, the style codes, as well as the features of previous motion blocks and adjacent speech
blocks; and (c) the inference module employs interpreters to transfer the speech features to gesture lexemes and style codes, which are then used by the
learned generator to predict future gestures.
using both user studies and objective metrics like FGD. We further
propose a simple but eective rhythmic metric to measure the per-
centage of matched beats by dynamically adjusting the matching
threshold, which provides a more informative picture of the rhythm
performance.
3 SYSTEM OVERVIEW
Our goal is to synthesize realistic co-speech upper-body gestures
that match a given speech context both temporally and semanti-
cally. To achieve this goal, we build a system using neural networks
that takes speech audio as input and generates gesture sequences
accordingly. Additional speech modalities, such as text and speaker
identity, will also be considered by the system when available to
enhance semantic coherence and generate stylized gestures.
A gesture motion consists of a sequence of gesture units, which
can be further broken down into a number of gesture phases that
align with intonational units, such as pitch accents or stressed sylla-
bles [Kendon 2004; Loehr 2012]. The action in each of these gesture
phases is typically a specic movement such as lifting a hand, hold-
ing an arm at a position, or moving both arms down together, which
is often referred to as a gesture lexeme by linguists [Kipp 2004; Ne
et al
.
2008; Webb 1996]. It is also revealed in the literature that there
are only a limited number of lexemes used in everyday conversation.
These lexemes form a gesture lexicon. A typical speaker may only
use a subset of this lexicon and apply slight variations to the motion.
We assume such variations cannot be inferred directly from the
speech but can be characterized by some latent variables, which
we refer to as the gesture style codes. Our system then generates
gestures in a hierarchical order. It rst determines the sequence of
gesture lexemes and style codes and then generates gestural moves
based on these motion-related features and other speech modalities.
Our system processes the input speech in a block-wise manner.
Considering the temporal and structural synchrony between the
gesture and the speech, we leverage a segmentation that aligns with
the rhythm of the speech to ensure temporal coherence between
the two modalities. Specically, our system extracts beats from the
input speech based on audio onsets and segments the speech into
short clips at every beat. These clips are then time-scaled and con-
verted into normalized blocks with the same length. We extract
features at multiple levels for each block, where the high-level fea-
tures are translated into a gesture lexeme, and the low-level features
ACM Trans. Graph., Vol. 41, No. 6, Article 209. Publication date: December 2022.
摘要:

RhythmicGesticulator:Rhythm-AwareCo-SpeechGestureSynthesiswithHierarchicalNeuralEmbeddingsTENGLONGAO,PekingUniversity,ChinaQINGZHEGAO,ShandongUniversityandPekingUniversity,ChinaYUKELOU,PekingUniversity,ChinaBAOQUANCHEN,SIST&KLMP(MOE),PekingUniversity,ChinaLIBINLIU∗,SIST&KLMP(MOE),PekingUniversity,Ch...

展开>> 收起<<
Rhythmic Gesticulator Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:9.07MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注