209:2 •Ao, Gao, Lou, Chen, and Liu
Gestures are grouped into six categories by linguists [Ekman and
Friesen 1969; McNeill 1992]—adaptors, emblems, deictics, iconics,
metaphorics, and beats. Among them, the beat gestures are rhythmic
movements that bear no apparent relation to speech semantics [Kipp
2004] but serve meta-narrative functions [McNeill 1992] that are
crucial to rhythmic harmony between speech and gestures. Gener-
ating realistic beat gestures requires modelling the relation between
the gestural beats and the verbal stresses. However, it has been
observed that these two modalities are not synchronized in a strict
rhythmic sense [McClave 1994], making it dicult to learn their
temporal connection directly from data using an end-to-end method
[Bhattacharya et al
.
2021a; Kucherenko et al
.
2020; Yoon et al
.
2020].
Gestures are associated with dierent levels of speech informa-
tion [McNeill 1992]. For example, an emblem gesture such as thumbs-
up usually accompanies high-level semantics like good or great,
while a beat gesture commonly comes with low-level acoustic em-
phasis. Many previous studies use only the features extracted at the
last layer of an audio encoder to synthesize gestures [Alexanderson
et al
.
2020; Bhattacharya et al
.
2021a; Kucherenko et al
.
2020; Qian
et al
.
2021; Yoon et al
.
2020]. This setup, however, may in eect
encourage the encoder to mix the speech information from multiple
levels into the same feature, causing ambiguity and increasing the
diculty in mining clear rhythmic and semantic cues.
In this paper, we focus on generating co-speech upper-body ges-
tures that can accompany a broad range of speech content—from a
single sentence to a public speech, aiming at achieving convincing
results both on the rhythm and semantics. Our rst observation
is that gesturing can be considered as a special form of dancing
under changing beats. We develop a rhythm-based canonicalization
and generation framework to deal with the challenge of generating
synchronized gestures to the speech, which segments the speech
into short clips at audio beats, normalizes these clips into canoni-
cal blocks of the same length, generates gestures for every block,
and aligns the generated motion to the rhythm of the speech. This
framework, which is partially inspired by recent research in dance
generation [Aristidou et al
.
2022], provides the gesture model with
an explicit hint of the rhythm, allowing the model to learn the pat-
tern of gestural beats within a rhythmic block eciently. Both the
quantitative evaluation with a novel rhythmic metric and the quali-
tative evaluation with user studies show that the gestures generated
by this pipeline exhibit natural synchronization to the speech.
As indicated in linguistics literature [Kipp 2004; Ne et al
.
2008;
Webb 1996], gestures used in everyday conversation can be bro-
ken down into a limited number of semantic units with dierent
motion variations. We assume that these semantic units, usually
referred to as lexemes, relate to the high-level features of speech
audio, while the motion variations are determined by the low-level
audio features. We thus disentangle high- and low-level features
from dierent layers of an audio encoder and learn the mappings
between them and the gesture lexemes and the motion variations,
respectively. Experiments demonstrate that this mechanism suc-
cessfully disentangles multi-level features of both the speech and
motion and synthesizes semantics-matching and stylized gestures.
In summary, our main contributions in this paper are:
•
We present a novel rhythm- and semantics-aware co-speech
gesture synthesis system that generates natural-looking ges-
tures. To the best of our knowledge, this is the rst neural
system that explicitly models both the rhythmic and semantic
relations between speech and gestures.
•
We develop a robust rhythm-based segmentation pipeline to
ensure the temporal coherence between speech and gestures,
which we nd is crucial to achieving rhythmic gestures.
•
We devise an eective mechanism to relate the disentangled
multi-level features of both speech and motion, which enables
generating gestures with convincing semantics.
2 RELATED WORK
2.1 Data-driven Human Motion Synthesis
Traditional human motion synthesis frameworks often rely on con-
catenative approaches such as motion graph [Kovar et al
.
2002].
Recently, learning-based methods with neural networks have been
widely applied to this area to generate high-quality and interactive
motions, using models ranging from feed-forward network [Holden
et al
.
2017; Starke et al
.
2022] to dedicated generative models [Henter
et al
.
2020; Ling et al
.
2020]. Dealing with the one-to-many issue
where a variety of motions can correspond to the same input or con-
trol signal is often a challenge for these learning-based approaches.
Previous systems often employ additional conditions, such as con-
tacts [Starke et al
.
2020] or phase indices [Holden et al
.
2017; Starke
et al
.
2022], to deal with this problem. Closer to the gesture domain is
the speech-driven head motion synthesis, where conditional GANs
[Sadoughi and Busso 2018], and conditional VAEs [Greenwood et al
.
2017] have been used.
2.1.1 Music-driven Dance Synthesis. Among the general motion
synthesis tasks, music-driven dance generation addresses a similar
problem to the co-speech gesture synthesis, where the complex
temporal relation between two dierent modalities needs to be
modeled accurately. Both motion graph-based methods [Chen et al
.
2021; Kim et al
.
2006] and learning-based approaches [Li et al
.
2021b;
Siyao et al
.
2022; Valle-Pérez et al
.
2021] have been adopted and
successfully achieved impressive generation results. To deal with the
synchronization between the dance and music, Chen et al
.
[2021]
develop a manually labeled rhythm signature to represent beat
patterns and ensures the rhythm signatures of the generated dance
match the music. Aristidou et al
.
[2022] segment the dance into
blocks at music onsets, convert each block into a motion motif
[Aristidou et al
.
2018] that denes a specic cluster of motions, and
use the motion motif to guide the synthesis of dance at the block
level. Siyao et al
.
[2022] employ a reinforcement learning scheme
to improve the rhythmic performance of the generator using a
reward function encouraging beat alignment. Our rhythm-based
segmentation and canonicalization framework is partially inspired
by [Aristidou et al
.
2022]. Similar to [Aristidou et al
.
2022], we also
segment the gestures into clips at audio beats but learn a high-level
representation for each clip via the vector quantization scheme
[Oord et al
.
2017] instead of the K-means clustering. Moreover, our
framework generates gestures in blocks of motion and denormalizes
the generated motion blocks to match the rhythm of the speech.
ACM Trans. Graph., Vol. 41, No. 6, Article 209. Publication date: December 2022.