
NATURALISTIC HEAD MOTION GENERATION FROM SPEECH
Trisha Mittal∗†
Department of Computer Science
University of Maryland, College Park
Zakaria Aldeneh†, Masha Fedzechkina†,
Anurag Ranjan, Barry-John Theobald
Apple
ABSTRACT
Synthesizing natural head motion to accompany speech
for an embodied conversational agent is necessary for pro-
viding a rich interactive experience. Most prior works as-
sess the quality of generated head motion by comparing them
against a single ground-truth using an objective metric. Yet
there are many plausible head motion sequences to accom-
pany a speech utterance. In this work, we study the varia-
tion in the perceptual quality of head motions sampled from a
generative model. We show that, despite providing more di-
verse head motions, the generative model produces motions
with varying degrees of perceptual quality. We finally show
that objective metrics commonly used in previous research do
not accurately reflect the perceptual quality of generated head
motions. These results open an interesting avenue for future
work to investigate better objective metrics that correlate with
human perception of quality.
Index Terms—head motion synthesis, speech animation,
audio-visual speech, perceptual study, human-computer inter-
action
1. INTRODUCTION
Head motion provides a rich source of non-verbal cues in
human communication and social interaction. Imbuing AI-
based characters and embodied conversational agents with a
natural head motion to accompany their speech can lead to a
more engaging and immersive interactive experience and an
improvement in the intelligibility of the agents’ speech [1].
Studies on human interaction suggest that there is a quan-
tifiable relationship between head motion and acoustic at-
tributes [2]. Consequently, a lot of work focused on the use
of machine learning to drive head motion from speech.
Ding et al. [3] were the first to successfully use a fully-
connected deep neural network to predict head motion from
acoustic features. In subsequent work [4], they improved
on the previous model by incorporating context using bi-
directional long short-term memory (BLSTM) networks.
Haag and Shimodaira [5] showed that additional improve-
ment in head motion synthesis can be obtained by appending
∗Work done during an internship at Apple.
†Authors contributed equally.
bottleneck features to the input speech features. While these
approaches differ in terms of the proposed model architec-
tures, they are deterministic, i.e., they generate only one head
motion sequence for a given a speech signal.
However, the correspondence between speech and head
motion is a one-to-many problem; while there is a correla-
tion between the speech signal and head motion, a speaker
repeating the same utterance produces different head move-
ments. Thus, deterministic models are unlikely to generate
suitably expressive head motions for conversational agents.
Recent work has focused on non-deterministic models that
can generate more than one head motion trajectory for the
same speech. For example, Sadoughi and Busso [6] pro-
posed a conditional generative adversarial network (GAN)
that learns a distribution of head motions conditioned on
the speech sample and generates a variety of trajectories by
sampling from this distribution based on different noise val-
ues. Greenwood et al. [7] proposed a conditional variational
autoencoder (CVAE) that generates a range of head motion
trajectories for the same speech signal by sampling from a
Gaussian distribution. While these proposals can produce
more varied head motions, to the best of our knowledge, no
study has evaluated the quality of the variety of head motions
produced by non-deterministic models for the same speech
signal. Instead, previous studies either performed subjective
evaluations of sample head motion sequences [6] or an infor-
mal inspection of the variety in the predicted values rather
than a more formal evaluation [7].
Additionally, objective evaluation of the generated head
motions is complicated by the fact that the objective measures
used, e.g., mean absolute error (MAE), dynamic time warp-
ing (DTW) distance, Pearson correlation coefficient, and the
Frechet distance (FD) [8–10], do not consider what is impor-
tant in the sense of human perception of naturalness of head
motion as they treat all errors equally [11–13]. Thus, what
remains unknown from prior work is whether the variety of
head motion sequences produced by non-deterministic mod-
els are of consistent perceptual quality.
This work aims to investigate the perceptual quality of
head motion sampled from a non-deterministic generative
model. We first demonstrate (both qualitatively and quantita-
tively) that our model generates diverse outputs that contain
natural variation in head motion for the same utterance. We
arXiv:2210.14800v1 [eess.AS] 26 Oct 2022