NATURALISTIC HEAD MOTION GENERATION FROM SPEECH Trisha Mittaly Department of Computer Science

2025-05-02 0 0 422.51KB 5 页 10玖币
侵权投诉
NATURALISTIC HEAD MOTION GENERATION FROM SPEECH
Trisha Mittal∗†
Department of Computer Science
University of Maryland, College Park
Zakaria Aldeneh, Masha Fedzechkina,
Anurag Ranjan, Barry-John Theobald
Apple
ABSTRACT
Synthesizing natural head motion to accompany speech
for an embodied conversational agent is necessary for pro-
viding a rich interactive experience. Most prior works as-
sess the quality of generated head motion by comparing them
against a single ground-truth using an objective metric. Yet
there are many plausible head motion sequences to accom-
pany a speech utterance. In this work, we study the varia-
tion in the perceptual quality of head motions sampled from a
generative model. We show that, despite providing more di-
verse head motions, the generative model produces motions
with varying degrees of perceptual quality. We finally show
that objective metrics commonly used in previous research do
not accurately reflect the perceptual quality of generated head
motions. These results open an interesting avenue for future
work to investigate better objective metrics that correlate with
human perception of quality.
Index Termshead motion synthesis, speech animation,
audio-visual speech, perceptual study, human-computer inter-
action
1. INTRODUCTION
Head motion provides a rich source of non-verbal cues in
human communication and social interaction. Imbuing AI-
based characters and embodied conversational agents with a
natural head motion to accompany their speech can lead to a
more engaging and immersive interactive experience and an
improvement in the intelligibility of the agents’ speech [1].
Studies on human interaction suggest that there is a quan-
tifiable relationship between head motion and acoustic at-
tributes [2]. Consequently, a lot of work focused on the use
of machine learning to drive head motion from speech.
Ding et al. [3] were the first to successfully use a fully-
connected deep neural network to predict head motion from
acoustic features. In subsequent work [4], they improved
on the previous model by incorporating context using bi-
directional long short-term memory (BLSTM) networks.
Haag and Shimodaira [5] showed that additional improve-
ment in head motion synthesis can be obtained by appending
Work done during an internship at Apple.
Authors contributed equally.
bottleneck features to the input speech features. While these
approaches differ in terms of the proposed model architec-
tures, they are deterministic, i.e., they generate only one head
motion sequence for a given a speech signal.
However, the correspondence between speech and head
motion is a one-to-many problem; while there is a correla-
tion between the speech signal and head motion, a speaker
repeating the same utterance produces different head move-
ments. Thus, deterministic models are unlikely to generate
suitably expressive head motions for conversational agents.
Recent work has focused on non-deterministic models that
can generate more than one head motion trajectory for the
same speech. For example, Sadoughi and Busso [6] pro-
posed a conditional generative adversarial network (GAN)
that learns a distribution of head motions conditioned on
the speech sample and generates a variety of trajectories by
sampling from this distribution based on different noise val-
ues. Greenwood et al. [7] proposed a conditional variational
autoencoder (CVAE) that generates a range of head motion
trajectories for the same speech signal by sampling from a
Gaussian distribution. While these proposals can produce
more varied head motions, to the best of our knowledge, no
study has evaluated the quality of the variety of head motions
produced by non-deterministic models for the same speech
signal. Instead, previous studies either performed subjective
evaluations of sample head motion sequences [6] or an infor-
mal inspection of the variety in the predicted values rather
than a more formal evaluation [7].
Additionally, objective evaluation of the generated head
motions is complicated by the fact that the objective measures
used, e.g., mean absolute error (MAE), dynamic time warp-
ing (DTW) distance, Pearson correlation coefficient, and the
Frechet distance (FD) [810], do not consider what is impor-
tant in the sense of human perception of naturalness of head
motion as they treat all errors equally [1113]. Thus, what
remains unknown from prior work is whether the variety of
head motion sequences produced by non-deterministic mod-
els are of consistent perceptual quality.
This work aims to investigate the perceptual quality of
head motion sampled from a non-deterministic generative
model. We first demonstrate (both qualitatively and quantita-
tively) that our model generates diverse outputs that contain
natural variation in head motion for the same utterance. We
arXiv:2210.14800v1 [eess.AS] 26 Oct 2022
摘要:

NATURALISTICHEADMOTIONGENERATIONFROMSPEECHTrishaMittalyDepartmentofComputerScienceUniversityofMaryland,CollegeParkZakariaAldenehy,MashaFedzechkinay,AnuragRanjan,Barry-JohnTheobaldAppleABSTRACTSynthesizingnaturalheadmotiontoaccompanyspeechforanembodiedconversationalagentisnecessaryforpro-vidingarich...

展开>> 收起<<
NATURALISTIC HEAD MOTION GENERATION FROM SPEECH Trisha Mittaly Department of Computer Science.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:422.51KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注