NATURALISTIC HEAD MOTION GENERATION FROM SPEECH Trisha Mittaly Department of Computer Science

2025-05-02 0 0 422.51KB 5 页 10玖币

侵权投诉

NATURALISTIC HEAD MOTION GENERATION FROM SPEECH

Trisha Mittal∗†

Department of Computer Science

University of Maryland, College Park

Zakaria Aldeneh†, Masha Fedzechkina†,

Anurag Ranjan, Barry-John Theobald

Apple

ABSTRACT

Synthesizing natural head motion to accompany speech

for an embodied conversational agent is necessary for pro-

viding a rich interactive experience. Most prior works as-

sess the quality of generated head motion by comparing them

against a single ground-truth using an objective metric. Yet

there are many plausible head motion sequences to accom-

pany a speech utterance. In this work, we study the varia-

tion in the perceptual quality of head motions sampled from a

generative model. We show that, despite providing more di-

verse head motions, the generative model produces motions

with varying degrees of perceptual quality. We ﬁnally show

that objective metrics commonly used in previous research do

not accurately reﬂect the perceptual quality of generated head

motions. These results open an interesting avenue for future

work to investigate better objective metrics that correlate with

human perception of quality.

Index Terms—head motion synthesis, speech animation,

audio-visual speech, perceptual study, human-computer inter-

action

1. INTRODUCTION

Head motion provides a rich source of non-verbal cues in

human communication and social interaction. Imbuing AI-

based characters and embodied conversational agents with a

natural head motion to accompany their speech can lead to a

more engaging and immersive interactive experience and an

improvement in the intelligibility of the agents’ speech [1].

Studies on human interaction suggest that there is a quan-

tiﬁable relationship between head motion and acoustic at-

tributes [2]. Consequently, a lot of work focused on the use

of machine learning to drive head motion from speech.

Ding et al. [3] were the ﬁrst to successfully use a fully-

connected deep neural network to predict head motion from

acoustic features. In subsequent work [4], they improved

on the previous model by incorporating context using bi-

directional long short-term memory (BLSTM) networks.

Haag and Shimodaira [5] showed that additional improve-

ment in head motion synthesis can be obtained by appending

∗Work done during an internship at Apple.

†Authors contributed equally.

bottleneck features to the input speech features. While these

approaches differ in terms of the proposed model architec-

tures, they are deterministic, i.e., they generate only one head

motion sequence for a given a speech signal.

However, the correspondence between speech and head

motion is a one-to-many problem; while there is a correla-

tion between the speech signal and head motion, a speaker

repeating the same utterance produces different head move-

ments. Thus, deterministic models are unlikely to generate

suitably expressive head motions for conversational agents.

Recent work has focused on non-deterministic models that

can generate more than one head motion trajectory for the

same speech. For example, Sadoughi and Busso [6] pro-

posed a conditional generative adversarial network (GAN)

that learns a distribution of head motions conditioned on

the speech sample and generates a variety of trajectories by

sampling from this distribution based on different noise val-

ues. Greenwood et al. [7] proposed a conditional variational

autoencoder (CVAE) that generates a range of head motion

trajectories for the same speech signal by sampling from a

Gaussian distribution. While these proposals can produce

more varied head motions, to the best of our knowledge, no

study has evaluated the quality of the variety of head motions

produced by non-deterministic models for the same speech

signal. Instead, previous studies either performed subjective

evaluations of sample head motion sequences [6] or an infor-

mal inspection of the variety in the predicted values rather

than a more formal evaluation [7].

Additionally, objective evaluation of the generated head

motions is complicated by the fact that the objective measures

used, e.g., mean absolute error (MAE), dynamic time warp-

ing (DTW) distance, Pearson correlation coefﬁcient, and the

Frechet distance (FD) [8–10], do not consider what is impor-

tant in the sense of human perception of naturalness of head

motion as they treat all errors equally [11–13]. Thus, what

remains unknown from prior work is whether the variety of

head motion sequences produced by non-deterministic mod-

els are of consistent perceptual quality.

This work aims to investigate the perceptual quality of

head motion sampled from a non-deterministic generative

model. We ﬁrst demonstrate (both qualitatively and quantita-

tively) that our model generates diverse outputs that contain

natural variation in head motion for the same utterance. We

arXiv:2210.14800v1 [eess.AS] 26 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NATURALISTICHEADMOTIONGENERATIONFROMSPEECHTrishaMittalyDepartmentofComputerScienceUniversityofMaryland,CollegeParkZakariaAldenehy,MashaFedzechkinay,AnuragRanjan,Barry-JohnTheobaldAppleABSTRACTSynthesizingnaturalheadmotiontoaccompanyspeechforanembodiedconversationalagentisnecessaryforpro-vidingarich...

展开>> 收起<<

NATURALISTIC HEAD MOTION GENERATION FROM SPEECH Trisha Mittaly Department of Computer Science.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

NATURALISTIC HEAD MOTION GENERATION FROM SPEECH Trisha Mittaly Department of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: