A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis Yichen Han Ya Li Yingming Gao Jinlong Xue

2025-04-27 0 0 2.69MB 6 页 10玖币

侵权投诉

A Keypoint Based Enhancement Method for Audio

Driven Free View Talking Head Synthesis

Yichen Han, Ya Li, Yingming Gao, Jinlong Xue

School of Artiﬁcial Intelligence

Beijing University of Posts and Telecommunications

Beijing, China

adelacvgaoiro@bupt.edu.cn, yli01@bupt.edu.cn,

yingming.gao@outlook.com, jinlong xue@bupt.edu.cn

Songpo Wang, Lei Yang

DeepScience Tech Ltd.

Beijing, China

wangsongpo@deepscience.cn, yanglei@deepscience.cn

Abstract—Audio driven talking head synthesis is a challenging

task that attracts increasing attention in recent years. Although

existing methods based on 2D landmarks or 3D face models can

synthesize accurate lip synchronization and rhythmic head pose

for arbitrary identity, they still have limitations, such as the cut

feeling in the mouth mapping and the lack of skin highlights. The

morphed region is blurry compared to the surrounding face.

A Keypoint Based Enhancement (KPBE) method is proposed

for audio driven free view talking head synthesis to improve

the naturalness of the generated video. Firstly, existing methods

were used as the backend to synthesize intermediate results.

Then we used keypoint decomposition to extract video synthesis

controlling parameters from the backend output and the source

image. After that, the controlling parameters were composited to

the source keypoints and the driving keypoints. A motion ﬁeld

based method was used to generate the ﬁnal image from the

keypoint representation. With keypoint representation, we over-

came the cut feeling in the mouth mapping and the lack of skin

highlights. Experiments show that our proposed enhancement

method improved the quality of talking-head videos in terms of

mean opinion score.

Index Terms—talking head generation, speech driven anima-

tion

I. INTRODUCTION

In many applications, such as virtual reality, digital body,

video conferencing, and visual dubbing, one-shot audio-driven

talking head synthesis is an important component. Early re-

search relied on motion capture by art experts, and could only

be used in ﬁlm and games, which was labor-intensive and

time-consuming [5] [23]. In recent years, signiﬁcant progress

has been made in this area, and a number of deep learning

methods have been proposed [17] [4] [26] [11] [18] [8] [6]

[12] in order to learn the warping from audio to expression.

For example, Wav2lip [12] uses a end-to-end framework, and

synthesizes lower half of the face. Many methods use 2D facial

landmark [2] [16] [4] or 3D head model [1] [18] [15] [3] [21]

[13] [8] as a transit medium. Because 2D facial landmark does

not contain information about depth perception, most methods

using 2D facial landmark [26] [16] do not support viewpoint

editing.

In order to achieve more ﬂexible manipulation, neural ren-

dering methods are proposed. They are identity independent,

and can change viewpoint in latent space. Neural radiation

ﬁelds (NeRF) [6] is proposed to avoid additional intermediate

representations. However, it is still a difﬁcult task to control

the head pose and expression at the same time. To overcome

this limitation, keypoint-based methods are proposed [19] [14].

They can render hair and sunglasses precisely that are not

possible for 3D-based methods, and are able to change the

viewpoint that are not possible for 2D-based methods.

To overcome the cut feeling in the mouth mapping and

the lack of skin highlights, we propose a Keypoint Based

Enhancement (KPBE) method for audio-driven free view

talking head synthesis. Our approach consists of a backend

and a frontend. The backend is model-free. Using audio and

source images as input, the existing backend methods synthe-

size intermediate results. The frontend contains ﬁve modules:

canonical keypoint estimator, appearance feature estimator,

head pose and expression estimator, motion ﬁeld estimator,

and generator. Canonical keypoint estimator is for estimating

customized keypoints for the different images. Appearance

feature estimator is for extracting the appearance features such

as skin and color of eyes. Head pose and expression estimator

is for extracting head pose and expression from backend

output. Speciﬁcally, head pose is determined by a rotation

matrix and a translation vector. Expression is parameterized

by vectors with the same number of the canonical keypoints.

Motion ﬁeld estimator is for compositing the motion vectors

in 3D space. Motion vector is obtained via pair of keypoints

extracted from source image and driving image. Generator

is for generating ﬁnal video from appearance feature and

composited motion ﬁeld. Using this workﬂow we can get

appearance features that maintain the skin highlights, and the

generator avoids the cut feeling in the mouth mapping. In

addition, we can change the viewpoint by user-deﬁned head

pose matrix.

The contributions of our work are two folds:

•A keypoint-based model-free enhancement method for

audio-driven talking head synthesis is proposed, which

can composite the head pose and the lip motion naturally.

•Head pose can be manipulated by user-deﬁned rotation

matrix and tranlation vector.

The rest of the paper is organized as follows. Section II

arXiv:2210.03335v1 [cs.CV] 7 Oct 2022

Model-free

Backend

method

Canonical

keypoint

estimator

Head pose

expression

estimator

Appearance

feature

extractor

Head pose

expression

estimator

kc,i

source image

driving audio

head pose

driving video

expb

Rd&tdkd,i

Head pose

expression

estimator

Rs&ts

exps

Tks,i

Fig. 1. Keypoint decomposition. The appearance features, 3D canonical

key points, head pose, and expression are extracted from the source image.

Meanwhile, expression is extracted from backend output, and head pose is

extracted from the head pose driving video. Applying the corresponding head

pose and expression to the canonical keypoint, we get the source keypoint

and the driving keypoint.

introduces the related works about talking head synthesis.

Section III represents the architecture of our method and

details of each part. Experiments and results are shown in

Section IV. Section V concludes the paper.

II. RELATED WORKS

Audio-driven talking head synthesis. Driving a talking

head with audio is the task of synchronizing image frames

of a video with arbitrary audio. There are two lines of

work: one is 3D-based and the other is 2D-based. The early

3D-based methods attempt to build the relationship between

audio features and lip motions by hand [23] [1], so they

need an expert of the ﬁeld. A well-known person-speciﬁc

3D-based work that does not rely on expert is [16]. The

authors generated talking-head videos of president Obama, and

focused on synthesizing natural head pose and accurate lip

motion. However, the method needs a large number of videos

of a speciﬁc person. 2D-based methods [7] [12] can achieve

identity independent expression synthesis by replacing part of

the face. Recent 2D-based methods can also synthesize head

pose using facial landmarks [24] [22]. And some methods

achieve real-time speed synthesis [20]. Latest 3D-based meth-

ods can perform identity-independent synthesis [3]. However,

realistic hair cannot be generated , because it is difﬁcult for

a network to learn the model of these high poly models. All

these methods have only a limited capabilities for manipulating

head pose and viewpoint.

Neural rendering based talking head synthesis. Sitzmann

et al. [25] used neural networks to represent the 3D shape or

appearance of scenes, and they sampled point set in space to

represent the appearance of an object. A method was originally

presented by Siarohin et al. [14] to get the warping between

sparse keypoints and the motion ﬁelds from them. Wang et

al. [19] used 3D-based keypoint warping to overcome the

shortcomings of the previous method. We use the similar idea

for keypoint decomposition in our method.

Recently, Neural Radience Fields (NeRF) [9] has gained

many achievements in neural rendering tasks. They trans-

formed the 3D appearance features to ray sampling results

of volume. AD-nerf [6] applied this idea to talking head syn-

thesis. Neural rendering methods achieve free view head pose

synthesis, and have the ability to learn the depth information

from the 2D image. We use a similar framework for extracting

3D appearance feature.

III. METHOD

We proposed a Keypoint Based Enhancement (KPBE)

method for audio-driven free view talking head synthesis.

The synthesis framework consists of two steps: keypoint

decomposition and generating from keypoint representation,

as illustrated in Fig. 1 and Fig. 2, respectively. Speciﬁcally,

the keypoint decomposition contains two parts. One is the

backend which synthesizes the intermediate video from the

audio. The other part belongs to the keypoint-based frontend

which enhances the result of the backend output. Using key-

point decomposition, it extracts source keypoints and driving

keypoints from the inputs. In the step of generating from

keypoint representation part, we use the keypoints obtained by

above part as input, and use the motion ﬁeld based generator

to synthesize the ﬁnal image.

A. Audio driven backend

In this paper, we use PC-AVS and Wav2Lip as our backend

methods (indicated by the yellow block in Fig. 1). Driving

audio and the source image are used as the input of the

backend model. The output of the backend is the intermediate

video, and it contains the lip motion information. We extract

head pose and expression from the video for enhancement

step.

B. Keypoint-based enhancement frontend

The frontend contains ﬁve modules (indicated by the blue

blocks in Fig. 1): Canonical keypoint estimator, appearance

feature estimator, head pose and expression estimator, motion

ﬁeld estimator and generator.

1) Canonical keypoint estimator: Using the canonical key-

point estimator, we can extract canonical keypoint from the

source image. Note that canonical keypoint is the keypoint

associated only with the identity, and not affected by head

pose and expression. These extracted keypoints shall encode

a person’s head geometry feature in a neutral expression and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AKeypointBasedEnhancementMethodforAudioDrivenFreeViewTalkingHeadSynthesisYichenHan,YaLi,YingmingGao,JinlongXueSchoolofArticialIntelligenceBeijingUniversityofPostsandTelecommunicationsBeijing,Chinaadelacvgaoiro@bupt.edu.cn,yli01@bupt.edu.cn,yingming.gao@outlook.com,jinlongxue@bupt.edu.cnSongpoWang,L...

展开>> 收起<<

A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis Yichen Han Ya Li Yingming Gao Jinlong Xue.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis Yichen Han Ya Li Yingming Gao Jinlong Xue

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: