A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis Yichen Han Ya Li Yingming Gao Jinlong Xue

2025-04-27 0 0 2.69MB 6 页 10玖币
侵权投诉
A Keypoint Based Enhancement Method for Audio
Driven Free View Talking Head Synthesis
Yichen Han, Ya Li, Yingming Gao, Jinlong Xue
School of Artificial Intelligence
Beijing University of Posts and Telecommunications
Beijing, China
adelacvgaoiro@bupt.edu.cn, yli01@bupt.edu.cn,
yingming.gao@outlook.com, jinlong xue@bupt.edu.cn
Songpo Wang, Lei Yang
DeepScience Tech Ltd.
Beijing, China
wangsongpo@deepscience.cn, yanglei@deepscience.cn
Abstract—Audio driven talking head synthesis is a challenging
task that attracts increasing attention in recent years. Although
existing methods based on 2D landmarks or 3D face models can
synthesize accurate lip synchronization and rhythmic head pose
for arbitrary identity, they still have limitations, such as the cut
feeling in the mouth mapping and the lack of skin highlights. The
morphed region is blurry compared to the surrounding face.
A Keypoint Based Enhancement (KPBE) method is proposed
for audio driven free view talking head synthesis to improve
the naturalness of the generated video. Firstly, existing methods
were used as the backend to synthesize intermediate results.
Then we used keypoint decomposition to extract video synthesis
controlling parameters from the backend output and the source
image. After that, the controlling parameters were composited to
the source keypoints and the driving keypoints. A motion field
based method was used to generate the final image from the
keypoint representation. With keypoint representation, we over-
came the cut feeling in the mouth mapping and the lack of skin
highlights. Experiments show that our proposed enhancement
method improved the quality of talking-head videos in terms of
mean opinion score.
Index Terms—talking head generation, speech driven anima-
tion
I. INTRODUCTION
In many applications, such as virtual reality, digital body,
video conferencing, and visual dubbing, one-shot audio-driven
talking head synthesis is an important component. Early re-
search relied on motion capture by art experts, and could only
be used in film and games, which was labor-intensive and
time-consuming [5] [23]. In recent years, significant progress
has been made in this area, and a number of deep learning
methods have been proposed [17] [4] [26] [11] [18] [8] [6]
[12] in order to learn the warping from audio to expression.
For example, Wav2lip [12] uses a end-to-end framework, and
synthesizes lower half of the face. Many methods use 2D facial
landmark [2] [16] [4] or 3D head model [1] [18] [15] [3] [21]
[13] [8] as a transit medium. Because 2D facial landmark does
not contain information about depth perception, most methods
using 2D facial landmark [26] [16] do not support viewpoint
editing.
In order to achieve more flexible manipulation, neural ren-
dering methods are proposed. They are identity independent,
and can change viewpoint in latent space. Neural radiation
fields (NeRF) [6] is proposed to avoid additional intermediate
representations. However, it is still a difficult task to control
the head pose and expression at the same time. To overcome
this limitation, keypoint-based methods are proposed [19] [14].
They can render hair and sunglasses precisely that are not
possible for 3D-based methods, and are able to change the
viewpoint that are not possible for 2D-based methods.
To overcome the cut feeling in the mouth mapping and
the lack of skin highlights, we propose a Keypoint Based
Enhancement (KPBE) method for audio-driven free view
talking head synthesis. Our approach consists of a backend
and a frontend. The backend is model-free. Using audio and
source images as input, the existing backend methods synthe-
size intermediate results. The frontend contains five modules:
canonical keypoint estimator, appearance feature estimator,
head pose and expression estimator, motion field estimator,
and generator. Canonical keypoint estimator is for estimating
customized keypoints for the different images. Appearance
feature estimator is for extracting the appearance features such
as skin and color of eyes. Head pose and expression estimator
is for extracting head pose and expression from backend
output. Specifically, head pose is determined by a rotation
matrix and a translation vector. Expression is parameterized
by vectors with the same number of the canonical keypoints.
Motion field estimator is for compositing the motion vectors
in 3D space. Motion vector is obtained via pair of keypoints
extracted from source image and driving image. Generator
is for generating final video from appearance feature and
composited motion field. Using this workflow we can get
appearance features that maintain the skin highlights, and the
generator avoids the cut feeling in the mouth mapping. In
addition, we can change the viewpoint by user-defined head
pose matrix.
The contributions of our work are two folds:
A keypoint-based model-free enhancement method for
audio-driven talking head synthesis is proposed, which
can composite the head pose and the lip motion naturally.
Head pose can be manipulated by user-defined rotation
matrix and tranlation vector.
The rest of the paper is organized as follows. Section II
978-1-6654-7189-3/22/$31.00 © 2022 IEEE
arXiv:2210.03335v1 [cs.CV] 7 Oct 2022
Model-free
Backend
method
Canonical
keypoint
estimator
Head pose
&
expression
estimator
Appearance
feature
extractor
Head pose
&
expression
estimator
kc,i
fs
source image
driving audio
head pose
driving video
expb
Rd&tdkd,i
T
Head pose
&
expression
estimator
Rs&ts
exps
Tks,i
Fig. 1. Keypoint decomposition. The appearance features, 3D canonical
key points, head pose, and expression are extracted from the source image.
Meanwhile, expression is extracted from backend output, and head pose is
extracted from the head pose driving video. Applying the corresponding head
pose and expression to the canonical keypoint, we get the source keypoint
and the driving keypoint.
introduces the related works about talking head synthesis.
Section III represents the architecture of our method and
details of each part. Experiments and results are shown in
Section IV. Section V concludes the paper.
II. RELATED WORKS
Audio-driven talking head synthesis. Driving a talking
head with audio is the task of synchronizing image frames
of a video with arbitrary audio. There are two lines of
work: one is 3D-based and the other is 2D-based. The early
3D-based methods attempt to build the relationship between
audio features and lip motions by hand [23] [1], so they
need an expert of the field. A well-known person-specific
3D-based work that does not rely on expert is [16]. The
authors generated talking-head videos of president Obama, and
focused on synthesizing natural head pose and accurate lip
motion. However, the method needs a large number of videos
of a specific person. 2D-based methods [7] [12] can achieve
identity independent expression synthesis by replacing part of
the face. Recent 2D-based methods can also synthesize head
pose using facial landmarks [24] [22]. And some methods
achieve real-time speed synthesis [20]. Latest 3D-based meth-
ods can perform identity-independent synthesis [3]. However,
realistic hair cannot be generated , because it is difficult for
a network to learn the model of these high poly models. All
these methods have only a limited capabilities for manipulating
head pose and viewpoint.
Neural rendering based talking head synthesis. Sitzmann
et al. [25] used neural networks to represent the 3D shape or
appearance of scenes, and they sampled point set in space to
represent the appearance of an object. A method was originally
presented by Siarohin et al. [14] to get the warping between
sparse keypoints and the motion fields from them. Wang et
al. [19] used 3D-based keypoint warping to overcome the
shortcomings of the previous method. We use the similar idea
for keypoint decomposition in our method.
Recently, Neural Radience Fields (NeRF) [9] has gained
many achievements in neural rendering tasks. They trans-
formed the 3D appearance features to ray sampling results
of volume. AD-nerf [6] applied this idea to talking head syn-
thesis. Neural rendering methods achieve free view head pose
synthesis, and have the ability to learn the depth information
from the 2D image. We use a similar framework for extracting
3D appearance feature.
III. METHOD
We proposed a Keypoint Based Enhancement (KPBE)
method for audio-driven free view talking head synthesis.
The synthesis framework consists of two steps: keypoint
decomposition and generating from keypoint representation,
as illustrated in Fig. 1 and Fig. 2, respectively. Specifically,
the keypoint decomposition contains two parts. One is the
backend which synthesizes the intermediate video from the
audio. The other part belongs to the keypoint-based frontend
which enhances the result of the backend output. Using key-
point decomposition, it extracts source keypoints and driving
keypoints from the inputs. In the step of generating from
keypoint representation part, we use the keypoints obtained by
above part as input, and use the motion field based generator
to synthesize the final image.
A. Audio driven backend
In this paper, we use PC-AVS and Wav2Lip as our backend
methods (indicated by the yellow block in Fig. 1). Driving
audio and the source image are used as the input of the
backend model. The output of the backend is the intermediate
video, and it contains the lip motion information. We extract
head pose and expression from the video for enhancement
step.
B. Keypoint-based enhancement frontend
The frontend contains five modules (indicated by the blue
blocks in Fig. 1): Canonical keypoint estimator, appearance
feature estimator, head pose and expression estimator, motion
field estimator and generator.
1) Canonical keypoint estimator: Using the canonical key-
point estimator, we can extract canonical keypoint from the
source image. Note that canonical keypoint is the keypoint
associated only with the identity, and not affected by head
pose and expression. These extracted keypoints shall encode
a person’s head geometry feature in a neutral expression and
摘要:

AKeypointBasedEnhancementMethodforAudioDrivenFreeViewTalkingHeadSynthesisYichenHan,YaLi,YingmingGao,JinlongXueSchoolofArticialIntelligenceBeijingUniversityofPostsandTelecommunicationsBeijing,Chinaadelacvgaoiro@bupt.edu.cn,yli01@bupt.edu.cn,yingming.gao@outlook.com,jinlongxue@bupt.edu.cnSongpoWang,L...

展开>> 收起<<
A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis Yichen Han Ya Li Yingming Gao Jinlong Xue.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:2.69MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注