
A Keypoint Based Enhancement Method for Audio
Driven Free View Talking Head Synthesis
Yichen Han, Ya Li, Yingming Gao, Jinlong Xue
School of Artificial Intelligence
Beijing University of Posts and Telecommunications
Beijing, China
adelacvgaoiro@bupt.edu.cn, yli01@bupt.edu.cn,
yingming.gao@outlook.com, jinlong xue@bupt.edu.cn
Songpo Wang, Lei Yang
DeepScience Tech Ltd.
Beijing, China
wangsongpo@deepscience.cn, yanglei@deepscience.cn
Abstract—Audio driven talking head synthesis is a challenging
task that attracts increasing attention in recent years. Although
existing methods based on 2D landmarks or 3D face models can
synthesize accurate lip synchronization and rhythmic head pose
for arbitrary identity, they still have limitations, such as the cut
feeling in the mouth mapping and the lack of skin highlights. The
morphed region is blurry compared to the surrounding face.
A Keypoint Based Enhancement (KPBE) method is proposed
for audio driven free view talking head synthesis to improve
the naturalness of the generated video. Firstly, existing methods
were used as the backend to synthesize intermediate results.
Then we used keypoint decomposition to extract video synthesis
controlling parameters from the backend output and the source
image. After that, the controlling parameters were composited to
the source keypoints and the driving keypoints. A motion field
based method was used to generate the final image from the
keypoint representation. With keypoint representation, we over-
came the cut feeling in the mouth mapping and the lack of skin
highlights. Experiments show that our proposed enhancement
method improved the quality of talking-head videos in terms of
mean opinion score.
Index Terms—talking head generation, speech driven anima-
tion
I. INTRODUCTION
In many applications, such as virtual reality, digital body,
video conferencing, and visual dubbing, one-shot audio-driven
talking head synthesis is an important component. Early re-
search relied on motion capture by art experts, and could only
be used in film and games, which was labor-intensive and
time-consuming [5] [23]. In recent years, significant progress
has been made in this area, and a number of deep learning
methods have been proposed [17] [4] [26] [11] [18] [8] [6]
[12] in order to learn the warping from audio to expression.
For example, Wav2lip [12] uses a end-to-end framework, and
synthesizes lower half of the face. Many methods use 2D facial
landmark [2] [16] [4] or 3D head model [1] [18] [15] [3] [21]
[13] [8] as a transit medium. Because 2D facial landmark does
not contain information about depth perception, most methods
using 2D facial landmark [26] [16] do not support viewpoint
editing.
In order to achieve more flexible manipulation, neural ren-
dering methods are proposed. They are identity independent,
and can change viewpoint in latent space. Neural radiation
fields (NeRF) [6] is proposed to avoid additional intermediate
representations. However, it is still a difficult task to control
the head pose and expression at the same time. To overcome
this limitation, keypoint-based methods are proposed [19] [14].
They can render hair and sunglasses precisely that are not
possible for 3D-based methods, and are able to change the
viewpoint that are not possible for 2D-based methods.
To overcome the cut feeling in the mouth mapping and
the lack of skin highlights, we propose a Keypoint Based
Enhancement (KPBE) method for audio-driven free view
talking head synthesis. Our approach consists of a backend
and a frontend. The backend is model-free. Using audio and
source images as input, the existing backend methods synthe-
size intermediate results. The frontend contains five modules:
canonical keypoint estimator, appearance feature estimator,
head pose and expression estimator, motion field estimator,
and generator. Canonical keypoint estimator is for estimating
customized keypoints for the different images. Appearance
feature estimator is for extracting the appearance features such
as skin and color of eyes. Head pose and expression estimator
is for extracting head pose and expression from backend
output. Specifically, head pose is determined by a rotation
matrix and a translation vector. Expression is parameterized
by vectors with the same number of the canonical keypoints.
Motion field estimator is for compositing the motion vectors
in 3D space. Motion vector is obtained via pair of keypoints
extracted from source image and driving image. Generator
is for generating final video from appearance feature and
composited motion field. Using this workflow we can get
appearance features that maintain the skin highlights, and the
generator avoids the cut feeling in the mouth mapping. In
addition, we can change the viewpoint by user-defined head
pose matrix.
The contributions of our work are two folds:
•A keypoint-based model-free enhancement method for
audio-driven talking head synthesis is proposed, which
can composite the head pose and the lip motion naturally.
•Head pose can be manipulated by user-defined rotation
matrix and tranlation vector.
The rest of the paper is organized as follows. Section II
978-1-6654-7189-3/22/$31.00 © 2022 IEEE
arXiv:2210.03335v1 [cs.CV] 7 Oct 2022