head generation models are good at generating quality lip-
sync; however, they have a serious drawback in handling
non-verbal cues. The video-driven methods heavily rely on
the disentanglement of motion from the appearance [17].
These methods generally use key points as an intermedi-
ate representation [29, 12, 39] and try to align the detected
key points of source and driving frames. These works learn
key points in an unsupervised manner and fail to focus on
specific regions of the face. This stems from inadequate pri-
ors regarding the face structure or the uttered speech. The
final quality of the generations also suffers from using a ba-
sic CNN-based decoder that fails to capture the sharpness
present in the source image and generates blurred output
video. As a part of this work, we provide a detailed review
of different approaches in Section 2.
In this paper, we analyze the shortcomings of the current
works and add key modules to our network. We introduce
Audio-Visual Face Reenactment GAN (AVFR-GAN), a
novel architecture that uses both audio and visual cues to
generate highly realistic face reenactments. We start with
providing additional priors about the structure of the face in
the form of a face segmentation mask and face mesh. We
also provide corresponding speech to our algorithm to help
it attend to the mouth region and improve lip synchroniza-
tion. Finally, our pipeline uses a novel identity-aware face
generator to improve the final outputs. Our approach gener-
ates superior results compared to the current state-of-the-art
works, as shown in Section 4. We comprehensively evaluate
our method against several baselines and report the quanti-
tative performance based on multiple standard metrics. We
also perform human evaluations to evaluate qualitative re-
sults in the same section. Our proposed method opens a host
of applications, as discussed in Section 6, including one in
compressing video calls. Our work achieves more than 7×
improvement in visual quality when tested at the same com-
pression levels using the recently released H.266 [7] codec.
Our contributions are summarized as follows:
1. We use additional priors in the form of face mesh and
face segmentation mask to preserve the geometry of the
face.
2. We utilize additional input in the form of audio to im-
prove the generation quality of the mouth region. Audio
also helps to preserve lip synchronization, enhancing the
viewing experience.
3. We build a novel carefully-designed identity-aware face
generator to generate high-quality talking head videos in
contrast to the high levels of blur present in the previous
works.
2. Related Work
Talking head generation works can be broadly classified
in three categories based on the type of input they use to
generate a talking head: Text-driven [16, 33, 36], Audio-
driven [9, 13, 18, 31, 37, 43, 45], and Video-driven [12, 27,
29, 39, 44] Talking Head Generation.
Text-driven Talking-head Generation Text-driven nat-
ural image generation [25, 26] has recently seen a lot of
progress in the computer vision community. Inspired by
the recent success of GANs in generating static faces from
text[38], Li et al. [16] proposed a method to use text for
driving animation parameters of the mouth, upper face and
head. Txt2Vid [33] converts the spoken language and fa-
cial webcam data into text and transmits it to achieve low-
bandwidth video conferencing using talking head genera-
tion. However, this method relies heavily on the generated
speech, altering the original speaker’s voice, prosody, and
head movements in the video call. It depends on the quality
of the Speech-to-Text module, which introduces grammati-
cal errors and language dependency. Text as a medium has
very little information about the head and lip movements;
thus, we consider the problem ill-posed.
Audio-driven Talking-head Generation While text-
driven methods suffer from a significant lack of adequate
priors, we now move on to audio, a much more expres-
sive and informative form of input. As the name suggests,
audio-driven methods [9, 13, 18, 31, 37, 43, 45] use only
audio to animate a static face image. The first set of works
like You-said-that? [9], LipGAN [15] and Wav2Lip [24]
achieved lip synchronization with given audio but failed to
generate head movements in sync with the speech. These
works used fully convolutional architectures and generated
a single frame at a time without considering the temporal
constraints. Eventually, a different class of works start-
ing from Song et al. [31] in 2018 and Zhou et al. [43] in
2019, started using conditional Recurrent Neural Networks
to model the temporal characteristics of a talking face. In
2020, Zhou et al. [45] published a landmark work that pre-
dicted dense flow from audio instead of directly generating
the output video. The dense flow was then used to warp
the source image to generate the final output. Several other
well-known works like Emotional Video Portraits [13] add
an additional emotion label as input to create the talking
head in the desired emotion. However, all of these works
lack fine-grained control of the talking head and often con-
tain a loopy head motion, and thus cannot be directly used
in many applications.
Video-driven Talking-head Generation Finally, we
move to video-driven methods, which use a driving video
to get the motion and other facial features required to reen-
act a source image. Please note that the driving video and
the source image may not have the same identity. Owing
to the significant priors in driving video, the final gener-
ation quality of video-driven methods surpasses those of