Audio-Visual Face Reenactment Madhav Agarwal IIIT HyderabadRudrabha Mukhopadhyay

2025-05-02 0 0 8.7MB 10 页 10玖币
侵权投诉
Audio-Visual Face Reenactment
Madhav Agarwal
IIIT, Hyderabad
Rudrabha Mukhopadhyay
IIIT, Hyderabad
Vinay Namboodiri
University of Bath
C V Jawahar
IIIT, Hyderabad
{madhav.agarwal,radrabha.m}@research.iiit.ac.in, vpn22@bath.ac.uk, jawahar@iiit.ac.in
Figure 1: We propose AVFR-GAN, a novel method for face reenactment. Our network takes a source identity, a driving frame,
and a small audio chunk associated with the driving frame to animate the source identity according to the driving frame. Our
network generates highly realistic outputs compared to previous works like [29] and [30]. Results from our network contain
significantly fewer artifacts and handle things like mouth movements, eye movements, etc. in a better manner.
Abstract
This work proposes a novel method to generate realis-
tic talking head videos using audio and visual streams. We
animate a source image by transferring head motion from
a driving video using a dense motion field generated us-
ing learnable keypoints. We improve the quality of lip sync
using audio as an additional input, helping the network to
attend to the mouth region. We use additional priors us-
ing face segmentation and face mesh to improve the struc-
ture of the reconstructed faces. Finally, we improve the vi-
sual quality of the generations by incorporating a carefully
designed identity-aware generator module. The identity-
aware generator takes the source image and the warped
motion features as input to generate a high-quality output
with fine-grained details. Our method produces state-of-
the-art results and generalizes well to unseen faces, lan-
guages, and voices. We comprehensively evaluate our ap-
proach using multiple metrics and outperforming the cur-
rent techniques both qualitative and quantitatively. Our
work opens up several applications, including enabling low
bandwidth video calls. We release a demo video and ad-
ditional information at http://cvit.iiit.ac.in/
research/projects/cvit-projects/avfr.
1. Introduction
Imagine your favorite celebrity giving daily news up-
dates, motivating you to work out, or interacting with you
on your mobile phone! What if a movie director could reen-
act an actor’s image without actually recording the actor?
Or, how about skilled content creators animating avatars in a
metaverse to follow an actor’s head movements and expres-
sions in great detail? We can also reduce zoom fatigue [11]
by animating a well-dressed image of ourselves in a video
call without transmitting a live video stream! These ideas
seem fictitious, infeasible, and not scalable. But, how about
animating or “reenacting” a single image of any person ac-
cording to a driving video of someone else? Face reenact-
ment, thus, opens up many opportunities in a world that is
becoming increasingly digital with each passing day.
Face Reenactment aims to animate a source image us-
ing a driving video’s motion while preserving the source
identity. Multiple publications have improved the quality of
the generations. Existing works on talking head generation
generally use a single modality, i.e., either visual[12, 29,
39, 40] or audio features[13, 37, 31]. Audio-driven talking
arXiv:2210.02755v1 [cs.CV] 6 Oct 2022
head generation models are good at generating quality lip-
sync; however, they have a serious drawback in handling
non-verbal cues. The video-driven methods heavily rely on
the disentanglement of motion from the appearance [17].
These methods generally use key points as an intermedi-
ate representation [29, 12, 39] and try to align the detected
key points of source and driving frames. These works learn
key points in an unsupervised manner and fail to focus on
specific regions of the face. This stems from inadequate pri-
ors regarding the face structure or the uttered speech. The
final quality of the generations also suffers from using a ba-
sic CNN-based decoder that fails to capture the sharpness
present in the source image and generates blurred output
video. As a part of this work, we provide a detailed review
of different approaches in Section 2.
In this paper, we analyze the shortcomings of the current
works and add key modules to our network. We introduce
Audio-Visual Face Reenactment GAN (AVFR-GAN), a
novel architecture that uses both audio and visual cues to
generate highly realistic face reenactments. We start with
providing additional priors about the structure of the face in
the form of a face segmentation mask and face mesh. We
also provide corresponding speech to our algorithm to help
it attend to the mouth region and improve lip synchroniza-
tion. Finally, our pipeline uses a novel identity-aware face
generator to improve the final outputs. Our approach gener-
ates superior results compared to the current state-of-the-art
works, as shown in Section 4. We comprehensively evaluate
our method against several baselines and report the quanti-
tative performance based on multiple standard metrics. We
also perform human evaluations to evaluate qualitative re-
sults in the same section. Our proposed method opens a host
of applications, as discussed in Section 6, including one in
compressing video calls. Our work achieves more than 7×
improvement in visual quality when tested at the same com-
pression levels using the recently released H.266 [7] codec.
Our contributions are summarized as follows:
1. We use additional priors in the form of face mesh and
face segmentation mask to preserve the geometry of the
face.
2. We utilize additional input in the form of audio to im-
prove the generation quality of the mouth region. Audio
also helps to preserve lip synchronization, enhancing the
viewing experience.
3. We build a novel carefully-designed identity-aware face
generator to generate high-quality talking head videos in
contrast to the high levels of blur present in the previous
works.
2. Related Work
Talking head generation works can be broadly classified
in three categories based on the type of input they use to
generate a talking head: Text-driven [16, 33, 36], Audio-
driven [9, 13, 18, 31, 37, 43, 45], and Video-driven [12, 27,
29, 39, 44] Talking Head Generation.
Text-driven Talking-head Generation Text-driven nat-
ural image generation [25, 26] has recently seen a lot of
progress in the computer vision community. Inspired by
the recent success of GANs in generating static faces from
text[38], Li et al. [16] proposed a method to use text for
driving animation parameters of the mouth, upper face and
head. Txt2Vid [33] converts the spoken language and fa-
cial webcam data into text and transmits it to achieve low-
bandwidth video conferencing using talking head genera-
tion. However, this method relies heavily on the generated
speech, altering the original speaker’s voice, prosody, and
head movements in the video call. It depends on the quality
of the Speech-to-Text module, which introduces grammati-
cal errors and language dependency. Text as a medium has
very little information about the head and lip movements;
thus, we consider the problem ill-posed.
Audio-driven Talking-head Generation While text-
driven methods suffer from a significant lack of adequate
priors, we now move on to audio, a much more expres-
sive and informative form of input. As the name suggests,
audio-driven methods [9, 13, 18, 31, 37, 43, 45] use only
audio to animate a static face image. The first set of works
like You-said-that? [9], LipGAN [15] and Wav2Lip [24]
achieved lip synchronization with given audio but failed to
generate head movements in sync with the speech. These
works used fully convolutional architectures and generated
a single frame at a time without considering the temporal
constraints. Eventually, a different class of works start-
ing from Song et al. [31] in 2018 and Zhou et al. [43] in
2019, started using conditional Recurrent Neural Networks
to model the temporal characteristics of a talking face. In
2020, Zhou et al. [45] published a landmark work that pre-
dicted dense flow from audio instead of directly generating
the output video. The dense flow was then used to warp
the source image to generate the final output. Several other
well-known works like Emotional Video Portraits [13] add
an additional emotion label as input to create the talking
head in the desired emotion. However, all of these works
lack fine-grained control of the talking head and often con-
tain a loopy head motion, and thus cannot be directly used
in many applications.
Video-driven Talking-head Generation Finally, we
move to video-driven methods, which use a driving video
to get the motion and other facial features required to reen-
act a source image. Please note that the driving video and
the source image may not have the same identity. Owing
to the significant priors in driving video, the final gener-
ation quality of video-driven methods surpasses those of
摘要:

Audio-VisualFaceReenactmentMadhavAgarwalIIIT,HyderabadRudrabhaMukhopadhyayIIIT,HyderabadVinayNamboodiriUniversityofBathCVJawaharIIIT,Hyderabad{madhav.agarwal,radrabha.m}@research.iiit.ac.in,vpn22@bath.ac.uk,jawahar@iiit.ac.inFigure1:WeproposeAVFR-GAN,anovelmethodforfacereenactment.Ournetworktakesaso...

展开>> 收起<<
Audio-Visual Face Reenactment Madhav Agarwal IIIT HyderabadRudrabha Mukhopadhyay.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:8.7MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注