Audio-Visual Face Reenactment Madhav Agarwal IIIT HyderabadRudrabha Mukhopadhyay

2025-05-02 0 0 8.7MB 10 页 10玖币

侵权投诉

Audio-Visual Face Reenactment

Madhav Agarwal

IIIT, Hyderabad

Rudrabha Mukhopadhyay

IIIT, Hyderabad

Vinay Namboodiri

University of Bath

C V Jawahar

IIIT, Hyderabad

{madhav.agarwal,radrabha.m}@research.iiit.ac.in, vpn22@bath.ac.uk, jawahar@iiit.ac.in

Figure 1: We propose AVFR-GAN, a novel method for face reenactment. Our network takes a source identity, a driving frame,

and a small audio chunk associated with the driving frame to animate the source identity according to the driving frame. Our

network generates highly realistic outputs compared to previous works like [29] and [30]. Results from our network contain

signiﬁcantly fewer artifacts and handle things like mouth movements, eye movements, etc. in a better manner.

Abstract

This work proposes a novel method to generate realis-

tic talking head videos using audio and visual streams. We

animate a source image by transferring head motion from

a driving video using a dense motion ﬁeld generated us-

ing learnable keypoints. We improve the quality of lip sync

using audio as an additional input, helping the network to

attend to the mouth region. We use additional priors us-

ing face segmentation and face mesh to improve the struc-

ture of the reconstructed faces. Finally, we improve the vi-

sual quality of the generations by incorporating a carefully

designed identity-aware generator module. The identity-

aware generator takes the source image and the warped

motion features as input to generate a high-quality output

with ﬁne-grained details. Our method produces state-of-

the-art results and generalizes well to unseen faces, lan-

guages, and voices. We comprehensively evaluate our ap-

proach using multiple metrics and outperforming the cur-

rent techniques both qualitative and quantitatively. Our

work opens up several applications, including enabling low

bandwidth video calls. We release a demo video and ad-

ditional information at http://cvit.iiit.ac.in/

research/projects/cvit-projects/avfr.

1. Introduction

Imagine your favorite celebrity giving daily news up-

dates, motivating you to work out, or interacting with you

on your mobile phone! What if a movie director could reen-

act an actor’s image without actually recording the actor?

Or, how about skilled content creators animating avatars in a

metaverse to follow an actor’s head movements and expres-

sions in great detail? We can also reduce zoom fatigue [11]

by animating a well-dressed image of ourselves in a video

call without transmitting a live video stream! These ideas

seem ﬁctitious, infeasible, and not scalable. But, how about

animating or “reenacting” a single image of any person ac-

cording to a driving video of someone else? Face reenact-

ment, thus, opens up many opportunities in a world that is

becoming increasingly digital with each passing day.

Face Reenactment aims to animate a source image us-

ing a driving video’s motion while preserving the source

identity. Multiple publications have improved the quality of

the generations. Existing works on talking head generation

generally use a single modality, i.e., either visual[12, 29,

39, 40] or audio features[13, 37, 31]. Audio-driven talking

arXiv:2210.02755v1 [cs.CV] 6 Oct 2022

head generation models are good at generating quality lip-

sync; however, they have a serious drawback in handling

non-verbal cues. The video-driven methods heavily rely on

the disentanglement of motion from the appearance [17].

These methods generally use key points as an intermedi-

ate representation [29, 12, 39] and try to align the detected

key points of source and driving frames. These works learn

key points in an unsupervised manner and fail to focus on

speciﬁc regions of the face. This stems from inadequate pri-

ors regarding the face structure or the uttered speech. The

ﬁnal quality of the generations also suffers from using a ba-

sic CNN-based decoder that fails to capture the sharpness

present in the source image and generates blurred output

video. As a part of this work, we provide a detailed review

of different approaches in Section 2.

In this paper, we analyze the shortcomings of the current

works and add key modules to our network. We introduce

Audio-Visual Face Reenactment GAN (AVFR-GAN), a

novel architecture that uses both audio and visual cues to

generate highly realistic face reenactments. We start with

providing additional priors about the structure of the face in

the form of a face segmentation mask and face mesh. We

also provide corresponding speech to our algorithm to help

it attend to the mouth region and improve lip synchroniza-

tion. Finally, our pipeline uses a novel identity-aware face

generator to improve the ﬁnal outputs. Our approach gener-

ates superior results compared to the current state-of-the-art

works, as shown in Section 4. We comprehensively evaluate

our method against several baselines and report the quanti-

tative performance based on multiple standard metrics. We

also perform human evaluations to evaluate qualitative re-

sults in the same section. Our proposed method opens a host

of applications, as discussed in Section 6, including one in

compressing video calls. Our work achieves more than 7×

improvement in visual quality when tested at the same com-

pression levels using the recently released H.266 [7] codec.

Our contributions are summarized as follows:

1. We use additional priors in the form of face mesh and

face segmentation mask to preserve the geometry of the

face.

2. We utilize additional input in the form of audio to im-

prove the generation quality of the mouth region. Audio

also helps to preserve lip synchronization, enhancing the

viewing experience.

3. We build a novel carefully-designed identity-aware face

generator to generate high-quality talking head videos in

contrast to the high levels of blur present in the previous

works.

2. Related Work

Talking head generation works can be broadly classiﬁed

in three categories based on the type of input they use to

generate a talking head: Text-driven [16, 33, 36], Audio-

driven [9, 13, 18, 31, 37, 43, 45], and Video-driven [12, 27,

29, 39, 44] Talking Head Generation.

Text-driven Talking-head Generation Text-driven nat-

ural image generation [25, 26] has recently seen a lot of

progress in the computer vision community. Inspired by

the recent success of GANs in generating static faces from

text[38], Li et al. [16] proposed a method to use text for

driving animation parameters of the mouth, upper face and

head. Txt2Vid [33] converts the spoken language and fa-

cial webcam data into text and transmits it to achieve low-

bandwidth video conferencing using talking head genera-

tion. However, this method relies heavily on the generated

speech, altering the original speaker’s voice, prosody, and

head movements in the video call. It depends on the quality

of the Speech-to-Text module, which introduces grammati-

cal errors and language dependency. Text as a medium has

very little information about the head and lip movements;

thus, we consider the problem ill-posed.

Audio-driven Talking-head Generation While text-

driven methods suffer from a signiﬁcant lack of adequate

priors, we now move on to audio, a much more expres-

sive and informative form of input. As the name suggests,

audio-driven methods [9, 13, 18, 31, 37, 43, 45] use only

audio to animate a static face image. The ﬁrst set of works

like You-said-that? [9], LipGAN [15] and Wav2Lip [24]

achieved lip synchronization with given audio but failed to

generate head movements in sync with the speech. These

works used fully convolutional architectures and generated

a single frame at a time without considering the temporal

constraints. Eventually, a different class of works start-

ing from Song et al. [31] in 2018 and Zhou et al. [43] in

2019, started using conditional Recurrent Neural Networks

to model the temporal characteristics of a talking face. In

2020, Zhou et al. [45] published a landmark work that pre-

dicted dense ﬂow from audio instead of directly generating

the output video. The dense ﬂow was then used to warp

the source image to generate the ﬁnal output. Several other

well-known works like Emotional Video Portraits [13] add

an additional emotion label as input to create the talking

head in the desired emotion. However, all of these works

lack ﬁne-grained control of the talking head and often con-

tain a loopy head motion, and thus cannot be directly used

in many applications.

Video-driven Talking-head Generation Finally, we

move to video-driven methods, which use a driving video

to get the motion and other facial features required to reen-

act a source image. Please note that the driving video and

the source image may not have the same identity. Owing

to the signiﬁcant priors in driving video, the ﬁnal gener-

ation quality of video-driven methods surpasses those of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Audio-VisualFaceReenactmentMadhavAgarwalIIIT,HyderabadRudrabhaMukhopadhyayIIIT,HyderabadVinayNamboodiriUniversityofBathCVJawaharIIIT,Hyderabad{madhav.agarwal,radrabha.m}@research.iiit.ac.in,vpn22@bath.ac.uk,jawahar@iiit.ac.inFigure1:WeproposeAVFR-GAN,anovelmethodforfacereenactment.Ournetworktakesaso...

展开>> 收起<<

Audio-Visual Face Reenactment Madhav Agarwal IIIT HyderabadRudrabha Mukhopadhyay.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Audio-Visual Face Reenactment Madhav Agarwal IIIT HyderabadRudrabha Mukhopadhyay

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: