AniFaceGAN Animatable 3D-Aware Face Image Generation for Video Avatars Yue Wu1Yu Deng2Jiaolong Yang3yFangyun Wei3Qifeng Chen1Xin Tong3

2025-04-27 0 0 2.51MB 14 页 10玖币

侵权投诉

AniFaceGAN: Animatable 3D-Aware Face Image

Generation for Video Avatars

Yue Wu1∗Yu Deng2∗Jiaolong Yang3†Fangyun Wei3Qifeng Chen1Xin Tong3

1HKUST 2Tsinghua University 3Microsoft Research

Abstract

Although 2D generative models have made great progress in face image generation

and animation, they often suffer from undesirable artifacts such as 3D inconsis-

tency when rendering images from different camera viewpoints. This prevents them

from synthesizing video animations indistinguishable from real ones. Recently,

3D-aware GANs extend 2D GANs for explicit disentanglement of camera pose

by leveraging 3D scene representations. These methods can well preserve the 3D

consistency of the generated images across different views, yet they cannot achieve

ﬁne-grained control over other attributes, among which facial expression control is

arguably the most useful and desirable for face animation. In this paper, we propose

an animatable 3D-aware GAN for multiview consistent face animation generation.

The key idea is to decompose the 3D representation of the 3D-aware GAN into a

template ﬁeld and a deformation ﬁeld, where the former represents different identi-

ties with a canonical expression, and the latter characterizes expression variations

of each identity. To achieve meaningful control over facial expressions via defor-

mation, we propose a 3D-level imitative learning scheme between the generator

and a parametric 3D face model during adversarial training of the 3D-aware GAN.

This helps our method achieve high-quality animatable face image generation

with strong visual 3D consistency, even though trained with only unstructured 2D

images. Extensive experiments demonstrate our superior performance over prior

works. Project page:

https://yuewuhkust.github.io/AniFaceGAN/

1 Introduction

Face image synthesis and animation have been a longstanding task in computer vision and computer

graphics with a wide range of applications such as virtual avatars and video conferencing. Remarkable

progress has been achieved in recent years with a large volume of methods proposed [

]. This progress is hinged on a number of advances in machine

learning within which generative adversarial networks (GANs) are arguably the core underpinning.

However, most existing face GANs are based on 2D convolutional neural networks (CNNs) and do

not model the underlying 3D facial geometry. When synthesizing faces under different poses and

expressions, the results cannot maintain strict 3D consistency. Consequently, these methods can

be used in interactive face manipulation but are not suitable for high-quality face video generation

and animation. To alleviate this problem, some methods incorporate 3D prioris into the generation

process [

] for better 3D rigging over the 2D images. Perhaps the most relevant

work to ours is DiscoFaceGAN [11], which considers an unconditional and disentangled generative

modeling setup as we do. Still, exact 3D consistency cannot be guaranteed by the aforementioned

∗Work done when YW and YD were interns at Microsoft Research.

†JY is the corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06465v1 [cs.CV] 12 Oct 2022

Expression control

Pose control

Figure 1: Random virtual persons generated by our method under different expressions and head

poses. Note the high texture consistency across poses and expressions (see more animated examples

in our accompanying video).

approaches, and their 2D CNN generators often lead to temporal artifacts such as ﬂickering and

texture sticking [25], which are undesirable for realistic video avatars.

Recently, a number of 3D-aware GANs are proposed by incorporating 3D representations [

] to achieve disentangled control of 3D pose. Among them, methods that generate

neural radiance ﬁelds (NeRF) [

] have demonstrated striking image synthesis results [

Owning to its 3D ﬁeld modeling and volumetric rendering scheme, the NeRF representation is

capable of producing realistic images with strong 3D consistency across different views, rendering

it suitable for high-quality 3D scene synthesis. Nevertheless, these methods lack the control over

attributes beyond camera pose and thus cannot be directly applied to face animation synthesis tasks.

In this paper, we present AniFaceGAN, an animatable 3D-aware face generation method that is used

to synthesize realistic face images with controllable pose and expression sequences. Our method is

an unconditional generative model that can generate novel, non-existing identities and is trained on

unstructured 2D face image collections without any 3D or multiview data. To achieve animation, we

leverage separate latent representations for identity and expression in the generator, and attain explicit

controllability by incorporating priors of a 3D parametric face model. To ensure geometry and texture

consistency under expression change, we leverage 3D deformation to derive the desired expressions.

Although deformation has been used in some recent NeRF methods [

], these works

mostly focused on modeling single dynamic scenes from videos. How to learn deformation in a

generative setting from unstructured 2D images and how to achieve explicit and accurate expression

control through deformation with unsupervised learning remains underexplored.

Our AniFaceGAN generates two 3D ﬁelds for face image rendering: a template radiance ﬁeld for

modeling the geometry and appearance of a generated identity and an expression-driven deformation

ﬁeld for animation. The former is based on the recent generative radiance manifold (GRAM) approach

that shows state-of-the-art 3D-aware image generation quality [

]. To learn desired deformation, we

incorporate a 3D face morphable model (3DMM) [

] into adversarial training and enforce our

deformation ﬁeld to imitate it under expression variations. In contrast to previous methods such as

[

] which imposes imitations on 2D images, we propose a set of 3D-space losses deﬁned on both

facial geometry and expression deformation. We train our method on the FFHQ dataset [

] and

show that it can generate high quality and 3D consistent images of virtual subjects across different

poses and expressions (an example is shown in Fig. 1).

The contributions of our work can be summarized as follows:

•

We present an animatable 3D-aware face GAN with expression variation disentangled

through our proposed expression-driven 3D deformation ﬁeld.

•

We introduce a set of novel 3D-space imitative losses, which align our expression generation

with a prior 3D parametric face model to gain accurate and semantically-meaningful control

of facial expression.

•

We show that our method can generate realistic face videos with high visual quality. We

believe it opens a new door to face video synthesis and photorealistic virtual avatar creation.

2 Related work

Neural implicit scene representations.

Neural implicit functions have been used in numerous

works [

] to represent 3D scenes in a continuous domain. Among them,

NeRF [

] shows its superiority at modeling complex scenes with detailed appearance and

synthesizing multi-view images with strict 3D consistency. The original NeRF and most of its

successors [

] focus on learning scene-speciﬁc representation using a set

of posed images or a video sequence of a static or dynamic scene. A few methods [

] have explored the generative modeling task using unstructured 2D images as training data.

A very recent method GRAM [

] constrains radiance ﬁeld learning on a set of 2D manifolds and

shows promising results for high-quality and multi-view consistent image generation. Our method is

also built upon the radiance manifolds representation [

] for high-quality face image generation and

animation.

3D-aware generative model for image synthesis.

Generative adversarial networks [

]

are widely used for realistic 2D image synthesis. Recent methods [

] extend GANs to 3D-aware image synthesis by incorporating 3D knowledge into

their generators. This enables them to learn multiview image generation given only unconstrained

2D image collections as supervisions. For example, HoloGAN [

] utilizes 3D-CNN to generate

low-resolution voxel features and projects them as 2D feature maps for further neural rendering.

GRAF [

] adopts a generative radiance ﬁeld as scene representation and generates images via

volumetric rendering [

]. GRAM [

] further sparsiﬁes the radiance ﬁeld as a set of radiance

manifolds and leverages manifold rendering [

] for more realistic image generation. Although

3D-aware GANs are able to control camera viewpoints, they cannot achieve ﬁne-grained control

over the generated instances such as shapes and expressions for human faces. This prevents them

from being used for multiview face animation generation tasks. In this work, we introduce imitative

learning of 3DMM into a 3D-aware GAN framework to achieve expression-controllable face image

generation.

Face image synthesis with 3D morphable model guidance.

3D Morphable Models (3DMMs) [

] play an important role for face image and animation synthesis [

]. Earlier

works [

] directly render textured meshes represented by 3DMM using traditional rendering

pipeline for face reenactment. However, these methods often produce over-smooth textures and

cannot generate non-face regions due to the limitation of 3DMM. Later works [

] apply

reﬁnement networks on top of the rendered 3DMM images to generate more realistic texture details

and ﬁll the missing regions. Nevertheless, these methods still rely on a graphics rendering procedure

to synthesize coarse images at inference time, which complicates the whole system. Recently,

DiscoFaceGAN [11] proposes an imitative-contrastive learning scheme that enforces the generative

network to mimic the rendering process of 3DMM. After training, it can directly generate realistic

face images and control multiple face attributes in a disentangled manner via the learned 2D GAN.

Some concurrent and following works [

] of it share a similar spirit. However, these

methods often encounter 3D-inconsistency issues when controlling camera poses due to the non-

physical image rendering process of 2D GANs. Some recent methods [

] incorporate 3DMM

knowledge into NeRF’s training process to achieve animatable face image synthesis. However, they

do not leverage a generative training paradigm [

] and require video sequences or multiview images

as supervision. By contrast, our method can be trained with only unstructured 2D images.

Face editing and animation with GANs.

A large amount of works adopt GANs for face image

manipulation and animation synthesis [

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AniFaceGAN:Animatable3D-AwareFaceImageGenerationforVideoAvatarsYueWu1YuDeng2JiaolongYang3yFangyunWei3QifengChen1XinTong31HKUST2TsinghuaUniversity3MicrosoftResearchAbstractAlthough2Dgenerativemodelshavemadegreatprogressinfaceimagegenerationandanimation,theyoftensufferfromundesirableartifactssuchas3...

展开>> 收起<<

AniFaceGAN Animatable 3D-Aware Face Image Generation for Video Avatars Yue Wu1Yu Deng2Jiaolong Yang3yFangyun Wei3Qifeng Chen1Xin Tong3.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AniFaceGAN Animatable 3D-Aware Face Image Generation for Video Avatars Yue Wu1Yu Deng2Jiaolong Yang3yFangyun Wei3Qifeng Chen1Xin Tong3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: