AniFaceGAN Animatable 3D-Aware Face Image Generation for Video Avatars Yue Wu1Yu Deng2Jiaolong Yang3yFangyun Wei3Qifeng Chen1Xin Tong3

2025-04-27 0 0 2.51MB 14 页 10玖币
侵权投诉
AniFaceGAN: Animatable 3D-Aware Face Image
Generation for Video Avatars
Yue Wu1Yu Deng2Jiaolong Yang3Fangyun Wei3Qifeng Chen1Xin Tong3
1HKUST 2Tsinghua University 3Microsoft Research
Abstract
Although 2D generative models have made great progress in face image generation
and animation, they often suffer from undesirable artifacts such as 3D inconsis-
tency when rendering images from different camera viewpoints. This prevents them
from synthesizing video animations indistinguishable from real ones. Recently,
3D-aware GANs extend 2D GANs for explicit disentanglement of camera pose
by leveraging 3D scene representations. These methods can well preserve the 3D
consistency of the generated images across different views, yet they cannot achieve
fine-grained control over other attributes, among which facial expression control is
arguably the most useful and desirable for face animation. In this paper, we propose
an animatable 3D-aware GAN for multiview consistent face animation generation.
The key idea is to decompose the 3D representation of the 3D-aware GAN into a
template field and a deformation field, where the former represents different identi-
ties with a canonical expression, and the latter characterizes expression variations
of each identity. To achieve meaningful control over facial expressions via defor-
mation, we propose a 3D-level imitative learning scheme between the generator
and a parametric 3D face model during adversarial training of the 3D-aware GAN.
This helps our method achieve high-quality animatable face image generation
with strong visual 3D consistency, even though trained with only unstructured 2D
images. Extensive experiments demonstrate our superior performance over prior
works. Project page:
https://yuewuhkust.github.io/AniFaceGAN/
1 Introduction
Face image synthesis and animation have been a longstanding task in computer vision and computer
graphics with a wide range of applications such as virtual avatars and video conferencing. Remarkable
progress has been achieved in recent years with a large volume of methods proposed [
50
,
2
,
19
,
57
,
52
,
69
,
58
,
56
,
63
,
1
,
67
,
70
,
46
,
77
]. This progress is hinged on a number of advances in machine
learning within which generative adversarial networks (GANs) are arguably the core underpinning.
However, most existing face GANs are based on 2D convolutional neural networks (CNNs) and do
not model the underlying 3D facial geometry. When synthesizing faces under different poses and
expressions, the results cannot maintain strict 3D consistency. Consequently, these methods can
be used in interactive face manipulation but are not suitable for high-quality face video generation
and animation. To alleviate this problem, some methods incorporate 3D prioris into the generation
process [
11
,
64
,
30
,
20
,
51
] for better 3D rigging over the 2D images. Perhaps the most relevant
work to ours is DiscoFaceGAN [11], which considers an unconditional and disentangled generative
modeling setup as we do. Still, exact 3D consistency cannot be guaranteed by the aforementioned
Work done when YW and YD were interns at Microsoft Research.
JY is the corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06465v1 [cs.CV] 12 Oct 2022
Expression control
Pose control
Figure 1: Random virtual persons generated by our method under different expressions and head
poses. Note the high texture consistency across poses and expressions (see more animated examples
in our accompanying video).
approaches, and their 2D CNN generators often lead to temporal artifacts such as flickering and
texture sticking [25], which are undesirable for realistic video avatars.
Recently, a number of 3D-aware GANs are proposed by incorporating 3D representations [
37
,
31
,
38
,
10
,
14
,
22
,
12
,
9
,
41
] to achieve disentangled control of 3D pose. Among them, methods that generate
neural radiance fields (NeRF) [
35
] have demonstrated striking image synthesis results [
10
,
22
,
12
,
9
].
Owning to its 3D field modeling and volumetric rendering scheme, the NeRF representation is
capable of producing realistic images with strong 3D consistency across different views, rendering
it suitable for high-quality 3D scene synthesis. Nevertheless, these methods lack the control over
attributes beyond camera pose and thus cannot be directly applied to face animation synthesis tasks.
In this paper, we present AniFaceGAN, an animatable 3D-aware face generation method that is used
to synthesize realistic face images with controllable pose and expression sequences. Our method is
an unconditional generative model that can generate novel, non-existing identities and is trained on
unstructured 2D face image collections without any 3D or multiview data. To achieve animation, we
leverage separate latent representations for identity and expression in the generator, and attain explicit
controllability by incorporating priors of a 3D parametric face model. To ensure geometry and texture
consistency under expression change, we leverage 3D deformation to derive the desired expressions.
Although deformation has been used in some recent NeRF methods [
53
,
44
,
16
,
45
], these works
mostly focused on modeling single dynamic scenes from videos. How to learn deformation in a
generative setting from unstructured 2D images and how to achieve explicit and accurate expression
control through deformation with unsupervised learning remains underexplored.
Our AniFaceGAN generates two 3D fields for face image rendering: a template radiance field for
modeling the geometry and appearance of a generated identity and an expression-driven deformation
field for animation. The former is based on the recent generative radiance manifold (GRAM) approach
that shows state-of-the-art 3D-aware image generation quality [
12
]. To learn desired deformation, we
incorporate a 3D face morphable model (3DMM) [
5
,
47
] into adversarial training and enforce our
deformation field to imitate it under expression variations. In contrast to previous methods such as
[
11
] which imposes imitations on 2D images, we propose a set of 3D-space losses defined on both
facial geometry and expression deformation. We train our method on the FFHQ dataset [
26
] and
show that it can generate high quality and 3D consistent images of virtual subjects across different
poses and expressions (an example is shown in Fig. 1).
2
The contributions of our work can be summarized as follows:
We present an animatable 3D-aware face GAN with expression variation disentangled
through our proposed expression-driven 3D deformation field.
We introduce a set of novel 3D-space imitative losses, which align our expression generation
with a prior 3D parametric face model to gain accurate and semantically-meaningful control
of facial expression.
We show that our method can generate realistic face videos with high visual quality. We
believe it opens a new door to face video synthesis and photorealistic virtual avatar creation.
2 Related work
Neural implicit scene representations.
Neural implicit functions have been used in numerous
works [
42
,
34
,
60
,
59
,
35
,
39
] to represent 3D scenes in a continuous domain. Among them,
NeRF [
35
,
3
] shows its superiority at modeling complex scenes with detailed appearance and
synthesizing multi-view images with strict 3D consistency. The original NeRF and most of its
successors [
43
,
33
,
45
,
49
,
48
,
40
,
32
,
73
] focus on learning scene-specific representation using a set
of posed images or a video sequence of a static or dynamic scene. A few methods [
55
,
10
,
38
,
76
,
22
,
12
,
9
] have explored the generative modeling task using unstructured 2D images as training data.
A very recent method GRAM [
12
] constrains radiance field learning on a set of 2D manifolds and
shows promising results for high-quality and multi-view consistent image generation. Our method is
also built upon the radiance manifolds representation [
12
] for high-quality face image generation and
animation.
3D-aware generative model for image synthesis.
Generative adversarial networks [
21
,
26
,
27
]
are widely used for realistic 2D image synthesis. Recent methods [
36
,
62
,
37
,
31
,
38
,
10
,
14
,
22
,
12
,
9
,
41
,
72
] extend GANs to 3D-aware image synthesis by incorporating 3D knowledge into
their generators. This enables them to learn multiview image generation given only unconstrained
2D image collections as supervisions. For example, HoloGAN [
36
] utilizes 3D-CNN to generate
low-resolution voxel features and projects them as 2D feature maps for further neural rendering.
GRAF [
55
] adopts a generative radiance field as scene representation and generates images via
volumetric rendering [
35
]. GRAM [
12
] further sparsifies the radiance field as a set of radiance
manifolds and leverages manifold rendering [
78
,
12
] for more realistic image generation. Although
3D-aware GANs are able to control camera viewpoints, they cannot achieve fine-grained control
over the generated instances such as shapes and expressions for human faces. This prevents them
from being used for multiview face animation generation tasks. In this work, we introduce imitative
learning of 3DMM into a 3D-aware GAN framework to achieve expression-controllable face image
generation.
Face image synthesis with 3D morphable model guidance.
3D Morphable Models (3DMMs) [
5
,
47
,
7
] play an important role for face image and animation synthesis [
66
,
28
,
75
,
54
]. Earlier
works [
65
,
66
] directly render textured meshes represented by 3DMM using traditional rendering
pipeline for face reenactment. However, these methods often produce over-smooth textures and
cannot generate non-face regions due to the limitation of 3DMM. Later works [
28
,
18
,
15
,
75
,
8
] apply
refinement networks on top of the rendered 3DMM images to generate more realistic texture details
and fill the missing regions. Nevertheless, these methods still rely on a graphics rendering procedure
to synthesize coarse images at inference time, which complicates the whole system. Recently,
DiscoFaceGAN [11] proposes an imitative-contrastive learning scheme that enforces the generative
network to mimic the rendering process of 3DMM. After training, it can directly generate realistic
face images and control multiple face attributes in a disentangled manner via the learned 2D GAN.
Some concurrent and following works [
64
,
30
,
20
,
51
] of it share a similar spirit. However, these
methods often encounter 3D-inconsistency issues when controlling camera poses due to the non-
physical image rendering process of 2D GANs. Some recent methods [
16
,
24
] incorporate 3DMM
knowledge into NeRF’s training process to achieve animatable face image synthesis. However, they
do not leverage a generative training paradigm [
21
] and require video sequences or multiview images
as supervision. By contrast, our method can be trained with only unstructured 2D images.
Face editing and animation with GANs.
A large amount of works adopt GANs for face image
manipulation and animation synthesis [
50
,
2
,
19
,
57
,
52
,
69
,
58
,
56
,
63
,
1
,
67
,
70
,
46
,
17
,
54
,
71
,
77
].
3
摘要:

AniFaceGAN:Animatable3D-AwareFaceImageGenerationforVideoAvatarsYueWu1YuDeng2JiaolongYang3yFangyunWei3QifengChen1XinTong31HKUST2TsinghuaUniversity3MicrosoftResearchAbstractAlthough2Dgenerativemodelshavemadegreatprogressinfaceimagegenerationandanimation,theyoftensufferfromundesirableartifactssuchas3...

展开>> 收起<<
AniFaceGAN Animatable 3D-Aware Face Image Generation for Video Avatars Yue Wu1Yu Deng2Jiaolong Yang3yFangyun Wei3Qifeng Chen1Xin Tong3.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.51MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注