The contributions of our work can be summarized as follows:
•
We present an animatable 3D-aware face GAN with expression variation disentangled
through our proposed expression-driven 3D deformation field.
•
We introduce a set of novel 3D-space imitative losses, which align our expression generation
with a prior 3D parametric face model to gain accurate and semantically-meaningful control
of facial expression.
•
We show that our method can generate realistic face videos with high visual quality. We
believe it opens a new door to face video synthesis and photorealistic virtual avatar creation.
2 Related work
Neural implicit scene representations.
Neural implicit functions have been used in numerous
works [
42
,
34
,
60
,
59
,
35
,
39
] to represent 3D scenes in a continuous domain. Among them,
NeRF [
35
,
3
] shows its superiority at modeling complex scenes with detailed appearance and
synthesizing multi-view images with strict 3D consistency. The original NeRF and most of its
successors [
43
,
33
,
45
,
49
,
48
,
40
,
32
,
73
] focus on learning scene-specific representation using a set
of posed images or a video sequence of a static or dynamic scene. A few methods [
55
,
10
,
38
,
76
,
22
,
12
,
9
] have explored the generative modeling task using unstructured 2D images as training data.
A very recent method GRAM [
12
] constrains radiance field learning on a set of 2D manifolds and
shows promising results for high-quality and multi-view consistent image generation. Our method is
also built upon the radiance manifolds representation [
12
] for high-quality face image generation and
animation.
3D-aware generative model for image synthesis.
Generative adversarial networks [
21
,
26
,
27
]
are widely used for realistic 2D image synthesis. Recent methods [
36
,
62
,
37
,
31
,
38
,
10
,
14
,
22
,
12
,
9
,
41
,
72
] extend GANs to 3D-aware image synthesis by incorporating 3D knowledge into
their generators. This enables them to learn multiview image generation given only unconstrained
2D image collections as supervisions. For example, HoloGAN [
36
] utilizes 3D-CNN to generate
low-resolution voxel features and projects them as 2D feature maps for further neural rendering.
GRAF [
55
] adopts a generative radiance field as scene representation and generates images via
volumetric rendering [
35
]. GRAM [
12
] further sparsifies the radiance field as a set of radiance
manifolds and leverages manifold rendering [
78
,
12
] for more realistic image generation. Although
3D-aware GANs are able to control camera viewpoints, they cannot achieve fine-grained control
over the generated instances such as shapes and expressions for human faces. This prevents them
from being used for multiview face animation generation tasks. In this work, we introduce imitative
learning of 3DMM into a 3D-aware GAN framework to achieve expression-controllable face image
generation.
Face image synthesis with 3D morphable model guidance.
3D Morphable Models (3DMMs) [
5
,
47
,
7
] play an important role for face image and animation synthesis [
66
,
28
,
75
,
54
]. Earlier
works [
65
,
66
] directly render textured meshes represented by 3DMM using traditional rendering
pipeline for face reenactment. However, these methods often produce over-smooth textures and
cannot generate non-face regions due to the limitation of 3DMM. Later works [
28
,
18
,
15
,
75
,
8
] apply
refinement networks on top of the rendered 3DMM images to generate more realistic texture details
and fill the missing regions. Nevertheless, these methods still rely on a graphics rendering procedure
to synthesize coarse images at inference time, which complicates the whole system. Recently,
DiscoFaceGAN [11] proposes an imitative-contrastive learning scheme that enforces the generative
network to mimic the rendering process of 3DMM. After training, it can directly generate realistic
face images and control multiple face attributes in a disentangled manner via the learned 2D GAN.
Some concurrent and following works [
64
,
30
,
20
,
51
] of it share a similar spirit. However, these
methods often encounter 3D-inconsistency issues when controlling camera poses due to the non-
physical image rendering process of 2D GANs. Some recent methods [
16
,
24
] incorporate 3DMM
knowledge into NeRF’s training process to achieve animatable face image synthesis. However, they
do not leverage a generative training paradigm [
21
] and require video sequences or multiview images
as supervision. By contrast, our method can be trained with only unstructured 2D images.
Face editing and animation with GANs.
A large amount of works adopt GANs for face image
manipulation and animation synthesis [
50
,
2
,
19
,
57
,
52
,
69
,
58
,
56
,
63
,
1
,
67
,
70
,
46
,
17
,
54
,
71
,
77
].
3