Controllable Radiance Fields for Dynamic Face Synthesis Peiye Zhuang1y Liqian Ma2y Sanmi Koyejo134 Alexander Schwing1 1University of Illinois Urbana-Champaign2ZMO AI Inc.3Stanford University4Google Inc.

2025-04-26 0 0 4.42MB 13 页 10玖币
侵权投诉
Controllable Radiance Fields for Dynamic Face Synthesis
Peiye Zhuang1?, Liqian Ma2, Sanmi Koyejo1,3,4, Alexander Schwing1
1University of Illinois Urbana-Champaign, 2ZMO AI Inc., 3Stanford University, 4Google Inc.
{peiye, sanmi, aschwing}@illinois.edu,liqianma.scholar@outlook.com
Identity 1
Identity 2
Identity 3
Identity
sampling
Free-view dynamic head synthesis with motion control
Figure 1: Controllable free-view dynamic head synthesis. Each row presents an identity sampled from a prior distribution
and two expressions guided by a reference image (bottom left), viewed from multiple directions (column 1-3 and 4-6).
Abstract
Recent work on 3D-aware image synthesis has achieved
compelling results using advances in neural rendering.
However, 3D-aware synthesis of face dynamics hasn’t re-
ceived much attention. Here, we study how to explicitly
control generative model synthesis of face dynamics exhibit-
ing non-rigid motion (e.g., facial expression change), while
simultaneously ensuring 3D-awareness. For this we pro-
pose a Controllable Radiance Field (CoRF): 1) Motion con-
trol is achieved by embedding motion features within the
layered latent motion space of a style-based generator; 2)
To ensure consistency of background, motion features and
subject-specific attributes such as lighting, texture, shapes,
albedo, and identity, a face parsing net, a head regressor
and an identity encoder are incorporated. On head im-
age/video data we show that CoRFs are 3D-aware while
enabling editing of identity, viewing directions, and motion.
?Some of the work was completed while P.Z. was at Google.
Corresponding author.
1. Introduction
Face synthesis is an important task with applications in
digital content creation, film-making and Virtual Reality
(VR). Generative Adversarial Nets (GANs), as a powerful
generative model, have demonstrated remarkable success in
generating high-quality faces. However, despite the impres-
sive performance of GANs in both 2D [15, 22, 2, 24, 25, 23]
and 3D image synthesis [35, 46, 7, 16], its extension to both
3D-aware and motion-controllable image synthesis has not
been fully explored. This is largely due to the complex-
ity of physical motion and the challenging yet consistent
appearance changes of an identity. For example, dynamic
head synthesis requires maintaining some 3D consistency
over time to preserve facial identity while permitting other
3D deformations due to expression changes. It is even more
challenging to enable interactive manipulation of identity,
motion, and viewing directions.
Addressing this task of generalizable,3D-aware, and
motion-controllable synthesis of face dynamics enables to
generate never-before-seen faces including their dynamic
motion, while controlling the viewpoint, as shown in Fig. 1.
Most related to this task are motion transfer methods
such as face reenactment [69, 64, 47, 61, 70, 67, 49, 48].
arXiv:2210.05825v1 [cs.CV] 11 Oct 2022
Specifically, prior work animates a subject shown in a
source image given target motion from a driving video. For
this, prior methods commonly use 2D GANs to produce dy-
namic subject images with additional guidance such as ref-
erence keypoints [69, 64, 47, 61, 70, 67] and self-learned
motion representations [48, 49]. However, due to intrin-
sic limitations of 2D convolutions, these approaches lack
3D-awareness. This leads to two main failure modes: 1)
spurious identity changes – the subject, head pose or facial
expression of the source image is distinct from the target
when rotating a source head by a large angle; 2) serious dis-
tortions when transferring motion across identity based on
keypoints which contain identity-specific information such
as face shapes. To address these failure modes, recent face
reenactment works [40, 13] estimate parameters of 3D mor-
phable face models, e.g., poses and expression, as addi-
tional guidance. Notably, Guy et al. [13] explicitly repre-
sent heads in a 3D radiance field showing impressive 3D-
consistency. However, one model per face identity needs to
be trained [13].
Different from this direction, we develop a generalizable
method of motion-controllable synthesis for novel, never-
before-seen identity generation. This differs from prior
works which either consider motion control but drop 3D-
awareness [43, 56, 44, 63, 69, 64, 48, 47, 61, 70, 67], or are
3D-aware but don’t consider dynamics [46, 7, 16] or can’t
generate never-before-seen identities [40, 13].
To achieve our goal, we propose controllable radiance
fields (CoRFs). We learn CoRFs from RGB images/videos
with unknown camera poses. This requires to address two
main challenges: 1) how to effectively represent and con-
trol identity and motion in 3D; and 2) how to ensure spatio-
temporal consistency across views and time.
To control identity and motion, CoRFs use a style-
based radiance field generator which takes low-dimensional
identity and motion representations as input, as shown in
Fig. 1. Unlike prior head reconstruction work [40, 13] that
uses ground-truth images for self-supervision, there is no
ground-truth target image for a generated image of a never-
before-seen person. Thus, we propose an additional motion
reconstruction loss to supervise motion control.
To ensure spatio-temporal consistency, we propose three
consistency constraints on face attributes, identities, and
background. Specifically, at each training step, we generate
two images given the same identity representation yet dif-
ferent motion representations. We encourage that the two
images share identical environment and subject-specific at-
tributes. For this, we apply a head regressor and an iden-
tity encoder. A head regressor decomposes the images into
representations of a statistic face model, including lighting,
shape, texture and albedo. Then, we compute a consis-
tency loss which compares the predicted attribute param-
eters of the paired synthetic images. Moreover, we use an
identity encoder to ensure the paired synthetic images share
the same identity. Further, to encourage a consistent back-
ground across time, we recognize background regions using
a face parsing net and employ a background consistency
loss between paired synthetic images that share the same
identity representation.
As there is no direct baseline, we compare CoRFs to
two types of work that are most related: 1) face reenact-
ment work [48, 49] that can control facial expression trans-
fer; and 2) video synthesis work [56, 44, 63, 55] that can
produce videos for novel identities. We evaluate the im-
provements of our method using the Fr´
echet Inception Dis-
tance (FID) [19] for videos, motion control scores, and 3D-
aware consistency metrics on FFHQ [22] and two face video
benchmarks [42, 9] at a resolution of 256 ×256. Moreover,
we show additional applications such as novel view synthe-
sis and motion interpolation in the latent motion space.
Contributions. 1) We study the new task of generalizable,
3D-aware, and motion-controllable face generation. 2) We
develop CoRFs which enable editing of motion, and view-
points of never-before-seen subjects during synthesis. For
this, we study techniques that aid motion control and spatio-
temporal consistency. 3) CoRFs improve visual quality and
spatio-temporal consistency of free-view motion synthesis
compared to multiple baselines [56, 44, 63, 48, 69, 49, 55]
on popular face benchmarks [22, 42, 9] using multiple met-
rics for image quality, temporal consistency, identity preser-
vation, and expression transfer.
2. Related work
3D-aware image synthesis. GANs [15] have significantly
advanced 2D image synthesis capabilities in recent years,
addressing early concerns regarding diversity, resolution,
and photo-realism [22, 2, 24, 25, 23, 17]. More recent
GANs aim to extend 2D image synthesis to 3D-aware im-
age generation [37, 53, 34, 35, 11], while permitting explicit
camera control. For this, methods [37, 53, 34, 35, 11] use
an implicit neural representation.
Implicit neural representations [32, 8, 31, 66, 28, 51,
50, 38, 5] have been introduced to encode 3D objects (i.e.,
its shape and/or appearance) via a parameterized neural
net. Compared to voxel-based [21, 41, 65] and mesh-
based [36, 59] methods, implicit neural functions represent
3D objects in continuous space and are not restricted to a
particular object topology. A recent variation, Neural Ra-
diance Fields (NeRFs) [32], represents the appearance and
geometry of a static real-world scene with a multi-layer per-
ceptron (MLP), enabling impressive novel-view synthesis
with multi-view consistency via volume rendering. Dif-
ferent from the studied method, those classical approaches
can’t generate novel scenes.
To address this, recent work [46, 7, 16, 72, 71] stud-
ies 3D-aware generative models using NeRFs as a gener-
Render 𝑭!
Noise
z∈ ℝ"#$ 𝒘
Motion
𝒎 ∈ ℝ%
StyleFC
StyleFC
StyleFC
(𝑭, 𝜎)
toRGB
Real/fake Motion (
𝒎
𝓛𝐦𝐨𝐭𝐢𝐨𝐧
𝑹
𝐺+
Position 𝒙
𝒅𝟏
𝒅𝟐
𝒅𝑵
Mapping
Network 𝑓
/
Mapping
Network 𝑓
0
Motion regressor Generator 𝑮
𝑮 = { 𝒇𝒎, 𝒇𝒛, 𝑮𝒔}
𝒘 + 𝒅𝟏
𝒘 + 𝒅𝟐
𝒘 + 𝒅𝑵
AMod
Demod ModFC
Weights
AMod
Demod ModFC
Weights
𝑫𝑹
𝓛𝐚𝐝𝐯
Pose 𝝃
Motion 𝒎
Real/fake image
Motion learning
Layered latent
motion space StyleFC
StyleFC
Figure 2: Overview of Controllable Radiance Fields (CoRFs). In the offline step, we collect motion representations m
using a regressor R. At training time, the generator Grenders an image given position input x, a noise z, and a motion
representation m. The motion input mis embedded via a mapping network fm(m)and combined with a style vector w
via a summation () before being used to generate the face. The discriminator Dcompares real and generated images,
conditioned on a camera pose ξand a motion representation m. To ensure visual quality and motion control, we employ a
discriminator loss Ladv and a motion reconstruction loss Lmotion.
ator. For instance, GRAF [46] and π-GAN [7] operate on
a randomly sampled camera pose and latent vectors from
prior distributions to produce a radiance field via an MLP.
StyleNeRF and CIPS-3D [16, 72] combine a shallow NeRF
network to provide low-resolution radiance fields and a 2D
rendering network to produce high-resolution images with
fine details. LolNeRF [39] learns 3D objects by optimizing
foreground and background NeRFs together with a learn-
able per-image table of latent codes. Zhao et al. [71] de-
velop a generative multi-plane image (GMPI) representa-
tion to ensure view-consistency. These works only generate
multi-view images for static scenes. In contrast, the pro-
posed CoRF targets generation of face dynamics while en-
abling to control motion. We also notice concurrent NeRF-
based generative models [52, 20] aiming for 3D-aware se-
mantic appearance or expression editing.
Dynamic object synthesis. Dynamic object synthesis has
been studied using 2D- and 3D-based methods. Some 2D-
based methods unconditionally generate dynamic instances
via convolutional neural nets such as GANs [58, 43, 44, 56,
63], but struggle with motion control. Follow-up work in-
corporates additional reference information during synthe-
sis [69, 64, 48, 47, 61, 70, 62, 40], e.g., for tasks like face
reenactment. Our method is related but differs in that CoRF
synthesizes images and controls motion for never-before-
seen identities from free viewpoints. 3D-based methods re-
cover face geometry and control the expression by changing
geometry parameters [54, 26, 14, 33]. While they can con-
trol facial motion effectively, they struggle to produce real-
istic hair, teeth, and accessories [14, 33]. Recent work [13]
adapts neural implicit methods, e.g., NeRFs to learn head
reconstruction from images. However, one model is trained
per identity, i.e., these methods don’t generalize across dif-
ferent identities. Unlike prior work [13], CoRF generalizes
across identities.
Motion conditioning. We draw inspiration from GAN-
based image inversion and editing work [1, 73]. To be con-
crete, Abdal et al. [1] proposed to embed images back to an
extended latent identity space of a StyleGAN [24], i.e., the
W+space. This extended latent identity space was shown
to be remarkably useful for image editing and expression
transfer in follow-up work [73]. In our case, we embed a
motion representation into a layered latent motion space,
which we then broadcast to every synthesis layer in the gen-
erator. In Section 4, we illustrate that such a motion embed-
ding strategy benefits motion control.
3D consistency preservation. To ensure 3D consistency
over time, a 3D convolutional discriminator is commonly
used to distinguish the source of input video clips [55]. Re-
cent work [68] finds that when employing an implicit neural
rendering net as a generator, a 2D discriminator for 2 syn-
thetic frames in a video is sufficient. Inspired by this result,
we propose consistency losses on paired synthetic frames
without using a 3D convolutional discriminator.
3. Method
We aim for generalizable, 3D-aware and motion-
controllable synthesis of face dynamics. Concretely, we
摘要:

ControllableRadianceFieldsforDynamicFaceSynthesisPeiyeZhuang1?y,LiqianMa2y,SanmiKoyejo1,3,4,AlexanderSchwing11UniversityofIllinoisUrbana-Champaign,2ZMOAIInc.,3StanfordUniversity,4GoogleInc.fpeiye,sanmi,aschwingg@illinois.edu,liqianma.scholar@outlook.comFigure1:Controllablefree-viewdynamicheadsynthes...

展开>> 收起<<
Controllable Radiance Fields for Dynamic Face Synthesis Peiye Zhuang1y Liqian Ma2y Sanmi Koyejo134 Alexander Schwing1 1University of Illinois Urbana-Champaign2ZMO AI Inc.3Stanford University4Google Inc..pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:4.42MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注