Controllable Radiance Fields for Dynamic Face Synthesis Peiye Zhuang1y Liqian Ma2y Sanmi Koyejo134 Alexander Schwing1 1University of Illinois Urbana-Champaign2ZMO AI Inc.3Stanford University4Google Inc.

2025-04-26 0 0 4.42MB 13 页 10玖币

侵权投诉

Controllable Radiance Fields for Dynamic Face Synthesis

Peiye Zhuang1?†, Liqian Ma2†, Sanmi Koyejo1,3,4, Alexander Schwing1

1University of Illinois Urbana-Champaign, 2ZMO AI Inc., 3Stanford University, 4Google Inc.

{peiye, sanmi, aschwing}@illinois.edu,liqianma.scholar@outlook.com

Identity 1

Identity 2

Identity 3

Identity

sampling

Free-view dynamic head synthesis with motion control

Figure 1: Controllable free-view dynamic head synthesis. Each row presents an identity sampled from a prior distribution

and two expressions guided by a reference image (bottom left), viewed from multiple directions (column 1-3 and 4-6).

Abstract

Recent work on 3D-aware image synthesis has achieved

compelling results using advances in neural rendering.

However, 3D-aware synthesis of face dynamics hasn’t re-

ceived much attention. Here, we study how to explicitly

control generative model synthesis of face dynamics exhibit-

ing non-rigid motion (e.g., facial expression change), while

simultaneously ensuring 3D-awareness. For this we pro-

pose a Controllable Radiance Field (CoRF): 1) Motion con-

trol is achieved by embedding motion features within the

layered latent motion space of a style-based generator; 2)

To ensure consistency of background, motion features and

subject-speciﬁc attributes such as lighting, texture, shapes,

albedo, and identity, a face parsing net, a head regressor

and an identity encoder are incorporated. On head im-

age/video data we show that CoRFs are 3D-aware while

enabling editing of identity, viewing directions, and motion.

?Some of the work was completed while P.Z. was at Google.

†Corresponding author.

1. Introduction

Face synthesis is an important task with applications in

digital content creation, ﬁlm-making and Virtual Reality

(VR). Generative Adversarial Nets (GANs), as a powerful

generative model, have demonstrated remarkable success in

generating high-quality faces. However, despite the impres-

sive performance of GANs in both 2D [15, 22, 2, 24, 25, 23]

and 3D image synthesis [35, 46, 7, 16], its extension to both

3D-aware and motion-controllable image synthesis has not

been fully explored. This is largely due to the complex-

ity of physical motion and the challenging yet consistent

appearance changes of an identity. For example, dynamic

head synthesis requires maintaining some 3D consistency

over time to preserve facial identity while permitting other

3D deformations due to expression changes. It is even more

challenging to enable interactive manipulation of identity,

motion, and viewing directions.

Addressing this task of generalizable,3D-aware, and

motion-controllable synthesis of face dynamics enables to

generate never-before-seen faces including their dynamic

motion, while controlling the viewpoint, as shown in Fig. 1.

Most related to this task are motion transfer methods

such as face reenactment [69, 64, 47, 61, 70, 67, 49, 48].

arXiv:2210.05825v1 [cs.CV] 11 Oct 2022

Speciﬁcally, prior work animates a subject shown in a

source image given target motion from a driving video. For

this, prior methods commonly use 2D GANs to produce dy-

namic subject images with additional guidance such as ref-

erence keypoints [69, 64, 47, 61, 70, 67] and self-learned

motion representations [48, 49]. However, due to intrin-

sic limitations of 2D convolutions, these approaches lack

3D-awareness. This leads to two main failure modes: 1)

spurious identity changes – the subject, head pose or facial

expression of the source image is distinct from the target

when rotating a source head by a large angle; 2) serious dis-

tortions when transferring motion across identity based on

keypoints which contain identity-speciﬁc information such

as face shapes. To address these failure modes, recent face

reenactment works [40, 13] estimate parameters of 3D mor-

phable face models, e.g., poses and expression, as addi-

tional guidance. Notably, Guy et al. [13] explicitly repre-

sent heads in a 3D radiance ﬁeld showing impressive 3D-

consistency. However, one model per face identity needs to

be trained [13].

Different from this direction, we develop a generalizable

method of motion-controllable synthesis for novel, never-

before-seen identity generation. This differs from prior

works which either consider motion control but drop 3D-

awareness [43, 56, 44, 63, 69, 64, 48, 47, 61, 70, 67], or are

3D-aware but don’t consider dynamics [46, 7, 16] or can’t

generate never-before-seen identities [40, 13].

To achieve our goal, we propose controllable radiance

ﬁelds (CoRFs). We learn CoRFs from RGB images/videos

with unknown camera poses. This requires to address two

main challenges: 1) how to effectively represent and con-

trol identity and motion in 3D; and 2) how to ensure spatio-

temporal consistency across views and time.

To control identity and motion, CoRFs use a style-

based radiance ﬁeld generator which takes low-dimensional

identity and motion representations as input, as shown in

Fig. 1. Unlike prior head reconstruction work [40, 13] that

uses ground-truth images for self-supervision, there is no

ground-truth target image for a generated image of a never-

before-seen person. Thus, we propose an additional motion

reconstruction loss to supervise motion control.

To ensure spatio-temporal consistency, we propose three

consistency constraints on face attributes, identities, and

background. Speciﬁcally, at each training step, we generate

two images given the same identity representation yet dif-

ferent motion representations. We encourage that the two

images share identical environment and subject-speciﬁc at-

tributes. For this, we apply a head regressor and an iden-

tity encoder. A head regressor decomposes the images into

representations of a statistic face model, including lighting,

shape, texture and albedo. Then, we compute a consis-

tency loss which compares the predicted attribute param-

eters of the paired synthetic images. Moreover, we use an

identity encoder to ensure the paired synthetic images share

the same identity. Further, to encourage a consistent back-

ground across time, we recognize background regions using

a face parsing net and employ a background consistency

loss between paired synthetic images that share the same

identity representation.

As there is no direct baseline, we compare CoRFs to

two types of work that are most related: 1) face reenact-

ment work [48, 49] that can control facial expression trans-

fer; and 2) video synthesis work [56, 44, 63, 55] that can

produce videos for novel identities. We evaluate the im-

provements of our method using the Fr´

echet Inception Dis-

tance (FID) [19] for videos, motion control scores, and 3D-

aware consistency metrics on FFHQ [22] and two face video

benchmarks [42, 9] at a resolution of 256 ×256. Moreover,

we show additional applications such as novel view synthe-

sis and motion interpolation in the latent motion space.

Contributions. 1) We study the new task of generalizable,

3D-aware, and motion-controllable face generation. 2) We

develop CoRFs which enable editing of motion, and view-

points of never-before-seen subjects during synthesis. For

this, we study techniques that aid motion control and spatio-

temporal consistency. 3) CoRFs improve visual quality and

spatio-temporal consistency of free-view motion synthesis

compared to multiple baselines [56, 44, 63, 48, 69, 49, 55]

on popular face benchmarks [22, 42, 9] using multiple met-

rics for image quality, temporal consistency, identity preser-

vation, and expression transfer.

2. Related work

3D-aware image synthesis. GANs [15] have signiﬁcantly

advanced 2D image synthesis capabilities in recent years,

addressing early concerns regarding diversity, resolution,

and photo-realism [22, 2, 24, 25, 23, 17]. More recent

GANs aim to extend 2D image synthesis to 3D-aware im-

age generation [37, 53, 34, 35, 11], while permitting explicit

camera control. For this, methods [37, 53, 34, 35, 11] use

an implicit neural representation.

Implicit neural representations [32, 8, 31, 66, 28, 51,

50, 38, 5] have been introduced to encode 3D objects (i.e.,

its shape and/or appearance) via a parameterized neural

net. Compared to voxel-based [21, 41, 65] and mesh-

based [36, 59] methods, implicit neural functions represent

3D objects in continuous space and are not restricted to a

particular object topology. A recent variation, Neural Ra-

diance Fields (NeRFs) [32], represents the appearance and

geometry of a static real-world scene with a multi-layer per-

ceptron (MLP), enabling impressive novel-view synthesis

with multi-view consistency via volume rendering. Dif-

ferent from the studied method, those classical approaches

can’t generate novel scenes.

To address this, recent work [46, 7, 16, 72, 71] stud-

ies 3D-aware generative models using NeRFs as a gener-

Render 𝑭!

Noise

z∈ ℝ"#$ 𝒘

Motion

𝒎 ∈ ℝ%

StyleFC

(𝑭, 𝜎)

toRGB

…

Real/fake Motion (

𝒎

𝓛𝐦𝐨𝐭𝐢𝐨𝐧

𝑹

𝐺+

Position 𝒙

𝒅𝟏

𝒅𝟐

⋮

𝒅𝑵

Mapping

Network 𝑓

Mapping

Network 𝑓

Motion regressor Generator 𝑮

𝑮 = { 𝒇𝒎, 𝒇𝒛, 𝑮𝒔}

⊕𝒘 + 𝒅𝟏

𝒘 + 𝒅𝟐

⋮

𝒘 + 𝒅𝑵

AMod

Demod ModFC

Weights

AMod

Demod ModFC

Weights

𝑫𝑹

𝓛𝐚𝐝𝐯

Pose 𝝃

Motion 𝒎

Real/fake image

Motion learning

Layered latent

motion space StyleFC

StyleFC

Figure 2: Overview of Controllable Radiance Fields (CoRFs). In the ofﬂine step, we collect motion representations m

using a regressor R. At training time, the generator Grenders an image given position input x, a noise z, and a motion

representation m. The motion input mis embedded via a mapping network fm(m)and combined with a style vector w

via a summation (⊕) before being used to generate the face. The discriminator Dcompares real and generated images,

conditioned on a camera pose ξand a motion representation m. To ensure visual quality and motion control, we employ a

discriminator loss Ladv and a motion reconstruction loss Lmotion.

ator. For instance, GRAF [46] and π-GAN [7] operate on

a randomly sampled camera pose and latent vectors from

prior distributions to produce a radiance ﬁeld via an MLP.

StyleNeRF and CIPS-3D [16, 72] combine a shallow NeRF

network to provide low-resolution radiance ﬁelds and a 2D

rendering network to produce high-resolution images with

ﬁne details. LolNeRF [39] learns 3D objects by optimizing

foreground and background NeRFs together with a learn-

able per-image table of latent codes. Zhao et al. [71] de-

velop a generative multi-plane image (GMPI) representa-

tion to ensure view-consistency. These works only generate

multi-view images for static scenes. In contrast, the pro-

posed CoRF targets generation of face dynamics while en-

abling to control motion. We also notice concurrent NeRF-

based generative models [52, 20] aiming for 3D-aware se-

mantic appearance or expression editing.

Dynamic object synthesis. Dynamic object synthesis has

been studied using 2D- and 3D-based methods. Some 2D-

based methods unconditionally generate dynamic instances

via convolutional neural nets such as GANs [58, 43, 44, 56,

63], but struggle with motion control. Follow-up work in-

corporates additional reference information during synthe-

sis [69, 64, 48, 47, 61, 70, 62, 40], e.g., for tasks like face

reenactment. Our method is related but differs in that CoRF

synthesizes images and controls motion for never-before-

seen identities from free viewpoints. 3D-based methods re-

cover face geometry and control the expression by changing

geometry parameters [54, 26, 14, 33]. While they can con-

trol facial motion effectively, they struggle to produce real-

istic hair, teeth, and accessories [14, 33]. Recent work [13]

adapts neural implicit methods, e.g., NeRFs to learn head

reconstruction from images. However, one model is trained

per identity, i.e., these methods don’t generalize across dif-

ferent identities. Unlike prior work [13], CoRF generalizes

across identities.

Motion conditioning. We draw inspiration from GAN-

based image inversion and editing work [1, 73]. To be con-

crete, Abdal et al. [1] proposed to embed images back to an

extended latent identity space of a StyleGAN [24], i.e., the

W+space. This extended latent identity space was shown

to be remarkably useful for image editing and expression

transfer in follow-up work [73]. In our case, we embed a

motion representation into a layered latent motion space,

which we then broadcast to every synthesis layer in the gen-

erator. In Section 4, we illustrate that such a motion embed-

ding strategy beneﬁts motion control.

3D consistency preservation. To ensure 3D consistency

over time, a 3D convolutional discriminator is commonly

used to distinguish the source of input video clips [55]. Re-

cent work [68] ﬁnds that when employing an implicit neural

rendering net as a generator, a 2D discriminator for 2 syn-

thetic frames in a video is sufﬁcient. Inspired by this result,

we propose consistency losses on paired synthetic frames

without using a 3D convolutional discriminator.

3. Method

We aim for generalizable, 3D-aware and motion-

controllable synthesis of face dynamics. Concretely, we

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ControllableRadianceFieldsforDynamicFaceSynthesisPeiyeZhuang1?y,LiqianMa2y,SanmiKoyejo1,3,4,AlexanderSchwing11UniversityofIllinoisUrbana-Champaign,2ZMOAIInc.,3StanfordUniversity,4GoogleInc.fpeiye,sanmi,aschwingg@illinois.edu,liqianma.scholar@outlook.comFigure1:Controllablefree-viewdynamicheadsynthes...

展开>> 收起<<

Controllable Radiance Fields for Dynamic Face Synthesis Peiye Zhuang1y Liqian Ma2y Sanmi Koyejo134 Alexander Schwing1 1University of Illinois Urbana-Champaign2ZMO AI Inc.3Stanford University4Google Inc..pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Controllable Radiance Fields for Dynamic Face Synthesis Peiye Zhuang1y Liqian Ma2y Sanmi Koyejo134 Alexander Schwing1 1University of Illinois Urbana-Champaign2ZMO AI Inc.3Stanford University4Google Inc.

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: