Specifically, prior work animates a subject shown in a
source image given target motion from a driving video. For
this, prior methods commonly use 2D GANs to produce dy-
namic subject images with additional guidance such as ref-
erence keypoints [69, 64, 47, 61, 70, 67] and self-learned
motion representations [48, 49]. However, due to intrin-
sic limitations of 2D convolutions, these approaches lack
3D-awareness. This leads to two main failure modes: 1)
spurious identity changes – the subject, head pose or facial
expression of the source image is distinct from the target
when rotating a source head by a large angle; 2) serious dis-
tortions when transferring motion across identity based on
keypoints which contain identity-specific information such
as face shapes. To address these failure modes, recent face
reenactment works [40, 13] estimate parameters of 3D mor-
phable face models, e.g., poses and expression, as addi-
tional guidance. Notably, Guy et al. [13] explicitly repre-
sent heads in a 3D radiance field showing impressive 3D-
consistency. However, one model per face identity needs to
be trained [13].
Different from this direction, we develop a generalizable
method of motion-controllable synthesis for novel, never-
before-seen identity generation. This differs from prior
works which either consider motion control but drop 3D-
awareness [43, 56, 44, 63, 69, 64, 48, 47, 61, 70, 67], or are
3D-aware but don’t consider dynamics [46, 7, 16] or can’t
generate never-before-seen identities [40, 13].
To achieve our goal, we propose controllable radiance
fields (CoRFs). We learn CoRFs from RGB images/videos
with unknown camera poses. This requires to address two
main challenges: 1) how to effectively represent and con-
trol identity and motion in 3D; and 2) how to ensure spatio-
temporal consistency across views and time.
To control identity and motion, CoRFs use a style-
based radiance field generator which takes low-dimensional
identity and motion representations as input, as shown in
Fig. 1. Unlike prior head reconstruction work [40, 13] that
uses ground-truth images for self-supervision, there is no
ground-truth target image for a generated image of a never-
before-seen person. Thus, we propose an additional motion
reconstruction loss to supervise motion control.
To ensure spatio-temporal consistency, we propose three
consistency constraints on face attributes, identities, and
background. Specifically, at each training step, we generate
two images given the same identity representation yet dif-
ferent motion representations. We encourage that the two
images share identical environment and subject-specific at-
tributes. For this, we apply a head regressor and an iden-
tity encoder. A head regressor decomposes the images into
representations of a statistic face model, including lighting,
shape, texture and albedo. Then, we compute a consis-
tency loss which compares the predicted attribute param-
eters of the paired synthetic images. Moreover, we use an
identity encoder to ensure the paired synthetic images share
the same identity. Further, to encourage a consistent back-
ground across time, we recognize background regions using
a face parsing net and employ a background consistency
loss between paired synthetic images that share the same
identity representation.
As there is no direct baseline, we compare CoRFs to
two types of work that are most related: 1) face reenact-
ment work [48, 49] that can control facial expression trans-
fer; and 2) video synthesis work [56, 44, 63, 55] that can
produce videos for novel identities. We evaluate the im-
provements of our method using the Fr´
echet Inception Dis-
tance (FID) [19] for videos, motion control scores, and 3D-
aware consistency metrics on FFHQ [22] and two face video
benchmarks [42, 9] at a resolution of 256 ×256. Moreover,
we show additional applications such as novel view synthe-
sis and motion interpolation in the latent motion space.
Contributions. 1) We study the new task of generalizable,
3D-aware, and motion-controllable face generation. 2) We
develop CoRFs which enable editing of motion, and view-
points of never-before-seen subjects during synthesis. For
this, we study techniques that aid motion control and spatio-
temporal consistency. 3) CoRFs improve visual quality and
spatio-temporal consistency of free-view motion synthesis
compared to multiple baselines [56, 44, 63, 48, 69, 49, 55]
on popular face benchmarks [22, 42, 9] using multiple met-
rics for image quality, temporal consistency, identity preser-
vation, and expression transfer.
2. Related work
3D-aware image synthesis. GANs [15] have significantly
advanced 2D image synthesis capabilities in recent years,
addressing early concerns regarding diversity, resolution,
and photo-realism [22, 2, 24, 25, 23, 17]. More recent
GANs aim to extend 2D image synthesis to 3D-aware im-
age generation [37, 53, 34, 35, 11], while permitting explicit
camera control. For this, methods [37, 53, 34, 35, 11] use
an implicit neural representation.
Implicit neural representations [32, 8, 31, 66, 28, 51,
50, 38, 5] have been introduced to encode 3D objects (i.e.,
its shape and/or appearance) via a parameterized neural
net. Compared to voxel-based [21, 41, 65] and mesh-
based [36, 59] methods, implicit neural functions represent
3D objects in continuous space and are not restricted to a
particular object topology. A recent variation, Neural Ra-
diance Fields (NeRFs) [32], represents the appearance and
geometry of a static real-world scene with a multi-layer per-
ceptron (MLP), enabling impressive novel-view synthesis
with multi-view consistency via volume rendering. Dif-
ferent from the studied method, those classical approaches
can’t generate novel scenes.
To address this, recent work [46, 7, 16, 72, 71] stud-
ies 3D-aware generative models using NeRFs as a gener-