200:2 •Gao, et al.
time. The blendshape head model, in the form of linear/bilinear com-
bination of dierent facial expressions, has the following advantages.
It is a semantic parameterization. The combination coecients have
intuitive meaning for the users as the strength or inuence of spe-
cic facial expressions. Meanwhile, the blendshape constructs a
reasonable shape space which can help the user freely control and
edit in the space.
The generalized semantic head models like FaceWarehouse [Cao
et al
.
2013] aim to model dierent subjects with dierent expressions,
and thus may ignore personalized geometry and texture details. To
construct a personalized blendshape model, traditional mesh based
methods usually adopt deformation transfer[Garrido et al
.
2016;
Li et al
.
2010; Sumner and Popović 2004] and multilinear tensor-
based 3DMM[Cao et al
.
2018, 2014; Vlasic et al
.
2006]. However these
methods usually have the following disadvantages. First, mesh based
parametric models are hard to represent personalized non-face parts
like hair and teeth. Second, to use an RGB supervision, we have to
use approximated dierentiable rendering techniques to alleviate
the non-dierentiable problems. Third, deformation transfer cannot
reconstruct expressions realistically due to limited representation
ability. Last, facial expressions are characterized by many factors
such as ages and muscle movements, and these factors are hard to
be accurately expressed by predened blendshapes.
Recently, NeRF based methods have made it possible to synthesize
photorealistic images. Some works integrate NeRF with GANs[Chan
et al
.
2022, 2021; Deng et al
.
2022; Gu et al
.
2022; Niemeyer and Geiger
2021; Schwarz et al
.
2020; Zhou et al
.
2021]. However, this kind of
generative models couple expression, identity and appearance to-
gether, resulting that the expressions can not be easily controlled.
HeadNeRF[Hong et al
.
2022] proposes to disentangle dierent se-
mantic attributes, but it could not represent personalized facial
dynamics and facial details due to its generic model capacity. AD-
NeRF[Guo et al
.
2021b] and NerFACE[Gafni et al
.
2021] could gener-
ate highly personalized facial animation, their user-specic training
make the model learn more personalized facial details. However,
they need a long time to train a reasonable dynamic head eld.
According to our experiments in section 4.3, this is because they
concatenate the expression condition with Fourier positional infor-
mation and directly input it to the MLP. Both the Fourier positional
encoding and the “concatenate” strategy is not ideal for fast training.
The Fourier encoding is not friendly for MLP for fast convergence.
And the concatenation operation does not contain any combination
law to discover the relation between local and global features (in
NeRF case, positional information and expression condition). There-
fore, it takes a long time for MLP to learn how to use the expression
condition to predict RGB and density.
Recently, local features have been explored to improve NeRF’s
quality and eciency. The original NeRF’s local feature is the Fourier
positional encoding, which takes a long time to converge. Following
works designs dierent kinds of local features to improve NeRF.
Some methods adopt a voxel eld to accelerate the training pro-
cess[Sara Fridovich-Keil and Alex Yu et al
.
2022; Sun et al
.
2022].
Other works use the voxel eld to accelerate ray marching and
volume rendering[Garbin et al
.
2021; Liu et al
.
2020; Yu et al
.
2021a].
EG3D[Chan et al
.
2022] adopts a compact and ecient tri-plane ar-
chitecture enabling geometry-aware synthesis. TensoRF[Chen et al
.
2022] factorizes the 4D scene tensor into multiple compact low-rank
tensor components to separate local features. Among these methods,
instant neural graphics primitives (INGP)[Müller et al
.
2022] demon-
strated a remarkable performance improvement in both training
and rendering. It uses a highly compressed compact data structure,
multi-level hash table, to make it possible to store a multi-level
voxel eld. A novel design of INGP is that the feature query colli-
sion is solved in an adaptive way. Features in dierent levels could
be trained together. Together with the help of high-performance
ray-marching implementation, it could train a static NeRF scene
less than 1 min and render one frame in tens of milliseconds.
Although a lot of methods have been proposed to speed up the
training and inference of a static NeRF eld, it still remains a prob-
lem to achieve a fast training of a dynamic scene such as the com-
plicated head deformation. As our baseline shows, using a direct
“concatenate” strategy to combine local features and expression code
together as the input of MLP, which is very common for NeRF based
head application, is not ecient and sucient to model dynamic
head motions.
In this paper, we present a personalized semantic facial model
architecture dened on multi-level voxel eld. It not only inherits
the semantic meaning of mesh blendshape used for tracking, but
also has more personalized facial detailed attributes especially for
no-face part. Each basis of our model is a radiance eld of a spe-
cic expression, represented by a multi-level voxel eld. We adopt
the multi-resolution hash tables to store the multi-level features
for performance consideration. For any novel expressions, it can
be expressed as the weighted combination of voxel bases with the
expression coecients. We adopt an MLP to interpret the voxel eld
as a radiance eld for volume rendering. To further accelerate the
ray marching in volume rendering and make the optimization focus
on the region possibly occupied by head, we design an expression-
aware density grid update strategy. Thanks to powerful representa-
tion ability and fast convergence of our implicit model, our method
outperforms other similar head construction methods in both mod-
eling quality and construction speed. Our method can construct a
photo-realistic personalized semantic facial model in around 10-20
minutes, which is remarkably faster than related NeRF based head
technique. As our model is trained from a video of a specic person
and combines the features in a latent space, it could capture personal-
ized details including non-linear deformation (cheek folds, forehead
wrinkle) and user-specic attributes(mole, beard). Compared with
traditional mesh based blendshape models, our model can be con-
structed from a short RGB video and generate high-delity view
consistent head images with dierent expressions.
In summary, the contributions include the following aspects:
•
We present a novel semantic model for human head dened
with neural radiance eld. Our constructed NeRF basis not
only has a disentangled semantic meaning, but also embodys
more personalized facial attributes including muscle actions
and detailed texture. Therefore, the constructed digital avatars
can model facial motions quite well and generate photo-
realistic results.
•
Our representation combines multi-level voxel elds with ex-
pression coecients in the latent space. The multi-resolution
ACM Trans. Graph., Vol. 41, No. 6, Article 200. Publication date: December 2022.