Reconstructing Personalized Semantic Facial NeRF Models From Monocular Video

2025-04-26 0 0 5.91MB 12 页 10玖币
侵权投诉
Reconstructing Personalized Semantic Facial NeRF Models From
Monocular Video
XUAN GAO,University of Science and Technology of China, China
CHENGLAI ZHONG, University of Science and Technology of China, China
JUN XIANG, University of Science and Technology of China, China
YANG HONG, University of Science and Technology of China, China
YUDONG GUO, Image Derivative Inc, China
JUYONG ZHANG,University of Science and Technology of China, China
1
2
1
2
Mesh Blendshape Semantic Facial NeRF Model
···
···
···
···
···
···
···
···
···
···
···
···
···
···
···
Fig. 1. A semantic model for human head defined with neural radiance field is presented. In this model, multi-level voxel field is adopted as basis with
corresponding expression coeicients, which enables strong representation ability on the aspect of rendering and fast training.
We present a novel semantic model for human head dened with neural
radiance eld. The 3D-consistent head model consist of a set of disentangled
and interpretable bases, and can be driven by low-dimensional expression
coecients. Thanks to the powerful representation ability of neural radi-
ance eld, the constructed model can represent complex facial attributes
including hair, wearings, which can not be represented by traditional mesh
blendshape. To construct the personalized semantic facial model, we propose
to dene the bases as several multi-level voxel elds. With a short monoc-
ular RGB video as input, our method can construct the subject’s semantic
This work was done when Xuan Gao, Chenglai Zhong and Jun Xiang were intern at
Image Derivative Inc.
Corresponding author (juyong@ustc.edu.cn).
Authors’ addresses:
{
Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Juy-
ong Zhang
}
, University of Science and Technology of China, 96 Jinzhai Road,
Hefei 230026, Anhui, China,
{
gx2017@mail.ustc.edu.cn, zcl2017@mail.ustc.edu.cn,
xiangjunxjkd1@mail.ustc.edu.cn, hymath@mail.ustc.edu.cn, juyong@ustc.edu.cn
}
;
Yudong Guo, Image Derivative Inc, 998 Wenyi West Road, Hangzhou, Zhejiang, China,
guoyudong@idr.ai.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
©2022 Association for Computing Machinery.
0730-0301/2022/12-ART200 $15.00
https://doi.org/10.1145/3550454.3555501
facial NeRF model with only ten to twenty minutes, and can render a photo-
realistic human head image in tens of miliseconds with a given expression
coecient and view direction. With this novel representation, we apply it
to many tasks like facial retargeting and expression editing. Experimental
results demonstrate its strong representation ability and training/inference
speed. Demo videos and released code are provided in our project page:
https://ustc3dv.github.io/NeRFBlendShape/
CCS Concepts: Computing methodologies Reconstruction.
Additional Key Words and Phrases: Blendshape, Neural Radiance Field, Facial
Retargting
ACM Reference Format:
Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Yudong Guo, and Juy-
ong Zhang. 2022. Reconstructing Personalized Semantic Facial NeRF Mod-
els From Monocular Video. ACM Trans. Graph. 41, 6, Article 200 (Decem-
ber 2022), 12 pages. https://doi.org/10.1145/3550454.3555501
1 INTRODUCTION
3D face/head representation is an important research problem in
computer vision and computer graphics, and has wide applications
in AR/VR, digital games and movie industry. How to represent the
dynamic head and faithfully reconstruct a personalized human head
model from a monocular RGB video is an important and challeng-
ing research topic. With the hypothesis that human head could be
embedded into a low dimensional space. Parametric semantic head
models, like blendshape, have been studied and improved for a long
ACM Trans. Graph., Vol. 41, No. 6, Article 200. Publication date: December 2022.
arXiv:2210.06108v1 [cs.GR] 12 Oct 2022
200:2 Gao, et al.
time. The blendshape head model, in the form of linear/bilinear com-
bination of dierent facial expressions, has the following advantages.
It is a semantic parameterization. The combination coecients have
intuitive meaning for the users as the strength or inuence of spe-
cic facial expressions. Meanwhile, the blendshape constructs a
reasonable shape space which can help the user freely control and
edit in the space.
The generalized semantic head models like FaceWarehouse [Cao
et al
.
2013] aim to model dierent subjects with dierent expressions,
and thus may ignore personalized geometry and texture details. To
construct a personalized blendshape model, traditional mesh based
methods usually adopt deformation transfer[Garrido et al
.
2016;
Li et al
.
2010; Sumner and Popović 2004] and multilinear tensor-
based 3DMM[Cao et al
.
2018, 2014; Vlasic et al
.
2006]. However these
methods usually have the following disadvantages. First, mesh based
parametric models are hard to represent personalized non-face parts
like hair and teeth. Second, to use an RGB supervision, we have to
use approximated dierentiable rendering techniques to alleviate
the non-dierentiable problems. Third, deformation transfer cannot
reconstruct expressions realistically due to limited representation
ability. Last, facial expressions are characterized by many factors
such as ages and muscle movements, and these factors are hard to
be accurately expressed by predened blendshapes.
Recently, NeRF based methods have made it possible to synthesize
photorealistic images. Some works integrate NeRF with GANs[Chan
et al
.
2022, 2021; Deng et al
.
2022; Gu et al
.
2022; Niemeyer and Geiger
2021; Schwarz et al
.
2020; Zhou et al
.
2021]. However, this kind of
generative models couple expression, identity and appearance to-
gether, resulting that the expressions can not be easily controlled.
HeadNeRF[Hong et al
.
2022] proposes to disentangle dierent se-
mantic attributes, but it could not represent personalized facial
dynamics and facial details due to its generic model capacity. AD-
NeRF[Guo et al
.
2021b] and NerFACE[Gafni et al
.
2021] could gener-
ate highly personalized facial animation, their user-specic training
make the model learn more personalized facial details. However,
they need a long time to train a reasonable dynamic head eld.
According to our experiments in section 4.3, this is because they
concatenate the expression condition with Fourier positional infor-
mation and directly input it to the MLP. Both the Fourier positional
encoding and the “concatenate” strategy is not ideal for fast training.
The Fourier encoding is not friendly for MLP for fast convergence.
And the concatenation operation does not contain any combination
law to discover the relation between local and global features (in
NeRF case, positional information and expression condition). There-
fore, it takes a long time for MLP to learn how to use the expression
condition to predict RGB and density.
Recently, local features have been explored to improve NeRF’s
quality and eciency. The original NeRF’s local feature is the Fourier
positional encoding, which takes a long time to converge. Following
works designs dierent kinds of local features to improve NeRF.
Some methods adopt a voxel eld to accelerate the training pro-
cess[Sara Fridovich-Keil and Alex Yu et al
.
2022; Sun et al
.
2022].
Other works use the voxel eld to accelerate ray marching and
volume rendering[Garbin et al
.
2021; Liu et al
.
2020; Yu et al
.
2021a].
EG3D[Chan et al
.
2022] adopts a compact and ecient tri-plane ar-
chitecture enabling geometry-aware synthesis. TensoRF[Chen et al
.
2022] factorizes the 4D scene tensor into multiple compact low-rank
tensor components to separate local features. Among these methods,
instant neural graphics primitives (INGP)[Müller et al
.
2022] demon-
strated a remarkable performance improvement in both training
and rendering. It uses a highly compressed compact data structure,
multi-level hash table, to make it possible to store a multi-level
voxel eld. A novel design of INGP is that the feature query colli-
sion is solved in an adaptive way. Features in dierent levels could
be trained together. Together with the help of high-performance
ray-marching implementation, it could train a static NeRF scene
less than 1 min and render one frame in tens of milliseconds.
Although a lot of methods have been proposed to speed up the
training and inference of a static NeRF eld, it still remains a prob-
lem to achieve a fast training of a dynamic scene such as the com-
plicated head deformation. As our baseline shows, using a direct
“concatenate” strategy to combine local features and expression code
together as the input of MLP, which is very common for NeRF based
head application, is not ecient and sucient to model dynamic
head motions.
In this paper, we present a personalized semantic facial model
architecture dened on multi-level voxel eld. It not only inherits
the semantic meaning of mesh blendshape used for tracking, but
also has more personalized facial detailed attributes especially for
no-face part. Each basis of our model is a radiance eld of a spe-
cic expression, represented by a multi-level voxel eld. We adopt
the multi-resolution hash tables to store the multi-level features
for performance consideration. For any novel expressions, it can
be expressed as the weighted combination of voxel bases with the
expression coecients. We adopt an MLP to interpret the voxel eld
as a radiance eld for volume rendering. To further accelerate the
ray marching in volume rendering and make the optimization focus
on the region possibly occupied by head, we design an expression-
aware density grid update strategy. Thanks to powerful representa-
tion ability and fast convergence of our implicit model, our method
outperforms other similar head construction methods in both mod-
eling quality and construction speed. Our method can construct a
photo-realistic personalized semantic facial model in around 10-20
minutes, which is remarkably faster than related NeRF based head
technique. As our model is trained from a video of a specic person
and combines the features in a latent space, it could capture personal-
ized details including non-linear deformation (cheek folds, forehead
wrinkle) and user-specic attributes(mole, beard). Compared with
traditional mesh based blendshape models, our model can be con-
structed from a short RGB video and generate high-delity view
consistent head images with dierent expressions.
In summary, the contributions include the following aspects:
We present a novel semantic model for human head dened
with neural radiance eld. Our constructed NeRF basis not
only has a disentangled semantic meaning, but also embodys
more personalized facial attributes including muscle actions
and detailed texture. Therefore, the constructed digital avatars
can model facial motions quite well and generate photo-
realistic results.
Our representation combines multi-level voxel elds with ex-
pression coecients in the latent space. The multi-resolution
ACM Trans. Graph., Vol. 41, No. 6, Article 200. Publication date: December 2022.
Reconstructing Personalized Semantic Facial NeRF Models From Monocular Video 200:3
features could eciently learn head details in dierent scales.
The linear blending design could modulate the local features
in advance to adapt to MLP’s input distribution, which makes
our model cost much less time to construct and express more
realistic facial details.
With this novel representation, digital human head related
applications like facial reenactment can be easily achieved
and have remarkable performance, which implies its potential
usage in photo-realistic animation industry.
2 RELATED WORK
Parametric Head Model. Under the hypothesis that human head
shape space can be well disentangled as identity, expression and ap-
pearance, Blanz and Vetter proposed 3DMM[Blanz and Vetter 1999]
to embed 3D head shape into several low-dimensional PCA spaces.
Mesh based parametric head model has been further studied by a
lot of following works. To improve its representation ability, some
work extends it to multilinear models[Cao et al
.
2013; Vlasic et al
.
2006], non-linear models[Guo et al
.
2021a; Ranjan et al
.
2018; Tran
and Liu 2018] and articulated models with corrective blendshape
to improve its modeling ability[Li et al
.
2017]. Both mesh based
methods and deep learning based methods have been widely used in
many related applications. However, mesh based parametric models
usually can not represent personalized facial details due to its limited
representation ability. Meanwhile, existing mesh based parametric
models can not represent non-face parts especially for hair. Some
works handle this problem using deformation transfer[Cao et al
.
2016; Garrido et al
.
2016; Hu et al
.
2017; Ichim et al
.
2015; Sumner
and Popović 2004] or neural network[Bai et al
.
2021; Chaudhuri
et al. 2020; Yang et al. 2020] to get user-specic blendshape basis.
To break through the limited representation ability of explicit
mesh based digital human representation, many works adopt the
implicit representation to improve the model capacity and visual
quality [Gafni et al
.
2021; Hong et al
.
2022; Jiang et al
.
2022; Wang
et al
.
2022; Yenamandra et al
.
2021; Zheng et al
.
2022; Zhuang et al
.
2021]. i3DMM[Yenamandra et al
.
2021] is the rst neural implicit
function based 3D morphable model of full heads. HeadNeRF[Hong
et al
.
2022] proposes a generic head parametric model based on neu-
ral radiance eld. Although neural implicit function based represen-
tations have demonstrated strong representation ability, a generic
model often still lacks personalized facial details. NerFACE[Gafni
et al
.
2021] presents a personalized NeRF based human head model.
However, their method requires a long time for training and in-
ference for each subject. IM Avatar[Zheng et al
.
2022] presents an
implicit LBS model. Note that both our method and IM Avatar have
an implicit blendshape architecture. The main dierence is that
IM Avatar focuses on detailed geometry and appearance and our
model focuses more on photorealistic rendering and ecient train-
ing/inference. Another dierence is that IM Avatar uses a backward
non-rigid ray marching to nd the canonical surface point for each
ray. Our ray marching is performed in the deformed space.
Human Portrait Synthesis. Many methods have been proposed
for facial reenactment and novel view synthesis. Image based meth-
ods[Pumarola et al
.
2019; Siarohin et al
.
2019; Zakharov et al
.
2020]
adopt warping elds or encoder-decoder architectures to synthe-
size the images. As these methods represent the 3D deformation
in the 2D space, artifacts may appear for large pose and expres-
sion changes. Morphable model[Kim et al
.
2018; Thies et al
.
2020a,
2019, 2016; Zhang et al
.
2022] based methods use a parametric 3D
model to synthesize a digital portrait. Deep Video Portraits[Kim
et al
.
2018] uses rendered correspondence maps together with an
image-to-image translation network to output photo-realistic im-
agery. Deferred Neural Rendering[Thies et al
.
2020a, 2019] proposes
an object-specic neural textures which can be interpreted by a
neural renderer.
Neural Radiance eld. NeRF[Mildenhall et al
.
2020] proposes to
represent a scene with an MLP and utilize the volume rendering for
novel view synthesis task. As NeRF is dierentiable, its inputs can
be only multi-view images. Due to the above listed characteristics,
NeRF has been widely used to 3D geometry reconstruction[Wang
et al
.
2021; Yariv et al
.
2021], 4D scene synthesis [Li et al
.
2022; Park
et al
.
2021a,b] and digital human modeling[Peng et al
.
2021a,b; Weng
et al
.
2022],etc. Besides, a lot of research focus on improving NeRF’s
representation ability[Barron et al
.
2021] and reducing the number
of inputs[Chibane et al
.
2021; Niemeyer et al
.
2022; Yu et al
.
2021b].
Recently, NeRF has also demonstrated its strong representation
ability in human head modeling. Many works adopt NeRF to rep-
resent dynamic human head scene, and synthesis high-delity 3D
consistent result. Generative head models [Chan et al
.
2022, 2021;
Deng et al
.
2022; Gu et al
.
2022; Niemeyer and Geiger 2021; Schwarz
et al
.
2020; Zhou et al
.
2021] use latent code to generate the render-
ing result. Although they usually have a good pose control over the
result, but do not support expression editing due to its generative
adversarial training strategy. Generic parametric head model [Hong
et al
.
2022; Zhuang et al
.
2021] disentangles latent space of human
head as identity, expression and appearance space, and to some ex-
tent realize semantical control over head transformation. However,
generic head model often ignores personalized facial details and
user-specic facial muscle movements due to limited MLP capacity.
AD-NeRF[Guo et al
.
2021b] and NerFACE[Gafni et al
.
2021] are
subject-specic models, and it can generate high delity human
head animation controled by voices or expressions. However, both
AD-NeRF and NerFACE need days for training and seconds for in-
ference. And we found both of them tend to learn a smooth head
scene and sometimes ignore high-frequency facial attributes.
Voxel Representation for NeRF Acceleration. With the help of voxel
eld, NeRF could spare its training burden across the local features,
which will signicantly improve the training speed[Sara Fridovich-
Keil and Alex Yu et al
.
2022; Sun et al
.
2022]. Voxel eld could
also help store the spacial information like density distribution in
advance to accelerate the inference speed[Garbin et al
.
2021; Liu
et al
.
2020; Lombardi et al
.
2021; Yu et al
.
2021a]. Recently, instant
neural graphics primitives[Müller et al
.
2022] adopts multi-level
hash table to augment a shallow MLP and achieves a combined
speedup of several orders of magnitude. It can train a static scene
with NeRF using only several seconds, and render the scene in
tens of milliseconds. However, these methods could not be directly
used for dynamic scenes due to its complex non-rigid deformations.
Meanwhile, it is hard to perform “pruning” operation for voxel grid
ACM Trans. Graph., Vol. 41, No. 6, Article 200. Publication date: December 2022.
摘要:

ReconstructingPersonalizedSemanticFacialNeRFModelsFromMonocularVideoXUANGAO∗,UniversityofScienceandTechnologyofChina,ChinaCHENGLAIZHONG,UniversityofScienceandTechnologyofChina,ChinaJUNXIANG,UniversityofScienceandTechnologyofChina,ChinaYANGHONG,UniversityofScienceandTechnologyofChina,ChinaYUDONGGUO,I...

展开>> 收起<<
Reconstructing Personalized Semantic Facial NeRF Models From Monocular Video.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:5.91MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注