Reconstructing Personalized Semantic Facial NeRF Models From Monocular Video

2025-04-26 0 0 5.91MB 12 页 10玖币

侵权投诉

Reconstructing Personalized Semantic Facial NeRF Models From

Monocular Video

XUAN GAO∗,University of Science and Technology of China, China

CHENGLAI ZHONG, University of Science and Technology of China, China

JUN XIANG, University of Science and Technology of China, China

YANG HONG, University of Science and Technology of China, China

YUDONG GUO, Image Derivative Inc, China

JUYONG ZHANG†,University of Science and Technology of China, China

1

2



1

2



Mesh Blendshape Semantic Facial NeRF Model

···

⊕

···

Fig. 1. A semantic model for human head defined with neural radiance field is presented. In this model, multi-level voxel field is adopted as basis with

corresponding expression coeicients, which enables strong representation ability on the aspect of rendering and fast training.

We present a novel semantic model for human head dened with neural

radiance eld. The 3D-consistent head model consist of a set of disentangled

and interpretable bases, and can be driven by low-dimensional expression

coecients. Thanks to the powerful representation ability of neural radi-

ance eld, the constructed model can represent complex facial attributes

including hair, wearings, which can not be represented by traditional mesh

blendshape. To construct the personalized semantic facial model, we propose

to dene the bases as several multi-level voxel elds. With a short monoc-

ular RGB video as input, our method can construct the subject’s semantic

∗

This work was done when Xuan Gao, Chenglai Zhong and Jun Xiang were intern at

Image Derivative Inc.

†Corresponding author (juyong@ustc.edu.cn).

Authors’ addresses:

{

Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Juy-

ong Zhang

}

, University of Science and Technology of China, 96 Jinzhai Road,

Hefei 230026, Anhui, China,

{

gx2017@mail.ustc.edu.cn, zcl2017@mail.ustc.edu.cn,

xiangjunxjkd1@mail.ustc.edu.cn, hymath@mail.ustc.edu.cn, juyong@ustc.edu.cn

}

;

Yudong Guo, Image Derivative Inc, 998 Wenyi West Road, Hangzhou, Zhejiang, China,

guoyudong@idr.ai.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

0730-0301/2022/12-ART200 $15.00

https://doi.org/10.1145/3550454.3555501

facial NeRF model with only ten to twenty minutes, and can render a photo-

realistic human head image in tens of miliseconds with a given expression

coecient and view direction. With this novel representation, we apply it

to many tasks like facial retargeting and expression editing. Experimental

results demonstrate its strong representation ability and training/inference

speed. Demo videos and released code are provided in our project page:

https://ustc3dv.github.io/NeRFBlendShape/

CCS Concepts: •Computing methodologies →Reconstruction.

Additional Key Words and Phrases: Blendshape, Neural Radiance Field, Facial

Retargting

ACM Reference Format:

Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Yudong Guo, and Juy-

ong Zhang. 2022. Reconstructing Personalized Semantic Facial NeRF Mod-

els From Monocular Video. ACM Trans. Graph. 41, 6, Article 200 (Decem-

ber 2022), 12 pages. https://doi.org/10.1145/3550454.3555501

1 INTRODUCTION

3D face/head representation is an important research problem in

computer vision and computer graphics, and has wide applications

in AR/VR, digital games and movie industry. How to represent the

dynamic head and faithfully reconstruct a personalized human head

model from a monocular RGB video is an important and challeng-

ing research topic. With the hypothesis that human head could be

embedded into a low dimensional space. Parametric semantic head

models, like blendshape, have been studied and improved for a long

ACM Trans. Graph., Vol. 41, No. 6, Article 200. Publication date: December 2022.

arXiv:2210.06108v1 [cs.GR] 12 Oct 2022

200:2 •Gao, et al.

time. The blendshape head model, in the form of linear/bilinear com-

bination of dierent facial expressions, has the following advantages.

It is a semantic parameterization. The combination coecients have

intuitive meaning for the users as the strength or inuence of spe-

cic facial expressions. Meanwhile, the blendshape constructs a

reasonable shape space which can help the user freely control and

edit in the space.

The generalized semantic head models like FaceWarehouse [Cao

et al

2013] aim to model dierent subjects with dierent expressions,

and thus may ignore personalized geometry and texture details. To

construct a personalized blendshape model, traditional mesh based

methods usually adopt deformation transfer[Garrido et al

2016;

Li et al

2010; Sumner and Popović 2004] and multilinear tensor-

based 3DMM[Cao et al

2018, 2014; Vlasic et al

2006]. However these

methods usually have the following disadvantages. First, mesh based

parametric models are hard to represent personalized non-face parts

like hair and teeth. Second, to use an RGB supervision, we have to

use approximated dierentiable rendering techniques to alleviate

the non-dierentiable problems. Third, deformation transfer cannot

reconstruct expressions realistically due to limited representation

ability. Last, facial expressions are characterized by many factors

such as ages and muscle movements, and these factors are hard to

be accurately expressed by predened blendshapes.

Recently, NeRF based methods have made it possible to synthesize

photorealistic images. Some works integrate NeRF with GANs[Chan

et al

2022, 2021; Deng et al

2022; Gu et al

2022; Niemeyer and Geiger

2021; Schwarz et al

2020; Zhou et al

2021]. However, this kind of

generative models couple expression, identity and appearance to-

gether, resulting that the expressions can not be easily controlled.

HeadNeRF[Hong et al

2022] proposes to disentangle dierent se-

mantic attributes, but it could not represent personalized facial

dynamics and facial details due to its generic model capacity. AD-

NeRF[Guo et al

2021b] and NerFACE[Gafni et al

2021] could gener-

ate highly personalized facial animation, their user-specic training

make the model learn more personalized facial details. However,

they need a long time to train a reasonable dynamic head eld.

According to our experiments in section 4.3, this is because they

concatenate the expression condition with Fourier positional infor-

mation and directly input it to the MLP. Both the Fourier positional

encoding and the “concatenate” strategy is not ideal for fast training.

The Fourier encoding is not friendly for MLP for fast convergence.

And the concatenation operation does not contain any combination

law to discover the relation between local and global features (in

NeRF case, positional information and expression condition). There-

fore, it takes a long time for MLP to learn how to use the expression

condition to predict RGB and density.

Recently, local features have been explored to improve NeRF’s

quality and eciency. The original NeRF’s local feature is the Fourier

positional encoding, which takes a long time to converge. Following

works designs dierent kinds of local features to improve NeRF.

Some methods adopt a voxel eld to accelerate the training pro-

cess[Sara Fridovich-Keil and Alex Yu et al

2022; Sun et al

2022].

Other works use the voxel eld to accelerate ray marching and

volume rendering[Garbin et al

2021; Liu et al

2020; Yu et al

2021a].

EG3D[Chan et al

2022] adopts a compact and ecient tri-plane ar-

chitecture enabling geometry-aware synthesis. TensoRF[Chen et al

2022] factorizes the 4D scene tensor into multiple compact low-rank

tensor components to separate local features. Among these methods,

instant neural graphics primitives (INGP)[Müller et al

2022] demon-

strated a remarkable performance improvement in both training

and rendering. It uses a highly compressed compact data structure,

multi-level hash table, to make it possible to store a multi-level

voxel eld. A novel design of INGP is that the feature query colli-

sion is solved in an adaptive way. Features in dierent levels could

be trained together. Together with the help of high-performance

ray-marching implementation, it could train a static NeRF scene

less than 1 min and render one frame in tens of milliseconds.

Although a lot of methods have been proposed to speed up the

training and inference of a static NeRF eld, it still remains a prob-

lem to achieve a fast training of a dynamic scene such as the com-

plicated head deformation. As our baseline shows, using a direct

“concatenate” strategy to combine local features and expression code

together as the input of MLP, which is very common for NeRF based

head application, is not ecient and sucient to model dynamic

head motions.

In this paper, we present a personalized semantic facial model

architecture dened on multi-level voxel eld. It not only inherits

the semantic meaning of mesh blendshape used for tracking, but

also has more personalized facial detailed attributes especially for

no-face part. Each basis of our model is a radiance eld of a spe-

cic expression, represented by a multi-level voxel eld. We adopt

the multi-resolution hash tables to store the multi-level features

for performance consideration. For any novel expressions, it can

be expressed as the weighted combination of voxel bases with the

expression coecients. We adopt an MLP to interpret the voxel eld

as a radiance eld for volume rendering. To further accelerate the

ray marching in volume rendering and make the optimization focus

on the region possibly occupied by head, we design an expression-

aware density grid update strategy. Thanks to powerful representa-

tion ability and fast convergence of our implicit model, our method

outperforms other similar head construction methods in both mod-

eling quality and construction speed. Our method can construct a

photo-realistic personalized semantic facial model in around 10-20

minutes, which is remarkably faster than related NeRF based head

technique. As our model is trained from a video of a specic person

and combines the features in a latent space, it could capture personal-

ized details including non-linear deformation (cheek folds, forehead

wrinkle) and user-specic attributes(mole, beard). Compared with

traditional mesh based blendshape models, our model can be con-

structed from a short RGB video and generate high-delity view

consistent head images with dierent expressions.

In summary, the contributions include the following aspects:

•

We present a novel semantic model for human head dened

with neural radiance eld. Our constructed NeRF basis not

only has a disentangled semantic meaning, but also embodys

more personalized facial attributes including muscle actions

and detailed texture. Therefore, the constructed digital avatars

can model facial motions quite well and generate photo-

realistic results.

•

Our representation combines multi-level voxel elds with ex-

pression coecients in the latent space. The multi-resolution

ACM Trans. Graph., Vol. 41, No. 6, Article 200. Publication date: December 2022.

Reconstructing Personalized Semantic Facial NeRF Models From Monocular Video •200:3

features could eciently learn head details in dierent scales.

The linear blending design could modulate the local features

in advance to adapt to MLP’s input distribution, which makes

our model cost much less time to construct and express more

realistic facial details.

•

With this novel representation, digital human head related

applications like facial reenactment can be easily achieved

and have remarkable performance, which implies its potential

usage in photo-realistic animation industry.

2 RELATED WORK

Parametric Head Model. Under the hypothesis that human head

shape space can be well disentangled as identity, expression and ap-

pearance, Blanz and Vetter proposed 3DMM[Blanz and Vetter 1999]

to embed 3D head shape into several low-dimensional PCA spaces.

Mesh based parametric head model has been further studied by a

lot of following works. To improve its representation ability, some

work extends it to multilinear models[Cao et al

2013; Vlasic et al

2006], non-linear models[Guo et al

2021a; Ranjan et al

2018; Tran

and Liu 2018] and articulated models with corrective blendshape

to improve its modeling ability[Li et al

2017]. Both mesh based

methods and deep learning based methods have been widely used in

many related applications. However, mesh based parametric models

usually can not represent personalized facial details due to its limited

representation ability. Meanwhile, existing mesh based parametric

models can not represent non-face parts especially for hair. Some

works handle this problem using deformation transfer[Cao et al

2016; Garrido et al

2016; Hu et al

2017; Ichim et al

2015; Sumner

and Popović 2004] or neural network[Bai et al

2021; Chaudhuri

et al. 2020; Yang et al. 2020] to get user-specic blendshape basis.

To break through the limited representation ability of explicit

mesh based digital human representation, many works adopt the

implicit representation to improve the model capacity and visual

quality [Gafni et al

2021; Hong et al

2022; Jiang et al

2022; Wang

et al

2022; Yenamandra et al

2021; Zheng et al

2022; Zhuang et al

2021]. i3DMM[Yenamandra et al

2021] is the rst neural implicit

function based 3D morphable model of full heads. HeadNeRF[Hong

et al

2022] proposes a generic head parametric model based on neu-

ral radiance eld. Although neural implicit function based represen-

tations have demonstrated strong representation ability, a generic

model often still lacks personalized facial details. NerFACE[Gafni

et al

2021] presents a personalized NeRF based human head model.

However, their method requires a long time for training and in-

ference for each subject. IM Avatar[Zheng et al

2022] presents an

implicit LBS model. Note that both our method and IM Avatar have

an implicit blendshape architecture. The main dierence is that

IM Avatar focuses on detailed geometry and appearance and our

model focuses more on photorealistic rendering and ecient train-

ing/inference. Another dierence is that IM Avatar uses a backward

non-rigid ray marching to nd the canonical surface point for each

ray. Our ray marching is performed in the deformed space.

Human Portrait Synthesis. Many methods have been proposed

for facial reenactment and novel view synthesis. Image based meth-

ods[Pumarola et al

2019; Siarohin et al

2019; Zakharov et al

2020]

adopt warping elds or encoder-decoder architectures to synthe-

size the images. As these methods represent the 3D deformation

in the 2D space, artifacts may appear for large pose and expres-

sion changes. Morphable model[Kim et al

2018; Thies et al

2020a,

2019, 2016; Zhang et al

2022] based methods use a parametric 3D

model to synthesize a digital portrait. Deep Video Portraits[Kim

et al

2018] uses rendered correspondence maps together with an

image-to-image translation network to output photo-realistic im-

agery. Deferred Neural Rendering[Thies et al

2020a, 2019] proposes

an object-specic neural textures which can be interpreted by a

neural renderer.

Neural Radiance eld. NeRF[Mildenhall et al

2020] proposes to

represent a scene with an MLP and utilize the volume rendering for

novel view synthesis task. As NeRF is dierentiable, its inputs can

be only multi-view images. Due to the above listed characteristics,

NeRF has been widely used to 3D geometry reconstruction[Wang

et al

2021; Yariv et al

2021], 4D scene synthesis [Li et al

2022; Park

et al

2021a,b] and digital human modeling[Peng et al

2021a,b; Weng

et al

2022],etc. Besides, a lot of research focus on improving NeRF’s

representation ability[Barron et al

2021] and reducing the number

of inputs[Chibane et al

2021; Niemeyer et al

2022; Yu et al

2021b].

Recently, NeRF has also demonstrated its strong representation

ability in human head modeling. Many works adopt NeRF to rep-

resent dynamic human head scene, and synthesis high-delity 3D

consistent result. Generative head models [Chan et al

2022, 2021;

Deng et al

2022; Gu et al

2022; Niemeyer and Geiger 2021; Schwarz

et al

2020; Zhou et al

2021] use latent code to generate the render-

ing result. Although they usually have a good pose control over the

result, but do not support expression editing due to its generative

adversarial training strategy. Generic parametric head model [Hong

et al

2022; Zhuang et al

2021] disentangles latent space of human

head as identity, expression and appearance space, and to some ex-

tent realize semantical control over head transformation. However,

generic head model often ignores personalized facial details and

user-specic facial muscle movements due to limited MLP capacity.

AD-NeRF[Guo et al

2021b] and NerFACE[Gafni et al

2021] are

subject-specic models, and it can generate high delity human

head animation controled by voices or expressions. However, both

AD-NeRF and NerFACE need days for training and seconds for in-

ference. And we found both of them tend to learn a smooth head

scene and sometimes ignore high-frequency facial attributes.

Voxel Representation for NeRF Acceleration. With the help of voxel

eld, NeRF could spare its training burden across the local features,

which will signicantly improve the training speed[Sara Fridovich-

Keil and Alex Yu et al

2022; Sun et al

2022]. Voxel eld could

also help store the spacial information like density distribution in

advance to accelerate the inference speed[Garbin et al

2021; Liu

et al

2020; Lombardi et al

2021; Yu et al

2021a]. Recently, instant

neural graphics primitives[Müller et al

2022] adopts multi-level

hash table to augment a shallow MLP and achieves a combined

speedup of several orders of magnitude. It can train a static scene

with NeRF using only several seconds, and render the scene in

tens of milliseconds. However, these methods could not be directly

used for dynamic scenes due to its complex non-rigid deformations.

Meanwhile, it is hard to perform “pruning” operation for voxel grid

ACM Trans. Graph., Vol. 41, No. 6, Article 200. Publication date: December 2022.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ReconstructingPersonalizedSemanticFacialNeRFModelsFromMonocularVideoXUANGAO∗,UniversityofScienceandTechnologyofChina,ChinaCHENGLAIZHONG,UniversityofScienceandTechnologyofChina,ChinaJUNXIANG,UniversityofScienceandTechnologyofChina,ChinaYANGHONG,UniversityofScienceandTechnologyofChina,ChinaYUDONGGUO,I...

展开>> 收起<<

Reconstructing Personalized Semantic Facial NeRF Models From Monocular Video.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reconstructing Personalized Semantic Facial NeRF Models From Monocular Video

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: