SelfNeRF Fast Training NeRF for Human from Monocular Self-rotating Video

2025-04-15 1 0 1.68MB 10 页 10玖币

侵权投诉

SelfNeRF: Fast Training NeRF for Human from Monocular Self-rotating Video

Bo Peng Jun Hu Jingtao Zhou Juyong Zhang*

University of Science and Technology of China

{pb15881461858@mail.,hu997372@mail.,ustc zjt@mail.,juyong@}ustc.edu.cn

Training

Monocular Video

~5s ~1min ~5min ~20min

Template

Novel View

Synthesis

Neural Rendering

Figure 1. Given a monocular self-rotating video of the human performer, SelfNeRF is able to train from scratch and converge in about

twenty minutes, and then generate free-view points videos.

Abstract

In this paper, we propose SelfNeRF, an efﬁcient neural

radiance ﬁeld based novel view synthesis method for hu-

man performance. Given monocular self-rotating videos

of human performers, SelfNeRF can train from scratch and

achieve high-ﬁdelity results in about twenty minutes. Some

recent works have utilized the neural radiance ﬁeld for

dynamic human reconstruction. However, most of these

methods need multi-view inputs and require hours of train-

ing, making it still difﬁcult for practical use. To address

this challenging problem, we introduce a surface-relative

representation based on multi-resolution hash encoding

that can greatly improve the training speed and aggre-

gate inter-frame information. Extensive experimental re-

sults on several different datasets demonstrate the effective-

ness and efﬁciency of SelfNeRF to challenging monocular

videos. Our code and video results will be available at

https://ustc3dv.github.io/SelfNeRF.

*Corresponding Author

1. Introduction

Novel view synthesis of human performance is an im-

portant research problem in computer vision and computer

graphics, and has wide applications in many areas such as

sports event broadcasts, video conferences, and VR/AR. Al-

though this problem has been widely studied for a long time,

existing methods still require multi-camera systems and

quite a long computation time. These shortcomings cause

this technology not easily used by public users. Therefore,

a high-ﬁdelity novel view synthesis of human performance

based on a monocular camera and training within tens of

minutes will have signiﬁcant value for practical use.

Traditional novel view synthesis methods need dense in-

puts for 2D image-based methods [8] or require depth cam-

eras for high-ﬁdelity 3D reconstruction [5] to render real-

istic results. Some model-based methods [1,3,14] could

reconstruct explicit 3D mesh from sparse RGB videos,

but lack geometry detail and tend to be unrealistic. Re-

cently, several works have applied NeRF [20] to synthe-

size novel view images of dynamic human bodies. Neural-

Body [25], AnimatableNeRF [24], H-NeRF [38] and other

works [15,16,30,41] are able to synthesize high-quality

rendering images and extract rough body geometry from

arXiv:2210.01651v1 [cs.CV] 4 Oct 2022

sparse-view videos of the human body by combining hu-

man body priors with the NeRF model. However, most of

these works require quite a long time to train for each sub-

ject. HumanNeRF [41] does not require training for each

subject from scratch but still takes around an hour to ﬁne-

tune the model to achieve better results, making it still dif-

ﬁcult to put into practical use. The long training time of

these methods is caused by the expensive computation cost

of NeRF. Moreover, most of these works still need cali-

brated multi-view camera system to integrate multi-frame

information to produce a consistent registration sequence,

making it hard to deploy. Recently, with the well-designed

multi-resolution hash encoding [22], the training speed of

NeRF has been improved by several orders. However, the

current strategy of INGP [22] only works for static scenes

with multi-view inputs, and how to extend it to dynamic

scenes with monocular inputs has not yet been explored.

In this paper, we propose SelfNeRF, a view synthe-

sis method for human body, which can synthesize high-

ﬁdelity novel view results of human performance with a

monocular camera and can converge within tens of minutes.

These characteristics make SelfNeRF practical for ordinary

users. We achieve these targets via a novel surface-relative

hash encoding by extending multi-resolution hash encod-

ing [22] to dynamic objects while aggregating informa-

tion across frames. Speciﬁcally, given the monocular self-

rotation video of a human performer, we recover the surface

shape for each frame with existing reconstruction methods

like VideoAvatar [1] and SelfRecon [11]. We then calculate

the K-nearest neighbor points and signed distance on the

current frame’s point cloud for each query point and take

the corresponding point of the k-nearest neighbor points on

the canonical space and signed distance as relative repre-

sentation. For a sample point at a speciﬁc frame, we ﬁrst

calculate its’ relative representation, and then use hash en-

coding to get a high-dimensional feature fed to NeRF MLP

to regress its color and density. We adopt volume render-

ing [17] to get the color of each pixel and then train our

model with photometric loss and geometric guidance loss.

Extensive experimental results demonstrate the effective-

ness of our proposed method. In summary, the contributions

of this paper include the following aspects:

• To the best of our knowledge, SelfNeRF is the ﬁrst

work that applies hash encoding to dynamic objects,

which can reconstruct a dynamic neural radiance ﬁeld

of a human in tens’ of minutes.

• A surface-relative hash encoding is proposed to aggre-

gate inter-frame information and signiﬁcantly speed up

the training of the neural radiance ﬁeld for humans.

• With the state-of-the-art clothed human body recon-

struction method, we can reconstruct high-ﬁdelity

novel view synthesis of human performance with a

monocular camera.

2. Related Work

Neural Radiance Field based Human Reconstruction

NeRF(neural radiance ﬁeld) [20] represents a static scene

as a learnable 5D function and adopts volume rendering to

render the image from any given view direction. Though

vanilla NeRF only ﬁts for static scenes, requires dense view

inputs, and is slow to train and render, lots of work has

been done to improve NeRF to dynamic scenes [26] and

sparse view inputs [23] and increase the training and ren-

dering speed [6]. Recently, some researchers have focused

on applying the neural radiance ﬁeld to human reconstruc-

tion. Neuralbody [25] utilizes a set of latent codes an-

chored to a deformable mesh which is shared at different

frames. H-NeRF [38] employs a structured implicit hu-

man body model to reconstruct the temporal motion of hu-

mans. AnimatableNeRF [24] introduces deformation ﬁelds

based on neural blend weight ﬁelds to generate observation-

to-canonical correspondences. Surface-Aligned NeRF [39]

deﬁnes the neural scene representation on the mesh surface

points and signed distances from the surface of a human

body mesh. Neural Actor [16] integrates texture map fea-

tures to reﬁne volume rendering. HumanNeRF [41] em-

ploys an aggregated pixel-alignment feature and a pose em-

bedded non-rigid deformation ﬁeld for tackling dynamic

motions. A-NeRF [30] proposes skeleton embedding serves

as a common reference that links constraints across time.

Neural Human Performer [15] introduces a temporal trans-

former and a multi-view transformer to aggregate corre-

sponding features across space and time. Weng et al. [35]

optimize for NeRF representation of the person in a canoni-

cal T-pose and a motion ﬁeld that maps the estimated canon-

ical representation to every frame of the video via back-

ward warps, making it only requires monocular inputs. AD-

NeRF [7] employs a conditional NeRF to generate audio-

driven talking head. HeadNeRF [9] adds controllable codes

to NeRF to obtain the parametric representation of the hu-

man head. Although these methods can generate novel view

synthesis results for human, they still need several views of

videos or are costly to train and evaluate.

Acceleration of Neural Radiance Field Training Al-

though NeRF [20] could generate high-ﬁdelity novel view

synthesis, its long training time cannot be accepted in prac-

tical use. Therefore, how to improve the training speed of

NeRF has been widely studied since its emergence of NeRF.

DS-NeRF [4] utilizes the depth information supplied by 3D

point clouds to speed up convergence and synthesize better

results from fewer training views. KiloNeRF [27] adopts

thousands of tiny MLPs instead of one single large MLP,

which could achieve real-time rendering and can train 2˜3x

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SelfNeRF:FastTrainingNeRFforHumanfromMonocularSelf-rotatingVideoBoPengJunHuJingtaoZhouJuyongZhang*UniversityofScienceandTechnologyofChinafpb15881461858@mail.,hu997372@mail.,ustczjt@mail.,juyong@gustc.edu.cnFigure1.Givenamonocularself-rotatingvideoofthehumanperformer,SelfNeRFisabletotrainfromscratcha...

展开>> 收起<<

SelfNeRF Fast Training NeRF for Human from Monocular Self-rotating Video.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SelfNeRF Fast Training NeRF for Human from Monocular Self-rotating Video

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: