SelfNeRF Fast Training NeRF for Human from Monocular Self-rotating Video

2025-04-15 0 0 1.68MB 10 页 10玖币
侵权投诉
SelfNeRF: Fast Training NeRF for Human from Monocular Self-rotating Video
Bo Peng Jun Hu Jingtao Zhou Juyong Zhang*
University of Science and Technology of China
{pb15881461858@mail.,hu997372@mail.,ustc zjt@mail.,juyong@}ustc.edu.cn
Training
Monocular Video
~5s ~1min ~5min ~20min
Template
Novel View
Synthesis
Neural Rendering
Figure 1. Given a monocular self-rotating video of the human performer, SelfNeRF is able to train from scratch and converge in about
twenty minutes, and then generate free-view points videos.
Abstract
In this paper, we propose SelfNeRF, an efficient neural
radiance field based novel view synthesis method for hu-
man performance. Given monocular self-rotating videos
of human performers, SelfNeRF can train from scratch and
achieve high-fidelity results in about twenty minutes. Some
recent works have utilized the neural radiance field for
dynamic human reconstruction. However, most of these
methods need multi-view inputs and require hours of train-
ing, making it still difficult for practical use. To address
this challenging problem, we introduce a surface-relative
representation based on multi-resolution hash encoding
that can greatly improve the training speed and aggre-
gate inter-frame information. Extensive experimental re-
sults on several different datasets demonstrate the effective-
ness and efficiency of SelfNeRF to challenging monocular
videos. Our code and video results will be available at
https://ustc3dv.github.io/SelfNeRF.
*Corresponding Author
1. Introduction
Novel view synthesis of human performance is an im-
portant research problem in computer vision and computer
graphics, and has wide applications in many areas such as
sports event broadcasts, video conferences, and VR/AR. Al-
though this problem has been widely studied for a long time,
existing methods still require multi-camera systems and
quite a long computation time. These shortcomings cause
this technology not easily used by public users. Therefore,
a high-fidelity novel view synthesis of human performance
based on a monocular camera and training within tens of
minutes will have significant value for practical use.
Traditional novel view synthesis methods need dense in-
puts for 2D image-based methods [8] or require depth cam-
eras for high-fidelity 3D reconstruction [5] to render real-
istic results. Some model-based methods [1,3,14] could
reconstruct explicit 3D mesh from sparse RGB videos,
but lack geometry detail and tend to be unrealistic. Re-
cently, several works have applied NeRF [20] to synthe-
size novel view images of dynamic human bodies. Neural-
Body [25], AnimatableNeRF [24], H-NeRF [38] and other
works [15,16,30,41] are able to synthesize high-quality
rendering images and extract rough body geometry from
1
arXiv:2210.01651v1 [cs.CV] 4 Oct 2022
sparse-view videos of the human body by combining hu-
man body priors with the NeRF model. However, most of
these works require quite a long time to train for each sub-
ject. HumanNeRF [41] does not require training for each
subject from scratch but still takes around an hour to fine-
tune the model to achieve better results, making it still dif-
ficult to put into practical use. The long training time of
these methods is caused by the expensive computation cost
of NeRF. Moreover, most of these works still need cali-
brated multi-view camera system to integrate multi-frame
information to produce a consistent registration sequence,
making it hard to deploy. Recently, with the well-designed
multi-resolution hash encoding [22], the training speed of
NeRF has been improved by several orders. However, the
current strategy of INGP [22] only works for static scenes
with multi-view inputs, and how to extend it to dynamic
scenes with monocular inputs has not yet been explored.
In this paper, we propose SelfNeRF, a view synthe-
sis method for human body, which can synthesize high-
fidelity novel view results of human performance with a
monocular camera and can converge within tens of minutes.
These characteristics make SelfNeRF practical for ordinary
users. We achieve these targets via a novel surface-relative
hash encoding by extending multi-resolution hash encod-
ing [22] to dynamic objects while aggregating informa-
tion across frames. Specifically, given the monocular self-
rotation video of a human performer, we recover the surface
shape for each frame with existing reconstruction methods
like VideoAvatar [1] and SelfRecon [11]. We then calculate
the K-nearest neighbor points and signed distance on the
current frame’s point cloud for each query point and take
the corresponding point of the k-nearest neighbor points on
the canonical space and signed distance as relative repre-
sentation. For a sample point at a specific frame, we first
calculate its’ relative representation, and then use hash en-
coding to get a high-dimensional feature fed to NeRF MLP
to regress its color and density. We adopt volume render-
ing [17] to get the color of each pixel and then train our
model with photometric loss and geometric guidance loss.
Extensive experimental results demonstrate the effective-
ness of our proposed method. In summary, the contributions
of this paper include the following aspects:
To the best of our knowledge, SelfNeRF is the first
work that applies hash encoding to dynamic objects,
which can reconstruct a dynamic neural radiance field
of a human in tens’ of minutes.
A surface-relative hash encoding is proposed to aggre-
gate inter-frame information and significantly speed up
the training of the neural radiance field for humans.
With the state-of-the-art clothed human body recon-
struction method, we can reconstruct high-fidelity
novel view synthesis of human performance with a
monocular camera.
2. Related Work
Neural Radiance Field based Human Reconstruction
NeRF(neural radiance field) [20] represents a static scene
as a learnable 5D function and adopts volume rendering to
render the image from any given view direction. Though
vanilla NeRF only fits for static scenes, requires dense view
inputs, and is slow to train and render, lots of work has
been done to improve NeRF to dynamic scenes [26] and
sparse view inputs [23] and increase the training and ren-
dering speed [6]. Recently, some researchers have focused
on applying the neural radiance field to human reconstruc-
tion. Neuralbody [25] utilizes a set of latent codes an-
chored to a deformable mesh which is shared at different
frames. H-NeRF [38] employs a structured implicit hu-
man body model to reconstruct the temporal motion of hu-
mans. AnimatableNeRF [24] introduces deformation fields
based on neural blend weight fields to generate observation-
to-canonical correspondences. Surface-Aligned NeRF [39]
defines the neural scene representation on the mesh surface
points and signed distances from the surface of a human
body mesh. Neural Actor [16] integrates texture map fea-
tures to refine volume rendering. HumanNeRF [41] em-
ploys an aggregated pixel-alignment feature and a pose em-
bedded non-rigid deformation field for tackling dynamic
motions. A-NeRF [30] proposes skeleton embedding serves
as a common reference that links constraints across time.
Neural Human Performer [15] introduces a temporal trans-
former and a multi-view transformer to aggregate corre-
sponding features across space and time. Weng et al. [35]
optimize for NeRF representation of the person in a canoni-
cal T-pose and a motion field that maps the estimated canon-
ical representation to every frame of the video via back-
ward warps, making it only requires monocular inputs. AD-
NeRF [7] employs a conditional NeRF to generate audio-
driven talking head. HeadNeRF [9] adds controllable codes
to NeRF to obtain the parametric representation of the hu-
man head. Although these methods can generate novel view
synthesis results for human, they still need several views of
videos or are costly to train and evaluate.
Acceleration of Neural Radiance Field Training Al-
though NeRF [20] could generate high-fidelity novel view
synthesis, its long training time cannot be accepted in prac-
tical use. Therefore, how to improve the training speed of
NeRF has been widely studied since its emergence of NeRF.
DS-NeRF [4] utilizes the depth information supplied by 3D
point clouds to speed up convergence and synthesize better
results from fewer training views. KiloNeRF [27] adopts
thousands of tiny MLPs instead of one single large MLP,
which could achieve real-time rendering and can train 2˜3x
2
摘要:

SelfNeRF:FastTrainingNeRFforHumanfromMonocularSelf-rotatingVideoBoPengJunHuJingtaoZhouJuyongZhang*UniversityofScienceandTechnologyofChinafpb15881461858@mail.,hu997372@mail.,ustczjt@mail.,juyong@gustc.edu.cnFigure1.Givenamonocularself-rotatingvideoofthehumanperformer,SelfNeRFisabletotrainfromscratcha...

展开>> 收起<<
SelfNeRF Fast Training NeRF for Human from Monocular Self-rotating Video.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:学术论文 价格:10玖币 属性:10 页 大小:1.68MB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注