sparse-view videos of the human body by combining hu-
man body priors with the NeRF model. However, most of
these works require quite a long time to train for each sub-
ject. HumanNeRF [41] does not require training for each
subject from scratch but still takes around an hour to fine-
tune the model to achieve better results, making it still dif-
ficult to put into practical use. The long training time of
these methods is caused by the expensive computation cost
of NeRF. Moreover, most of these works still need cali-
brated multi-view camera system to integrate multi-frame
information to produce a consistent registration sequence,
making it hard to deploy. Recently, with the well-designed
multi-resolution hash encoding [22], the training speed of
NeRF has been improved by several orders. However, the
current strategy of INGP [22] only works for static scenes
with multi-view inputs, and how to extend it to dynamic
scenes with monocular inputs has not yet been explored.
In this paper, we propose SelfNeRF, a view synthe-
sis method for human body, which can synthesize high-
fidelity novel view results of human performance with a
monocular camera and can converge within tens of minutes.
These characteristics make SelfNeRF practical for ordinary
users. We achieve these targets via a novel surface-relative
hash encoding by extending multi-resolution hash encod-
ing [22] to dynamic objects while aggregating informa-
tion across frames. Specifically, given the monocular self-
rotation video of a human performer, we recover the surface
shape for each frame with existing reconstruction methods
like VideoAvatar [1] and SelfRecon [11]. We then calculate
the K-nearest neighbor points and signed distance on the
current frame’s point cloud for each query point and take
the corresponding point of the k-nearest neighbor points on
the canonical space and signed distance as relative repre-
sentation. For a sample point at a specific frame, we first
calculate its’ relative representation, and then use hash en-
coding to get a high-dimensional feature fed to NeRF MLP
to regress its color and density. We adopt volume render-
ing [17] to get the color of each pixel and then train our
model with photometric loss and geometric guidance loss.
Extensive experimental results demonstrate the effective-
ness of our proposed method. In summary, the contributions
of this paper include the following aspects:
• To the best of our knowledge, SelfNeRF is the first
work that applies hash encoding to dynamic objects,
which can reconstruct a dynamic neural radiance field
of a human in tens’ of minutes.
• A surface-relative hash encoding is proposed to aggre-
gate inter-frame information and significantly speed up
the training of the neural radiance field for humans.
• With the state-of-the-art clothed human body recon-
struction method, we can reconstruct high-fidelity
novel view synthesis of human performance with a
monocular camera.
2. Related Work
Neural Radiance Field based Human Reconstruction
NeRF(neural radiance field) [20] represents a static scene
as a learnable 5D function and adopts volume rendering to
render the image from any given view direction. Though
vanilla NeRF only fits for static scenes, requires dense view
inputs, and is slow to train and render, lots of work has
been done to improve NeRF to dynamic scenes [26] and
sparse view inputs [23] and increase the training and ren-
dering speed [6]. Recently, some researchers have focused
on applying the neural radiance field to human reconstruc-
tion. Neuralbody [25] utilizes a set of latent codes an-
chored to a deformable mesh which is shared at different
frames. H-NeRF [38] employs a structured implicit hu-
man body model to reconstruct the temporal motion of hu-
mans. AnimatableNeRF [24] introduces deformation fields
based on neural blend weight fields to generate observation-
to-canonical correspondences. Surface-Aligned NeRF [39]
defines the neural scene representation on the mesh surface
points and signed distances from the surface of a human
body mesh. Neural Actor [16] integrates texture map fea-
tures to refine volume rendering. HumanNeRF [41] em-
ploys an aggregated pixel-alignment feature and a pose em-
bedded non-rigid deformation field for tackling dynamic
motions. A-NeRF [30] proposes skeleton embedding serves
as a common reference that links constraints across time.
Neural Human Performer [15] introduces a temporal trans-
former and a multi-view transformer to aggregate corre-
sponding features across space and time. Weng et al. [35]
optimize for NeRF representation of the person in a canoni-
cal T-pose and a motion field that maps the estimated canon-
ical representation to every frame of the video via back-
ward warps, making it only requires monocular inputs. AD-
NeRF [7] employs a conditional NeRF to generate audio-
driven talking head. HeadNeRF [9] adds controllable codes
to NeRF to obtain the parametric representation of the hu-
man head. Although these methods can generate novel view
synthesis results for human, they still need several views of
videos or are costly to train and evaluate.
Acceleration of Neural Radiance Field Training Al-
though NeRF [20] could generate high-fidelity novel view
synthesis, its long training time cannot be accepted in prac-
tical use. Therefore, how to improve the training speed of
NeRF has been widely studied since its emergence of NeRF.
DS-NeRF [4] utilizes the depth information supplied by 3D
point clouds to speed up convergence and synthesize better
results from fewer training views. KiloNeRF [27] adopts
thousands of tiny MLPs instead of one single large MLP,
which could achieve real-time rendering and can train 2˜3x
2