pearances without relying on explicit geometry formulation
using only a single input view.
NeRF-based human rendering methods implicitly encode
a dense scene radiance field in the form of a density and
a color for a given 3D query point and viewing direction
via a neural network. One of their advantages is that they
do not require 3D supervision, and instead rely on 2D su-
pervision from multi-view images. The main drawback of
the original formulation, however, is that a NeRF model has
to be optimized for each scene, since it is, in fact, an opti-
mization scheme, and it is not a learning approach per se.
One stream of NeRF-based approaches for human render-
ing is building a subject-specific representation. NHR [52]
takes a sequence of point clouds as an input and conditions
the rendered novel images using 80 input points of view.
NB [36] utilizes a 3D human mesh model (i.e. SMPL [26])
and subject-specific latent codes, to construct the 3D latent
code volume, which is used for density and color regres-
sion of any 3D point bounded by a given 3D human mesh.
Other works [2,25,35] deform observation-space 3D points
to the canonical 3D space using inverse bone transforma-
tions, and learn the neural radiance fields. The canonical
3D space represents the pose normalized space around the
template human mesh. NARF [33] learns neural radiance
fields per human part, and trains an autoencoder to encode
human appearances using a synthetic human dataset.
Recently, pixelNeRF [57], IBRNet [49], and SRF [3]
proposed to combine an image-based feature encoding and
NeRF. Instead of memorizing the scene radiance in a 3D
space, their networks estimate it based on pixel-aligned im-
age features. We chose to adopt this strategy, as it allows a
single network to be trained across multiple scenes to learn
a scene prior, enabling it to generalize to unseen scenes in a
feed-forward manner from a sparse set of views. PVA [37],
Wang et al. [50], and NHP [20] also apply this approach
to human rendering. Especially, NHP targets human body
rendering, which is also our interest, using sparse multi-
view videos. Given a GT SMPL mesh, it exploits the pixel-
aligned image features to construct the 3D latent volume.
Among the above generalizable NeRF-extensions [20,
37,50], no work has addressed the challenges inherent to
single monocular images, such as occlusions and depth am-
biguity. Instead, they attempted to resolve the issues by in-
creasing the input information, such as using more views
and temporal information, which in practice can be a lim-
iting or even prohibitive factor for real-world applications.
In this paper, we demonstrate the robustness of MonoNHR
on single images by comparing it with pixelNeRF [57] and
NHP [20], which are generalizable NeRF-extensions and
are applicable to monocular images.
Neural surface fields-based human rendering methods
are closely related to NeRF representations, since they also
aim at learning an implicit function, in this case, an indica-
tor of the interior and exterior of the observed shape. They
allow for accurate and detailed 3D geometry reconstruc-
tions, but they require strong 3D supervision, such as 3D
scans. 3D scans are highly costly to obtain at scale, and
consequently, methods are trained on a small scan dataset
and tend to exhibit poor generalization capabilities to un-
seen human poses and appearances. PIFu [38] and its ex-
tensions [13,39] propose to estimate the 3D surface of a
human using an implicit function based on pixel-aligned
image features. The pixel-aligned image features are ob-
tained by projecting 3D points onto the image plane. Sim-
ilar to more classical approaches, they first reconstruct 3D
surfaces and then condition texture inference on surface re-
construction features similar to ours. However, as Double-
Field [44] pointed out, the learning space of texture is highly
limited around the surface and discontinuous, which hin-
ders the optimization. Zins et al. [61] improves PIFu in a
multi-view setting by introducing an attention-based view
fusion layer and a context encoding module using 3D con-
volutions. POSEFusion [24] takes monocular RGBD video
frames as an input and learns to fuse multiple surface esti-
mations in different time steps. DoubleField jointly learns
neural radiance and surface fields, and uses raw RGB pixel
values to render high-resolution images.
Compared to the above neural surface fields-based hu-
man rendering methods, MonoNHR has two clear differ-
ences. First, it does not require 3D scans for training and
follows NeRF-based human rendering pipeline, i.e. using
a weak supervision signal from multi-view images. Al-
though NeRF-based human rendering methods, including
ours, use SMPL fits, the SMPL fits are much easier to ob-
tain than 3D scans using existing powerful 3D human pose
and shape estimation methods [19,31]. On the other hand,
special and expensive equipments, e.g. over 100 multi-view
synchronized cameras, are necessary to generate accurate
3D scans, making them difficult to obtain at a large scale.
Please note that 3D scans obtained from a small number of
cameras, e.g. COLMAP [41,42], are not accurate enough
to provide supervision targets for PIFu and its variants, as
discussed in [22,36]. In consequence, PIFu and its vari-
ants are trained on small scale datasets and tend not to gen-
eralize well to unseen data, especially non-upright stand-
ing poses, as discussed in [20,36].Second, the absence of
3D scans prevents explicit 3D geometry supervision, mak-
ing disentangling geometry and texture non-trivial. We ex-
tract geometry-dedicated features for disentanglement and
use them for density estimation without RGB estimation.
3. MonoNHR
The overall pipeline of MonoNHR is detailed in Fig-
ure 2. It is trained in an end-to-end manner and consists
of an image feature backbone, Mesh Inpainter, a geometry
branch, and a texture branch.