scenes, (ii) with insufficient views that are 360◦around the
scenes and (iii) incorporate the depth data for better render-
ing. Fig. 1 displays an example of our setting. To tackle this
hard setting, we propose an explicit neural radiance field
(X-NeRF), which can take RGB-D images as inputs. Dif-
ferent from other NeRF-like approaches implicitly mapping
coordinates to colors and densities, we explicitly modeling
this problem as a completion task. The intuition behind
comes from the observation that given a few RGB-D input
images, a large part of the scene is actually known. In other
words, plenty of information is already available initially, so
that we only have to learn a general scene-irrelevant com-
pletion relation. Since the network is designed to encode
a general completion mapping rather than a specific scene,
we can naturally deal with the multi-scene problem.
Specifically, the input RGB-D images are converted to
sparse colorful point clouds and quantized to sparse ten-
sors on which we can directly apply Minkowski Engine [4]
to operate. We adopt a 3D sparse generative CNN to con-
struct and complete the explicit neural radiance fields. Our
backbone applies a UNet-like [34] encoder-decoder struc-
ture with multi-stage generative transposed convolution and
pruning layers in the decoder. To avoid overfitting on seen
views, besides common rendering loss, we also apply per-
ceptual loss with patch-wise sampling as well as view aug-
mentation through random rotation on point clouds. Volu-
metric rendering with post-activation is used. By shooting
and querying a ray from a pixel , the accumulated color and
depth of it can be rendered.
Extensive experiments demonstrate that the proposed
task is extremely challenging for existing methods while our
approach can handle it well. We first compare our approach
with DS-NeRF [7], an advanced NeRF-based work that
also supports depth supervision, and DVGO [39] which is
a state-of-the-art NeRF-like method utilizing explicit struc-
tures, on single scene experiments. To be fair, we add depth
supervision to DVGO [39]. Then we compare X-NeRF with
some recent NeRF-related work that supports multi-scene
training such as pixelNeRF [50] and IBRNet [44] (depth
supervision is also added). The results state clearly that
X-NeRF is robust with multi-scene 360◦insufficient views
and can produce reliable novel view predictions. Our work
outperforms previous methods on the extreme setting, indi-
cating that X-NeRF can be applied to practice in a low-cost
manner as we can train one model for many scenes while
the inference process is quite lightweight.
2. Related Work
Novel view synthesis. To synthesize a novel view im-
age given a set of images is a classic and long-standing task.
Rendering methods can be mainly divided into image-based
or model-based. Image-based methods [9, 15, 44] directly
learn the transformation on image level such as warping
or interpolation, which are typically more computational
efficient. However, they need reference views during in-
ference and the number and density of reference images
may influence the rendering quality greatly. Model-based
methods [16, 18, 32, 35, 42] express scenes as high di-
mensional representations and apply physically meaningful
models such as optical model [27] to render the novel view
images. There are various forms to represent scenes. Earlier
works apply lumigraph [2, 12] and light fields [5, 19, 20, 36]
to directly interpolate on input images. Nevertheless, they
need exceedingly dense inputs which is totally unafford-
able in many applications. Other methods utilize explicit
representations such as mesh [6, 40, 43, 45] to deal with
sparse inputs. However, mesh-based approaches cannot
work well with gradient-based optimization due to discon-
tinuities and local minima. Recently, many deep learning
based methods employ CNNs to construct multi-plane im-
ages (MPIs) [8, 22, 28, 38, 41, 53] for forward-facing cap-
tures. There are also approaches that encode scenes as vol-
umetric representations [16, 18, 28, 41, 42, 53], but they
often struggles with complex and large-scale scenes.
Neural Radiance Fields. NeRFs have aroused great
interest and achieved huge success in novel view synthe-
sis task in recent years. A classic NeRF [29] learns a di-
rect mapping from coordinates to corresponding textures
such as color and density, implicitly encoding a scene in
MLPs. Since proposed, people have extended NeRF [29]
to a lot of variants with different characteristics including
editable [17, 24, 47], fast inference and/or training [7, 11,
23, 39], deformable [30, 33], unconstrained images [3, 26],
etc. Some recent work [39, 48, 49] introduces explicit struc-
tures to gain great performance enhancement, which indi-
cates that the implicit MLP architecture is not necessar-
ily the key to success. Nevertheless, despite the explicit
voxel grid structures, they are actually still essentially an
implicit modeling, as they still encode the scene informa-
tion in learnable parameters. The implicit modeling makes
NeRF-based methods hard to freely generalize on multi-
scene cases. Though some work such as [50] claims that
they have the ability to deal with multi-scene task, they in
practice can only process multiple small objects or multiple
similar simulated scenes. Moreover, when it comes to the
extreme situation proposed in this paper that the input views
are insufficient, which means the input views are extremely
sparse but 360◦around the real scenes and have almost no
overlapping (often less than 10% to 20%), the implicit mod-
eling easily overfits to a trivial solution due to its little con-
straints on the scene structure. Some approaches can deal
with few inputs such as [7, 50], but their applicable scenario
is mostly forward-facing captures which is actually still not
sparse enough.
Multi-modal RGB-D data. Nowadays, with the rapid
development of hardware devices, depth modal is becoming