functions, the geometry and appearance of the underlying scene can be optimized based on the error
derived from the downstream supervision signals.
[3;21;33;28;16]
adopt neural implicit fields to represent
3D shapes and attain highly detailed geometries. On the other hand,
[27;29;31]
work on discrete grids,
UV maps, and point clouds, respectively, with attached learnable neural features that can produce
pleasing novel view imagery. More recently, the Neural Radiance Field (NeRF) technique
[17]
has
revolutionized several research fields with a trained MLP-based radiance and opacity field, achieving
unprecedented success in producing photo-realistic imagery. An explosion of NeRF techniques
occurred in the research community since then that improves the NeRF in various aspects of the
problem
[12;22;35;11;32;15;10;30;13]
. A research direction drawing increasing interest, which we discuss in
the following, is to incorporate such neural representations to learn a 3D generative model possessing
photo-realistic viewing effects.
Generative neural scene generation.
With great success achieved in 2D image generation tasks,
research have taken several routes to realize 3D-aware generative models. The heart of these methods
is a 3D neural scene representation, paired with the volume rendering, makes the whole pipeline
differentiable with respect to the supervision imposed in image domain.
[23;1]
integrate a neural
radiance field into generative models, and directly produce the final images via volume rendering,
yielding a 3D field modeled with great view-consistency under varying views. To overcome the
low query efficiency and high memory cost issues, researchers proposed to adopt 2D neural renders
to improve the inference speed, and achieved high-resolution outputs
[20;6;2]
.
[20]
utilizes multiple
radiance sub-fields to represent a scene, and shows its potential to model scenes containing multiple
objects on synthetic primitive datasets. More often than not, these methods are demonstrated on
single-object scenes.
[4]
found that the capacity of a generative model conditioned on a global latent
vector is limited, and instead propose to use a grid of locally conditioned radiance sub-fields to
model more complicated scenes. All these work learn category-specific models, requiring training on
sufficient volumes of images data collected from many homogeneous scenes. In this work, we target
general natural scenes, which in general possess intricate and exclusive characteristic, suggesting
difficulties in collecting necessary volumes of training data and rendering these data-consuming
learning setups intractable. Moreover, as aforementioned, our task necessitates localizing the training
over local regions, which is lacking in MLP-based representations, leading us to the use of voxel
grids in this work.
3 Method
SinGRAV learns a powerful generative model for generating neural radiance volumes from multi-view
observations
X={x1, ..., xm}
of a single scene. In contrast to learning class-specific priors, our
specific emphasis is to learn the internal distribution of the input scene. To this end, we resort to
convolutional networks, which inherently possess spatial locality bias, with limited receptive fields
for learning over a variety of local regions within the input scene. The generative model is learned via
adversarial generation and discrimination through 2D projections of the generated volumes. During
training, the camera pose is randomly selected from the training set. We will omit the notion of the
camera pose for brevity. Moreover, we use a multi-scale framework to learn properties at different
scales, ranging from global configurations to local fine texture details. Figure 2 presents an overview.
3.1 Neural radiance volume and rendering
The generated scene is represented by a discrete 3D voxel grid, and is to be produced by a 3D
convolutional network. Each voxel center stores a 4-channel vector that contains a density scalar
σ
and a color vector
c
. Trilinear interpolation is used to define a continuous radiance field in the volume.
We use the same differentiable function as in NeRF for volume-rendering generated volumes. The
expected color
ˆ
C
of a camera ray
r
is approximated by integrating over
M
samples spreading along
the ray:
ˆ
C(r) =
M
X
i=1
Ti1−exp(−σiδi)ci,and Ti=exp−
i−1
X
j=1
σjδj,(1)
where the subscript denotes the sample index between the near
tn
and far
tf
bound,
δi=ti+1 −ti
is
the distance between two consecutive samples, and
Ti
is the accumulated transmittance at sample
i
,
which is obtained via integrating over the preceding samples along the ray. This color rendering is
3