SinGRA V Learning a Generative Radiance Volume from a Single Natural Scene Yujie Wang13Xuelin Chen2

2025-05-03 0 0 7.46MB 19 页 10玖币
侵权投诉
SinGRAV: Learning a Generative Radiance Volume
from a Single Natural Scene
Yujie Wang1,3Xuelin Chen2
1Shandong University 2Tencent AI Lab
yujiew.cn@gmail.com xuelin.chen.3d@gmail.com
Baoquan Chen3
3Peking University
baoquan.chen@gmail.com
Abstract
We present a 3D generative model for general natural scenes. Lacking necessary
volumes of 3D data characterizing the target scene, we propose to learn from
a single scene. Our key insight is that a natural scene often contains multiple
constituents whose geometry, texture, and spatial arrangements follow some clear
patterns, but still exhibit rich variations over different regions within the same
scene. This suggests localizing the learning of a generative model on substantial
local regions. Hence, we exploit a multi-scale convolutional network, which
possesses the spatial locality bias in nature, to learn from the statistics of local
regions at multiple scales within a single scene. In contrast to existing methods,
our learning setup bypasses the need to collect data from many homogeneous 3D
scenes for learning common features. We coin our method SinGRAV, for learning
aGenerative RAdiance Volume from a Single natural scene. We demonstrate the
ability of SinGRAV in generating plausible and diverse variations from a single
scene, the merits of SinGRAV over state-of-the-art generative neural scene methods,
as well as the versatility of SinGRAV by its use in a variety of applications, spanning
3D scene editing, composition, and animation. Code and data will be released to
facilitate further research.
1 Introduction
Recently, 3D generative modeling has made great strides via gravitating towards neural scene repre-
sentations, which boast unprecedentedly photo-realism. Generative neural scene models
[23;20;2;6;1]
can now draw class-specific realistic scenes (e.g., cars and portraits), offering a glimpse into the
boundless universe in the virtual. Yet an obvious question is how we can go beyond class-specific
scenes, and achieve similar success with general natural scenes, creating at scale diverse scenes of
more sorts. This work presents, to our knowledge, the first endeavor towards answering this question.
While neural scene generation has boosted the field via learning from images with differentiable
projections, there still exists strong dependence on datasets containing images of many homogeneous
scenes. Collecting homogeneous data for each scene type ad hoc is cumbersome, and would become
prohibitive when the scene type of interest varies dynamically. Fortunately, parallels can be drawn
following nature, who implicitly maintains a rich variety of natural scenes. In nature, no scenes ever
stayed absolutely unaltered — imagine, as the stars shift, a river changes its current and little pebbles
get scattered and weathered variously. Analogically, in scene creation, new scenes can be created via
"varying" an existing one, yet still resembling the original (see Figure 1).
Following this spirit, we present SinGRAV, for a generative radiance volume learned to synthesize
variations from a single general natural scene. Building upon recent advances in 3D generative
modeling, SinGRAV also learns from visual observations of the target scene via differentiable volume
arXiv:2210.01202v3 [cs.CV] 16 Oct 2022
Training scene Random generation
Editing - removal Editing - duplicate Composition
Figure 1: Top two rows: From observations of a single natural scene, we learn a generative model to
synthesize highly plausible variations. Three views rendered from each of input/generated scenes are
presented. Note how the global and object configurations vary in generated samples, yet still resemble
the original. Bottom: Applications enabled by SinGRAV, including removal (left) and duplicate
(middle) operation for editing a 3D scene sample, and scene composition (right) that combines 5
different generated samples to form a novel complex scene.
rendering, but the training supervision, scene representation, and network architecture all have
to be altered fundamentally to meet unique challenges arising from learning with a single scene.
Predominantly, the supervision comes from merely a single scene, which would manifestly elude
state-of-the-art models that learn common features across homogeneous scenes.
Natural scenes often contain multiple constituents whose geometry, appearance, and spatial arrange-
ments follow some clear patterns, but still exhibit rich variations over different regions within the
same scene. This naturally inspires one to learn from the statistics of internal regions via localizing
the training. As our supervision comes from 2D images, learning internal distributions consequently
grounds SinGRAV on the assumption that multi-view observations share a consistent internal distri-
bution for learning, which can be simply realized by uniformly placing the training cameras around
the scene to obtain homogeneous images. On the other hand, MLP-based representations tend
to synthesize holistically and perform better at modeling global patterns over local ones
[5]
, thus
favor modeling class-specific distributions. Hence, we resort to convolutional operations, which
generate discrete radiance volumes from noise volumes with limited receptive fields, for learning
local properties over a confined spatial extent, granting better out-of-distribution generation in terms
of the global configuration. Moreover, we adopt a multi-scale architecture containing a pyramid of
convolutional GANs to capture the internal distribution at various scales, alleviating the notorious
mode-collapse issue. This framework is similar in spirit with
[24;25]
, however, important designs must
be incorporated to efficiently and effectively improve the plausibility of the spatial arrangement and
the appearance realism of the generated 3D scene.
We demonstrate SinGRAV enables us to easily generate plausible variations of the input scene in large
quantities and varieties, with novel geometry, textures and configurations. Plausibility evaluations are
conducted via perceptual studies, and comparisons are made to state-of-the-art generative models.
The importance of various network design choices are validated. Finally, we show the versatility of
SinGRAV by its use in a series of applications, spanning 3D scene editing, retargeting, and animation.
2 Related work
Neural scene representation and rendering.
In recent years, neural scene representations have
been the de facto infrastructure in several tasks, including representing shapes
[3;21;16;28;14]
, novel view
synthesis
[18;17;31]
, and 3D generative modeling
[19;23;1;20;6;4;2]
. Paired with differentiable projection
2
functions, the geometry and appearance of the underlying scene can be optimized based on the error
derived from the downstream supervision signals.
[3;21;33;28;16]
adopt neural implicit fields to represent
3D shapes and attain highly detailed geometries. On the other hand,
[27;29;31]
work on discrete grids,
UV maps, and point clouds, respectively, with attached learnable neural features that can produce
pleasing novel view imagery. More recently, the Neural Radiance Field (NeRF) technique
[17]
has
revolutionized several research fields with a trained MLP-based radiance and opacity field, achieving
unprecedented success in producing photo-realistic imagery. An explosion of NeRF techniques
occurred in the research community since then that improves the NeRF in various aspects of the
problem
[12;22;35;11;32;15;10;30;13]
. A research direction drawing increasing interest, which we discuss in
the following, is to incorporate such neural representations to learn a 3D generative model possessing
photo-realistic viewing effects.
Generative neural scene generation.
With great success achieved in 2D image generation tasks,
research have taken several routes to realize 3D-aware generative models. The heart of these methods
is a 3D neural scene representation, paired with the volume rendering, makes the whole pipeline
differentiable with respect to the supervision imposed in image domain.
[23;1]
integrate a neural
radiance field into generative models, and directly produce the final images via volume rendering,
yielding a 3D field modeled with great view-consistency under varying views. To overcome the
low query efficiency and high memory cost issues, researchers proposed to adopt 2D neural renders
to improve the inference speed, and achieved high-resolution outputs
[20;6;2]
.
[20]
utilizes multiple
radiance sub-fields to represent a scene, and shows its potential to model scenes containing multiple
objects on synthetic primitive datasets. More often than not, these methods are demonstrated on
single-object scenes.
[4]
found that the capacity of a generative model conditioned on a global latent
vector is limited, and instead propose to use a grid of locally conditioned radiance sub-fields to
model more complicated scenes. All these work learn category-specific models, requiring training on
sufficient volumes of images data collected from many homogeneous scenes. In this work, we target
general natural scenes, which in general possess intricate and exclusive characteristic, suggesting
difficulties in collecting necessary volumes of training data and rendering these data-consuming
learning setups intractable. Moreover, as aforementioned, our task necessitates localizing the training
over local regions, which is lacking in MLP-based representations, leading us to the use of voxel
grids in this work.
3 Method
SinGRAV learns a powerful generative model for generating neural radiance volumes from multi-view
observations
X={x1, ..., xm}
of a single scene. In contrast to learning class-specific priors, our
specific emphasis is to learn the internal distribution of the input scene. To this end, we resort to
convolutional networks, which inherently possess spatial locality bias, with limited receptive fields
for learning over a variety of local regions within the input scene. The generative model is learned via
adversarial generation and discrimination through 2D projections of the generated volumes. During
training, the camera pose is randomly selected from the training set. We will omit the notion of the
camera pose for brevity. Moreover, we use a multi-scale framework to learn properties at different
scales, ranging from global configurations to local fine texture details. Figure 2 presents an overview.
3.1 Neural radiance volume and rendering
The generated scene is represented by a discrete 3D voxel grid, and is to be produced by a 3D
convolutional network. Each voxel center stores a 4-channel vector that contains a density scalar
σ
and a color vector
c
. Trilinear interpolation is used to define a continuous radiance field in the volume.
We use the same differentiable function as in NeRF for volume-rendering generated volumes. The
expected color
ˆ
C
of a camera ray
r
is approximated by integrating over
M
samples spreading along
the ray:
ˆ
C(r) =
M
X
i=1
Ti1exp(σiδi)ci,and Ti=exp
i1
X
j=1
σjδj,(1)
where the subscript denotes the sample index between the near
tn
and far
tf
bound,
δi=ti+1 ti
is
the distance between two consecutive samples, and
Ti
is the accumulated transmittance at sample
i
,
which is obtained via integrating over the preceding samples along the ray. This color rendering is
3
.
.
..
...
𝐺!
𝑧"#! $
.
.
.
.
...
.
𝐺!
𝐺"#! %
𝑉
"#!
%
𝑉
!
𝑧!
𝐷"
%&
𝐷!
%&
𝐷"#!
%&
Spatial anchors
Receptive field
𝑒"#$
Fake renderings Training views
Inject
Pyramid of
discriminators
Pyramid of
generators
Gaussian
.
Rendering from
scale 𝑁 − 1
Figure 2: SinGRAV training setup. A series of convolutional generators are trained to generate a scene
in a coarse-to-fine manner. At each scale,
Gn
learns to form a volume via generating realistic 3D
overlapping patches, that collectively contribute to a volumetric-rendered imagery indistinguishable
from the observation images of the input scene by the discriminator
Dn
. At the finest scale, the
generator
GN
operates purely on 2D domain to super-resolve the imagery produced from the previous
scale (top), significantly reducing the computation overhead.
reduced to traditional alpha compositing with alpha values
αi= 1 exp(σiδi)
. This function is
differentiable and enables updating the volume based on the error derived from supervision signals.
3.2 Hybrid multi-scale architecture
We use the multi-scale framework widely followed in image generation. In addition, SinGRAV
adopts a hybrid framework, which contains a series of 3D convolutional generators
{Gn}N1
n=1
and a
lightweight 2D convolutional generator
GN
(see Figure 2). Specifically, 3D generators
{Gn}N1
n=1
sequentially generate a volume with an increasing resolution at
N1
coarser scales, with the volume
resolution increased by a factor of
θ
between two consecutive scales and the rendering resolution
increases by a factor of
µr
. At
N
th scale, to address the overly high computation cost, we use a
lightweight 2D generator
GN
to super-resolve the imagery from the preceding scale by a factor
µs
, achieving higher-resolution outputs. Essentially, each generator
Gn
learns to generate realistic
outputs to fool an associate 2D discriminator
D2D
n
, that is designed to distinguish the generated
renderings from real ones. Importantly, the generators and discriminators are designed to be equipped
with limited receptive fields, preventing them from over-fitting to the whole scene.
Starting from a coarsest volume, the generation sequentially passes through all generators and
increases the resolution up to the finest, with noise injected at each scale. At the coarsest scale, the
volume generation is purely generative, i.e.
G1
maps a Gaussian noise volume
z1
to a radiance volume.
(−𝟏, 𝟏, 𝟏)
(−𝟏, −𝟏, 𝟏)
(𝟏, −𝟏, 𝟏)
(𝟏, 𝟏, 𝟏)
(𝟎, 𝟎, 𝟎)
(𝟏, −𝟏, −𝟏) (𝟏, 𝟏, −𝟏)
Figure 3: The spatial an-
chors provided by ecsg.
We observe that, compared to existing 3D category-level models, learn-
ing the internal distribution with spatial-invariant and receptive field-
limited convolutional networks leads to more difficulties in producing
plausible 3D structures. Following
[34]
, which alleviates a similar issue
in image generation by introducing spatial inductive bias, we intro-
duce spatial inductive bias into our pipeline by lifting the normalized
Cartesian Spatial Grid (CSG) to 3D:
ecsg(x, y, z)=2·[x
W1
2,y
H1
2,z
U1
2],(2)
where
W
,
H
and
U
are size of the volume along the
x
,
y
and
z
axis. As illustrated in Figure 3, the grid is equipped with distinct
spatial anchors, empowering the model with better spatial localization.
The spatial anchors provided by
ecsg
are injected into the noise volume
z1
at the coarsest level:
˜
V1=G1(z1, ecsg).
Note we only inject the spatial inductive bias at the
coarsest scale, as the positional-encoded information will be propagated through subsequent scales
4
摘要:

SinGRAV:LearningaGenerativeRadianceVolumefromaSingleNaturalSceneYujieWang1;3XuelinChen21ShandongUniversity2TencentAILabyujiew.cn@gmail.comxuelin.chen.3d@gmail.comBaoquanChen33PekingUniversitybaoquan.chen@gmail.comAbstractWepresenta3Dgenerativemodelforgeneralnaturalscenes.Lackingnecessaryvolumesof3Dd...

展开>> 收起<<
SinGRA V Learning a Generative Radiance Volume from a Single Natural Scene Yujie Wang13Xuelin Chen2.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:7.46MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注