SinGRA V Learning a Generative Radiance Volume from a Single Natural Scene Yujie Wang13Xuelin Chen2

2025-05-03 1 0 7.46MB 19 页 10玖币

侵权投诉

SinGRAV: Learning a Generative Radiance Volume

from a Single Natural Scene

Yujie Wang1,3Xuelin Chen2

1Shandong University 2Tencent AI Lab

yujiew.cn@gmail.com xuelin.chen.3d@gmail.com

Baoquan Chen3

3Peking University

baoquan.chen@gmail.com

Abstract

We present a 3D generative model for general natural scenes. Lacking necessary

volumes of 3D data characterizing the target scene, we propose to learn from

a single scene. Our key insight is that a natural scene often contains multiple

constituents whose geometry, texture, and spatial arrangements follow some clear

patterns, but still exhibit rich variations over different regions within the same

scene. This suggests localizing the learning of a generative model on substantial

local regions. Hence, we exploit a multi-scale convolutional network, which

possesses the spatial locality bias in nature, to learn from the statistics of local

regions at multiple scales within a single scene. In contrast to existing methods,

our learning setup bypasses the need to collect data from many homogeneous 3D

scenes for learning common features. We coin our method SinGRAV, for learning

aGenerative RAdiance Volume from a Single natural scene. We demonstrate the

ability of SinGRAV in generating plausible and diverse variations from a single

scene, the merits of SinGRAV over state-of-the-art generative neural scene methods,

as well as the versatility of SinGRAV by its use in a variety of applications, spanning

3D scene editing, composition, and animation. Code and data will be released to

facilitate further research.

1 Introduction

Recently, 3D generative modeling has made great strides via gravitating towards neural scene repre-

sentations, which boast unprecedentedly photo-realism. Generative neural scene models

[23;20;2;6;1]

can now draw class-speciﬁc realistic scenes (e.g., cars and portraits), offering a glimpse into the

boundless universe in the virtual. Yet an obvious question is how we can go beyond class-speciﬁc

scenes, and achieve similar success with general natural scenes, creating at scale diverse scenes of

more sorts. This work presents, to our knowledge, the ﬁrst endeavor towards answering this question.

While neural scene generation has boosted the ﬁeld via learning from images with differentiable

projections, there still exists strong dependence on datasets containing images of many homogeneous

scenes. Collecting homogeneous data for each scene type ad hoc is cumbersome, and would become

prohibitive when the scene type of interest varies dynamically. Fortunately, parallels can be drawn

following nature, who implicitly maintains a rich variety of natural scenes. In nature, no scenes ever

stayed absolutely unaltered — imagine, as the stars shift, a river changes its current and little pebbles

get scattered and weathered variously. Analogically, in scene creation, new scenes can be created via

"varying" an existing one, yet still resembling the original (see Figure 1).

Following this spirit, we present SinGRAV, for a generative radiance volume learned to synthesize

variations from a single general natural scene. Building upon recent advances in 3D generative

modeling, SinGRAV also learns from visual observations of the target scene via differentiable volume

arXiv:2210.01202v3 [cs.CV] 16 Oct 2022

Training scene Random generation

Editing - removal Editing - duplicate Composition

Figure 1: Top two rows: From observations of a single natural scene, we learn a generative model to

synthesize highly plausible variations. Three views rendered from each of input/generated scenes are

presented. Note how the global and object conﬁgurations vary in generated samples, yet still resemble

the original. Bottom: Applications enabled by SinGRAV, including removal (left) and duplicate

(middle) operation for editing a 3D scene sample, and scene composition (right) that combines 5

different generated samples to form a novel complex scene.

rendering, but the training supervision, scene representation, and network architecture all have

to be altered fundamentally to meet unique challenges arising from learning with a single scene.

Predominantly, the supervision comes from merely a single scene, which would manifestly elude

state-of-the-art models that learn common features across homogeneous scenes.

Natural scenes often contain multiple constituents whose geometry, appearance, and spatial arrange-

ments follow some clear patterns, but still exhibit rich variations over different regions within the

same scene. This naturally inspires one to learn from the statistics of internal regions via localizing

the training. As our supervision comes from 2D images, learning internal distributions consequently

grounds SinGRAV on the assumption that multi-view observations share a consistent internal distri-

bution for learning, which can be simply realized by uniformly placing the training cameras around

the scene to obtain homogeneous images. On the other hand, MLP-based representations tend

to synthesize holistically and perform better at modeling global patterns over local ones

[5]

, thus

favor modeling class-speciﬁc distributions. Hence, we resort to convolutional operations, which

generate discrete radiance volumes from noise volumes with limited receptive ﬁelds, for learning

local properties over a conﬁned spatial extent, granting better out-of-distribution generation in terms

of the global conﬁguration. Moreover, we adopt a multi-scale architecture containing a pyramid of

convolutional GANs to capture the internal distribution at various scales, alleviating the notorious

mode-collapse issue. This framework is similar in spirit with

[24;25]

, however, important designs must

be incorporated to efﬁciently and effectively improve the plausibility of the spatial arrangement and

the appearance realism of the generated 3D scene.

We demonstrate SinGRAV enables us to easily generate plausible variations of the input scene in large

quantities and varieties, with novel geometry, textures and conﬁgurations. Plausibility evaluations are

conducted via perceptual studies, and comparisons are made to state-of-the-art generative models.

The importance of various network design choices are validated. Finally, we show the versatility of

SinGRAV by its use in a series of applications, spanning 3D scene editing, retargeting, and animation.

2 Related work

Neural scene representation and rendering.

In recent years, neural scene representations have

been the de facto infrastructure in several tasks, including representing shapes

[3;21;16;28;14]

, novel view

synthesis

[18;17;31]

, and 3D generative modeling

[19;23;1;20;6;4;2]

. Paired with differentiable projection

functions, the geometry and appearance of the underlying scene can be optimized based on the error

derived from the downstream supervision signals.

[3;21;33;28;16]

adopt neural implicit ﬁelds to represent

3D shapes and attain highly detailed geometries. On the other hand,

[27;29;31]

work on discrete grids,

UV maps, and point clouds, respectively, with attached learnable neural features that can produce

pleasing novel view imagery. More recently, the Neural Radiance Field (NeRF) technique

[17]

has

revolutionized several research ﬁelds with a trained MLP-based radiance and opacity ﬁeld, achieving

unprecedented success in producing photo-realistic imagery. An explosion of NeRF techniques

occurred in the research community since then that improves the NeRF in various aspects of the

problem

[12;22;35;11;32;15;10;30;13]

. A research direction drawing increasing interest, which we discuss in

the following, is to incorporate such neural representations to learn a 3D generative model possessing

photo-realistic viewing effects.

Generative neural scene generation.

With great success achieved in 2D image generation tasks,

research have taken several routes to realize 3D-aware generative models. The heart of these methods

is a 3D neural scene representation, paired with the volume rendering, makes the whole pipeline

differentiable with respect to the supervision imposed in image domain.

[23;1]

integrate a neural

radiance ﬁeld into generative models, and directly produce the ﬁnal images via volume rendering,

yielding a 3D ﬁeld modeled with great view-consistency under varying views. To overcome the

low query efﬁciency and high memory cost issues, researchers proposed to adopt 2D neural renders

to improve the inference speed, and achieved high-resolution outputs

[20;6;2]

[20]

utilizes multiple

radiance sub-ﬁelds to represent a scene, and shows its potential to model scenes containing multiple

objects on synthetic primitive datasets. More often than not, these methods are demonstrated on

single-object scenes.

[4]

found that the capacity of a generative model conditioned on a global latent

vector is limited, and instead propose to use a grid of locally conditioned radiance sub-ﬁelds to

model more complicated scenes. All these work learn category-speciﬁc models, requiring training on

sufﬁcient volumes of images data collected from many homogeneous scenes. In this work, we target

general natural scenes, which in general possess intricate and exclusive characteristic, suggesting

difﬁculties in collecting necessary volumes of training data and rendering these data-consuming

learning setups intractable. Moreover, as aforementioned, our task necessitates localizing the training

over local regions, which is lacking in MLP-based representations, leading us to the use of voxel

grids in this work.

3 Method

SinGRAV learns a powerful generative model for generating neural radiance volumes from multi-view

observations

X={x1, ..., xm}

of a single scene. In contrast to learning class-speciﬁc priors, our

speciﬁc emphasis is to learn the internal distribution of the input scene. To this end, we resort to

convolutional networks, which inherently possess spatial locality bias, with limited receptive ﬁelds

for learning over a variety of local regions within the input scene. The generative model is learned via

adversarial generation and discrimination through 2D projections of the generated volumes. During

training, the camera pose is randomly selected from the training set. We will omit the notion of the

camera pose for brevity. Moreover, we use a multi-scale framework to learn properties at different

scales, ranging from global conﬁgurations to local ﬁne texture details. Figure 2 presents an overview.

3.1 Neural radiance volume and rendering

The generated scene is represented by a discrete 3D voxel grid, and is to be produced by a 3D

convolutional network. Each voxel center stores a 4-channel vector that contains a density scalar

and a color vector

. Trilinear interpolation is used to deﬁne a continuous radiance ﬁeld in the volume.

We use the same differentiable function as in NeRF for volume-rendering generated volumes. The

expected color

of a camera ray

is approximated by integrating over

samples spreading along

the ray:

C(r) =

i=1

Ti1−exp(−σiδi)ci,and Ti=exp−

i−1

j=1

σjδj,(1)

where the subscript denotes the sample index between the near

and far

bound,

δi=ti+1 −ti

the distance between two consecutive samples, and

is the accumulated transmittance at sample

which is obtained via integrating over the preceding samples along the ray. This color rendering is

...

𝐺!

𝑧"#! ↑$

...

…

⨁

𝐺!

𝐺"#! %

𝑉

"#!

𝑉

𝑧!

𝐷"

𝐷!

𝐷"#!

Spatial anchors

Receptive field

𝑒"#$

Fake renderings Training views

Inject

Pyramid of

discriminators

Pyramid of

generators

Gaussian

Rendering from

scale 𝑁 − 1

Figure 2: SinGRAV training setup. A series of convolutional generators are trained to generate a scene

in a coarse-to-ﬁne manner. At each scale,

learns to form a volume via generating realistic 3D

overlapping patches, that collectively contribute to a volumetric-rendered imagery indistinguishable

from the observation images of the input scene by the discriminator

. At the ﬁnest scale, the

generator

operates purely on 2D domain to super-resolve the imagery produced from the previous

scale (top), signiﬁcantly reducing the computation overhead.

reduced to traditional alpha compositing with alpha values

αi= 1 −exp(−σiδi)

. This function is

differentiable and enables updating the volume based on the error derived from supervision signals.

3.2 Hybrid multi-scale architecture

We use the multi-scale framework widely followed in image generation. In addition, SinGRAV

adopts a hybrid framework, which contains a series of 3D convolutional generators

{Gn}N−1

n=1

and a

lightweight 2D convolutional generator

(see Figure 2). Speciﬁcally, 3D generators

{Gn}N−1

n=1

sequentially generate a volume with an increasing resolution at

N−1

coarser scales, with the volume

resolution increased by a factor of

between two consecutive scales and the rendering resolution

increases by a factor of

µr

. At

th scale, to address the overly high computation cost, we use a

lightweight 2D generator

to super-resolve the imagery from the preceding scale by a factor

µs

, achieving higher-resolution outputs. Essentially, each generator

learns to generate realistic

outputs to fool an associate 2D discriminator

D2D

, that is designed to distinguish the generated

renderings from real ones. Importantly, the generators and discriminators are designed to be equipped

with limited receptive ﬁelds, preventing them from over-ﬁtting to the whole scene.

Starting from a coarsest volume, the generation sequentially passes through all generators and

increases the resolution up to the ﬁnest, with noise injected at each scale. At the coarsest scale, the

volume generation is purely generative, i.e.

maps a Gaussian noise volume

to a radiance volume.

(−𝟏, 𝟏, 𝟏)

(−𝟏, −𝟏, 𝟏)

(𝟏, −𝟏, 𝟏)

(𝟏, 𝟏, 𝟏)

(𝟎, 𝟎, 𝟎)

(𝟏, −𝟏, −𝟏) (𝟏, 𝟏, −𝟏)

Figure 3: The spatial an-

chors provided by ecsg.

We observe that, compared to existing 3D category-level models, learn-

ing the internal distribution with spatial-invariant and receptive ﬁeld-

limited convolutional networks leads to more difﬁculties in producing

plausible 3D structures. Following

[34]

, which alleviates a similar issue

in image generation by introducing spatial inductive bias, we intro-

duce spatial inductive bias into our pipeline by lifting the normalized

Cartesian Spatial Grid (CSG) to 3D:

ecsg(x, y, z)=2·[x

W−1

2,y

H−1

2,z

U−1

2],(2)

where

and

are size of the volume along the

x−

y−

and

z−

axis. As illustrated in Figure 3, the grid is equipped with distinct

spatial anchors, empowering the model with better spatial localization.

The spatial anchors provided by

ecsg

are injected into the noise volume

at the coarsest level:

V1=G1(z1, ecsg).

Note we only inject the spatial inductive bias at the

coarsest scale, as the positional-encoded information will be propagated through subsequent scales

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SinGRAV:LearningaGenerativeRadianceVolumefromaSingleNaturalSceneYujieWang1;3XuelinChen21ShandongUniversity2TencentAILabyujiew.cn@gmail.comxuelin.chen.3d@gmail.comBaoquanChen33PekingUniversitybaoquan.chen@gmail.comAbstractWepresenta3Dgenerativemodelforgeneralnaturalscenes.Lackingnecessaryvolumesof3Dd...

展开>> 收起<<

SinGRA V Learning a Generative Radiance Volume from a Single Natural Scene Yujie Wang13Xuelin Chen2.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SinGRA V Learning a Generative Radiance Volume from a Single Natural Scene Yujie Wang13Xuelin Chen2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: