X-NeRF Explicit Neural Radiance Field for Multi-Scene 360Insufﬁcient RGB-D Views Haoyi Zhu Hao-Shu Fang Cewu Lu

2025-04-29 0 0 2.47MB 10 页 10玖币

侵权投诉

X-NeRF: Explicit Neural Radiance Field for

Multi-Scene 360◦Insufﬁcient RGB-D Views

Haoyi Zhu, Hao-Shu Fang, Cewu Lu

Shanghai Jiao Tong University

Shanghai, China

{hyizhu1108, fhaoshu}@gmail.com, lucewu@sjtu.edu.cn

Abstract

Neural Radiance Fields (NeRFs), despite their outstand-

ing performance on novel view synthesis, often need dense

input views. Many papers train one model for each scene

respectively and few of them explore incorporating multi-

modal data into this problem. In this paper, we focus on a

rarely discussed but important setting: can we train one

model that can represent multiple scenes, with 360◦in-

sufﬁcient views and RGB-D images? We refer insufﬁcient

views to few extremely sparse and almost non-overlapping

views. To deal with it, X-NeRF, a fully explicit approach

which learns a general scene completion process instead of

a coordinate-based mapping, is proposed. Given a few in-

sufﬁcient RGB-D input views, X-NeRF ﬁrst transforms them

to a sparse point cloud tensor and then applies a 3D sparse

generative Convolutional Neural Network (CNN) to com-

plete it to an explicit radiance ﬁeld whose volumetric ren-

dering can be conducted fast without running networks dur-

ing inference. To avoid overﬁtting, besides common render-

ing loss, we apply perceptual loss as well as view augmenta-

tion through random rotation on point clouds. The proposed

methodology signiﬁcantly out-performs previous implicit

methods in our setting, indicating the great potential of pro-

posed problem and approach. Codes and data are available

at https://github.com/HaoyiZhu/XNeRF.

1. Introduction

Neural Radiance Fields (NeRFs) [29] have aroused sig-

niﬁcant research interest recently, which usually implic-

itly encode scenes using coordinate-based multi-layer per-

ceptrons (MLPs) and have a wide range of applications

such as novel view synthesis [1, 7, 29, 44, 50, 51]. A lot

of follow-up work makes efforts to improve and extend

NeRF [29] in various ways from convergence and render-

ing speed [7, 11, 23, 39] to dynamic scenes [10, 21, 46],

etc. Some methods [39, 48, 49] utilize explicit structures to

Figure 1. An illustration of our problem setting. The center

shows an incomplete scene captured by a few low-cost RGB-D

cameras. The small square cones around the scene represent the

locations and directions of cameras and their corresponding RGB

images are shown in surrounding rectangles. Among them, the red

are seen views for training while the green one denotes the unseen

view for testing. The insufﬁcient views are very sparse with less

than 10% to 20% overlapping with each other, making the problem

extremely hard.

gain huge performance improvement but they still directly

encode scenes in learnable network parameters.

Despite the exceptional performance in lots of scenarios,

most of NeRF-like methods need a lot of densely captured

views when training, making them hard or expensive to ap-

ply to practice. Although some work [7, 50] has studied

few-view training, their usual applicable scenarios require

views with small perspective changes and large overlap-

ping. What’s more, most methods usually train a model for

only one scene given the implicit modeling, making it difﬁ-

cult to apply them to massive scenes. Finally, with the rapid

development of hardwares, depth data is increasingly avail-

able. But most current NeRF-related work only take RGB

modality as input. How to utilize the depth information for

better rendering deserves more exploration.

To this end, in this paper we aim to propose a method-

ology which allows a single model to (i) deal with multiple

arXiv:2210.05135v1 [cs.CV] 11 Oct 2022

scenes, (ii) with insufﬁcient views that are 360◦around the

scenes and (iii) incorporate the depth data for better render-

ing. Fig. 1 displays an example of our setting. To tackle this

hard setting, we propose an explicit neural radiance ﬁeld

(X-NeRF), which can take RGB-D images as inputs. Dif-

ferent from other NeRF-like approaches implicitly mapping

coordinates to colors and densities, we explicitly modeling

this problem as a completion task. The intuition behind

comes from the observation that given a few RGB-D input

images, a large part of the scene is actually known. In other

words, plenty of information is already available initially, so

that we only have to learn a general scene-irrelevant com-

pletion relation. Since the network is designed to encode

a general completion mapping rather than a speciﬁc scene,

we can naturally deal with the multi-scene problem.

Speciﬁcally, the input RGB-D images are converted to

sparse colorful point clouds and quantized to sparse ten-

sors on which we can directly apply Minkowski Engine [4]

to operate. We adopt a 3D sparse generative CNN to con-

struct and complete the explicit neural radiance ﬁelds. Our

backbone applies a UNet-like [34] encoder-decoder struc-

ture with multi-stage generative transposed convolution and

pruning layers in the decoder. To avoid overﬁtting on seen

views, besides common rendering loss, we also apply per-

ceptual loss with patch-wise sampling as well as view aug-

mentation through random rotation on point clouds. Volu-

metric rendering with post-activation is used. By shooting

and querying a ray from a pixel , the accumulated color and

depth of it can be rendered.

Extensive experiments demonstrate that the proposed

task is extremely challenging for existing methods while our

approach can handle it well. We ﬁrst compare our approach

with DS-NeRF [7], an advanced NeRF-based work that

also supports depth supervision, and DVGO [39] which is

a state-of-the-art NeRF-like method utilizing explicit struc-

tures, on single scene experiments. To be fair, we add depth

supervision to DVGO [39]. Then we compare X-NeRF with

some recent NeRF-related work that supports multi-scene

training such as pixelNeRF [50] and IBRNet [44] (depth

supervision is also added). The results state clearly that

X-NeRF is robust with multi-scene 360◦insufﬁcient views

and can produce reliable novel view predictions. Our work

outperforms previous methods on the extreme setting, indi-

cating that X-NeRF can be applied to practice in a low-cost

manner as we can train one model for many scenes while

the inference process is quite lightweight.

2. Related Work

Novel view synthesis. To synthesize a novel view im-

age given a set of images is a classic and long-standing task.

Rendering methods can be mainly divided into image-based

or model-based. Image-based methods [9, 15, 44] directly

learn the transformation on image level such as warping

or interpolation, which are typically more computational

efﬁcient. However, they need reference views during in-

ference and the number and density of reference images

may inﬂuence the rendering quality greatly. Model-based

methods [16, 18, 32, 35, 42] express scenes as high di-

mensional representations and apply physically meaningful

models such as optical model [27] to render the novel view

images. There are various forms to represent scenes. Earlier

works apply lumigraph [2, 12] and light ﬁelds [5, 19, 20, 36]

to directly interpolate on input images. Nevertheless, they

need exceedingly dense inputs which is totally unafford-

able in many applications. Other methods utilize explicit

representations such as mesh [6, 40, 43, 45] to deal with

sparse inputs. However, mesh-based approaches cannot

work well with gradient-based optimization due to discon-

tinuities and local minima. Recently, many deep learning

based methods employ CNNs to construct multi-plane im-

ages (MPIs) [8, 22, 28, 38, 41, 53] for forward-facing cap-

tures. There are also approaches that encode scenes as vol-

umetric representations [16, 18, 28, 41, 42, 53], but they

often struggles with complex and large-scale scenes.

Neural Radiance Fields. NeRFs have aroused great

interest and achieved huge success in novel view synthe-

sis task in recent years. A classic NeRF [29] learns a di-

rect mapping from coordinates to corresponding textures

such as color and density, implicitly encoding a scene in

MLPs. Since proposed, people have extended NeRF [29]

to a lot of variants with different characteristics including

editable [17, 24, 47], fast inference and/or training [7, 11,

23, 39], deformable [30, 33], unconstrained images [3, 26],

etc. Some recent work [39, 48, 49] introduces explicit struc-

tures to gain great performance enhancement, which indi-

cates that the implicit MLP architecture is not necessar-

ily the key to success. Nevertheless, despite the explicit

voxel grid structures, they are actually still essentially an

implicit modeling, as they still encode the scene informa-

tion in learnable parameters. The implicit modeling makes

NeRF-based methods hard to freely generalize on multi-

scene cases. Though some work such as [50] claims that

they have the ability to deal with multi-scene task, they in

practice can only process multiple small objects or multiple

similar simulated scenes. Moreover, when it comes to the

extreme situation proposed in this paper that the input views

are insufﬁcient, which means the input views are extremely

sparse but 360◦around the real scenes and have almost no

overlapping (often less than 10% to 20%), the implicit mod-

eling easily overﬁts to a trivial solution due to its little con-

straints on the scene structure. Some approaches can deal

with few inputs such as [7, 50], but their applicable scenario

is mostly forward-facing captures which is actually still not

sparse enough.

Multi-modal RGB-D data. Nowadays, with the rapid

development of hardware devices, depth modal is becoming

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

X-NeRF:ExplicitNeuralRadianceFieldforMulti-Scene360InsufcientRGB-DViewsHaoyiZhu,Hao-ShuFang,CewuLuShanghaiJiaoTongUniversityShanghai,Chinafhyizhu1108,fhaoshug@gmail.com,lucewu@sjtu.edu.cnAbstractNeuralRadianceFields(NeRFs),despitetheiroutstand-ingperformanceonnovelviewsynthesis,oftenneeddenseinput...

展开>> 收起<<

X-NeRF Explicit Neural Radiance Field for Multi-Scene 360Insufﬁcient RGB-D Views Haoyi Zhu Hao-Shu Fang Cewu Lu.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

X-NeRF Explicit Neural Radiance Field for Multi-Scene 360Insufﬁcient RGB-D Views Haoyi Zhu Hao-Shu Fang Cewu Lu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: