X-NeRF Explicit Neural Radiance Field for Multi-Scene 360Insufficient RGB-D Views Haoyi Zhu Hao-Shu Fang Cewu Lu

2025-04-29 0 0 2.47MB 10 页 10玖币
侵权投诉
X-NeRF: Explicit Neural Radiance Field for
Multi-Scene 360Insufficient RGB-D Views
Haoyi Zhu, Hao-Shu Fang, Cewu Lu
Shanghai Jiao Tong University
Shanghai, China
{hyizhu1108, fhaoshu}@gmail.com, lucewu@sjtu.edu.cn
Abstract
Neural Radiance Fields (NeRFs), despite their outstand-
ing performance on novel view synthesis, often need dense
input views. Many papers train one model for each scene
respectively and few of them explore incorporating multi-
modal data into this problem. In this paper, we focus on a
rarely discussed but important setting: can we train one
model that can represent multiple scenes, with 360in-
sufficient views and RGB-D images? We refer insufficient
views to few extremely sparse and almost non-overlapping
views. To deal with it, X-NeRF, a fully explicit approach
which learns a general scene completion process instead of
a coordinate-based mapping, is proposed. Given a few in-
sufficient RGB-D input views, X-NeRF first transforms them
to a sparse point cloud tensor and then applies a 3D sparse
generative Convolutional Neural Network (CNN) to com-
plete it to an explicit radiance field whose volumetric ren-
dering can be conducted fast without running networks dur-
ing inference. To avoid overfitting, besides common render-
ing loss, we apply perceptual loss as well as view augmenta-
tion through random rotation on point clouds. The proposed
methodology significantly out-performs previous implicit
methods in our setting, indicating the great potential of pro-
posed problem and approach. Codes and data are available
at https://github.com/HaoyiZhu/XNeRF.
1. Introduction
Neural Radiance Fields (NeRFs) [29] have aroused sig-
nificant research interest recently, which usually implic-
itly encode scenes using coordinate-based multi-layer per-
ceptrons (MLPs) and have a wide range of applications
such as novel view synthesis [1, 7, 29, 44, 50, 51]. A lot
of follow-up work makes efforts to improve and extend
NeRF [29] in various ways from convergence and render-
ing speed [7, 11, 23, 39] to dynamic scenes [10, 21, 46],
etc. Some methods [39, 48, 49] utilize explicit structures to
Figure 1. An illustration of our problem setting. The center
shows an incomplete scene captured by a few low-cost RGB-D
cameras. The small square cones around the scene represent the
locations and directions of cameras and their corresponding RGB
images are shown in surrounding rectangles. Among them, the red
are seen views for training while the green one denotes the unseen
view for testing. The insufficient views are very sparse with less
than 10% to 20% overlapping with each other, making the problem
extremely hard.
gain huge performance improvement but they still directly
encode scenes in learnable network parameters.
Despite the exceptional performance in lots of scenarios,
most of NeRF-like methods need a lot of densely captured
views when training, making them hard or expensive to ap-
ply to practice. Although some work [7, 50] has studied
few-view training, their usual applicable scenarios require
views with small perspective changes and large overlap-
ping. What’s more, most methods usually train a model for
only one scene given the implicit modeling, making it diffi-
cult to apply them to massive scenes. Finally, with the rapid
development of hardwares, depth data is increasingly avail-
able. But most current NeRF-related work only take RGB
modality as input. How to utilize the depth information for
better rendering deserves more exploration.
To this end, in this paper we aim to propose a method-
ology which allows a single model to (i) deal with multiple
arXiv:2210.05135v1 [cs.CV] 11 Oct 2022
scenes, (ii) with insufficient views that are 360around the
scenes and (iii) incorporate the depth data for better render-
ing. Fig. 1 displays an example of our setting. To tackle this
hard setting, we propose an explicit neural radiance field
(X-NeRF), which can take RGB-D images as inputs. Dif-
ferent from other NeRF-like approaches implicitly mapping
coordinates to colors and densities, we explicitly modeling
this problem as a completion task. The intuition behind
comes from the observation that given a few RGB-D input
images, a large part of the scene is actually known. In other
words, plenty of information is already available initially, so
that we only have to learn a general scene-irrelevant com-
pletion relation. Since the network is designed to encode
a general completion mapping rather than a specific scene,
we can naturally deal with the multi-scene problem.
Specifically, the input RGB-D images are converted to
sparse colorful point clouds and quantized to sparse ten-
sors on which we can directly apply Minkowski Engine [4]
to operate. We adopt a 3D sparse generative CNN to con-
struct and complete the explicit neural radiance fields. Our
backbone applies a UNet-like [34] encoder-decoder struc-
ture with multi-stage generative transposed convolution and
pruning layers in the decoder. To avoid overfitting on seen
views, besides common rendering loss, we also apply per-
ceptual loss with patch-wise sampling as well as view aug-
mentation through random rotation on point clouds. Volu-
metric rendering with post-activation is used. By shooting
and querying a ray from a pixel , the accumulated color and
depth of it can be rendered.
Extensive experiments demonstrate that the proposed
task is extremely challenging for existing methods while our
approach can handle it well. We first compare our approach
with DS-NeRF [7], an advanced NeRF-based work that
also supports depth supervision, and DVGO [39] which is
a state-of-the-art NeRF-like method utilizing explicit struc-
tures, on single scene experiments. To be fair, we add depth
supervision to DVGO [39]. Then we compare X-NeRF with
some recent NeRF-related work that supports multi-scene
training such as pixelNeRF [50] and IBRNet [44] (depth
supervision is also added). The results state clearly that
X-NeRF is robust with multi-scene 360insufficient views
and can produce reliable novel view predictions. Our work
outperforms previous methods on the extreme setting, indi-
cating that X-NeRF can be applied to practice in a low-cost
manner as we can train one model for many scenes while
the inference process is quite lightweight.
2. Related Work
Novel view synthesis. To synthesize a novel view im-
age given a set of images is a classic and long-standing task.
Rendering methods can be mainly divided into image-based
or model-based. Image-based methods [9, 15, 44] directly
learn the transformation on image level such as warping
or interpolation, which are typically more computational
efficient. However, they need reference views during in-
ference and the number and density of reference images
may influence the rendering quality greatly. Model-based
methods [16, 18, 32, 35, 42] express scenes as high di-
mensional representations and apply physically meaningful
models such as optical model [27] to render the novel view
images. There are various forms to represent scenes. Earlier
works apply lumigraph [2, 12] and light fields [5, 19, 20, 36]
to directly interpolate on input images. Nevertheless, they
need exceedingly dense inputs which is totally unafford-
able in many applications. Other methods utilize explicit
representations such as mesh [6, 40, 43, 45] to deal with
sparse inputs. However, mesh-based approaches cannot
work well with gradient-based optimization due to discon-
tinuities and local minima. Recently, many deep learning
based methods employ CNNs to construct multi-plane im-
ages (MPIs) [8, 22, 28, 38, 41, 53] for forward-facing cap-
tures. There are also approaches that encode scenes as vol-
umetric representations [16, 18, 28, 41, 42, 53], but they
often struggles with complex and large-scale scenes.
Neural Radiance Fields. NeRFs have aroused great
interest and achieved huge success in novel view synthe-
sis task in recent years. A classic NeRF [29] learns a di-
rect mapping from coordinates to corresponding textures
such as color and density, implicitly encoding a scene in
MLPs. Since proposed, people have extended NeRF [29]
to a lot of variants with different characteristics including
editable [17, 24, 47], fast inference and/or training [7, 11,
23, 39], deformable [30, 33], unconstrained images [3, 26],
etc. Some recent work [39, 48, 49] introduces explicit struc-
tures to gain great performance enhancement, which indi-
cates that the implicit MLP architecture is not necessar-
ily the key to success. Nevertheless, despite the explicit
voxel grid structures, they are actually still essentially an
implicit modeling, as they still encode the scene informa-
tion in learnable parameters. The implicit modeling makes
NeRF-based methods hard to freely generalize on multi-
scene cases. Though some work such as [50] claims that
they have the ability to deal with multi-scene task, they in
practice can only process multiple small objects or multiple
similar simulated scenes. Moreover, when it comes to the
extreme situation proposed in this paper that the input views
are insufficient, which means the input views are extremely
sparse but 360around the real scenes and have almost no
overlapping (often less than 10% to 20%), the implicit mod-
eling easily overfits to a trivial solution due to its little con-
straints on the scene structure. Some approaches can deal
with few inputs such as [7, 50], but their applicable scenario
is mostly forward-facing captures which is actually still not
sparse enough.
Multi-modal RGB-D data. Nowadays, with the rapid
development of hardware devices, depth modal is becoming
摘要:

X-NeRF:ExplicitNeuralRadianceFieldforMulti-Scene360InsufcientRGB-DViewsHaoyiZhu,Hao-ShuFang,CewuLuShanghaiJiaoTongUniversityShanghai,Chinafhyizhu1108,fhaoshug@gmail.com,lucewu@sjtu.edu.cnAbstractNeuralRadianceFields(NeRFs),despitetheiroutstand-ingperformanceonnovelviewsynthesis,oftenneeddenseinput...

展开>> 收起<<
X-NeRF Explicit Neural Radiance Field for Multi-Scene 360Insufficient RGB-D Views Haoyi Zhu Hao-Shu Fang Cewu Lu.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:2.47MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注