Feature-Realistic Neural Fusion for Real-Time Open Set Scene Understanding Kirill Mazur Edgar Sucar and Andrew J. Davison

2025-04-27 0 0 5.92MB 7 页 10玖币
侵权投诉
Feature-Realistic Neural Fusion for Real-Time, Open Set Scene
Understanding
Kirill Mazur, Edgar Sucar and Andrew J. Davison
Abstract General scene understanding for robotics requires
flexible semantic representation, so that novel objects and
structures which may not have been known at training time can
be identified, segmented and grouped. We present an algorithm
which fuses general learned features from a standard pre-
trained network into a highly efficient 3D geometric neural
field representation during real-time SLAM. The fused 3D
feature maps inherit the coherence of the neural field’s geom-
etry representation. This means that tiny amounts of human
labelling interacting at runtime enable objects or even parts
of objects to be robustly and accurately segmented in an open
set manner. Project page: https://makezur.github.io/
FeatureRealisticFusion/
I. INTRODUCTION
Robots which aim towards general, long-term capabilities
in complex environments such as homes must use vision and
other sensors to build scene representations which are both
geometric and semantic. Ideally these representations should
be general purpose, enabling many types of task reasoning,
while also efficient to build, update and store.
Semantic segmentation outputs from powerful single-
frame neural networks can be fused into dense 3D scene
reconstructions to create semantic maps. Systems such as
SemanticFusion [1] have shown that this can be achieved in
real-time to be useful for robotics. However, such systems
only make maps of the semantic classes pre-defined in
training datasets, which limits how broadly they can be
used. Further, their performance in applications is often
disappointing as soon as real-world conditions vary too much
from their training data.
In this paper we demonstrate the advantages of an alterna-
tive real-time fusion method using general learned features,
which tend to have semantic properties but remain general
purpose when fused into 3D. They can then be grouped
with scene-specific semantic meaning in an open-set manner
at runtime via tiny amounts of labelling such as a human
teaching interaction. Semantic regions, objects or even object
parts can be persistently segmented in the 3D map.
In our method, input 2D RGB frames are processed by
networks pre-trained on the largest image datasets available,
such as ImageNet [2], to produce pixel-aligned banks of
features, at the same or often lower resolution than the
input frames. We employ either a classification CNN [3] or
a Transformer trained in a self-supervised manner [4]. We
deliberately use these off-the-shelf pre-trained networks to
Dyson Robotics Lab, Imperial College London, UK. Research presented
in this paper has been supported by Dyson Technology Ltd.
General
2D
Extractor
Vision Front-End
Downstream Online Tasks Fused Features
Tracking and
Mapping
SLAM Back-End
depth colour
Fig. 1: Method Overview. We fuse general pre-trained features
into a coherent 3D neural field SLAM model in real-time. The
fused feature maps enable highly efficient open set scene labelling
during live operation.
make the strong point that any sufficiently descriptive learned
features are suitable for our approach.
Rather than fusing features via essentially painting feature
distributions onto an explicit 3D geometric reconstruction
as is done with semantic classes in [1], here we represent
geometry and feature maps jointly via a neural field. Neural
fields have been recently shown to enable joint representation
of geometry and semantics within a single network, such
as in the off-line SemanticNeRF system [5]. The great
advantage of this is that the semantic representation inherits
the coherence of shape and colour reconstruction, and this
means that semantic regions can accurately fit the shapes of
objects even with very sparse annotation.
We base our new real-time neural feature fusion system on
iMAP [6], a neural field SLAM system which uses RGB-D
input to efficiently map scenes up to room scale. We augment
iMAP with a new latent volumetric rendering technique,
which enables fusion of very high dimensional feature maps
with little computational or memory overhead.
We call our scene representation “feature-realistic” as a
counterpoint to the “photo-realistic” scene models which are
the aim of many neural field approaches such as NeRF [7].
We believe that robotics usually does not need scene rep-
resentations which precisely model the light and colours in
a scene, and that it is more valuable and efficient to store
abstract feature maps which relate much more closely to
semantic properties.
We demonstrate the scene understanding properties of our
system via an open-set semantic segmentation task with
sparse user interaction, which represents the way a human
might interact with a robot to efficiently teach it about a
scene’s properties and objects. The user uses a few pointing
arXiv:2210.03043v1 [cs.CV] 6 Oct 2022
clicks to give labels to pixels, and the system then predicts
these label properties for the whole scene. We show that
compelling dense 3D scene semantic mapping is possible
with incredible sparse teaching input at runtime, even for ob-
ject categories which were never present in training datasets.
Usually the user only needs to place one click on an object
of a certain type for all instances of that class to be densely
segmented from their surroundings. We evaluate the system
on a new custom open-set video segmentation dataset.
To summarise, our contributions are as follows:
The first neural field feature fusion system operating in
real-time;
A system that operates incrementally and successfully
handles exploration of previously unobserved scene
regions;
Alatent volumetric rendering technique which allows
fusion up to 1536-dimensional feature-maps with neg-
ligible performance overhead compared to iMAP and a
scene representation of only 3 MB of parameters;
Dynamic open set semantic segmentation application of
the presented method;
II. RELATED WORK
SemanticFusion [1], an extension of ElasticFusion [8],
introduced a mechanism to incrementally fuse 2D semantic
label predictions from a CNN into a three-dimensional
environment map. Among other similar systems, the panoptic
fusion approach of [9] made an advance by explicitly repre-
senting object instances alongside semantic region classes.
The latest systems in this vein wield neural fields as an
underlying 3D representation. The advantageous properties
of the coherence of neural fusion were first shown by Se-
mantic NeRF [5], with variations aimed towards multi-scene
generalisation and panoptic fusion demonstrated in [10, 11].
The aforementioned methods suffer from a train-
ing/runtime domain gap and the inherently closed-set nature
of a fixed semantic label set. The domain is fixed by the
dataset and the closed target label set employed during the
semantic segmentation model pre-training.
Our method relates to two recently released approaches,
Distilled Feature Fields (DFF) [12] and Neural Feature
Fusion Fields (N3F) [13], which also add a feature output
branch to a neural field network and supervise the renders
with the outputs of a pre-trained feature extractor.
Unlike our work, N3F and DFF supervise neural fields
with up to 64- and 384-dimensional feature maps respec-
tively, which is 24×and 5×times smaller than our proposed
method. Both DFF and N3F operate in an off-line protocol
similar to NeRF and require approximately a day to converge
on a single scene, whereas our system operates at interactive
frame rates making it useful for robotics. Additionally, N3F
heavily leverages offline assumptions on an input sequence:
all frames have to be known prior to training, due to a
pre-processing step which executes dimensionality reduction
jointly on all input feature maps. In our online execution
paradigm these assumptions would be fundamentally vio-
lated and the input distribution might change drastically in
a few seconds (e.g. entering a new room).
Both N3F and DFF mainly consider object retrieval and
3D object segmentation mask extraction scenarios. In con-
trast, we focus on extracting all object instances of varying
appearance and geometry, given a semantic class. While
DFF also considers the semantic segmentation scenario, it
fuses the penultimate activations of a pre-trained semantic
segmentation model. This method is therefore essentially
equivalent to a SemanticNeRF-style approach with the same
benefits and pitfalls, such as the domain gap.
Our method achieves real-time performance by using a
core neural field SLAM approach based on iMAP [6], with a
small MLP network, RGB-D input and guided keyframe and
pixel sampling for efficiency. This type of efficient network
is well suited to semantic and label fusion. Recent work
iLabel [14], also based on iMAP, showed a type of interactive
scene segmentation based on no prior training data. The
coherence of the neural field alone was shown to be a basis
for segmenting objects from sparse interaction. However, in
iLabel there was little evidence that annotation of an object
led to grouping with other instances of the same class. In our
work we specifically show that this becomes possible due to
fusion of general features from an off-the-shelf pre-trained
network.
Our method also closely relates to SemanticPaint [15],
an older online interactive labelling system. SemanticPaint,
like our system, operates by propagating user-given labels
to novel object instances. However, propagation is severely
limited to objects which are almost identical apart from
colour. The core of the SemanticPaint is a random forest
classifier with hand-crafted features and refinement with a
Conditional Random Field. This machinery cannot compete
in pattern recognition abilities with the modern deep learning
methods for computer vision our approach builds on. Our
system benefits both the best properties of neural fields which
encourage coherent segmentation [14], and the power of
features from general pre-trained networks.
III. METHOD
𝒔
𝒇𝒓
Final Feature
𝜎
Volume density
Latent Features
Semantics
Position
(x, y, z)
Positional
Encoding
2D rendering
𝒇
Fig. 2: Scene Network. Overview of our Scene Network. Our
scene MLP predicts semantics and latent features, which are further
refined after the volumetric rendering.
Our system is composed of two principal components:
a pre-trained frozen 2D image feature extractor (vision
front-end) and an iMAP-like SLAM system (SLAM back-
end). While our method technically allows an image feature
extractor of any choice, we focus on ones that are general,
i.e not trained for dense prediction tasks.
摘要:

Feature-RealisticNeuralFusionforReal-Time,OpenSetSceneUnderstandingKirillMazur,EdgarSucarandAndrewJ.DavisonAbstract—Generalsceneunderstandingforroboticsrequiresexiblesemanticrepresentation,sothatnovelobjectsandstructureswhichmaynothavebeenknownattrainingtimecanbeidentied,segmentedandgrouped.Wepres...

展开>> 收起<<
Feature-Realistic Neural Fusion for Real-Time Open Set Scene Understanding Kirill Mazur Edgar Sucar and Andrew J. Davison.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:5.92MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注