Feature-Realistic Neural Fusion for Real-Time Open Set Scene Understanding Kirill Mazur Edgar Sucar and Andrew J. Davison

2025-04-27 0 0 5.92MB 7 页 10玖币

侵权投诉

Feature-Realistic Neural Fusion for Real-Time, Open Set Scene

Understanding

Kirill Mazur, Edgar Sucar and Andrew J. Davison

Abstract— General scene understanding for robotics requires

ﬂexible semantic representation, so that novel objects and

structures which may not have been known at training time can

be identiﬁed, segmented and grouped. We present an algorithm

which fuses general learned features from a standard pre-

trained network into a highly efﬁcient 3D geometric neural

ﬁeld representation during real-time SLAM. The fused 3D

feature maps inherit the coherence of the neural ﬁeld’s geom-

etry representation. This means that tiny amounts of human

labelling interacting at runtime enable objects or even parts

of objects to be robustly and accurately segmented in an open

set manner. Project page: https://makezur.github.io/

FeatureRealisticFusion/

I. INTRODUCTION

Robots which aim towards general, long-term capabilities

in complex environments such as homes must use vision and

other sensors to build scene representations which are both

geometric and semantic. Ideally these representations should

be general purpose, enabling many types of task reasoning,

while also efﬁcient to build, update and store.

Semantic segmentation outputs from powerful single-

frame neural networks can be fused into dense 3D scene

reconstructions to create semantic maps. Systems such as

SemanticFusion [1] have shown that this can be achieved in

real-time to be useful for robotics. However, such systems

only make maps of the semantic classes pre-deﬁned in

training datasets, which limits how broadly they can be

used. Further, their performance in applications is often

disappointing as soon as real-world conditions vary too much

from their training data.

In this paper we demonstrate the advantages of an alterna-

tive real-time fusion method using general learned features,

which tend to have semantic properties but remain general

purpose when fused into 3D. They can then be grouped

with scene-speciﬁc semantic meaning in an open-set manner

at runtime via tiny amounts of labelling such as a human

teaching interaction. Semantic regions, objects or even object

parts can be persistently segmented in the 3D map.

In our method, input 2D RGB frames are processed by

networks pre-trained on the largest image datasets available,

such as ImageNet [2], to produce pixel-aligned banks of

features, at the same or often lower resolution than the

input frames. We employ either a classiﬁcation CNN [3] or

a Transformer trained in a self-supervised manner [4]. We

deliberately use these off-the-shelf pre-trained networks to

Dyson Robotics Lab, Imperial College London, UK. Research presented

in this paper has been supported by Dyson Technology Ltd.

General

Extractor

Vision Front-End

Downstream Online Tasks Fused Features

Tracking and

Mapping

SLAM Back-End

depth colour

Fig. 1: Method Overview. We fuse general pre-trained features

into a coherent 3D neural ﬁeld SLAM model in real-time. The

fused feature maps enable highly efﬁcient open set scene labelling

during live operation.

make the strong point that any sufﬁciently descriptive learned

features are suitable for our approach.

Rather than fusing features via essentially painting feature

distributions onto an explicit 3D geometric reconstruction

as is done with semantic classes in [1], here we represent

geometry and feature maps jointly via a neural ﬁeld. Neural

ﬁelds have been recently shown to enable joint representation

of geometry and semantics within a single network, such

as in the off-line SemanticNeRF system [5]. The great

advantage of this is that the semantic representation inherits

the coherence of shape and colour reconstruction, and this

means that semantic regions can accurately ﬁt the shapes of

objects even with very sparse annotation.

We base our new real-time neural feature fusion system on

iMAP [6], a neural ﬁeld SLAM system which uses RGB-D

input to efﬁciently map scenes up to room scale. We augment

iMAP with a new latent volumetric rendering technique,

which enables fusion of very high dimensional feature maps

with little computational or memory overhead.

We call our scene representation “feature-realistic” as a

counterpoint to the “photo-realistic” scene models which are

the aim of many neural ﬁeld approaches such as NeRF [7].

We believe that robotics usually does not need scene rep-

resentations which precisely model the light and colours in

a scene, and that it is more valuable and efﬁcient to store

abstract feature maps which relate much more closely to

semantic properties.

We demonstrate the scene understanding properties of our

system via an open-set semantic segmentation task with

sparse user interaction, which represents the way a human

might interact with a robot to efﬁciently teach it about a

scene’s properties and objects. The user uses a few pointing

arXiv:2210.03043v1 [cs.CV] 6 Oct 2022

clicks to give labels to pixels, and the system then predicts

these label properties for the whole scene. We show that

compelling dense 3D scene semantic mapping is possible

with incredible sparse teaching input at runtime, even for ob-

ject categories which were never present in training datasets.

Usually the user only needs to place one click on an object

of a certain type for all instances of that class to be densely

segmented from their surroundings. We evaluate the system

on a new custom open-set video segmentation dataset.

To summarise, our contributions are as follows:

•The ﬁrst neural ﬁeld feature fusion system operating in

real-time;

•A system that operates incrementally and successfully

handles exploration of previously unobserved scene

regions;

•Alatent volumetric rendering technique which allows

fusion up to 1536-dimensional feature-maps with neg-

ligible performance overhead compared to iMAP and a

scene representation of only 3 MB of parameters;

•Dynamic open set semantic segmentation application of

the presented method;

II. RELATED WORK

SemanticFusion [1], an extension of ElasticFusion [8],

introduced a mechanism to incrementally fuse 2D semantic

label predictions from a CNN into a three-dimensional

environment map. Among other similar systems, the panoptic

fusion approach of [9] made an advance by explicitly repre-

senting object instances alongside semantic region classes.

The latest systems in this vein wield neural ﬁelds as an

underlying 3D representation. The advantageous properties

of the coherence of neural fusion were ﬁrst shown by Se-

mantic NeRF [5], with variations aimed towards multi-scene

generalisation and panoptic fusion demonstrated in [10, 11].

The aforementioned methods suffer from a train-

ing/runtime domain gap and the inherently closed-set nature

of a ﬁxed semantic label set. The domain is ﬁxed by the

dataset and the closed target label set employed during the

semantic segmentation model pre-training.

Our method relates to two recently released approaches,

Distilled Feature Fields (DFF) [12] and Neural Feature

Fusion Fields (N3F) [13], which also add a feature output

branch to a neural ﬁeld network and supervise the renders

with the outputs of a pre-trained feature extractor.

Unlike our work, N3F and DFF supervise neural ﬁelds

with up to 64- and 384-dimensional feature maps respec-

tively, which is 24×and 5×times smaller than our proposed

method. Both DFF and N3F operate in an off-line protocol

similar to NeRF and require approximately a day to converge

on a single scene, whereas our system operates at interactive

frame rates making it useful for robotics. Additionally, N3F

heavily leverages ofﬂine assumptions on an input sequence:

all frames have to be known prior to training, due to a

pre-processing step which executes dimensionality reduction

jointly on all input feature maps. In our online execution

paradigm these assumptions would be fundamentally vio-

lated and the input distribution might change drastically in

a few seconds (e.g. entering a new room).

Both N3F and DFF mainly consider object retrieval and

3D object segmentation mask extraction scenarios. In con-

trast, we focus on extracting all object instances of varying

appearance and geometry, given a semantic class. While

DFF also considers the semantic segmentation scenario, it

fuses the penultimate activations of a pre-trained semantic

segmentation model. This method is therefore essentially

equivalent to a SemanticNeRF-style approach with the same

beneﬁts and pitfalls, such as the domain gap.

Our method achieves real-time performance by using a

core neural ﬁeld SLAM approach based on iMAP [6], with a

small MLP network, RGB-D input and guided keyframe and

pixel sampling for efﬁciency. This type of efﬁcient network

is well suited to semantic and label fusion. Recent work

iLabel [14], also based on iMAP, showed a type of interactive

scene segmentation based on no prior training data. The

coherence of the neural ﬁeld alone was shown to be a basis

for segmenting objects from sparse interaction. However, in

iLabel there was little evidence that annotation of an object

led to grouping with other instances of the same class. In our

work we speciﬁcally show that this becomes possible due to

fusion of general features from an off-the-shelf pre-trained

network.

Our method also closely relates to SemanticPaint [15],

an older online interactive labelling system. SemanticPaint,

like our system, operates by propagating user-given labels

to novel object instances. However, propagation is severely

limited to objects which are almost identical apart from

colour. The core of the SemanticPaint is a random forest

classiﬁer with hand-crafted features and reﬁnement with a

Conditional Random Field. This machinery cannot compete

in pattern recognition abilities with the modern deep learning

methods for computer vision our approach builds on. Our

system beneﬁts both the best properties of neural ﬁelds which

encourage coherent segmentation [14], and the power of

features from general pre-trained networks.

III. METHOD

𝒔

𝒇𝒓

Final Feature

∫

𝜎

Volume density

Latent Features

Semantics

Position

(x, y, z)

Positional

Encoding

2D rendering

𝒇

Fig. 2: Scene Network. Overview of our Scene Network. Our

scene MLP predicts semantics and latent features, which are further

reﬁned after the volumetric rendering.

Our system is composed of two principal components:

a pre-trained frozen 2D image feature extractor (vision

front-end) and an iMAP-like SLAM system (SLAM back-

end). While our method technically allows an image feature

extractor of any choice, we focus on ones that are general,

i.e not trained for dense prediction tasks.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Feature-RealisticNeuralFusionforReal-Time,OpenSetSceneUnderstandingKirillMazur,EdgarSucarandAndrewJ.DavisonAbstractGeneralsceneunderstandingforroboticsrequiresexiblesemanticrepresentation,sothatnovelobjectsandstructureswhichmaynothavebeenknownattrainingtimecanbeidentied,segmentedandgrouped.Wepres...

展开>> 收起<<

Feature-Realistic Neural Fusion for Real-Time Open Set Scene Understanding Kirill Mazur Edgar Sucar and Andrew J. Davison.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Feature-Realistic Neural Fusion for Real-Time Open Set Scene Understanding Kirill Mazur Edgar Sucar and Andrew J. Davison

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: