
clicks to give labels to pixels, and the system then predicts
these label properties for the whole scene. We show that
compelling dense 3D scene semantic mapping is possible
with incredible sparse teaching input at runtime, even for ob-
ject categories which were never present in training datasets.
Usually the user only needs to place one click on an object
of a certain type for all instances of that class to be densely
segmented from their surroundings. We evaluate the system
on a new custom open-set video segmentation dataset.
To summarise, our contributions are as follows:
•The first neural field feature fusion system operating in
real-time;
•A system that operates incrementally and successfully
handles exploration of previously unobserved scene
regions;
•Alatent volumetric rendering technique which allows
fusion up to 1536-dimensional feature-maps with neg-
ligible performance overhead compared to iMAP and a
scene representation of only 3 MB of parameters;
•Dynamic open set semantic segmentation application of
the presented method;
II. RELATED WORK
SemanticFusion [1], an extension of ElasticFusion [8],
introduced a mechanism to incrementally fuse 2D semantic
label predictions from a CNN into a three-dimensional
environment map. Among other similar systems, the panoptic
fusion approach of [9] made an advance by explicitly repre-
senting object instances alongside semantic region classes.
The latest systems in this vein wield neural fields as an
underlying 3D representation. The advantageous properties
of the coherence of neural fusion were first shown by Se-
mantic NeRF [5], with variations aimed towards multi-scene
generalisation and panoptic fusion demonstrated in [10, 11].
The aforementioned methods suffer from a train-
ing/runtime domain gap and the inherently closed-set nature
of a fixed semantic label set. The domain is fixed by the
dataset and the closed target label set employed during the
semantic segmentation model pre-training.
Our method relates to two recently released approaches,
Distilled Feature Fields (DFF) [12] and Neural Feature
Fusion Fields (N3F) [13], which also add a feature output
branch to a neural field network and supervise the renders
with the outputs of a pre-trained feature extractor.
Unlike our work, N3F and DFF supervise neural fields
with up to 64- and 384-dimensional feature maps respec-
tively, which is 24×and 5×times smaller than our proposed
method. Both DFF and N3F operate in an off-line protocol
similar to NeRF and require approximately a day to converge
on a single scene, whereas our system operates at interactive
frame rates making it useful for robotics. Additionally, N3F
heavily leverages offline assumptions on an input sequence:
all frames have to be known prior to training, due to a
pre-processing step which executes dimensionality reduction
jointly on all input feature maps. In our online execution
paradigm these assumptions would be fundamentally vio-
lated and the input distribution might change drastically in
a few seconds (e.g. entering a new room).
Both N3F and DFF mainly consider object retrieval and
3D object segmentation mask extraction scenarios. In con-
trast, we focus on extracting all object instances of varying
appearance and geometry, given a semantic class. While
DFF also considers the semantic segmentation scenario, it
fuses the penultimate activations of a pre-trained semantic
segmentation model. This method is therefore essentially
equivalent to a SemanticNeRF-style approach with the same
benefits and pitfalls, such as the domain gap.
Our method achieves real-time performance by using a
core neural field SLAM approach based on iMAP [6], with a
small MLP network, RGB-D input and guided keyframe and
pixel sampling for efficiency. This type of efficient network
is well suited to semantic and label fusion. Recent work
iLabel [14], also based on iMAP, showed a type of interactive
scene segmentation based on no prior training data. The
coherence of the neural field alone was shown to be a basis
for segmenting objects from sparse interaction. However, in
iLabel there was little evidence that annotation of an object
led to grouping with other instances of the same class. In our
work we specifically show that this becomes possible due to
fusion of general features from an off-the-shelf pre-trained
network.
Our method also closely relates to SemanticPaint [15],
an older online interactive labelling system. SemanticPaint,
like our system, operates by propagating user-given labels
to novel object instances. However, propagation is severely
limited to objects which are almost identical apart from
colour. The core of the SemanticPaint is a random forest
classifier with hand-crafted features and refinement with a
Conditional Random Field. This machinery cannot compete
in pattern recognition abilities with the modern deep learning
methods for computer vision our approach builds on. Our
system benefits both the best properties of neural fields which
encourage coherent segmentation [14], and the power of
features from general pre-trained networks.
III. METHOD
𝒔
𝒇𝒓
Final Feature
∫
𝜎
Volume density
Latent Features
Semantics
Position
(x, y, z)
Positional
Encoding
2D rendering
𝒇
Fig. 2: Scene Network. Overview of our Scene Network. Our
scene MLP predicts semantics and latent features, which are further
refined after the volumetric rendering.
Our system is composed of two principal components:
a pre-trained frozen 2D image feature extractor (vision
front-end) and an iMAP-like SLAM system (SLAM back-
end). While our method technically allows an image feature
extractor of any choice, we focus on ones that are general,
i.e not trained for dense prediction tasks.