with only a few methods focusing on 3D object detection
[34, 37, 39] or 3D semantic segmentation [29, 63, 63] and
no methods for 3D instance segmentation. Overall, these
approaches are still behind in terms of performance compared
to current state-of-the-art methods [38, 48, 56, 57].
In this work, we propose the first Transformer-based model
for 3D semantic instance segmentation of large-scale scenes
that sets new state-of-the-art scores over a wide range of
datasets, and addresses the aforementioned problems on
hand-crafted model designs. The main challenge lies in
directly predicting instance masks and their corresponding
semantic labels. To this end, our model predicts instance
queries that encode semantic and geometric information
of each instance in the scene. Each instance query is then
further decoded into a semantic class and an instance feature.
The key idea (to directly generate masks) is to compute
similarity scores between individual instance features and all
point features in the point cloud [4, 6, 21]. This results in a
heatmap over the point cloud, which (after normalization
and thresholding) yields the final binary instance mask
(c.f. Fig. 1). Our model, called Mask3D, builds on recent
advances in both Transformers [5, 37] and 3D deep learning
[8, 17, 57]: to compute strong point features, we leverage
a sparse convolutional feature backbone [8] that efficiently
processes full scenes and naturally provides multi-scale point
features. To generate instance queries, we rely on stacked
Transformer decoders [5, 6] that iteratively attend to learned
point features in a coarse-to-fine fashion using non-parametric
queries [37]. Unlike voting-based methods, directly predicting
and supervising masks causes some challenges during training:
before computing a mask loss, we first have to establish
correspondences between predicted and annotated masks.
A naïve solution would be to choose for each predicted
mask the nearest ground truth mask [21]. However, this
does not guarantee an optimal matching and any unmatched
annotated mask would not contribute to the loss. Instead,
we perform bipartite graph matching to obtain optimal
associations between ground truth and predicted masks [2, 61].
We evaluate our model on four challenging 3D instance
segmentation datasets, ScanNet v2 [9], ScanNet200 [47],
S3DIS [1] and STPLS3D [3] and significantly outperform
prior art, even surpassing architectures that are highly tuned
towards specific datasets. Our experimental study compares
various query types, different mask losses, and evaluates the
number of queries as well as Transformer decoder steps.
Our contributions are as follows:
(1)
We propose the
first competitive Transformer-based model for 3D semantic
instance segmentation.
(2)
Our model named Mask3D builds
on domain-agnostic components, avoiding center voting, non-
maximum suppression, or grouping heuristics, and overall re-
quires less hand-tuning.
(3)
Mask3D achieves state-of-the-art
performance on ScanNet, ScanNet200, S3DIS and STPLS3D.
To reach that level of performance with a Transformer-based
approach, it is key to predict instance queries that encode the
semantics and geometry of the scene and objects.
II. RELATED WORK
3D Instance Segmentation.
Numerous methods have been
proposed for 3D instance semantic segmentation, includ-
ing bottom-up approaches [12, 28, 32, 58, 59], top-down ap-
proaches [22, 61, 61], and more recently, voting-based ap-
proaches [4, 13, 18, 26, 56]. MASC [32] uses a multi-scale
hierarchical feature backbone, similar to ours, however, the
multi-scale features are used to compute pairwise affinities
followed by an offline clustering step. Such backbones are
also successfully employed in other fields [5, 48]. Another
influential work is DyCo3D [21], which is among the few
approaches that directly predict instance masks without
a subsequent clustering step. DyCo3D relies on dynamic
convolutions [25, 54] which is similar in spirit to our mask
prediction mechanism. However, it does not use optimal
supervision assignment during training, resulting in subpar
performance. Optimal assignment of the supervision signal
was first implemented by 3D-BoNet [61] using Hungarian
matching. Similar to ours, [61] directly predicts all instances
in parallel. However, it uses only a single-scale scene
descriptor which cannot encode object masks of diverse sizes.
Transformers.
Initially proposed by Vaswani et al. [55] for
NLP, Transformers have recently revolutionized the field of
computer vision with successful models such as ViT [11] for
image classification, DETR [2] for 2D object detection, or
Mask2Former [5, 6] for 2D segmentation tasks. The success
of Transformers has been less prominent in the 3D point cloud
domain though and recent Transformer-based methods focus
on either 3D object detection [34, 37, 39] or 3D semantic
segmentation [29, 40, 63]. Most of these rely on specific
attention modifications to deal with the quadratic complexity
of the attention [29, 39, 40, 63]. Liu et al. [34] use vanilla
Transformer decoder, but only to refine object proposals,
whereas Misra et al. [37] are the first to show how to apply
a vanilla Transformer to point clouds, still relying on an
initial learned downsampling stage though. DyCo3D [21]
also uses a Transformer, however at the bottleneck of the
feature backbone to increase the receptive field size and is
not related to our mechanism for 3D instance segmentation.
In this work, we show how a vanilla Transformer decoder can
be applied to the task of 3D semantic instance segmentation
and achieve state-of-the-art performance.
III. METHOD
Fig. 2illustrates our end-to-end 3D instance segmentation
model Mask3D. As in Mask2Former [5], our model includes
a feature backbone (
∎◻
), a Transformer decoder (
∎◻
) built from
mask modules (
∎◻
) and Transformer decoder layers used for
query refinement (
∎◻
). At the core of the model are instance
queries, which each should represent one object instance in
the scene and predict the corresponding point-level instance
mask. To that end, the instance queries are iteratively refined
by the Transformer decoder (Fig. 2,
∎◻
) which allows the
instance queries to cross-attend to point features extracted
from the feature backbone and self-attend the other instance
queries. This process is repeated for multiple iterations and