Mask3D Mask Transformer for 3D Semantic Instance Segmentation Jonas Schult1 Francis Engelmann23 Alexander Hermans1 Or Litany4 Siyu Tang2 Bastian Leibe1 Abstract Modern 3D semantic instance segmentation ap-

2025-05-02 0 0 3.44MB 12 页 10玖币
侵权投诉
Mask3D: Mask Transformer for 3D Semantic Instance Segmentation
Jonas Schult1, Francis Engelmann2,3, Alexander Hermans1, Or Litany4, Siyu Tang2, Bastian Leibe1
Abstract Modern 3D semantic instance segmentation ap-
proaches predominantly rely on specialized voting mechanisms
followed by carefully designed geometric clustering techniques.
Building on the successes of recent Transformer-based methods
for object detection and image segmentation, we propose the
first Transformer-based approach for 3D semantic instance seg-
mentation. We show that we can leverage generic Transformer
building blocks to directly predict instance masks from 3D point
clouds. In our model – called Mask3D – each object instance is
represented as an instance query. Using Transformer decoders,
the instance queries are learned by iteratively attending to point
cloud features at multiple scales. Combined with point features,
the instance queries directly yield all instance masks in parallel.
Mask3D has several advantages over current state-of-the-art
approaches, since it neither relies on (1) voting schemes which
require hand-selected geometric properties (such as centers) nor
(2) geometric grouping mechanisms requiring manually-tuned
hyper-parameters (e.g. radii) and (3) enables a loss that directly
optimizes instance masks. Mask3D sets a new state-of-the-art on
ScanNet test (+ 6.2 mAP), S3DIS 6-fold (+ 10.1 mAP), STPLS3D
(+ 11.2 mAP) and ScanNet200 test (+ 12.4 mAP).
I. INTRODUCTION
This work addresses the task of semantic instance seg-
mentation of 3D scenes. That is, given a 3D point cloud,
the desired output is a set of object instances represented as
binary foreground masks (over all input points) with their
corresponding semantic labels (e.g. ‘chair’,‘table’,‘window’).
Instance segmentation resides at the intersection of two
problems: semantic segmentation and object detection. There-
fore methods have opted to either first learn semantic point
features followed by grouping them into separate instances
(bottom-up) or detecting object instances followed by refining
their semantic mask (top-down). Bottom-up approaches
(ASIS [59], SGPN [58], 3D-BEVIS [12]) employ contrastive
learning, mapping points to a high-dimensional feature space
where features of the same instance are close together, and
far apart otherwise. Top-down methods (3D-SIS [22], 3D-
BoNet [61]) use an approach akin to Mask R-CNN [19]:
First detect instances as bounding boxes and then perform
mask segmentation on each box individually. While 3D-SIS
[22] relies on predefined anchor boxes [19], 3D-BoNet [61]
proposes an interesting variation that predicts bounding boxes
from a global scene descriptor and optimizes an association
loss based on bipartite matching [27]. A major step forward
was sparked by powerful feature backbones [17, 53, 57] such
as sparse convolutional networks [8, 16] that improve over
existing PointNets [42, 44] and dense 3D CNNs [36, 43, 60].
1Computer Vision Group, RWTH Aachen University, Germany.
2Computer Vision and Learning Group, ETH Zürich, Switzerland.
3ETH AI Center, Zürich, Switzerland.
4NVIDIA, Santa Clara, USA
Input 3D Scene Instance Heatmaps 3D Semantic Instances
Fig. 1: Mask3D.
We train an end-to-end model for 3D semantic
instance segmentation on point clouds. Given an input 3D point cloud
(left), our Transformer-based model uses an attention mechanism to
produce instance heatmaps across all points (center) and directly
predicts all semantic object instances in parallel (right).
Well established 2D CNN architectures [20, 46] can now easily
be adapted to sparse 3D data. These models can process large-
scale 3D scenes in one pass, which is necessary to capture
global scene context at multiple scales. As a result, bottom-
up approaches which benefit from strong features (MTML
[28], MASC [32]) experienced another performance boost.
Soon after, inspired by Hough voting approaches [24, 30],
VoteNet [41] proposed center-voting for 3D object detection.
Instead of mapping points to an abstract high-dimensional
feature space (as in bottom-up approaches), points now vote
for their object center – votes from the same object are
then closer to each other which enables geometric grouping
into instance masks. This idea quickly influenced the 3D
instance segmentation field, and by now, the vast majority
of current state-of-the-art 3D instance segmentation methods
[4, 13, 18, 26, 56] make use of both object center-voting and
sparse feature backbones.
Although 3D instance segmentation has made impressive
progress, current approaches have several major problems:
typical state-of-the-art models are based on manually-tuned
components, such as voting mechanisms that predict hand-
selected geometric properties (e.g., centers [26], bounding
boxes [7], occupancy [18]), and heuristics for clustering the
votes (e.g., dual-set grouping [26], proposal aggregation [13],
set aggregation/filtering [4]). Another limitation of these
models is that they are not designed to directly predict
instance masks. Instead, masks are obtained by grouping
votes, and the model is trained using proxy-losses on the votes.
A more elegant alternative consists of directly predicting
and supervising instance masks, such as 3D-BoNet [61]
or DyCo3D [21]. Recently, this idea gained popularity in
2D object detection (DETR [2]) and image segmentation
(Mask2Former [5, 6]) but so far received less attention in 3D
[21, 37, 61]. At the same time, in 2D image processing, we
observe a strong shift from ubiquitous CNN architectures
[19, 20, 45, 46] towards Transformer-based models [6, 11, 33].
In 3D, the move towards Transformers is less pronounced
arXiv:2210.03105v2 [cs.CV] 12 Apr 2023
with only a few methods focusing on 3D object detection
[34, 37, 39] or 3D semantic segmentation [29, 63, 63] and
no methods for 3D instance segmentation. Overall, these
approaches are still behind in terms of performance compared
to current state-of-the-art methods [38, 48, 56, 57].
In this work, we propose the first Transformer-based model
for 3D semantic instance segmentation of large-scale scenes
that sets new state-of-the-art scores over a wide range of
datasets, and addresses the aforementioned problems on
hand-crafted model designs. The main challenge lies in
directly predicting instance masks and their corresponding
semantic labels. To this end, our model predicts instance
queries that encode semantic and geometric information
of each instance in the scene. Each instance query is then
further decoded into a semantic class and an instance feature.
The key idea (to directly generate masks) is to compute
similarity scores between individual instance features and all
point features in the point cloud [4, 6, 21]. This results in a
heatmap over the point cloud, which (after normalization
and thresholding) yields the final binary instance mask
(c.f. Fig. 1). Our model, called Mask3D, builds on recent
advances in both Transformers [5, 37] and 3D deep learning
[8, 17, 57]: to compute strong point features, we leverage
a sparse convolutional feature backbone [8] that efficiently
processes full scenes and naturally provides multi-scale point
features. To generate instance queries, we rely on stacked
Transformer decoders [5, 6] that iteratively attend to learned
point features in a coarse-to-fine fashion using non-parametric
queries [37]. Unlike voting-based methods, directly predicting
and supervising masks causes some challenges during training:
before computing a mask loss, we first have to establish
correspondences between predicted and annotated masks.
A naïve solution would be to choose for each predicted
mask the nearest ground truth mask [21]. However, this
does not guarantee an optimal matching and any unmatched
annotated mask would not contribute to the loss. Instead,
we perform bipartite graph matching to obtain optimal
associations between ground truth and predicted masks [2, 61].
We evaluate our model on four challenging 3D instance
segmentation datasets, ScanNet v2 [9], ScanNet200 [47],
S3DIS [1] and STPLS3D [3] and significantly outperform
prior art, even surpassing architectures that are highly tuned
towards specific datasets. Our experimental study compares
various query types, different mask losses, and evaluates the
number of queries as well as Transformer decoder steps.
Our contributions are as follows:
(1)
We propose the
first competitive Transformer-based model for 3D semantic
instance segmentation.
(2)
Our model named Mask3D builds
on domain-agnostic components, avoiding center voting, non-
maximum suppression, or grouping heuristics, and overall re-
quires less hand-tuning.
(3)
Mask3D achieves state-of-the-art
performance on ScanNet, ScanNet200, S3DIS and STPLS3D.
To reach that level of performance with a Transformer-based
approach, it is key to predict instance queries that encode the
semantics and geometry of the scene and objects.
II. RELATED WORK
3D Instance Segmentation.
Numerous methods have been
proposed for 3D instance semantic segmentation, includ-
ing bottom-up approaches [12, 28, 32, 58, 59], top-down ap-
proaches [22, 61, 61], and more recently, voting-based ap-
proaches [4, 13, 18, 26, 56]. MASC [32] uses a multi-scale
hierarchical feature backbone, similar to ours, however, the
multi-scale features are used to compute pairwise affinities
followed by an offline clustering step. Such backbones are
also successfully employed in other fields [5, 48]. Another
influential work is DyCo3D [21], which is among the few
approaches that directly predict instance masks without
a subsequent clustering step. DyCo3D relies on dynamic
convolutions [25, 54] which is similar in spirit to our mask
prediction mechanism. However, it does not use optimal
supervision assignment during training, resulting in subpar
performance. Optimal assignment of the supervision signal
was first implemented by 3D-BoNet [61] using Hungarian
matching. Similar to ours, [61] directly predicts all instances
in parallel. However, it uses only a single-scale scene
descriptor which cannot encode object masks of diverse sizes.
Transformers.
Initially proposed by Vaswani et al. [55] for
NLP, Transformers have recently revolutionized the field of
computer vision with successful models such as ViT [11] for
image classification, DETR [2] for 2D object detection, or
Mask2Former [5, 6] for 2D segmentation tasks. The success
of Transformers has been less prominent in the 3D point cloud
domain though and recent Transformer-based methods focus
on either 3D object detection [34, 37, 39] or 3D semantic
segmentation [29, 40, 63]. Most of these rely on specific
attention modifications to deal with the quadratic complexity
of the attention [29, 39, 40, 63]. Liu et al. [34] use vanilla
Transformer decoder, but only to refine object proposals,
whereas Misra et al. [37] are the first to show how to apply
a vanilla Transformer to point clouds, still relying on an
initial learned downsampling stage though. DyCo3D [21]
also uses a Transformer, however at the bottleneck of the
feature backbone to increase the receptive field size and is
not related to our mechanism for 3D instance segmentation.
In this work, we show how a vanilla Transformer decoder can
be applied to the task of 3D semantic instance segmentation
and achieve state-of-the-art performance.
III. METHOD
Fig. 2illustrates our end-to-end 3D instance segmentation
model Mask3D. As in Mask2Former [5], our model includes
a feature backbone (
), a Transformer decoder (
) built from
mask modules (
) and Transformer decoder layers used for
query refinement (
). At the core of the model are instance
queries, which each should represent one object instance in
the scene and predict the corresponding point-level instance
mask. To that end, the instance queries are iteratively refined
by the Transformer decoder (Fig. 2,
) which allows the
instance queries to cross-attend to point features extracted
from the feature backbone and self-attend the other instance
queries. This process is repeated for multiple iterations and
Fig. 2: Illustration of the Mask3D model.
The feature backbone
outputs multi-scale point features
F
, while the Transformer decoder
iteratively refines the instance queries
X
. Given point features and
instance queries, the mask module predicts for each query a semantic
class and an instance heatmap, which (after thresholding) results
in a binary instance mask
B
.
τ
applies a threshold of 0.5 and
spatially rescales if required.
is the dot product.
σ
is the sigmoid
function. We show a simplified model with fewer layers.
feature scales, yielding the final set of refined instance queries.
A mask module consumes the refined instance queries together
with the point features, and returns (for each query) a semantic
class and a binary instance mask based on the dot product
between point features and instance queries. Next, we describe
each of these components in more detail.
Sparse Feature Backbone.
(Fig. 2,
) We use a sparse
convolutional U-net backbone with a symmetrical encoder and
decoder, based on the MinkowskiEngine [8]. Given a colored
input point cloud
PRN×6
of size
N
, it is first quantized
into
M0
voxels
VRM0×3
, where each voxel is assigned
the average RGB color of the points within that voxel as its
initial feature. Next to the full-resolution output feature map
F0RM0×D
, we also extract a multi-resolution hierarchy of
features from the backbone decoder before upsampling to the
next finer feature map. At each of these resolutions
r0
we
can extract features for a set of
Mr
voxels, which we linearly
project to a fixed and common dimension
D
, yielding feature
matrices
FrRMr×D
. We let the queries attend to features
from coarser feature maps of the backbone decoder, i.e.
r1
,
and use the full-resolution feature map (
r=0
) to compute
the auxiliary and final per-voxel instance masks.
Mask Module.
(Fig. 2,
) Given the set of
K
instance queries
XRK×D
, we predict a binary mask for each instance
and classify each of them as one of
C
classes or as being
inactive. To create the binary mask, we map the instance
queries through an MLP fmask(), to the same feature space
as the backbone output features. We then compute the dot
product between these instance features and the backbone
features
F0
. The resulting similarity scores are fed through
a sigmoid and thresholded at 0.5, yielding the final binary
mask B{0,1}M×K:
B={bi,j =[σ(F0fmask(X)T)i,j >0.5]}.(1)
We apply the mask module to the refined queries
X
at each
Transformer layer using the full-resolution feature map
F0
, to
create auxiliary binary masks for the masked cross-attention
of the following refinement step. When this mask is used as
input for the masked cross-attention, we reduce the resolution
according to the voxel feature resolution by average pooling.
Next to the binary mask, we predict a single semantic class
per instance. This step is done via a linear projection layer
into
C+1
dimensions, followed by a softmax. While prior
work [4, 13, 56] typically needs to obtain the semantic label
of an instance via majority voting or grouping over per-point
predicted semantics, this information is directly contained in
the refined instance queries.
Query Refinement.
(Fig. 2,
) The Transformer decoder
starts with
K
instance queries, and refines them through
a stack of
L
Transformer decoder layers to a final set of
accurate, scene specific instance queries by cross-attending
to scene features, and reasoning at the instance-level through
self-attention. We discuss different types of instance queries
in Sec. III-A. Each layer attends to one of the feature maps
from the feature backbone using standard cross-attention:
X=softmax(QKT/D)V.(2)
To do so, the voxel features
FrRMr×D
are first linearly
projected to a set of keys and values of fixed dimensionality
K,VRMr×D
and our
K
instance queries
X
are linearly
projected to the queries
QRK×D
. This cross-attention
thus allows the queries to extract information from the voxel
features. The cross-attention is followed by a self-attention
step between the queries, where the keys, values, and queries
are all computed based on linear projections of the instance
queries. Without such inter-query communications, the model
could not avoid multiple instance queries latching onto the
same object, resulting in duplicate instance masks. Similar
to most Transformer-based approaches, we use positional
encodings for our keys and queries. We use Fourier positional
encodings [52] based on voxel positions. We add the resulting
positional encodings to their respective keys before computing
the cross-attention. All instance queries are also assigned a
fixed (and potentially learned) positional embedding, that is
not updated throughout the query refinement process. These
positional encodings are added to the respective queries in the
cross-attention, as well as to both the keys and queries in the
self-attention. Instead of using the vanilla cross-attention
(where each query attends to all voxel features in one
resolution) we use a masked variant where each instance
query only attends to the voxels within its corresponding
intermediate instance mask
B
predicted by the previous layer.
This is realized by adding
−∞
to the attention matrix to all
voxels for which the mask is 0. Eq. 2then becomes:
X=softmax(QKT/D+B)Vwith B
ij =−∞[Bij =0]
(3)
where
[]
are Iverson brackets. In [5], masking out the context
from the cross-attention improved segmentation. A likely
reason is that the Transformer does not need to learn to focus
on a specific instance instead of irrelevant context, but is
forced to do so by design.
摘要:

Mask3D:MaskTransformerfor3DSemanticInstanceSegmentationJonasSchult1,FrancisEngelmann2;3,AlexanderHermans1,OrLitany4,SiyuTang2,BastianLeibe1Abstract—Modern3Dsemanticinstancesegmentationap-proachespredominantlyrelyonspecializedvotingmechanismsfollowedbycarefullydesignedgeometricclusteringtechniques.Bu...

展开>> 收起<<
Mask3D Mask Transformer for 3D Semantic Instance Segmentation Jonas Schult1 Francis Engelmann23 Alexander Hermans1 Or Litany4 Siyu Tang2 Bastian Leibe1 Abstract Modern 3D semantic instance segmentation ap-.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:3.44MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注