Mask3D Mask Transformer for 3D Semantic Instance Segmentation Jonas Schult1 Francis Engelmann23 Alexander Hermans1 Or Litany4 Siyu Tang2 Bastian Leibe1 Abstract Modern 3D semantic instance segmentation ap-

2025-05-02 0 0 3.44MB 12 页 10玖币

侵权投诉

Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

Jonas Schult1, Francis Engelmann2,3, Alexander Hermans1, Or Litany4, Siyu Tang2, Bastian Leibe1

Abstract— Modern 3D semantic instance segmentation ap-

proaches predominantly rely on specialized voting mechanisms

followed by carefully designed geometric clustering techniques.

Building on the successes of recent Transformer-based methods

for object detection and image segmentation, we propose the

ﬁrst Transformer-based approach for 3D semantic instance seg-

mentation. We show that we can leverage generic Transformer

building blocks to directly predict instance masks from 3D point

clouds. In our model – called Mask3D – each object instance is

represented as an instance query. Using Transformer decoders,

the instance queries are learned by iteratively attending to point

cloud features at multiple scales. Combined with point features,

the instance queries directly yield all instance masks in parallel.

Mask3D has several advantages over current state-of-the-art

approaches, since it neither relies on (1) voting schemes which

require hand-selected geometric properties (such as centers) nor

(2) geometric grouping mechanisms requiring manually-tuned

hyper-parameters (e.g. radii) and (3) enables a loss that directly

optimizes instance masks. Mask3D sets a new state-of-the-art on

ScanNet test (+ 6.2 mAP), S3DIS 6-fold (+ 10.1 mAP), STPLS3D

(+ 11.2 mAP) and ScanNet200 test (+ 12.4 mAP).

I. INTRODUCTION

This work addresses the task of semantic instance seg-

mentation of 3D scenes. That is, given a 3D point cloud,

the desired output is a set of object instances represented as

binary foreground masks (over all input points) with their

corresponding semantic labels (e.g. ‘chair’,‘table’,‘window’).

Instance segmentation resides at the intersection of two

problems: semantic segmentation and object detection. There-

fore methods have opted to either ﬁrst learn semantic point

features followed by grouping them into separate instances

(bottom-up) or detecting object instances followed by reﬁning

their semantic mask (top-down). Bottom-up approaches

(ASIS [59], SGPN [58], 3D-BEVIS [12]) employ contrastive

learning, mapping points to a high-dimensional feature space

where features of the same instance are close together, and

far apart otherwise. Top-down methods (3D-SIS [22], 3D-

BoNet [61]) use an approach akin to Mask R-CNN [19]:

First detect instances as bounding boxes and then perform

mask segmentation on each box individually. While 3D-SIS

[22] relies on predeﬁned anchor boxes [19], 3D-BoNet [61]

proposes an interesting variation that predicts bounding boxes

from a global scene descriptor and optimizes an association

loss based on bipartite matching [27]. A major step forward

was sparked by powerful feature backbones [17, 53, 57] such

as sparse convolutional networks [8, 16] that improve over

existing PointNets [42, 44] and dense 3D CNNs [36, 43, 60].

1Computer Vision Group, RWTH Aachen University, Germany.

2Computer Vision and Learning Group, ETH Zürich, Switzerland.

3ETH AI Center, Zürich, Switzerland.

4NVIDIA, Santa Clara, USA

Input 3D Scene Instance Heatmaps 3D Semantic Instances

Fig. 1: Mask3D.

We train an end-to-end model for 3D semantic

instance segmentation on point clouds. Given an input 3D point cloud

(left), our Transformer-based model uses an attention mechanism to

produce instance heatmaps across all points (center) and directly

predicts all semantic object instances in parallel (right).

Well established 2D CNN architectures [20, 46] can now easily

be adapted to sparse 3D data. These models can process large-

scale 3D scenes in one pass, which is necessary to capture

global scene context at multiple scales. As a result, bottom-

up approaches which beneﬁt from strong features (MTML

[28], MASC [32]) experienced another performance boost.

Soon after, inspired by Hough voting approaches [24, 30],

VoteNet [41] proposed center-voting for 3D object detection.

Instead of mapping points to an abstract high-dimensional

feature space (as in bottom-up approaches), points now vote

for their object center – votes from the same object are

then closer to each other which enables geometric grouping

into instance masks. This idea quickly inﬂuenced the 3D

instance segmentation ﬁeld, and by now, the vast majority

of current state-of-the-art 3D instance segmentation methods

[4, 13, 18, 26, 56] make use of both object center-voting and

sparse feature backbones.

Although 3D instance segmentation has made impressive

progress, current approaches have several major problems:

typical state-of-the-art models are based on manually-tuned

components, such as voting mechanisms that predict hand-

selected geometric properties (e.g., centers [26], bounding

boxes [7], occupancy [18]), and heuristics for clustering the

votes (e.g., dual-set grouping [26], proposal aggregation [13],

set aggregation/ﬁltering [4]). Another limitation of these

models is that they are not designed to directly predict

instance masks. Instead, masks are obtained by grouping

votes, and the model is trained using proxy-losses on the votes.

A more elegant alternative consists of directly predicting

and supervising instance masks, such as 3D-BoNet [61]

or DyCo3D [21]. Recently, this idea gained popularity in

2D object detection (DETR [2]) and image segmentation

(Mask2Former [5, 6]) but so far received less attention in 3D

[21, 37, 61]. At the same time, in 2D image processing, we

observe a strong shift from ubiquitous CNN architectures

[19, 20, 45, 46] towards Transformer-based models [6, 11, 33].

In 3D, the move towards Transformers is less pronounced

arXiv:2210.03105v2 [cs.CV] 12 Apr 2023

with only a few methods focusing on 3D object detection

[34, 37, 39] or 3D semantic segmentation [29, 63, 63] and

no methods for 3D instance segmentation. Overall, these

approaches are still behind in terms of performance compared

to current state-of-the-art methods [38, 48, 56, 57].

In this work, we propose the ﬁrst Transformer-based model

for 3D semantic instance segmentation of large-scale scenes

that sets new state-of-the-art scores over a wide range of

datasets, and addresses the aforementioned problems on

hand-crafted model designs. The main challenge lies in

directly predicting instance masks and their corresponding

semantic labels. To this end, our model predicts instance

queries that encode semantic and geometric information

of each instance in the scene. Each instance query is then

further decoded into a semantic class and an instance feature.

The key idea (to directly generate masks) is to compute

similarity scores between individual instance features and all

point features in the point cloud [4, 6, 21]. This results in a

heatmap over the point cloud, which (after normalization

and thresholding) yields the ﬁnal binary instance mask

(c.f. Fig. 1). Our model, called Mask3D, builds on recent

advances in both Transformers [5, 37] and 3D deep learning

[8, 17, 57]: to compute strong point features, we leverage

a sparse convolutional feature backbone [8] that efﬁciently

processes full scenes and naturally provides multi-scale point

features. To generate instance queries, we rely on stacked

Transformer decoders [5, 6] that iteratively attend to learned

point features in a coarse-to-ﬁne fashion using non-parametric

queries [37]. Unlike voting-based methods, directly predicting

and supervising masks causes some challenges during training:

before computing a mask loss, we ﬁrst have to establish

correspondences between predicted and annotated masks.

A naïve solution would be to choose for each predicted

mask the nearest ground truth mask [21]. However, this

does not guarantee an optimal matching and any unmatched

annotated mask would not contribute to the loss. Instead,

we perform bipartite graph matching to obtain optimal

associations between ground truth and predicted masks [2, 61].

We evaluate our model on four challenging 3D instance

segmentation datasets, ScanNet v2 [9], ScanNet200 [47],

S3DIS [1] and STPLS3D [3] and signiﬁcantly outperform

prior art, even surpassing architectures that are highly tuned

towards speciﬁc datasets. Our experimental study compares

various query types, different mask losses, and evaluates the

number of queries as well as Transformer decoder steps.

Our contributions are as follows:

(1)

We propose the

ﬁrst competitive Transformer-based model for 3D semantic

instance segmentation.

(2)

Our model named Mask3D builds

on domain-agnostic components, avoiding center voting, non-

maximum suppression, or grouping heuristics, and overall re-

quires less hand-tuning.

(3)

Mask3D achieves state-of-the-art

performance on ScanNet, ScanNet200, S3DIS and STPLS3D.

To reach that level of performance with a Transformer-based

approach, it is key to predict instance queries that encode the

semantics and geometry of the scene and objects.

II. RELATED WORK

3D Instance Segmentation.

Numerous methods have been

proposed for 3D instance semantic segmentation, includ-

ing bottom-up approaches [12, 28, 32, 58, 59], top-down ap-

proaches [22, 61, 61], and more recently, voting-based ap-

proaches [4, 13, 18, 26, 56]. MASC [32] uses a multi-scale

hierarchical feature backbone, similar to ours, however, the

multi-scale features are used to compute pairwise afﬁnities

followed by an ofﬂine clustering step. Such backbones are

also successfully employed in other ﬁelds [5, 48]. Another

inﬂuential work is DyCo3D [21], which is among the few

approaches that directly predict instance masks without

a subsequent clustering step. DyCo3D relies on dynamic

convolutions [25, 54] which is similar in spirit to our mask

prediction mechanism. However, it does not use optimal

supervision assignment during training, resulting in subpar

performance. Optimal assignment of the supervision signal

was ﬁrst implemented by 3D-BoNet [61] using Hungarian

matching. Similar to ours, [61] directly predicts all instances

in parallel. However, it uses only a single-scale scene

descriptor which cannot encode object masks of diverse sizes.

Transformers.

Initially proposed by Vaswani et al. [55] for

NLP, Transformers have recently revolutionized the ﬁeld of

computer vision with successful models such as ViT [11] for

image classiﬁcation, DETR [2] for 2D object detection, or

Mask2Former [5, 6] for 2D segmentation tasks. The success

of Transformers has been less prominent in the 3D point cloud

domain though and recent Transformer-based methods focus

on either 3D object detection [34, 37, 39] or 3D semantic

segmentation [29, 40, 63]. Most of these rely on speciﬁc

attention modiﬁcations to deal with the quadratic complexity

of the attention [29, 39, 40, 63]. Liu et al. [34] use vanilla

Transformer decoder, but only to reﬁne object proposals,

whereas Misra et al. [37] are the ﬁrst to show how to apply

a vanilla Transformer to point clouds, still relying on an

initial learned downsampling stage though. DyCo3D [21]

also uses a Transformer, however at the bottleneck of the

feature backbone to increase the receptive ﬁeld size and is

not related to our mechanism for 3D instance segmentation.

In this work, we show how a vanilla Transformer decoder can

be applied to the task of 3D semantic instance segmentation

and achieve state-of-the-art performance.

III. METHOD

Fig. 2illustrates our end-to-end 3D instance segmentation

model Mask3D. As in Mask2Former [5], our model includes

a feature backbone (

∎◻

), a Transformer decoder (

∎◻

) built from

mask modules (

∎◻

) and Transformer decoder layers used for

query reﬁnement (

∎◻

). At the core of the model are instance

queries, which each should represent one object instance in

the scene and predict the corresponding point-level instance

mask. To that end, the instance queries are iteratively reﬁned

by the Transformer decoder (Fig. 2,

∎◻

) which allows the

instance queries to cross-attend to point features extracted

from the feature backbone and self-attend the other instance

queries. This process is repeated for multiple iterations and

Fig. 2: Illustration of the Mask3D model.

The feature backbone

outputs multi-scale point features

, while the Transformer decoder

iteratively reﬁnes the instance queries

. Given point features and

instance queries, the mask module predicts for each query a semantic

class and an instance heatmap, which (after thresholding) results

in a binary instance mask



applies a threshold of 0.5 and

spatially rescales if required.

⋅



is the dot product.



is the sigmoid

function. We show a simpliﬁed model with fewer layers.

feature scales, yielding the ﬁnal set of reﬁned instance queries.

A mask module consumes the reﬁned instance queries together

with the point features, and returns (for each query) a semantic

class and a binary instance mask based on the dot product

between point features and instance queries. Next, we describe

each of these components in more detail.

Sparse Feature Backbone.

(Fig. 2,

∎◻

) We use a sparse

convolutional U-net backbone with a symmetrical encoder and

decoder, based on the MinkowskiEngine [8]. Given a colored

input point cloud

P∈RN×6

of size

, it is ﬁrst quantized

into

voxels

V∈RM0×3

, where each voxel is assigned

the average RGB color of the points within that voxel as its

initial feature. Next to the full-resolution output feature map

F0∈RM0×D

, we also extract a multi-resolution hierarchy of

features from the backbone decoder before upsampling to the

next ﬁner feature map. At each of these resolutions

r≥0

can extract features for a set of

voxels, which we linearly

project to a ﬁxed and common dimension

, yielding feature

matrices

Fr∈RMr×D

. We let the queries attend to features

from coarser feature maps of the backbone decoder, i.e.

r≥1

and use the full-resolution feature map (

r=0

) to compute

the auxiliary and ﬁnal per-voxel instance masks.

Mask Module.

(Fig. 2,

∎◻

) Given the set of

instance queries

X∈RK×D

, we predict a binary mask for each instance

and classify each of them as one of

classes or as being

inactive. To create the binary mask, we map the instance

queries through an MLP fmask(⋅), to the same feature space

as the backbone output features. We then compute the dot

product between these instance features and the backbone

features

. The resulting similarity scores are fed through

a sigmoid and thresholded at 0.5, yielding the ﬁnal binary

mask B∈{0,1}M×K:

B={bi,j =[σ(F0fmask(X)T)i,j >0.5]}.(1)

We apply the mask module to the reﬁned queries

at each

Transformer layer using the full-resolution feature map

, to

create auxiliary binary masks for the masked cross-attention

of the following reﬁnement step. When this mask is used as

input for the masked cross-attention, we reduce the resolution

according to the voxel feature resolution by average pooling.

Next to the binary mask, we predict a single semantic class

per instance. This step is done via a linear projection layer

into

C+1

dimensions, followed by a softmax. While prior

work [4, 13, 56] typically needs to obtain the semantic label

of an instance via majority voting or grouping over per-point

predicted semantics, this information is directly contained in

the reﬁned instance queries.

Query Reﬁnement.

(Fig. 2,

∎◻

) The Transformer decoder

starts with

instance queries, and reﬁnes them through

a stack of

Transformer decoder layers to a ﬁnal set of

accurate, scene speciﬁc instance queries by cross-attending

to scene features, and reasoning at the instance-level through

self-attention. We discuss different types of instance queries

in Sec. III-A. Each layer attends to one of the feature maps

from the feature backbone using standard cross-attention:

X=softmax(QKT/√D)V.(2)

To do so, the voxel features

Fr∈RMr×D

are ﬁrst linearly

projected to a set of keys and values of ﬁxed dimensionality

K,V∈RMr×D

and our

instance queries

are linearly

projected to the queries

Q∈RK×D

. This cross-attention

thus allows the queries to extract information from the voxel

features. The cross-attention is followed by a self-attention

step between the queries, where the keys, values, and queries

are all computed based on linear projections of the instance

queries. Without such inter-query communications, the model

could not avoid multiple instance queries latching onto the

same object, resulting in duplicate instance masks. Similar

to most Transformer-based approaches, we use positional

encodings for our keys and queries. We use Fourier positional

encodings [52] based on voxel positions. We add the resulting

positional encodings to their respective keys before computing

the cross-attention. All instance queries are also assigned a

ﬁxed (and potentially learned) positional embedding, that is

not updated throughout the query reﬁnement process. These

positional encodings are added to the respective queries in the

cross-attention, as well as to both the keys and queries in the

self-attention. Instead of using the vanilla cross-attention

(where each query attends to all voxel features in one

resolution) we use a masked variant where each instance

query only attends to the voxels within its corresponding

intermediate instance mask

predicted by the previous layer.

This is realized by adding

−∞

to the attention matrix to all

voxels for which the mask is 0. Eq. 2then becomes:

X=softmax(QKT/√D+B′)Vwith B′

ij =−∞⋅[Bij =0]

(3)

where

[⋅]

are Iverson brackets. In [5], masking out the context

from the cross-attention improved segmentation. A likely

reason is that the Transformer does not need to learn to focus

on a speciﬁc instance instead of irrelevant context, but is

forced to do so by design.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Mask3D:MaskTransformerfor3DSemanticInstanceSegmentationJonasSchult1,FrancisEngelmann2;3,AlexanderHermans1,OrLitany4,SiyuTang2,BastianLeibe1AbstractModern3Dsemanticinstancesegmentationap-proachespredominantlyrelyonspecializedvotingmechanismsfollowedbycarefullydesignedgeometricclusteringtechniques.Bu...

展开>> 收起<<

Mask3D Mask Transformer for 3D Semantic Instance Segmentation Jonas Schult1 Francis Engelmann23 Alexander Hermans1 Or Litany4 Siyu Tang2 Bastian Leibe1 Abstract Modern 3D semantic instance segmentation ap-.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Mask3D Mask Transformer for 3D Semantic Instance Segmentation Jonas Schult1 Francis Engelmann23 Alexander Hermans1 Or Litany4 Siyu Tang2 Bastian Leibe1 Abstract Modern 3D semantic instance segmentation ap-

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: