Learning Equivariant Segmentation with Instance-Unique Querying Wenguan Wang

2025-04-29 0 0 2.96MB 20 页 10玖币
侵权投诉
Learning Equivariant Segmentation with
Instance-Unique Querying
Wenguan Wang
ReLER, AAII, University of Technology Sydney
James Liang
Rochester Institute of Technology
Dongfang Liu
Rochester Institute of Technology
https://github.com/JamesLiang819/Instance_Unique_Querying
Abstract
Prevalent state-of-the-art instance segmentation methods fall into a query-based
scheme, in which instance masks are derived by querying the image feature using
a set of instance-aware embeddings. In this work, we devise a new training frame-
work that boosts query-based models through discriminative query embedding
learning. It explores two essential properties, namely dataset-level uniqueness
and transformation equivariance, of the relation between queries and instances.
First, our algorithm uses the queries to retrieve the corresponding instances from
the whole training dataset, instead of only searching within individual scenes. As
querying instances across scenes is more challenging, the segmenters are forced to
learn more discriminative queries for effective instance separation. Second, our
algorithm encourages both image (instance) representations and queries to be equiv-
ariant against geometric transformations, leading to more robust, instance-query
matching. On top of four famous, query-based models (i.e., CondInst, SOLOv2,
SOTR, and Mask2Former), our training algorithm provides significant performance
gains (e.g., +1.6 3.2 AP) on COCO dataset. In addition, our algorithm promotes
the performance of SOLOv2 by 2.7 AP, on LVISv1 dataset.
1 Introduction
Instance segmentation, i.e., labeling image pixels with classes and instances, plays a critical role in a
wide range of applications, e.g., autonomous driving, medical health, and augmented reality. Modern
instance segmentation solutions are largely built upon three paradigms: top-down (‘detect-then-
segment’) [1
19], bottom-up (‘label-then-cluster’) [20
28], and single-shot (‘directly-predict’) [29
44]. Among them, the top-leading algorithms [18,34,39
44] typically operate in a query-based mode,
in which a set of instance-aware embeddings is learned and used to query the dense image feature for
instance mask prediction. The key to their triumph is the instance-aware query vectors that are learned
to encode the characteristics (e.g., location, appearance) of instances [34,43]. By straightforwardly
minimizing the differences between the retrieved and groundtruth instance masks, the query-based
methods, in essence, learn the query vectors for instance discrimination only within individual scenes.
As a result, existing query-based instance segmentation algorithms place a premium on intra-scene
analysis during network training. Since the scenario in one single training scene is simple, i.e., the
diversity and volume of object instances as well as the complexity of the background are typically
limited, learning to distinguish between object instances only within the same training scenes is less
challenging, and inevitably hinders the discrimination potential of the learned instance queries.
authors contributed equally
corresponding author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.00911v2 [cs.CV] 19 Dec 2022
This work brings a paradigm shift in training query-based instance segmenters: it goes beyond the de
facto, within-scene training strategy by further considering the cross-scene level query embedding
separation of different instances – querying instances from the whole training dataset. The underlying
rationale is intuitive yet powerful: an advanced instance segmenter should be able to differentiate all
the instances of the entire dataset, rather than only the ones within single scenes. Concretely, in our
training framework, the queries are not only learned to fire on the pixels of their counterpart instances
in the current training image, but also forced to mismatch the pixels in other training images. By virtue
of intra-and inter-scene instance disambiguation, our framework forces the query-based segmenters
to learn more discriminative query vectors capable of uniquely identifying the corresponding ins-
tances even at the dataset level. To further facilitate the establishment of robust, one-to-one relation
between queries and instances, we complement our training framework with a transformation equi-
variance constraint, accommodating the equivariance property of the instance segmentation task to
geometric transformations. For example, if we crop or flip the input image, we expect the image
(instance) features and query embeddings to change accordingly, so as to appropriately reflect the
100 200
0
35
40
45
50
COCO Mask AP
Inference Speed (ms)
SOLOv2
CondInst + ours
SOLOv2 + ours
CondInst
SOTR + ours Mask2Former
Mask2Former + ours
ResNet-50 ResNet-50
ResNet-50
ResNet-101
SOTR
ResNet-101
ResNet-101
Swin-S
Swin-B
Swin-L
base model
ours
[39] [34]
[41]
[40]
Figure 1: Our training algorithm yields solid
performance gains over state-of-the-art query-
based models [34,39
41] without architec-
tural modification and inference speed delay.
variation of instance patterns (e.g., scale, position,
shape, etc) caused by the input transformation.
Exploring intra-and inter-scene instance uniqueness
as well as transformation equivariance leads to a gen-
eral yet powerful training framework. Our algorithm,
in principle, can be seamlessly incorporated into the
training process of existing query-based instance seg-
menters. For comprehensive evaluation, we apply our
algorithm to four representative, query-based models
(i.e., CondInst [34], SOLOv2 [39], SOTR [41], and
Mask2Former [40]) with various backbones (i.e., Res-
Net [45], Swin [46]). Experiments on COCO [47]
verify our impressive performance, i.e., +
2.8
3.1
,
+
2.9
3.2
, +
2.4
2.6
, and +
1.6
2.4
AP gains over
CondInst, SOLOv2, SOTR, and Mask2Former, respec-
tively (see Fig. 1). Our algorithm also brings remark-
able improvement, +
2.7
AP, on LVISv1 [48] dataset,
on top of SOLOv2 [39]. These results are particularly
impressive considering our training algorithm causes
neither architectural change nor extra computational
load during model deployment.
2 Related Work
This section summarizes the most relevant research on
instance segmentation and equivariant learning.
Instance Segmentation.
With the renaissance of connectionism, remarkable progress has been made
in instance segmentation. Existing deep learning based solutions can be broadly classified into three
paradigms: top-down, bottom-up, and single-shot. Following the idea of ‘detect-then-segment’,
top-
down
methods [1
18] predict a bounding box for each object and then ouput an instance mask for each
box. Though effective, this type of methods is complicated and dependent on the priori detection results.
In contrast,
bottom-up
methods [20
27] adopt a ‘label-then-cluster’ strategy: learning per-pixel em-
beddings and then grouping them into different instances. Albeit simple, this type of methods relies on
the performance of post-processing and easily suffers from under-segment or over-segment problems.
Inspired by the advance of single-stage object detection [49,50], a few recent efforts approach instance
segmentation in a
single-shot
fashion, by coalescing detection and segmentation over pre-defined
anchor boxes [29
33], or directly predicting instance masks from feature maps [34
44]. This type of
methods is well recognized and generally demonstrates better speed-accuracy trade-off [51].
Despite the blossoming of diverse approaches, the vast majority of recent top-performing algori-
thms [18,34,39
44] fall into one grand category – query-based models. The query-based methods
utilize compact, learnable embedding vectors to represent instances of interest and leverage them as
queries to decode masks from image features. Their triumph is founded on comprehensively encoding
instance-specific properties (e.g., location, appearance) into the query vectors, which significantly
increases prediction robustness. For instance, [34,39] exploit the technique of dynamic filter [52] to
2
generate instance-specific descriptors, which are convolved with image feature maps for instance
mask decoding. Inspired by DETR [53], [40,42,43] alternatively leverage a Transformer decoder to
obtain instance-aware query embeddings and cast instance segmentation as a set prediction problem.
Our contribution is orthogonal and these query-based segmenters can benefit. We scaffold a new
training framework that sharpens the instance discriminative capability of the query-based segmenters.
This is achieved by matching query embeddings with instances within and cross training scenes. Such
intra-and inter-scene instance querying strategy is further enhanced by an equivariance regularisation
term, addressing not only the uniqueness but also the robustness of instance-query relations.
Equivariant Representation Learning.
Transformations play a critical role in learning expressive
representations by transforming images as a means to reveal the intrinsic patterns from transformed
visual structures [54]. Motivated by the concept of translation equivariance underlying the success of
CNNs, numerous efforts (e.g., capsule nets [55,56], group equivariant convolutions [57], and harmonic
networks [58]) investigate learning more powerful representations equivariant to generic types of
transformations [59,60]. A representation
f
is said to be equivariant with a transformation
g
for
input (say image) Iif f(g(I))g(f(I)). In other words, the output representation f(I)transforms
in the same manner (or, in a broad sense, a predictable manner) given the input transformation
g
.
Many recent self-supervised learning methods [61
63] encourage the representations to be invariant
under transformations, i.e.,
f(g(I))f(I)
. As such, invariance can be viewed as a special case of
equivariance [64] where the output representation
f(I)
does not vary with the input transformation
g
.
In our training framework, we fully explore the inherent, transformation-equivariance nature of the
instance segmentation task to pursue reliable, one-to-one correspondence between learnable queries
and object instances. This is accomplished by promoting the equivariance of both query embeddings
and feature representations with respect to spatial transformations, i.e., cropping or flipping of an
input image should result in correspondingly changed feature representation, query embeddings, as
well as instance mask predictions. Note that invariance is not suitable for instance segmentation task,
as it encourages the feature map (and segmentation mask) to not vary with the input transformation.
Our algorithm is also in contrast to the common data augmentation strategy, in which the transformed
images and annotations are used directly as additional individual training examples, without any
constraint about the relation between the representations (and queries) produced from the original
and transformed views. Our experimental results (see §4.3) also evidence the superiority of our
transformation equivariance learning over transformation-based data augmentation.
3 Methodology
Next, we first formulate instance segmentation from a classical view of mask prediction and classifi-
cation (§3.1). Then we describe our new training framework (§3.2) and implementation details (§3.3).
3.1 Problem Formulation
Instance segmentation seeks a partition of an observed image IRH×W×3into Kinstance regions:
{Yk}K
k=1 ={(Mk, ck)}K
k=1,where Mk {0,1}H×W, ck{1,· · · , C}.(1)
Here the instances of interest are represented by a total of
K
non-overlap, binary masks
{Mk}K
k=1
as
well as corresponding class labels
{ck}K
k=1
(e.g., table, chair, etc). For a pixel
iI
, its counterpart
value in the
k
-th groundtruth mask, i.e.,
Mk(i)
, denotes whether
i
belongs to instance
k
(
1
) or not (
0
).
Note that the number of instances,
K
, varies across different images. Existing mainstream (or, more
precisely, most top-down and one-shot) solutions approach the task by decomposing the image
I
into
a fixed-size set of soft masks. In this setting, each mask is associated with a probability distribution
over all the Ccategories. The output can thus be represented as a set of Nmask-probability pairs:
{ˆ
Yn}N
n=1 ={(ˆ
Mn,ˆpn)}N
n=1,where ˆ
Mn[0,1]H×W,ˆpn4
C.(2)
Here
4
C
stands for the
C
-dimensional probability simplex. The size of the prediction set,
N
, is
usually set as a constant and much larger than the typical number of object instances in an image.
Hence the training objective penalizes the errors of both label prediction and mask estimation:
L({ˆ
Yn}N
n=1,{Yk}K
k=1) = XN
n=1 Lcls(ˆpn, cσ(n)) + Lmask(ˆ
Mn, Mσ(n)),(3)
3
query
creator
I
{qn}n=1
N0
Iflipping
(I)
g
( I )
g
{qn}n=1
N
g
dense feature
extractor f
h
{qn }n=1
N
I
{Mn}n=1
ˆN{Mσ(n)}n=1
N
g
{qn }n=1
(I )N
g
I{qn }n=1
N
I
{ (Mσ(n))}n=1
g
I
ˆ
{On }n=1
N
Intra-Scene
Instance
Discrimination
(standard)
Inter-Scene
Instance
Discrimination
(ours)
Transformation
Equivariance
(ours)
{Wn}n=1
ˆN
f
h
f
N
dataset
'
I
''
'
I
Lintra_mask
Eq. 5
Linter_mask
Eq. 5
Lequi
Eq. 8
Figure 2: Overview of our new training framework for query-based instance segmentation. Rather than
current intra-scene training paradigm, our framework addresses inter-scene instance discrimination
and transformation equivariance for discriminative instance query embedding learning (see §3.2). To
improve readability, for image I, we only plot one extra image I0for cross-scene training.
where
σ
refers to the matching between the prediction and groundtruth sets (established by certain
rules [6,53]). The classification loss
Lcls
is typically the cross-entropy loss or focal loss [65]; the mask
prediction loss
Lmask
can be the cross-entropy loss in [40], dice loss [66] in [34,39], or focal loss in [53].
While many approaches supplement
L
with various extra losses (e.g., bounding box loss [6,32,34,53],
ranking loss [67], semantic segmentation loss [13,32]), later we will show our cross-scene training
scheme is fundamentally different from (yet complementary to) current scene-wise training paradigm.
3.2 Equivariant Learning with Intra-and Inter-Scene Instance Uniqueness
Query-based Instance Segmentation.
As clearly indicated by Eq. 2, the prediction masks
{ˆ
Mn}N
n=1
are the means of separating instances at the pixel level. Current top-performing instance segmen-
ters [18,34,39
43] typically generate mask predictions in a query-based fashion (see the middle part
of Fig. 2). Let
f
be a dense feature extractor (e.g., an encoder-decoder fully convolutional
network [68]) that produces
D
-dimensional dense embedding
I
for image
I
,i.e.,
I=f(I)RH×W×D
.
Then a query creator his adopted to produce a set of Ninstance-aware embedding vectors
{qnRd}N
n=1, which are used to query the image representation Ifor instance mask decoding:
{ˆ
Mn}N
n=1 ={hqn,Ii}N
n=1,where {qn}N
n=1 =h(I).(4)
Here
,·i
is a certain similarity measure performed pixel-wise, and
I
(typically) refers to a low-
resolution feature representation of image
I
. Note that position information is integrated to either or
both of
I
and
I
, to make the model location-sensitive. The query creator
h
is implemented as a
dynamic network [34,39], or a Transformer decoder [40,42]. For dynamic network based
h
, it predicts
N
convolution filters
{qn}N
n=1
dynamically conditioned on the input
I
, and hence
,·i
refers to
convolution (
d6=D
). For Transformer decoder based
h
, it additionally leverages a set of
N
learnable
positional embeddings (omitted for brevity) to gather instance-related context from
I
; the collected
context is stored in
{qn}N
n=1
and
,·i
is computed as dot product for instance mask decoding (
d=D
).
Eq. 4informs that, query-based segmenters in essence learn
N
compact descriptors
{qn}N
n=1
to grasp
critical characteristics (e.g., appearance and location) of potentially interested instances, and use
these instance descriptors as queries to retrieve corresponding pixels from
I
. It is thus reasonable to
assume the discriminative ability of the learned query embeddings is crucial for the performance of
query-based methods. Viewed in this light, a question naturally arises:
?
how to learn discriminative
instance query embeddings? Yet, this fundamental question is largely ignored in the literature so far.
To respond
?
, we exploit two crucial properties of instance-query matching, namely uniqueness and
robustness. This is achieved by addressing dataset-level uniqueness and transformation equivariance
during the learning of query-based segmenters, leading to a powerful training scheme eventually.
Learning with Intra-and Inter-Scene Instance Uniqueness.
If we closely scrutinize at the current
de facto training regime (cf. Eq. 3) and the work mode of query-based segmenters (cf. Eq. 4), we can
find: the mask prediction loss Lmask forces each query qnto match the pixels iIof its counterpart
instance
k=σ(n)
,i.e.,
Mσ(n)(i) = 1
, and mismatch the pixels
i0I
of other instances
k6=σ(n)
,i.e.,
4
摘要:

LearningEquivariantSegmentationwithInstance-UniqueQueryingWenguanWangReLER,AAII,UniversityofTechnologySydneyJamesLiangRochesterInstituteofTechnologyDongfangLiuyRochesterInstituteofTechnologyhttps://github.com/JamesLiang819/Instance_Unique_QueryingAbstractPrevalentstate-of-the-artinstancesegmentati...

展开>> 收起<<
Learning Equivariant Segmentation with Instance-Unique Querying Wenguan Wang.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:2.96MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注