Learning Equivariant Segmentation with Instance-Unique Querying Wenguan Wang

2025-04-29 0 0 2.96MB 20 页 10玖币

侵权投诉

Learning Equivariant Segmentation with

Instance-Unique Querying

Wenguan Wang∗

ReLER, AAII, University of Technology Sydney

James Liang∗

Rochester Institute of Technology

Dongfang Liu†

Rochester Institute of Technology

https://github.com/JamesLiang819/Instance_Unique_Querying

Abstract

Prevalent state-of-the-art instance segmentation methods fall into a query-based

scheme, in which instance masks are derived by querying the image feature using

a set of instance-aware embeddings. In this work, we devise a new training frame-

work that boosts query-based models through discriminative query embedding

learning. It explores two essential properties, namely dataset-level uniqueness

and transformation equivariance, of the relation between queries and instances.

First, our algorithm uses the queries to retrieve the corresponding instances from

the whole training dataset, instead of only searching within individual scenes. As

querying instances across scenes is more challenging, the segmenters are forced to

learn more discriminative queries for effective instance separation. Second, our

algorithm encourages both image (instance) representations and queries to be equiv-

ariant against geometric transformations, leading to more robust, instance-query

matching. On top of four famous, query-based models (i.e., CondInst, SOLOv2,

SOTR, and Mask2Former), our training algorithm provides signiﬁcant performance

gains (e.g., +1.6 – 3.2 AP) on COCO dataset. In addition, our algorithm promotes

the performance of SOLOv2 by 2.7 AP, on LVISv1 dataset.

1 Introduction

Instance segmentation, i.e., labeling image pixels with classes and instances, plays a critical role in a

wide range of applications, e.g., autonomous driving, medical health, and augmented reality. Modern

instance segmentation solutions are largely built upon three paradigms: top-down (‘detect-then-

segment’) [1

–

19], bottom-up (‘label-then-cluster’) [20

–

28], and single-shot (‘directly-predict’) [29

–

44]. Among them, the top-leading algorithms [18,34,39

–

44] typically operate in a query-based mode,

in which a set of instance-aware embeddings is learned and used to query the dense image feature for

instance mask prediction. The key to their triumph is the instance-aware query vectors that are learned

to encode the characteristics (e.g., location, appearance) of instances [34,43]. By straightforwardly

minimizing the differences between the retrieved and groundtruth instance masks, the query-based

methods, in essence, learn the query vectors for instance discrimination only within individual scenes.

As a result, existing query-based instance segmentation algorithms place a premium on intra-scene

analysis during network training. Since the scenario in one single training scene is simple, i.e., the

diversity and volume of object instances as well as the complexity of the background are typically

limited, learning to distinguish between object instances only within the same training scenes is less

challenging, and inevitably hinders the discrimination potential of the learned instance queries.

∗authors contributed equally

†corresponding author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.00911v2 [cs.CV] 19 Dec 2022

This work brings a paradigm shift in training query-based instance segmenters: it goes beyond the de

facto, within-scene training strategy by further considering the cross-scene level query embedding

separation of different instances – querying instances from the whole training dataset. The underlying

rationale is intuitive yet powerful: an advanced instance segmenter should be able to differentiate all

the instances of the entire dataset, rather than only the ones within single scenes. Concretely, in our

training framework, the queries are not only learned to ﬁre on the pixels of their counterpart instances

in the current training image, but also forced to mismatch the pixels in other training images. By virtue

of intra-and inter-scene instance disambiguation, our framework forces the query-based segmenters

to learn more discriminative query vectors capable of uniquely identifying the corresponding ins-

tances even at the dataset level. To further facilitate the establishment of robust, one-to-one relation

between queries and instances, we complement our training framework with a transformation equi-

variance constraint, accommodating the equivariance property of the instance segmentation task to

geometric transformations. For example, if we crop or ﬂip the input image, we expect the image

(instance) features and query embeddings to change accordingly, so as to appropriately reﬂect the

100 200

COCO Mask AP

Inference Speed (ms)

SOLOv2

CondInst + ours

SOLOv2 + ours

CondInst

SOTR + ours Mask2Former

Mask2Former + ours

ResNet-50 ResNet-50

ResNet-50

ResNet-101

SOTR

ResNet-101

Swin-S

Swin-B

Swin-L

base model

ours

[39] [34]

[41]

[40]

Figure 1: Our training algorithm yields solid

performance gains over state-of-the-art query-

based models [34,39

–

41] without architec-

tural modiﬁcation and inference speed delay.

variation of instance patterns (e.g., scale, position,

shape, etc) caused by the input transformation.

Exploring intra-and inter-scene instance uniqueness

as well as transformation equivariance leads to a gen-

eral yet powerful training framework. Our algorithm,

in principle, can be seamlessly incorporated into the

training process of existing query-based instance seg-

menters. For comprehensive evaluation, we apply our

algorithm to four representative, query-based models

(i.e., CondInst [34], SOLOv2 [39], SOTR [41], and

Mask2Former [40]) with various backbones (i.e., Res-

Net [45], Swin [46]). Experiments on COCO [47]

verify our impressive performance, i.e., +

2.8

–

3.1

2.9

–

3.2

, +

2.4

–

2.6

, and +

1.6

–

2.4

AP gains over

CondInst, SOLOv2, SOTR, and Mask2Former, respec-

tively (see Fig. 1). Our algorithm also brings remark-

able improvement, +

2.7

AP, on LVISv1 [48] dataset,

on top of SOLOv2 [39]. These results are particularly

impressive considering our training algorithm causes

neither architectural change nor extra computational

load during model deployment.

2 Related Work

This section summarizes the most relevant research on

instance segmentation and equivariant learning.

Instance Segmentation.

With the renaissance of connectionism, remarkable progress has been made

in instance segmentation. Existing deep learning based solutions can be broadly classiﬁed into three

paradigms: top-down, bottom-up, and single-shot. Following the idea of ‘detect-then-segment’,

top-

down

methods [1

–

18] predict a bounding box for each object and then ouput an instance mask for each

box. Though effective, this type of methods is complicated and dependent on the priori detection results.

In contrast,

bottom-up

methods [20

–

27] adopt a ‘label-then-cluster’ strategy: learning per-pixel em-

beddings and then grouping them into different instances. Albeit simple, this type of methods relies on

the performance of post-processing and easily suffers from under-segment or over-segment problems.

Inspired by the advance of single-stage object detection [49,50], a few recent efforts approach instance

segmentation in a

single-shot

fashion, by coalescing detection and segmentation over pre-deﬁned

anchor boxes [29

–

33], or directly predicting instance masks from feature maps [34

–

44]. This type of

methods is well recognized and generally demonstrates better speed-accuracy trade-off [51].

Despite the blossoming of diverse approaches, the vast majority of recent top-performing algori-

thms [18,34,39

–

44] fall into one grand category – query-based models. The query-based methods

utilize compact, learnable embedding vectors to represent instances of interest and leverage them as

queries to decode masks from image features. Their triumph is founded on comprehensively encoding

instance-speciﬁc properties (e.g., location, appearance) into the query vectors, which signiﬁcantly

increases prediction robustness. For instance, [34,39] exploit the technique of dynamic ﬁlter [52] to

generate instance-speciﬁc descriptors, which are convolved with image feature maps for instance

mask decoding. Inspired by DETR [53], [40,42,43] alternatively leverage a Transformer decoder to

obtain instance-aware query embeddings and cast instance segmentation as a set prediction problem.

Our contribution is orthogonal and these query-based segmenters can beneﬁt. We scaffold a new

training framework that sharpens the instance discriminative capability of the query-based segmenters.

This is achieved by matching query embeddings with instances within and cross training scenes. Such

intra-and inter-scene instance querying strategy is further enhanced by an equivariance regularisation

term, addressing not only the uniqueness but also the robustness of instance-query relations.

Equivariant Representation Learning.

Transformations play a critical role in learning expressive

representations by transforming images as a means to reveal the intrinsic patterns from transformed

visual structures [54]. Motivated by the concept of translation equivariance underlying the success of

CNNs, numerous efforts (e.g., capsule nets [55,56], group equivariant convolutions [57], and harmonic

networks [58]) investigate learning more powerful representations equivariant to generic types of

transformations [59,60]. A representation

is said to be equivariant with a transformation

for

input (say image) Iif f(g(I))≈g(f(I)). In other words, the output representation f(I)transforms

in the same manner (or, in a broad sense, a predictable manner) given the input transformation

Many recent self-supervised learning methods [61

–

63] encourage the representations to be invariant

under transformations, i.e.,

f(g(I))≈f(I)

. As such, invariance can be viewed as a special case of

equivariance [64] where the output representation

f(I)

does not vary with the input transformation

In our training framework, we fully explore the inherent, transformation-equivariance nature of the

instance segmentation task to pursue reliable, one-to-one correspondence between learnable queries

and object instances. This is accomplished by promoting the equivariance of both query embeddings

and feature representations with respect to spatial transformations, i.e., cropping or ﬂipping of an

input image should result in correspondingly changed feature representation, query embeddings, as

well as instance mask predictions. Note that invariance is not suitable for instance segmentation task,

as it encourages the feature map (and segmentation mask) to not vary with the input transformation.

Our algorithm is also in contrast to the common data augmentation strategy, in which the transformed

images and annotations are used directly as additional individual training examples, without any

constraint about the relation between the representations (and queries) produced from the original

and transformed views. Our experimental results (see §4.3) also evidence the superiority of our

transformation equivariance learning over transformation-based data augmentation.

3 Methodology

Next, we ﬁrst formulate instance segmentation from a classical view of mask prediction and classiﬁ-

cation (§3.1). Then we describe our new training framework (§3.2) and implementation details (§3.3).

3.1 Problem Formulation

Instance segmentation seeks a partition of an observed image I∈RH×W×3into Kinstance regions:

{Yk}K

k=1 ={(Mk, ck)}K

k=1,where Mk∈ {0,1}H×W, ck∈{1,· · · , C}.(1)

Here the instances of interest are represented by a total of

non-overlap, binary masks

{Mk}K

k=1

well as corresponding class labels

{ck}K

k=1

(e.g., table, chair, etc). For a pixel

i∈I

, its counterpart

value in the

-th groundtruth mask, i.e.,

Mk(i)

, denotes whether

belongs to instance

(

) or not (

Note that the number of instances,

, varies across different images. Existing mainstream (or, more

precisely, most top-down and one-shot) solutions approach the task by decomposing the image

into

a ﬁxed-size set of soft masks. In this setting, each mask is associated with a probability distribution

over all the Ccategories. The output can thus be represented as a set of Nmask-probability pairs:

{ˆ

Yn}N

n=1 ={(ˆ

Mn,ˆpn)}N

n=1,where ˆ

Mn∈[0,1]H×W,ˆpn∈4

C.(2)

Here

stands for the

-dimensional probability simplex. The size of the prediction set,

, is

usually set as a constant and much larger than the typical number of object instances in an image.

Hence the training objective penalizes the errors of both label prediction and mask estimation:

L({ˆ

Yn}N

n=1,{Yk}K

k=1) = XN

n=1 Lcls(ˆpn, cσ(n)) + Lmask(ˆ

Mn, Mσ(n)),(3)

query

creator

{qn}n=1

Iflipping

(I)

( I )

{qn}n=1

dense feature

extractor f

{qn， }n=1

{Mn}n=1

ˆN{Mσ(n)}n=1

{qn， }n=1

(I )N

I{qn， }n=1

{ (Mσ(n))}n=1

{On }n=1

Intra-Scene

Instance

Discrimination

(standard)

Inter-Scene

Instance

Discrimination

(ours)

Transformation

Equivariance

(ours)

{Wn}n=1

ˆN

dataset

Lintra_mask

Eq. 5

Linter_mask

Eq. 5

Lequi

Eq. 8

Figure 2: Overview of our new training framework for query-based instance segmentation. Rather than

current intra-scene training paradigm, our framework addresses inter-scene instance discrimination

and transformation equivariance for discriminative instance query embedding learning (see §3.2). To

improve readability, for image I, we only plot one extra image I0for cross-scene training.

where

refers to the matching between the prediction and groundtruth sets (established by certain

rules [6,53]). The classiﬁcation loss

Lcls

is typically the cross-entropy loss or focal loss [65]; the mask

prediction loss

Lmask

can be the cross-entropy loss in [40], dice loss [66] in [34,39], or focal loss in [53].

While many approaches supplement

with various extra losses (e.g., bounding box loss [6,32,34,53],

ranking loss [67], semantic segmentation loss [13,32]), later we will show our cross-scene training

scheme is fundamentally different from (yet complementary to) current scene-wise training paradigm.

3.2 Equivariant Learning with Intra-and Inter-Scene Instance Uniqueness

Query-based Instance Segmentation.

As clearly indicated by Eq. 2, the prediction masks

{ˆ

Mn}N

n=1

are the means of separating instances at the pixel level. Current top-performing instance segmen-

ters [18,34,39

–

43] typically generate mask predictions in a query-based fashion (see the middle part

of Fig. 2). Let

be a dense feature extractor (e.g., an encoder-decoder fully convolutional

network [68]) that produces

-dimensional dense embedding

for image

,i.e.,

I=f(I)∈RH×W×D

Then a query creator his adopted to produce a set of Ninstance-aware embedding vectors

{qn∈Rd}N

n=1, which are used to query the image representation Ifor instance mask decoding:

{ˆ

Mn}N

n=1 ={hqn,Ii}N

n=1,where {qn}N

n=1 =h(I⇓).(4)

Here

h·,·i

is a certain similarity measure performed pixel-wise, and

I⇓

(typically) refers to a low-

resolution feature representation of image

. Note that position information is integrated to either or

both of

and

I⇓

, to make the model location-sensitive. The query creator

is implemented as a

dynamic network [34,39], or a Transformer decoder [40,42]. For dynamic network based

, it predicts

convolution ﬁlters

{qn}N

n=1

dynamically conditioned on the input

I⇓

, and hence

h·,·i

refers to

convolution (

d6=D

). For Transformer decoder based

, it additionally leverages a set of

learnable

positional embeddings (omitted for brevity) to gather instance-related context from

I⇓

; the collected

context is stored in

{qn}N

n=1

and

h·,·i

is computed as dot product for instance mask decoding (

d=D

Eq. 4informs that, query-based segmenters in essence learn

compact descriptors

{qn}N

n=1

to grasp

critical characteristics (e.g., appearance and location) of potentially interested instances, and use

these instance descriptors as queries to retrieve corresponding pixels from

. It is thus reasonable to

assume the discriminative ability of the learned query embeddings is crucial for the performance of

query-based methods. Viewed in this light, a question naturally arises:

how to learn discriminative

instance query embeddings? Yet, this fundamental question is largely ignored in the literature so far.

To respond

, we exploit two crucial properties of instance-query matching, namely uniqueness and

robustness. This is achieved by addressing dataset-level uniqueness and transformation equivariance

during the learning of query-based segmenters, leading to a powerful training scheme eventually.

Learning with Intra-and Inter-Scene Instance Uniqueness.

If we closely scrutinize at the current

de facto training regime (cf. Eq. 3) and the work mode of query-based segmenters (cf. Eq. 4), we can

ﬁnd: the mask prediction loss Lmask forces each query qnto match the pixels i∈Iof its counterpart

instance

k=σ(n)

,i.e.,

Mσ(n)(i) = 1

, and mismatch the pixels

i0∈I

of other instances

k6=σ(n)

,i.e.,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningEquivariantSegmentationwithInstance-UniqueQueryingWenguanWangReLER,AAII,UniversityofTechnologySydneyJamesLiangRochesterInstituteofTechnologyDongfangLiuyRochesterInstituteofTechnologyhttps://github.com/JamesLiang819/Instance_Unique_QueryingAbstractPrevalentstate-of-the-artinstancesegmentati...

展开>> 收起<<

Learning Equivariant Segmentation with Instance-Unique Querying Wenguan Wang.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Equivariant Segmentation with Instance-Unique Querying Wenguan Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: