
This work brings a paradigm shift in training query-based instance segmenters: it goes beyond the de
facto, within-scene training strategy by further considering the cross-scene level query embedding
separation of different instances – querying instances from the whole training dataset. The underlying
rationale is intuitive yet powerful: an advanced instance segmenter should be able to differentiate all
the instances of the entire dataset, rather than only the ones within single scenes. Concretely, in our
training framework, the queries are not only learned to fire on the pixels of their counterpart instances
in the current training image, but also forced to mismatch the pixels in other training images. By virtue
of intra-and inter-scene instance disambiguation, our framework forces the query-based segmenters
to learn more discriminative query vectors capable of uniquely identifying the corresponding ins-
tances even at the dataset level. To further facilitate the establishment of robust, one-to-one relation
between queries and instances, we complement our training framework with a transformation equi-
variance constraint, accommodating the equivariance property of the instance segmentation task to
geometric transformations. For example, if we crop or flip the input image, we expect the image
(instance) features and query embeddings to change accordingly, so as to appropriately reflect the
100 200
0
35
40
45
50
COCO Mask AP
Inference Speed (ms)
SOLOv2
CondInst + ours
SOLOv2 + ours
CondInst
SOTR + ours Mask2Former
Mask2Former + ours
ResNet-50 ResNet-50
ResNet-50
ResNet-101
SOTR
ResNet-101
ResNet-101
Swin-S
Swin-B
Swin-L
base model
ours
[39] [34]
[41]
[40]
Figure 1: Our training algorithm yields solid
performance gains over state-of-the-art query-
based models [34,39
–
41] without architec-
tural modification and inference speed delay.
variation of instance patterns (e.g., scale, position,
shape, etc) caused by the input transformation.
Exploring intra-and inter-scene instance uniqueness
as well as transformation equivariance leads to a gen-
eral yet powerful training framework. Our algorithm,
in principle, can be seamlessly incorporated into the
training process of existing query-based instance seg-
menters. For comprehensive evaluation, we apply our
algorithm to four representative, query-based models
(i.e., CondInst [34], SOLOv2 [39], SOTR [41], and
Mask2Former [40]) with various backbones (i.e., Res-
Net [45], Swin [46]). Experiments on COCO [47]
verify our impressive performance, i.e., +
2.8
–
3.1
,
+
2.9
–
3.2
, +
2.4
–
2.6
, and +
1.6
–
2.4
AP gains over
CondInst, SOLOv2, SOTR, and Mask2Former, respec-
tively (see Fig. 1). Our algorithm also brings remark-
able improvement, +
2.7
AP, on LVISv1 [48] dataset,
on top of SOLOv2 [39]. These results are particularly
impressive considering our training algorithm causes
neither architectural change nor extra computational
load during model deployment.
2 Related Work
This section summarizes the most relevant research on
instance segmentation and equivariant learning.
Instance Segmentation.
With the renaissance of connectionism, remarkable progress has been made
in instance segmentation. Existing deep learning based solutions can be broadly classified into three
paradigms: top-down, bottom-up, and single-shot. Following the idea of ‘detect-then-segment’,
top-
down
methods [1
–
18] predict a bounding box for each object and then ouput an instance mask for each
box. Though effective, this type of methods is complicated and dependent on the priori detection results.
In contrast,
bottom-up
methods [20
–
27] adopt a ‘label-then-cluster’ strategy: learning per-pixel em-
beddings and then grouping them into different instances. Albeit simple, this type of methods relies on
the performance of post-processing and easily suffers from under-segment or over-segment problems.
Inspired by the advance of single-stage object detection [49,50], a few recent efforts approach instance
segmentation in a
single-shot
fashion, by coalescing detection and segmentation over pre-defined
anchor boxes [29
–
33], or directly predicting instance masks from feature maps [34
–
44]. This type of
methods is well recognized and generally demonstrates better speed-accuracy trade-off [51].
Despite the blossoming of diverse approaches, the vast majority of recent top-performing algori-
thms [18,34,39
–
44] fall into one grand category – query-based models. The query-based methods
utilize compact, learnable embedding vectors to represent instances of interest and leverage them as
queries to decode masks from image features. Their triumph is founded on comprehensively encoding
instance-specific properties (e.g., location, appearance) into the query vectors, which significantly
increases prediction robustness. For instance, [34,39] exploit the technique of dynamic filter [52] to
2