We conduct extensive experiments on two challenging datasets (i.e., COCO-Stuff-10K [
3
] and
ADE20K [50]) to demonstrate the effectiveness of our method. We summarize our contributions as
1) We propose a dual similarity transfer framework named SimFormer for weak-shot semantic
segmentation, in which MaskFormer lays the foundation for proposal-pixel similarity transfer.
2) We propose pixel-pixel similarity transfer, which learns pixel-pixel semantic similarity from base
classes and distills such class-agnostic similarity to the segmentation results of novel classes. We also
propose a complementary loss to facilitate the mask learning of novel classes.
3) Extensive experiments on the challenging COCO-Stuff-10K [
3
] and ADE20K [
50
] datasets
demonstrate the practicality of our scenario and the effectiveness of our method.
2 Related Works
Weakly-supervised Semantic Segmentation (WSSS).
Considering the expensive cost for annotat-
ing pixel-level masks, WSSS [
32
,
11
,
34
,
20
] only relies on image-level labels to train the segmen-
tation model, which has attracted increasing attention. The majority of WSSS methods [
4
,
1
,
40
]
firstly train a classifier to obtain class activation map (CAM) [
49
] to derive pseudo masks, which are
then used to train a standard segmentation model. For example, SEC [
18
] proposed the principle
of “seed, expand, and constrain”, which has a great impact on WSSS. Under similar pipeline, some
works [
22
,
4
,
40
] focus on enhancing the seed, while some other works [
17
,
39
,
46
] pay attention to
improving the expanding strategy. Although WSSS has achieved great success, the expanded CAM
is difficult to cover the intact semantic region due to the lack of informative pixel-level annotations.
Fortunately, in our focused problem, such information could be derived from an off-the-shelf dataset
and transferred to facilitate the learning of novel classes with only image-level labels.
Weak-shot Learning.
Reducing the annotation cost is a practical and extensive demand for various
applications of deep learning. Recently, weak-shot learning, i.e., learning weakly supervised novel
classes with the support of strongly supervised base classes, has been explored in image classification
[
5
], object detection [
26
,
48
,
23
], semantic segmentation [
51
], instance segmentation [
16
,
19
,
2
], and
so on [
35
], which has achieved promising success. In weak-shot semantic segmentation [
51
], the
problem is learning to segment novel classes with only image-level labels with the support of base
classes having pixel-level labels. Concerning the task setting, as aforementioned, RETAB [
51
] further
assumes that the off-the-shelf dataset has no novel classes and the background class is annotated
with pixel-level mask, while the setting in this paper is more succinct and practical. Concerning the
technical method, RETAB [
51
] follows the framework of WSSS, which suffers from a complex and
tedious multi-stage pipeline, i.e., training classifier, deriving CAM, expanding to pseudo-labels, and
re-training. In contrast, we build our framework based on MaskFormer [
8
] to perform dual similarity
transfer, which could achieve satisfactory performance in single-stage without re-training.
Similarity Transfer.
As an effective method, similarity transfer has been widely applied in various
transfer learning tasks [
7
,
36
,
5
]. Specifically, semantic similarity (whether the two inputs belong
to the same class) is class-agnostic, and thus transferable across classes. To name a few, Guo et
al. [
13
] transferred class-class similarity and sample-sample similarity across domains in active
learning. CCN [
15
] proposed to learn semantic similarity between image pair, which is robust in
both cross-domain and cross-task transfer learning. PrototypeNet [
33
] proposed to learn and transfer
the similarity between image and prototype across categories for both few-shot classification and
zero-shot classification. In this paper, we propose to transfer proposal-pixel similarity and pixel-pixel
similarity for weak-shot semantic segmentation. These two types of similarities both belong to
semantic similarity, which are highly transferable across classes.
3 Methodology
3.1 Problem Definition
In our weak-shot semantic segmentation, we are given a standard segmentation dataset annotated
for base classes
Cb
, and we would like to further segment another set of novel classes
Cn
ignored in
the off-the-shelf dataset, where
Cb∩ Cn=∅
. We assume that we have the image-level labels for
Cn
,
which is rather cheaper and more convenient to obtain than pixel-level mask. In summary, for each
3