Weak-shot Semantic Segmentation via Dual Similarity Transfer Junjie Chen1 Li Niu1 Siyuan Zhou1 Jianlou Si2 Chen Qian2 Liqing Zhang1

2025-05-06 0 0 6.58MB 17 页 10玖币
侵权投诉
Weak-shot Semantic Segmentation
via Dual Similarity Transfer
Junjie Chen1, Li Niu1
, Siyuan Zhou1, Jianlou Si2, Chen Qian2, Liqing Zhang1
1The MoE Key Lab of AI, CSE department, Shanghai Jiao Tong University
2SenseTime Research, SenseTime
{chen.bys, ustcnewly, ssluvble}@sjtu.edu.cn
{sijianlou,qianchen}@sensetime.com,zhang-lq@cs.sjtu.edu.cn
Abstract
Semantic segmentation is an important and prevalent task, but severely suffers from
the high cost of pixel-level annotations when extending to more classes in wider
applications. To this end, we focus on the problem named weak-shot semantic
segmentation, where the novel classes are learnt from cheaper image-level labels
with the support of base classes having off-the-shelf pixel-level labels. To tackle
this problem, we propose SimFormer, which performs dual similarity transfer upon
MaskFormer. Specifically, MaskFormer disentangles the semantic segmentation
task into two sub-tasks: proposal classification and proposal segmentation for each
proposal. Proposal segmentation allows proposal-pixel similarity transfer from
base classes to novel classes, which enables the mask learning of novel classes. We
also learn pixel-pixel similarity from base classes and distill such class-agnostic
semantic similarity to the semantic masks of novel classes, which regularizes
the segmentation model with pixel-level semantic relationship across images. In
addition, we propose a complementary loss to facilitate the learning of novel
classes. Comprehensive experiments on the challenging COCO-Stuff-10K and
ADE20K datasets demonstrate the effectiveness of our method. Codes are available
at https://github.com/bcmi/SimFormer-Weak-Shot-Semantic-Segmentation.
1 Introduction
Semantic segmentation [
27
,
6
,
45
,
47
] is a fundamental and active task in computer vision, which
aims to produce class label for each pixel in image. Training modern semantic segmentation models
usually requires pixel-level labels for each class in each image, and thus costly datasets have been
constructed for learning. As a practical task, there is a growing requirement to segment more classes
in wider applications. However, annotating dense pixel-level mask is too expensive to cover the
continuously increasing classes, which dramatically limits the applications of semantic segmentation.
In practice, we have some already annotated semantic segmentation datasets, e.g., COCO-80 [
25
].
These costly datasets focus on certain classes (e.g., the
80
classes in [
25
]), and ignore other uninter-
ested classes (e.g., other classes beyond the
80
classes in [
25
]). Over time, there may be demand
to segment more classes, e.g., extending to COCO-171 [
3
]. We refer to the off-the-shelf classes
having already annotated masks as base classes, and those newly covered classes as novel classes. As
illustrated in Fig. 1 (a), one sample in COCO-80 focuses on segmenting cat,bed, etc, but ignores
the lamp. Existing scenario for expansion in [3] is to annotate the dense pixel-level masks for novel
classes again (i.e., annotating the mask for lamp in Fig. 1 (a)), which is too expensive to scale up. To
this end, we focus on a cheaper yet effective scenario in this paper, where we only need to annotate
image-level labels for the novel classes.
Corresponding author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.02270v1 [cs.CV] 5 Oct 2022
Figure 1: In (a), off-the-shelf datasets focus on certain base classes, i.e., the colored masks in (a.2),
while leaving novel classes ignored, i.e., the regions in black. We would like to segment novel classes
by virtue of cheaper image-level labels. In (b), our method obtains proposal embeddings and pixel
embeddings for each image based on MaskFormer [
8
]. Then, the transferred proposal-pixel similarity
could produce masks for proposal embeddings of novel classes, and the transferred pixel-pixel
similarity could provide semantic relation regularization for pixel pairs across images. The solid
(resp., dotted) arrow indicates similarity (resp., dissimilarity). Best viewed in colors.
We refer to our learning scenario as weak-shot semantic segmentation, which focuses on further
segmenting novel classes by virtue of cheaper image-level labels with the support of base classes
having pixel-level masks. Specifically, given a standard semantic segmentation dataset annotated
only for base classes (the novel classes hide in the ignored regions), we assume that the image-level
labels are available for novel classes in each image. We would like to segment all the base classes
and novel classes in the test stage. Similar scenario has been explored in RETAB [
51
], but they 1)
further assume that the off-the-shelf dataset contains no novel classes; 2) assume that the background
class (in PASCAL VOC dataset [
10
]) consisting of all ignored classes is annotated with pixel-level
mask. However, these two assumptions are hard to satisfy in real-world applications, because the
ignored region in off-the-shelf dataset may contain novel classes and we actually cannot have the
masks of background class before having the masks of novel classes. In contrast, our scenario is
similar in spirit but more succinct and practical, which is well demonstrated by the aforementioned
instance expanding from COCO-80 to COCO-171. Besides, RETAB [
51
] only conducted experiments
on PASCAL VOC, while we focus on more challenging datasets (i.e., COCO-Stuff-10K [
3
] and
ADE20K [50]), without a background class annotated with masks.
In weak-shot semantic segmentation, the key problem is how to learn dense pixel-level masks from
the image-level labels of novel classes with the support of pixel-level masks of base classes. Our
proposed solution is SimFormer, which performs dual similarity transfer upon MaskFormer [
8
] as
shown in Fig. 1 (b). We choose MaskFormer [
8
] because it disentangles the segmentation task into two
sub-tasks: proposal classification and proposal segmentation. Specifically, MaskFormer [
8
] produces
some proposal embeddings (aka per-segment embeddings in [
8
]) from shared query embeddings for
input image, each of which is assigned to be responsible for one class present in image (allowing
empty assignment). Then MaskFormer performs proposal classification sub-task and proposal
segmentation sub-task for each proposal embedding according to their assigned classes.
In our setting and framework, the proposal embeddings of base classes are supervised in both sub-
tasks using class and mask annotations, while the proposal embeddings of novel classes are supervised
only in classification sub-task due to lacking mask annotations. The proposal segmentation sub-
task is essentially learning proposal-pixel similarity. Such similarity belongs to pair-wise semantic
similarity, which is class-agnostic and transferable across different categories [
7
,
36
,
5
]. Therefore,
proposal segmentation for novel classes could be accomplished based on the proposal-pixel similarity
transferred from base classes. To further improve the quality of binary masks of novel classes,
we additionally propose pixel-pixel similarity transfer, which learns the pixel pair-wise semantic
similarity from base classes and distills such class-agnostic similarity into the produced masks of
novel classes. In this way, the model is supervised to produce segmentation results containing
semantic relationship for novel classes. Inspired by [
38
,
42
,
37
], we learn from and distill to cross-
image pixel pairs, which could also introduce global context into model, i.e., enhancing the pixel
semantic consistency across images. In addition, we propose a complementary loss, based on the
insight that the union set of pixels belonging to base classes is complementary to the union set of
pixels belonging to novel classes or ignore class in each image, which provides supervision for the
union of masks of novel classes.
2
We conduct extensive experiments on two challenging datasets (i.e., COCO-Stuff-10K [
3
] and
ADE20K [50]) to demonstrate the effectiveness of our method. We summarize our contributions as
1) We propose a dual similarity transfer framework named SimFormer for weak-shot semantic
segmentation, in which MaskFormer lays the foundation for proposal-pixel similarity transfer.
2) We propose pixel-pixel similarity transfer, which learns pixel-pixel semantic similarity from base
classes and distills such class-agnostic similarity to the segmentation results of novel classes. We also
propose a complementary loss to facilitate the mask learning of novel classes.
3) Extensive experiments on the challenging COCO-Stuff-10K [
3
] and ADE20K [
50
] datasets
demonstrate the practicality of our scenario and the effectiveness of our method.
2 Related Works
Weakly-supervised Semantic Segmentation (WSSS).
Considering the expensive cost for annotat-
ing pixel-level masks, WSSS [
32
,
11
,
34
,
20
] only relies on image-level labels to train the segmen-
tation model, which has attracted increasing attention. The majority of WSSS methods [
4
,
1
,
40
]
firstly train a classifier to obtain class activation map (CAM) [
49
] to derive pseudo masks, which are
then used to train a standard segmentation model. For example, SEC [
18
] proposed the principle
of “seed, expand, and constrain”, which has a great impact on WSSS. Under similar pipeline, some
works [
22
,
4
,
40
] focus on enhancing the seed, while some other works [
17
,
39
,
46
] pay attention to
improving the expanding strategy. Although WSSS has achieved great success, the expanded CAM
is difficult to cover the intact semantic region due to the lack of informative pixel-level annotations.
Fortunately, in our focused problem, such information could be derived from an off-the-shelf dataset
and transferred to facilitate the learning of novel classes with only image-level labels.
Weak-shot Learning.
Reducing the annotation cost is a practical and extensive demand for various
applications of deep learning. Recently, weak-shot learning, i.e., learning weakly supervised novel
classes with the support of strongly supervised base classes, has been explored in image classification
[
5
], object detection [
26
,
48
,
23
], semantic segmentation [
51
], instance segmentation [
16
,
19
,
2
], and
so on [
35
], which has achieved promising success. In weak-shot semantic segmentation [
51
], the
problem is learning to segment novel classes with only image-level labels with the support of base
classes having pixel-level labels. Concerning the task setting, as aforementioned, RETAB [
51
] further
assumes that the off-the-shelf dataset has no novel classes and the background class is annotated
with pixel-level mask, while the setting in this paper is more succinct and practical. Concerning the
technical method, RETAB [
51
] follows the framework of WSSS, which suffers from a complex and
tedious multi-stage pipeline, i.e., training classifier, deriving CAM, expanding to pseudo-labels, and
re-training. In contrast, we build our framework based on MaskFormer [
8
] to perform dual similarity
transfer, which could achieve satisfactory performance in single-stage without re-training.
Similarity Transfer.
As an effective method, similarity transfer has been widely applied in various
transfer learning tasks [
7
,
36
,
5
]. Specifically, semantic similarity (whether the two inputs belong
to the same class) is class-agnostic, and thus transferable across classes. To name a few, Guo et
al. [
13
] transferred class-class similarity and sample-sample similarity across domains in active
learning. CCN [
15
] proposed to learn semantic similarity between image pair, which is robust in
both cross-domain and cross-task transfer learning. PrototypeNet [
33
] proposed to learn and transfer
the similarity between image and prototype across categories for both few-shot classification and
zero-shot classification. In this paper, we propose to transfer proposal-pixel similarity and pixel-pixel
similarity for weak-shot semantic segmentation. These two types of similarities both belong to
semantic similarity, which are highly transferable across classes.
3 Methodology
3.1 Problem Definition
In our weak-shot semantic segmentation, we are given a standard segmentation dataset annotated
for base classes
Cb
, and we would like to further segment another set of novel classes
Cn
ignored in
the off-the-shelf dataset, where
Cb∩ Cn=
. We assume that we have the image-level labels for
Cn
,
which is rather cheaper and more convenient to obtain than pixel-level mask. In summary, for each
3
Figure 2: The detailed illustration of our framework. As in MaskFormer, we produce
N
proposal
embeddings in each image. On the one hand, each proposal embedding is fed to the classifier, where
both base and novel classes are supervised by classification loss (ClsLoss). On the other hand, the
similarities between each proposal embedding and pixel embeddings are computed to produce binary
mask, where only base masks (in red) are supervised by GT mask (MaskLoss) while novel masks (in
blue) are supervised by complementary loss (CompLoss). We sample some pixels and construct pixel
pairs across two images. The concatenated pixel embeddings are fed to SimNet, where the base pixel
pairs (in red) are used to train SimNet with similarity loss (SimLoss) and novel pixel pairs (in blue)
are used for similarity distillation (DistLoss).
training image, we have image-level labels for both
Cb
and
Cn
, and we have pixel-level masks only
for Cb. In the test stage, we need to predict pixel-level masks for both Cband Cn.
3.2 Review of MaskFormer
In this section, we briefly review MaskFormer [
8
], which lays the foundation of our framework.
The general pipeline of MaskFormer disentangles semantic segmentation task into two sub-tasks:
proposal classification and proposal segmentation. Specifically, MaskFormer maintains
N
learnable
C
-dim query embeddings
Q ∈ RC×N
shared for all images, as shown in the upper part of Fig.
2. When the image is input, the backbone features are extracted via a backbone. Then,
N
query
embeddings attend to backbone features to produce proposal embeddings
Eprop RC×N
via a
transformer decoder. For each proposal embedding, a bipartite matching algorithm assigns a class
present in the input image to it, considering the classification loss and mask loss as the cost for
assignment. For proposal classification sub-task, each proposal embedding is fed to a simple classifier
to yield class probability predictions
Y R(K+1)×N
over
K
semantic classes and
1
ignore class. For
proposal segmentation sub-task, pixel embeddings
Epix RC×H×W
are extracted from backbone
features via a pixel decoder. Afterwards, proposal embeddings are processed by several FC layers
and their dot-products with pixel embeddings followed by sigmoid are computed to produce binary
masks
M ∈ RN×H×W
,i.e.,
M[i, h, w] = sigmoid(Eprop[:, i]· Epix[:, h, w])
. In the training stage,
the two sub-tasks for each proposal embedding are supervised by the label and mask of assigned
class (the mask loss of ignore class is eliminated). In the test stage, the semantic segmentation result
for each class at pixel
(h, w)
is obtained by summarizing all the masks weighted by the class score,
i.e.,arg maxc∈{1,...,K}PN
i=1 Y[c, i]· M[i, h, w]. For more details, please refer to MaskFormer [8].
3.3 Proposal-Pixel Similarity Transfer on MaskFormer
In our setting, we have only image-level labels for novel classes and we choose to produce pixel-level
masks via proposal-pixel similarity transfer based on MaskFormer [
8
]. For concise description, we
4
摘要:

Weak-shotSemanticSegmentationviaDualSimilarityTransferJunjieChen1,LiNiu1,SiyuanZhou1,JianlouSi2,ChenQian2,LiqingZhang11TheMoEKeyLabofAI,CSEdepartment,ShanghaiJiaoTongUniversity2SenseTimeResearch,SenseTime{chen.bys,ustcnewly,ssluvble}@sjtu.edu.cn{sijianlou,qianchen}@sensetime.com,zhang-lq@cs.sjtu.e...

展开>> 收起<<
Weak-shot Semantic Segmentation via Dual Similarity Transfer Junjie Chen1 Li Niu1 Siyuan Zhou1 Jianlou Si2 Chen Qian2 Liqing Zhang1.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:17 页 大小:6.58MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注