Weak-shot Semantic Segmentation via Dual Similarity Transfer Junjie Chen1 Li Niu1 Siyuan Zhou1 Jianlou Si2 Chen Qian2 Liqing Zhang1

2025-05-06 0 0 6.58MB 17 页 10玖币

侵权投诉

Weak-shot Semantic Segmentation

via Dual Similarity Transfer

Junjie Chen1, Li Niu1∗

, Siyuan Zhou1, Jianlou Si2, Chen Qian2, Liqing Zhang1∗

1The MoE Key Lab of AI, CSE department, Shanghai Jiao Tong University

2SenseTime Research, SenseTime

{chen.bys, ustcnewly, ssluvble}@sjtu.edu.cn

{sijianlou,qianchen}@sensetime.com,zhang-lq@cs.sjtu.edu.cn

Abstract

Semantic segmentation is an important and prevalent task, but severely suffers from

the high cost of pixel-level annotations when extending to more classes in wider

applications. To this end, we focus on the problem named weak-shot semantic

segmentation, where the novel classes are learnt from cheaper image-level labels

with the support of base classes having off-the-shelf pixel-level labels. To tackle

this problem, we propose SimFormer, which performs dual similarity transfer upon

MaskFormer. Speciﬁcally, MaskFormer disentangles the semantic segmentation

task into two sub-tasks: proposal classiﬁcation and proposal segmentation for each

proposal. Proposal segmentation allows proposal-pixel similarity transfer from

base classes to novel classes, which enables the mask learning of novel classes. We

also learn pixel-pixel similarity from base classes and distill such class-agnostic

semantic similarity to the semantic masks of novel classes, which regularizes

the segmentation model with pixel-level semantic relationship across images. In

addition, we propose a complementary loss to facilitate the learning of novel

classes. Comprehensive experiments on the challenging COCO-Stuff-10K and

ADE20K datasets demonstrate the effectiveness of our method. Codes are available

at https://github.com/bcmi/SimFormer-Weak-Shot-Semantic-Segmentation.

1 Introduction

Semantic segmentation [

] is a fundamental and active task in computer vision, which

aims to produce class label for each pixel in image. Training modern semantic segmentation models

usually requires pixel-level labels for each class in each image, and thus costly datasets have been

constructed for learning. As a practical task, there is a growing requirement to segment more classes

in wider applications. However, annotating dense pixel-level mask is too expensive to cover the

continuously increasing classes, which dramatically limits the applications of semantic segmentation.

In practice, we have some already annotated semantic segmentation datasets, e.g., COCO-80 [

These costly datasets focus on certain classes (e.g., the

classes in [

]), and ignore other uninter-

ested classes (e.g., other classes beyond the

classes in [

]). Over time, there may be demand

to segment more classes, e.g., extending to COCO-171 [

]. We refer to the off-the-shelf classes

having already annotated masks as base classes, and those newly covered classes as novel classes. As

illustrated in Fig. 1 (a), one sample in COCO-80 focuses on segmenting cat,bed, etc, but ignores

the lamp. Existing scenario for expansion in [3] is to annotate the dense pixel-level masks for novel

classes again (i.e., annotating the mask for lamp in Fig. 1 (a)), which is too expensive to scale up. To

this end, we focus on a cheaper yet effective scenario in this paper, where we only need to annotate

image-level labels for the novel classes.

∗Corresponding author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.02270v1 [cs.CV] 5 Oct 2022

Figure 1: In (a), off-the-shelf datasets focus on certain base classes, i.e., the colored masks in (a.2),

while leaving novel classes ignored, i.e., the regions in black. We would like to segment novel classes

by virtue of cheaper image-level labels. In (b), our method obtains proposal embeddings and pixel

embeddings for each image based on MaskFormer [

]. Then, the transferred proposal-pixel similarity

could produce masks for proposal embeddings of novel classes, and the transferred pixel-pixel

similarity could provide semantic relation regularization for pixel pairs across images. The solid

(resp., dotted) arrow indicates similarity (resp., dissimilarity). Best viewed in colors.

We refer to our learning scenario as weak-shot semantic segmentation, which focuses on further

segmenting novel classes by virtue of cheaper image-level labels with the support of base classes

having pixel-level masks. Speciﬁcally, given a standard semantic segmentation dataset annotated

only for base classes (the novel classes hide in the ignored regions), we assume that the image-level

labels are available for novel classes in each image. We would like to segment all the base classes

and novel classes in the test stage. Similar scenario has been explored in RETAB [

], but they 1)

further assume that the off-the-shelf dataset contains no novel classes; 2) assume that the background

class (in PASCAL VOC dataset [

]) consisting of all ignored classes is annotated with pixel-level

mask. However, these two assumptions are hard to satisfy in real-world applications, because the

ignored region in off-the-shelf dataset may contain novel classes and we actually cannot have the

masks of background class before having the masks of novel classes. In contrast, our scenario is

similar in spirit but more succinct and practical, which is well demonstrated by the aforementioned

instance expanding from COCO-80 to COCO-171. Besides, RETAB [

] only conducted experiments

on PASCAL VOC, while we focus on more challenging datasets (i.e., COCO-Stuff-10K [

] and

ADE20K [50]), without a background class annotated with masks.

In weak-shot semantic segmentation, the key problem is how to learn dense pixel-level masks from

the image-level labels of novel classes with the support of pixel-level masks of base classes. Our

proposed solution is SimFormer, which performs dual similarity transfer upon MaskFormer [

] as

shown in Fig. 1 (b). We choose MaskFormer [

] because it disentangles the segmentation task into two

sub-tasks: proposal classiﬁcation and proposal segmentation. Speciﬁcally, MaskFormer [

] produces

some proposal embeddings (aka per-segment embeddings in [

]) from shared query embeddings for

input image, each of which is assigned to be responsible for one class present in image (allowing

empty assignment). Then MaskFormer performs proposal classiﬁcation sub-task and proposal

segmentation sub-task for each proposal embedding according to their assigned classes.

In our setting and framework, the proposal embeddings of base classes are supervised in both sub-

tasks using class and mask annotations, while the proposal embeddings of novel classes are supervised

only in classiﬁcation sub-task due to lacking mask annotations. The proposal segmentation sub-

task is essentially learning proposal-pixel similarity. Such similarity belongs to pair-wise semantic

similarity, which is class-agnostic and transferable across different categories [

]. Therefore,

proposal segmentation for novel classes could be accomplished based on the proposal-pixel similarity

transferred from base classes. To further improve the quality of binary masks of novel classes,

we additionally propose pixel-pixel similarity transfer, which learns the pixel pair-wise semantic

similarity from base classes and distills such class-agnostic similarity into the produced masks of

novel classes. In this way, the model is supervised to produce segmentation results containing

semantic relationship for novel classes. Inspired by [

], we learn from and distill to cross-

image pixel pairs, which could also introduce global context into model, i.e., enhancing the pixel

semantic consistency across images. In addition, we propose a complementary loss, based on the

insight that the union set of pixels belonging to base classes is complementary to the union set of

pixels belonging to novel classes or ignore class in each image, which provides supervision for the

union of masks of novel classes.

We conduct extensive experiments on two challenging datasets (i.e., COCO-Stuff-10K [

] and

ADE20K [50]) to demonstrate the effectiveness of our method. We summarize our contributions as

1) We propose a dual similarity transfer framework named SimFormer for weak-shot semantic

segmentation, in which MaskFormer lays the foundation for proposal-pixel similarity transfer.

2) We propose pixel-pixel similarity transfer, which learns pixel-pixel semantic similarity from base

classes and distills such class-agnostic similarity to the segmentation results of novel classes. We also

propose a complementary loss to facilitate the mask learning of novel classes.

3) Extensive experiments on the challenging COCO-Stuff-10K [

] and ADE20K [

] datasets

demonstrate the practicality of our scenario and the effectiveness of our method.

2 Related Works

Weakly-supervised Semantic Segmentation (WSSS).

Considering the expensive cost for annotat-

ing pixel-level masks, WSSS [

] only relies on image-level labels to train the segmen-

tation model, which has attracted increasing attention. The majority of WSSS methods [

]

ﬁrstly train a classiﬁer to obtain class activation map (CAM) [

] to derive pseudo masks, which are

then used to train a standard segmentation model. For example, SEC [

] proposed the principle

of “seed, expand, and constrain”, which has a great impact on WSSS. Under similar pipeline, some

works [

] focus on enhancing the seed, while some other works [

] pay attention to

improving the expanding strategy. Although WSSS has achieved great success, the expanded CAM

is difﬁcult to cover the intact semantic region due to the lack of informative pixel-level annotations.

Fortunately, in our focused problem, such information could be derived from an off-the-shelf dataset

and transferred to facilitate the learning of novel classes with only image-level labels.

Weak-shot Learning.

Reducing the annotation cost is a practical and extensive demand for various

applications of deep learning. Recently, weak-shot learning, i.e., learning weakly supervised novel

classes with the support of strongly supervised base classes, has been explored in image classiﬁcation

[

], object detection [

], semantic segmentation [

], instance segmentation [

], and

so on [

], which has achieved promising success. In weak-shot semantic segmentation [

], the

problem is learning to segment novel classes with only image-level labels with the support of base

classes having pixel-level labels. Concerning the task setting, as aforementioned, RETAB [

] further

assumes that the off-the-shelf dataset has no novel classes and the background class is annotated

with pixel-level mask, while the setting in this paper is more succinct and practical. Concerning the

technical method, RETAB [

] follows the framework of WSSS, which suffers from a complex and

tedious multi-stage pipeline, i.e., training classiﬁer, deriving CAM, expanding to pseudo-labels, and

re-training. In contrast, we build our framework based on MaskFormer [

] to perform dual similarity

transfer, which could achieve satisfactory performance in single-stage without re-training.

Similarity Transfer.

As an effective method, similarity transfer has been widely applied in various

transfer learning tasks [

]. Speciﬁcally, semantic similarity (whether the two inputs belong

to the same class) is class-agnostic, and thus transferable across classes. To name a few, Guo et

al. [

] transferred class-class similarity and sample-sample similarity across domains in active

learning. CCN [

] proposed to learn semantic similarity between image pair, which is robust in

both cross-domain and cross-task transfer learning. PrototypeNet [

] proposed to learn and transfer

the similarity between image and prototype across categories for both few-shot classiﬁcation and

zero-shot classiﬁcation. In this paper, we propose to transfer proposal-pixel similarity and pixel-pixel

similarity for weak-shot semantic segmentation. These two types of similarities both belong to

semantic similarity, which are highly transferable across classes.

3 Methodology

3.1 Problem Deﬁnition

In our weak-shot semantic segmentation, we are given a standard segmentation dataset annotated

for base classes

, and we would like to further segment another set of novel classes

ignored in

the off-the-shelf dataset, where

Cb∩ Cn=∅

. We assume that we have the image-level labels for

which is rather cheaper and more convenient to obtain than pixel-level mask. In summary, for each

Figure 2: The detailed illustration of our framework. As in MaskFormer, we produce

proposal

embeddings in each image. On the one hand, each proposal embedding is fed to the classiﬁer, where

both base and novel classes are supervised by classiﬁcation loss (ClsLoss). On the other hand, the

similarities between each proposal embedding and pixel embeddings are computed to produce binary

mask, where only base masks (in red) are supervised by GT mask (MaskLoss) while novel masks (in

blue) are supervised by complementary loss (CompLoss). We sample some pixels and construct pixel

pairs across two images. The concatenated pixel embeddings are fed to SimNet, where the base pixel

pairs (in red) are used to train SimNet with similarity loss (SimLoss) and novel pixel pairs (in blue)

are used for similarity distillation (DistLoss).

training image, we have image-level labels for both

and

, and we have pixel-level masks only

for Cb. In the test stage, we need to predict pixel-level masks for both Cband Cn.

3.2 Review of MaskFormer

In this section, we brieﬂy review MaskFormer [

], which lays the foundation of our framework.

The general pipeline of MaskFormer disentangles semantic segmentation task into two sub-tasks:

proposal classiﬁcation and proposal segmentation. Speciﬁcally, MaskFormer maintains

learnable

-dim query embeddings

Q ∈ RC×N

shared for all images, as shown in the upper part of Fig.

2. When the image is input, the backbone features are extracted via a backbone. Then,

query

embeddings attend to backbone features to produce proposal embeddings

Eprop ∈RC×N

via a

transformer decoder. For each proposal embedding, a bipartite matching algorithm assigns a class

present in the input image to it, considering the classiﬁcation loss and mask loss as the cost for

assignment. For proposal classiﬁcation sub-task, each proposal embedding is fed to a simple classiﬁer

to yield class probability predictions

Y ∈ R(K+1)×N

over

semantic classes and

ignore class. For

proposal segmentation sub-task, pixel embeddings

Epix ∈RC×H×W

are extracted from backbone

features via a pixel decoder. Afterwards, proposal embeddings are processed by several FC layers

and their dot-products with pixel embeddings followed by sigmoid are computed to produce binary

masks

M ∈ RN×H×W

,i.e.,

M[i, h, w] = sigmoid(Eprop[:, i]· Epix[:, h, w])

. In the training stage,

the two sub-tasks for each proposal embedding are supervised by the label and mask of assigned

class (the mask loss of ignore class is eliminated). In the test stage, the semantic segmentation result

for each class at pixel

(h, w)

is obtained by summarizing all the masks weighted by the class score,

i.e.,arg maxc∈{1,...,K}PN

i=1 Y[c, i]· M[i, h, w]. For more details, please refer to MaskFormer [8].

3.3 Proposal-Pixel Similarity Transfer on MaskFormer

In our setting, we have only image-level labels for novel classes and we choose to produce pixel-level

masks via proposal-pixel similarity transfer based on MaskFormer [

]. For concise description, we

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Weak-shotSemanticSegmentationviaDualSimilarityTransferJunjieChen1,LiNiu1,SiyuanZhou1,JianlouSi2,ChenQian2,LiqingZhang11TheMoEKeyLabofAI,CSEdepartment,ShanghaiJiaoTongUniversity2SenseTimeResearch,SenseTime{chen.bys,ustcnewly,ssluvble}@sjtu.edu.cn{sijianlou,qianchen}@sensetime.com,zhang-lq@cs.sjtu.e...

展开>> 收起<<

Weak-shot Semantic Segmentation via Dual Similarity Transfer Junjie Chen1 Li Niu1 Siyuan Zhou1 Jianlou Si2 Chen Qian2 Liqing Zhang1.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Weak-shot Semantic Segmentation via Dual Similarity Transfer Junjie Chen1 Li Niu1 Siyuan Zhou1 Jianlou Si2 Chen Qian2 Liqing Zhang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: