SemFormer Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation

2025-05-03 0 0 3.36MB 17 页 10玖币

侵权投诉

SemFormer: Semantic Guided Activation

Transformer for Weakly Supervised Semantic

Segmentation

Junliang Chen, Xiaodong Zhao, Cheng Luo, Linlin Shen∗

Shenzhen University

{chenjunliang2016,zhaoxiaodong2020,luocheng2020}@email.szu.edu.cn

llshen@szu.edu.cn

Abstract

Recent mainstream weakly supervised semantic segmentation (WSSS) approaches

are mainly based on Class Activation Map (CAM) generated by a CNN (Con-

volutional Neural Network) based image classiﬁer. In this paper, we propose a

novel transformer-based framework, named Semantic Guided Activation Trans-

former (SemFormer), for WSSS. We design a transformer-based Class-Aware

AutoEncoder (CAAE) to extract the class embeddings for the input image and learn

class semantics for all classes of the dataset. The class embeddings and learned

class semantics are then used to guide the generation of activation maps with

four losses, i.e., class-foreground, class-background, activation suppression, and

activation complementation loss. Experimental results show that our SemFormer

achieves

74.3

% mIoU and surpasses many recent mainstream WSSS approaches

by a large margin on PASCAL VOC 2012 dataset. Code will be available at

https://github.com/JLChen-C/SemFormer.

1 Introduction

Semantic Segmentation, a fundamental but challenging task in computer vision, has received much

attention of researchers. This task is to assign a class for each pixel in the given image. Due

to the rapid development of computer vision community, many approaches [

]

have achieved surprising performance for fully-supervised semantic segmentation (FSSS). However,

FSSS is developed upon pixel-level annotations, which is very time-consuming and may cost huge

human labour. To reduce the numerous resource cost, recently many weakly supervised semantic

segmentation (WSSS) approaches aims at making full use of weaker annotations, e.g., bounding

boxes [

], scribbles [

], points [

], and image-level labels [

]. These

WSSS approaches follow a two-stage paradigm: ﬁrst generating pseudo semantic segmentation labels

utilizing the weak annotations, and then training an FSSS network via the pseudo labels. In the above

approaches, image-level annotations are the most convenient labels to acquire, and are therefore

broadly studied by researchers. As a result, in this paper, we mainly pay attention to approaches

using image-level annotations.

However, the image-level labels can not provide enough location information. Fortunately, the

development of Class Activation Map (CAM) [

] provides an effective way to acquire the location

information only using image-level annotations. Many WSSS approaches are built upon CAM.

Although CAM is simple and effective, it has an obvious disadvantage, i.e., it only activates the

most discriminative regions, which results in under-activation. As a result, many WSSS approaches

including CAM expanding [

] are trying to solve this problem. Erasing-based methods drive

∗Corresponding Author

arXiv:2210.14618v1 [cs.CV] 26 Oct 2022

the network to focus on the less discriminative regions by erasing the regions with high response

[26, 37, 20].

Recently, vision transformers has been introduced to computer vision community. Vision Transformer

(ViT) [

] splits the given image into patch sequence. The patches are projected to patch embeddings

by a shared fully-connected (FC) layer and summed up with corresponding learnable position

embeddings. The summation is then input to the sequential transformer blocks which contain Multi-

Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP). Thanks to the MHSA and MLP,

ViT can build long-range relationship between the patches and handle complex feature transforms.

Recent researches [

] show the superiority of vision transformers over conventional

convolutional neural network (CNN).

In this paper, we propose a novel framework with vision transformer for WSSS, which generates

activation maps based on class semantics. In this framework, we design a transformer based Class-

Aware AutoEncoder (CAAE) to extract class embeddings of the input image and learn semantic

of each class in the dataset with the semantic similarity loss. Via the CAAE and the learned class

semantics, we design an adversarial learning framework for generating class activation map (CAM)

for each class using transformer. Given an input image and its CAM, we construct an complementary

image pair, class-foreground and class-background images, by multiplying the CAM with the given

image. Both of the complementary images are input to the CAAE to extract the corresponding class

embedding. While the similarity between the embedding of the class-foreground image and the

class semantic should be maximized, the similarity between the embedding of the class-background

image and the corresponding class semantic should be minimized. We design two losses, activation

suppression, and activation complementary loss, to improve the activation map. Besides, we introduce

extra class tokens into the segmentation network to learn the class embeddings, and reﬁne the CAM

using the class-relative attention of each class.

We summarize our contribution as follows. First, different from conventional classiﬁcation net-

work based CAM approaches, we design a Class-Aware AutoEncoder (CAAE) to extract the class

embedding of the foreground and background images and learn the semantic of each class in the

dataset, to guide the generation of the activation map. Second, to maximize the similarity between

the class embedding of the input image and the semantic of this class, adversarial losses are designed

to measure the similarity between the class-foreground or class-background image and this class,

and thus guide the generation of the activation map. Third, we introduce extra class tokens into the

segmentation network to learn the class embeddings and reﬁne the CAM with the generated multi-

head self attention of each class. Finally, experimental results show that our proposed SemFormer

achieves

74.3

% mIoU and surpasses many recent mainstream WSSS approaches on PASCAL VOC

2012 dataset.

2 Related Work

2.1 Weakly-Supervised Semantic Segmentation

As CAM only activates the most discriminative regions of the given image, many approaches attempt

to address this issue. [

] use object boundary information to expand the regions of the initial

CAM. [

] utilize the pixel relationship to reﬁne the initial CAM to generate better pseudo labels.

But these approaches heavily rely on the quality of the initial CAM. Many methods try to improve the

quality of the initial CAM. MDC [

] introduce multi-dilated convolutions to enlarge the receptive

ﬁeld for classiﬁcation of the non-discriminative regions. SEAM [

] designs a module exploring the

pixel context correlation. However, these methods bring complicated modules to the segmentation

network or require complex training procedure. Beyond above approaches, another popular technique,

adversarial erasing (AE), generally erases the discriminative regions found by CAM, and drives the

network to recognize the objects in the non-discriminative regions. AE-PSL [

] erases the regions

with the CAM and forces the network to discover the objects in the remaining regions, but suffers

from over-activation, i.e., some background regions are incorrectly activated. GAIN [

] proposes a

two-stage scheme using a shared classiﬁer that minimize the classiﬁcation score of the masked image

using the thresholded CAM generated in the ﬁrst stage. OC-CSE [

] improves GAIN by training

a segmentation network in the ﬁrst stage, and using the pre-trained classiﬁer in the second stage.

However, Zhang et al. [

] points out that the AE-based approaches has some weakness. First, their

training stage is unstable as some regions might be lost due to the random hiding process. Second,

they may over-activate the background regions, resulting in very small classiﬁcation score in the

second stage and thus increase the training instability.

2.2 Vision Transformer

Recently, vision transformers have been introduced to computer vision community. Vision Trans-

former (ViT) [

] ﬁrst proposes to split the input image into a non-overlapping patch sequence. Each

patch is then projected into a token using a shared fully-connected (FC) layer. The patch tokens plus

a class token will be summed with the corresponding position embedding and passed to the trans-

former blocks with Multi-Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP). DeiT

[

] improves ViT via distillation on an extra distillation token. As the MHSA with MLP builds a

connection between each token and other tokens, the message passing within the transformer is global

and better than the local connection in the conventional CNN. Many approaches aims at improving

ViT [

] with better architectures. Recently, TS-CAM [

] and MCTFormer [

] introduce

transformers to weakly-supervised object localization and WSSS, respectively, and both achieve

impressive performance. However, TS-CAM [

] and MCTFormer [

] are still classiﬁcation-based

CAM generation approaches and may still suffer from incomplete activation problem. In this paper,

we propose a novel transformer-based learning framework for WSSS to generate activation maps

under the guidance of semantic, which can produce more complete regions than those CAM-based

approaches.

3 Method

3.1 Overall Framework

Image Iwith

boat, person

…

Segmentation Network

…

FC & Reshape

…

E’bus

E’person

E’boat Sboat

Sperson

…

Class

Semantics

Minimize

Maximize

Sbus

Semantic

Similarity

Loss

Class

Embeddings

…

K Class Tokens

Mperson Mperson ⊙ I

1 -Mperson (1 -Mperson ) ⊙ I

Activation

Suppression

Loss

Activation

Complementation

Loss CAAE

CAAE

…

K Activation Maps MEBperson

EFperson Sperson

Class

Semantics

Minimize

Maximize

Sperson

Class-

Foreground

Loss

Class

Embeddings

Class-

Background

Loss

Image Iwith

boat, person

…

Transformer

Encoder

…

T1,boat

FC & Reshape T1,person

T1,bus

Thw,boat

Thw,person

Thw,bus

Primary

Tokens

Reconstruction

Loss

Merge

…

Ebus

Eperson

Eboat Sboat

Sperson

…

Class

Semantics

Minimize

Maximize

Sbus

Semantic

Similarity

Loss

…

Transformer

Decoder

…

Reshape & FC

Class

Embeddings

Reconstructed

Image Irecon

Figure 1: Overview of the proposed two-stage SemFormer. The ﬁrst stage (top) is to train a class-

aware autoencoder (CAAE). The second stage (bottom) is to train the segmentation network using

the pre-trained CAAE.

In this section, we introduce the overall framework of our SemFormer. As shown in Figure 1,

SemFormer contains two training stages. In the ﬁrst stage, we train a Class-Aware AutoEncoder

(CAAE) under the constraint of semantic similarity and reconstruction loss to correctly extract the

class embeddings of the input image, and learn the class semantics of the dataset. In the second

stage, we train the segmentation network based on the class embeddings extracted by the pre-

trained CAAE and the learned class semantics. The similarity of the same class or different classes

are maximized or minimized through four losses (class-foreground, class-background, activation

superssion, and activation complementary loss) to guarantee the correctness, completeness, compact,

and complementation of activation maps.

3.2 Class-Aware AutoEncoder

3.2.1 Architecture.

The Class-Aware AutoEncoder (CAAE) is a transformer-based architecture designed to extract class

embeddings for input image and learn class semantics of the dataset, which consists of an encoder (a

transformer backbone without classiﬁcation layer) and a asymmetric decoder (transformer blocks).

3.2.2 Class-Aware Semantic Representation

Let

be the input image with

H×W

pixels, which is split into

h×w

patches, where

h=W/P

w=W/P

, and

is the patch size. The patches are ﬂattened and projected to patch tokens

t0∈hw×d

via a shared FC layer, where

is the dimension of each token embedding. The patch

tokens (in our experiments, we ﬁnd that ignorance of the class token has negligible inﬂuence to the

performance) are then summed up with the corresponding position embeddings and passed to

transformer blocks with MHSA and MLP. Let

tL∈hw×d

be the patch tokens after the the

-th

transformer block.

is passed to a fully-connected (FC) layer followed by

ReLU

function and

reshaped to generate primary tokens T∈wh×K×D:

T=reshape(ReLU(tLW1),(wh, K, D)),(1)

where

W1∈d×KD

is the weights of the FC layer,

reshape

denotes the reshaping function,

the number of classes in the dataset, and Dis the class dimension.

After that, we sum up the primary tokens

to generate the class embeddings

E∈K×D

,i.e.,

E=1

KPwh

i=1 Ti.

The class embeddings

can represent the semantic for each class of the input image. We also deﬁne

class semantics

S∈K×D

to represent the overall semantics of the

classes in the dataset. Each

semantic is a

-dimension vector corresponds to a class.

is obtained by

S=ReLU(WS)

, where

WSis the weights, which is randomly initialized before training CAAE and ﬁxed after training.

3.2.3 Loss Functions

For CAAE, we design semantic similarity and reconstruction loss to extract the class embeddings of

the input image and learn class semantics of the dataset.

Semantic Similarity Loss.

For the input image

, the semantic similarity (SS) loss is designed to

make CAAE aware what classes exist in the image. For class

, the cosine similarity

Sim(Ec, Sc)

between the class embedding

and the class semantic

is maximized if any object of class

exists in the image, otherwise

Sim(Ec, Sc)

is minimized, where

Sim(a, b) = a>b

kak2kbk2

is the cosine

similarity between

and

. The value of

Sim(Ec, Sc)

is in

[0,1]

as both

and

are passed to ReLU

function without negative values as mentioned above. Let

Y∈K

be the image-level annotation of

the input image. For class

is in

{0,1}

, if class

exist in the image,

Yc= 1

, otherwise,

Yc= 0

The SS loss LSS is a binary cross entropy (BCE) loss:

LSS =−1

[Yclog(Sim(Ec, Sc)) + (1 −Yc)log(1 −Sim(Ec, Sc))].(2)

Reconstruction Loss.

There is no guarantee that the embeddings extracted by the encoder can

correctly represent the semantic for each class of the input image. To address this issue, we introduce

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SemFormer:SemanticGuidedActivationTransformerforWeaklySupervisedSemanticSegmentationJunliangChen,XiaodongZhao,ChengLuo,LinlinShenShenzhenUniversity{chenjunliang2016,zhaoxiaodong2020,luocheng2020}@email.szu.edu.cnllshen@szu.edu.cnAbstractRecentmainstreamweaklysupervisedsemanticsegmentation(WSSS)appr...

展开>> 收起<<

SemFormer Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SemFormer Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: