SemFormer Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation

2025-05-03 0 0 3.36MB 17 页 10玖币
侵权投诉
SemFormer: Semantic Guided Activation
Transformer for Weakly Supervised Semantic
Segmentation
Junliang Chen, Xiaodong Zhao, Cheng Luo, Linlin Shen
Shenzhen University
{chenjunliang2016,zhaoxiaodong2020,luocheng2020}@email.szu.edu.cn
llshen@szu.edu.cn
Abstract
Recent mainstream weakly supervised semantic segmentation (WSSS) approaches
are mainly based on Class Activation Map (CAM) generated by a CNN (Con-
volutional Neural Network) based image classifier. In this paper, we propose a
novel transformer-based framework, named Semantic Guided Activation Trans-
former (SemFormer), for WSSS. We design a transformer-based Class-Aware
AutoEncoder (CAAE) to extract the class embeddings for the input image and learn
class semantics for all classes of the dataset. The class embeddings and learned
class semantics are then used to guide the generation of activation maps with
four losses, i.e., class-foreground, class-background, activation suppression, and
activation complementation loss. Experimental results show that our SemFormer
achieves
74.3
% mIoU and surpasses many recent mainstream WSSS approaches
by a large margin on PASCAL VOC 2012 dataset. Code will be available at
https://github.com/JLChen-C/SemFormer.
1 Introduction
Semantic Segmentation, a fundamental but challenging task in computer vision, has received much
attention of researchers. This task is to assign a class for each pixel in the given image. Due
to the rapid development of computer vision community, many approaches [
29
,
4
,
6
,
43
,
46
,
40
]
have achieved surprising performance for fully-supervised semantic segmentation (FSSS). However,
FSSS is developed upon pixel-level annotations, which is very time-consuming and may cost huge
human labour. To reduce the numerous resource cost, recently many weakly supervised semantic
segmentation (WSSS) approaches aims at making full use of weaker annotations, e.g., bounding
boxes [
31
,
18
], scribbles [
34
,
27
], points [
2
], and image-level labels [
17
,
37
,
45
,
26
,
32
]. These
WSSS approaches follow a two-stage paradigm: first generating pseudo semantic segmentation labels
utilizing the weak annotations, and then training an FSSS network via the pseudo labels. In the above
approaches, image-level annotations are the most convenient labels to acquire, and are therefore
broadly studied by researchers. As a result, in this paper, we mainly pay attention to approaches
using image-level annotations.
However, the image-level labels can not provide enough location information. Fortunately, the
development of Class Activation Map (CAM) [
47
] provides an effective way to acquire the location
information only using image-level annotations. Many WSSS approaches are built upon CAM.
Although CAM is simple and effective, it has an obvious disadvantage, i.e., it only activates the
most discriminative regions, which results in under-activation. As a result, many WSSS approaches
including CAM expanding [
17
,
38
,
36
] are trying to solve this problem. Erasing-based methods drive
Corresponding Author
arXiv:2210.14618v1 [cs.CV] 26 Oct 2022
the network to focus on the less discriminative regions by erasing the regions with high response
[26, 37, 20].
Recently, vision transformers has been introduced to computer vision community. Vision Transformer
(ViT) [
10
] splits the given image into patch sequence. The patches are projected to patch embeddings
by a shared fully-connected (FC) layer and summed up with corresponding learnable position
embeddings. The summation is then input to the sequential transformer blocks which contain Multi-
Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP). Thanks to the MHSA and MLP,
ViT can build long-range relationship between the patches and handle complex feature transforms.
Recent researches [
10
,
33
,
35
,
28
] show the superiority of vision transformers over conventional
convolutional neural network (CNN).
In this paper, we propose a novel framework with vision transformer for WSSS, which generates
activation maps based on class semantics. In this framework, we design a transformer based Class-
Aware AutoEncoder (CAAE) to extract class embeddings of the input image and learn semantic
of each class in the dataset with the semantic similarity loss. Via the CAAE and the learned class
semantics, we design an adversarial learning framework for generating class activation map (CAM)
for each class using transformer. Given an input image and its CAM, we construct an complementary
image pair, class-foreground and class-background images, by multiplying the CAM with the given
image. Both of the complementary images are input to the CAAE to extract the corresponding class
embedding. While the similarity between the embedding of the class-foreground image and the
class semantic should be maximized, the similarity between the embedding of the class-background
image and the corresponding class semantic should be minimized. We design two losses, activation
suppression, and activation complementary loss, to improve the activation map. Besides, we introduce
extra class tokens into the segmentation network to learn the class embeddings, and refine the CAM
using the class-relative attention of each class.
We summarize our contribution as follows. First, different from conventional classification net-
work based CAM approaches, we design a Class-Aware AutoEncoder (CAAE) to extract the class
embedding of the foreground and background images and learn the semantic of each class in the
dataset, to guide the generation of the activation map. Second, to maximize the similarity between
the class embedding of the input image and the semantic of this class, adversarial losses are designed
to measure the similarity between the class-foreground or class-background image and this class,
and thus guide the generation of the activation map. Third, we introduce extra class tokens into the
segmentation network to learn the class embeddings and refine the CAM with the generated multi-
head self attention of each class. Finally, experimental results show that our proposed SemFormer
achieves
74.3
% mIoU and surpasses many recent mainstream WSSS approaches on PASCAL VOC
2012 dataset.
2 Related Work
2.1 Weakly-Supervised Semantic Segmentation
As CAM only activates the most discriminative regions of the given image, many approaches attempt
to address this issue. [
16
,
19
] use object boundary information to expand the regions of the initial
CAM. [
13
,
1
] utilize the pixel relationship to refine the initial CAM to generate better pseudo labels.
But these approaches heavily rely on the quality of the initial CAM. Many methods try to improve the
quality of the initial CAM. MDC [
38
] introduce multi-dilated convolutions to enlarge the receptive
field for classification of the non-discriminative regions. SEAM [
36
] designs a module exploring the
pixel context correlation. However, these methods bring complicated modules to the segmentation
network or require complex training procedure. Beyond above approaches, another popular technique,
adversarial erasing (AE), generally erases the discriminative regions found by CAM, and drives the
network to recognize the objects in the non-discriminative regions. AE-PSL [
37
] erases the regions
with the CAM and forces the network to discover the objects in the remaining regions, but suffers
from over-activation, i.e., some background regions are incorrectly activated. GAIN [
26
] proposes a
two-stage scheme using a shared classifier that minimize the classification score of the masked image
using the thresholded CAM generated in the first stage. OC-CSE [
20
] improves GAIN by training
a segmentation network in the first stage, and using the pre-trained classifier in the second stage.
However, Zhang et al. [
45
] points out that the AE-based approaches has some weakness. First, their
training stage is unstable as some regions might be lost due to the random hiding process. Second,
they may over-activate the background regions, resulting in very small classification score in the
second stage and thus increase the training instability.
2
2.2 Vision Transformer
Recently, vision transformers have been introduced to computer vision community. Vision Trans-
former (ViT) [
10
] first proposes to split the input image into a non-overlapping patch sequence. Each
patch is then projected into a token using a shared fully-connected (FC) layer. The patch tokens plus
a class token will be summed with the corresponding position embedding and passed to the trans-
former blocks with Multi-Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP). DeiT
[
33
] improves ViT via distillation on an extra distillation token. As the MHSA with MLP builds a
connection between each token and other tokens, the message passing within the transformer is global
and better than the local connection in the conventional CNN. Many approaches aims at improving
ViT [
35
,
28
,
7
,
9
] with better architectures. Recently, TS-CAM [
14
] and MCTFormer [
42
] introduce
transformers to weakly-supervised object localization and WSSS, respectively, and both achieve
impressive performance. However, TS-CAM [
14
] and MCTFormer [
42
] are still classification-based
CAM generation approaches and may still suffer from incomplete activation problem. In this paper,
we propose a novel transformer-based learning framework for WSSS to generate activation maps
under the guidance of semantic, which can produce more complete regions than those CAM-based
approaches.
3 Method
3.1 Overall Framework
Image Iwith
boat, person
Segmentation Network
FC & Reshape
E’bus
E’person
E’boat Sboat
Sperson
Class
Semantics
Minimize
Maximize
Sbus
Semantic
Similarity
Loss
Class
Embeddings
K Class Tokens
Mperson Mperson I
1 -Mperson (1 -Mperson ) I
Activation
Suppression
Loss
Activation
Complementation
Loss CAAE
CAAE
K Activation Maps MEBperson
EFperson Sperson
Class
Semantics
Minimize
Maximize
Sperson
Class-
Foreground
Loss
Class
Embeddings
Class-
Background
Loss
FC
Image Iwith
boat, person
Transformer
Encoder
T1,boat
FC & Reshape T1,person
T1,bus
Thw,boat
Thw,person
Thw,bus
Primary
Tokens
Reconstruction
Loss
Merge
Ebus
Eperson
Eboat Sboat
Sperson
Class
Semantics
Minimize
Maximize
Sbus
Semantic
Similarity
Loss
Transformer
Decoder
Reshape & FC
Class
Embeddings
Reconstructed
Image Irecon
Figure 1: Overview of the proposed two-stage SemFormer. The first stage (top) is to train a class-
aware autoencoder (CAAE). The second stage (bottom) is to train the segmentation network using
the pre-trained CAAE.
3
In this section, we introduce the overall framework of our SemFormer. As shown in Figure 1,
SemFormer contains two training stages. In the first stage, we train a Class-Aware AutoEncoder
(CAAE) under the constraint of semantic similarity and reconstruction loss to correctly extract the
class embeddings of the input image, and learn the class semantics of the dataset. In the second
stage, we train the segmentation network based on the class embeddings extracted by the pre-
trained CAAE and the learned class semantics. The similarity of the same class or different classes
are maximized or minimized through four losses (class-foreground, class-background, activation
superssion, and activation complementary loss) to guarantee the correctness, completeness, compact,
and complementation of activation maps.
3.2 Class-Aware AutoEncoder
3.2.1 Architecture.
The Class-Aware AutoEncoder (CAAE) is a transformer-based architecture designed to extract class
embeddings for input image and learn class semantics of the dataset, which consists of an encoder (a
transformer backbone without classification layer) and a asymmetric decoder (transformer blocks).
3.2.2 Class-Aware Semantic Representation
Let
I
be the input image with
H×W
pixels, which is split into
h×w
patches, where
h=W/P
,
w=W/P
, and
P
is the patch size. The patches are flattened and projected to patch tokens
t0hw×d
via a shared FC layer, where
d
is the dimension of each token embedding. The patch
tokens (in our experiments, we find that ignorance of the class token has negligible influence to the
performance) are then summed up with the corresponding position embeddings and passed to
L
transformer blocks with MHSA and MLP. Let
tLhw×d
be the patch tokens after the the
L
-th
transformer block.
tL
is passed to a fully-connected (FC) layer followed by
ReLU
function and
reshaped to generate primary tokens Twh×K×D:
T=reshape(ReLU(tLW1),(wh, K, D)),(1)
where
W1d×KD
is the weights of the FC layer,
reshape
denotes the reshaping function,
K
is
the number of classes in the dataset, and Dis the class dimension.
After that, we sum up the primary tokens
T
to generate the class embeddings
EK×D
,i.e.,
E=1
KPwh
i=1 Ti.
The class embeddings
E
can represent the semantic for each class of the input image. We also define
class semantics
SK×D
to represent the overall semantics of the
K
classes in the dataset. Each
semantic is a
D
-dimension vector corresponds to a class.
S
is obtained by
S=ReLU(WS)
, where
WSis the weights, which is randomly initialized before training CAAE and fixed after training.
3.2.3 Loss Functions
For CAAE, we design semantic similarity and reconstruction loss to extract the class embeddings of
the input image and learn class semantics of the dataset.
Semantic Similarity Loss.
For the input image
I
, the semantic similarity (SS) loss is designed to
make CAAE aware what classes exist in the image. For class
c
, the cosine similarity
Sim(Ec, Sc)
between the class embedding
Ec
and the class semantic
Sc
is maximized if any object of class
c
exists in the image, otherwise
Sim(Ec, Sc)
is minimized, where
Sim(a, b) = a>b
kak2kbk2
is the cosine
similarity between
a
and
b
. The value of
Sim(Ec, Sc)
is in
[0,1]
as both
E
and
S
are passed to ReLU
function without negative values as mentioned above. Let
YK
be the image-level annotation of
the input image. For class
c
,
Yc
is in
{0,1}
, if class
c
exist in the image,
Yc= 1
, otherwise,
Yc= 0
.
The SS loss LSS is a binary cross entropy (BCE) loss:
LSS =1
K
K
X
c
[Yclog(Sim(Ec, Sc)) + (1 Yc)log(1 Sim(Ec, Sc))].(2)
Reconstruction Loss.
There is no guarantee that the embeddings extracted by the encoder can
correctly represent the semantic for each class of the input image. To address this issue, we introduce
4
摘要:

SemFormer:SemanticGuidedActivationTransformerforWeaklySupervisedSemanticSegmentationJunliangChen,XiaodongZhao,ChengLuo,LinlinShenShenzhenUniversity{chenjunliang2016,zhaoxiaodong2020,luocheng2020}@email.szu.edu.cnllshen@szu.edu.cnAbstractRecentmainstreamweaklysupervisedsemanticsegmentation(WSSS)appr...

展开>> 收起<<
SemFormer Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:3.36MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注