the network to focus on the less discriminative regions by erasing the regions with high response
[26, 37, 20].
Recently, vision transformers has been introduced to computer vision community. Vision Transformer
(ViT) [
10
] splits the given image into patch sequence. The patches are projected to patch embeddings
by a shared fully-connected (FC) layer and summed up with corresponding learnable position
embeddings. The summation is then input to the sequential transformer blocks which contain Multi-
Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP). Thanks to the MHSA and MLP,
ViT can build long-range relationship between the patches and handle complex feature transforms.
Recent researches [
10
,
33
,
35
,
28
] show the superiority of vision transformers over conventional
convolutional neural network (CNN).
In this paper, we propose a novel framework with vision transformer for WSSS, which generates
activation maps based on class semantics. In this framework, we design a transformer based Class-
Aware AutoEncoder (CAAE) to extract class embeddings of the input image and learn semantic
of each class in the dataset with the semantic similarity loss. Via the CAAE and the learned class
semantics, we design an adversarial learning framework for generating class activation map (CAM)
for each class using transformer. Given an input image and its CAM, we construct an complementary
image pair, class-foreground and class-background images, by multiplying the CAM with the given
image. Both of the complementary images are input to the CAAE to extract the corresponding class
embedding. While the similarity between the embedding of the class-foreground image and the
class semantic should be maximized, the similarity between the embedding of the class-background
image and the corresponding class semantic should be minimized. We design two losses, activation
suppression, and activation complementary loss, to improve the activation map. Besides, we introduce
extra class tokens into the segmentation network to learn the class embeddings, and refine the CAM
using the class-relative attention of each class.
We summarize our contribution as follows. First, different from conventional classification net-
work based CAM approaches, we design a Class-Aware AutoEncoder (CAAE) to extract the class
embedding of the foreground and background images and learn the semantic of each class in the
dataset, to guide the generation of the activation map. Second, to maximize the similarity between
the class embedding of the input image and the semantic of this class, adversarial losses are designed
to measure the similarity between the class-foreground or class-background image and this class,
and thus guide the generation of the activation map. Third, we introduce extra class tokens into the
segmentation network to learn the class embeddings and refine the CAM with the generated multi-
head self attention of each class. Finally, experimental results show that our proposed SemFormer
achieves
74.3
% mIoU and surpasses many recent mainstream WSSS approaches on PASCAL VOC
2012 dataset.
2 Related Work
2.1 Weakly-Supervised Semantic Segmentation
As CAM only activates the most discriminative regions of the given image, many approaches attempt
to address this issue. [
16
,
19
] use object boundary information to expand the regions of the initial
CAM. [
13
,
1
] utilize the pixel relationship to refine the initial CAM to generate better pseudo labels.
But these approaches heavily rely on the quality of the initial CAM. Many methods try to improve the
quality of the initial CAM. MDC [
38
] introduce multi-dilated convolutions to enlarge the receptive
field for classification of the non-discriminative regions. SEAM [
36
] designs a module exploring the
pixel context correlation. However, these methods bring complicated modules to the segmentation
network or require complex training procedure. Beyond above approaches, another popular technique,
adversarial erasing (AE), generally erases the discriminative regions found by CAM, and drives the
network to recognize the objects in the non-discriminative regions. AE-PSL [
37
] erases the regions
with the CAM and forces the network to discover the objects in the remaining regions, but suffers
from over-activation, i.e., some background regions are incorrectly activated. GAIN [
26
] proposes a
two-stage scheme using a shared classifier that minimize the classification score of the masked image
using the thresholded CAM generated in the first stage. OC-CSE [
20
] improves GAIN by training
a segmentation network in the first stage, and using the pre-trained classifier in the second stage.
However, Zhang et al. [
45
] points out that the AE-based approaches has some weakness. First, their
training stage is unstable as some regions might be lost due to the random hiding process. Second,
they may over-activate the background regions, resulting in very small classification score in the
second stage and thus increase the training instability.
2