
view consistency [6,21], edge detection [20,58], or saliency
prior [45]. Recently, the self-supervised ViT [4] provides a
new paradigm for USS due to its property of containing se-
mantic information in pixel-level representations. We make
it more intuitive through Figure 1, which shows that in the
representation space of an image, the pixel-level representa-
tions produced by the self-supervised ViT contain underly-
ing clusters. When projecting these clusters into the image,
they become semantically consistent groups of pixels or re-
gions representing “concepts”.
In this work, we aim to achieve USS by accurately ex-
tracting and classifying these “concepts” in the pixel rep-
resentation space of each image. Unlike the previous at-
tempts which only consider foreground-background parti-
tion [42,46,50] or divide each image into a fixed number
of clusters [19,33], we argue that it is crucial to consider
different images distinguishably due to the complexity of
various scenarios (Figure 1). We thus propose the Adaptive
Conceptualization for unsupervised semantic Segmentation
(ACSeg), a framework that finds these underlying concepts
adaptively for each image and achieves USS by classifying
the discovered concepts in an unsupervised manner.
To achieve conceptualization, we explicitly encode con-
cepts to learnable prototypes and adaptively update them for
different images by a network, as shown in Figure 2. This
network, named as Adaptive Concept Generator (ACG), is
implemented by iteratively applying scaled dot-product at-
tention [47] on the prototypes and pixel-level representa-
tions in the image to be processed. Through such a struc-
ture, the ACG learns to project the initial prototypes to the
concept in the representation space depending on the in-
put pixel-level representations. Then the concepts are ex-
plicitly presented in the image as different regions by as-
signing each pixel to the nearest concept in the representa-
tion space. The ACG is end-to-end optimized without any
annotations by the proposed modularity loss. Specifically,
we construct an affinity graph on the pixel-level representa-
tions and use the connection relationship of two pixels in the
affinity graph to adjust the strength of assigning two pixels
to the same concept, motivated by the modularity [35].
As the main part of ACSeg, the ACG achieves precise
conceptualization for different images due to its adaptive-
ness, which is reflected in two aspect: Firstly, it can adap-
tively operate on pixel-level representations of different im-
ages thanks to the dynamic update structure. Secondly, the
training objective does not enforce the number of concepts,
resulting in adaptive number of concepts for different im-
ages. With these properties, we get accurate partition for
images with different scene complexity via the concepts
produced by the ACG, as shown in Figure 1(c). Therefore,
in ACSeg, the semantic segmentation of an image can fi-
nally be achieved by matting the corresponding regions in
the image and classifying them with the help of powerful
Update Updated prototypes
Initialized prototypes
Prototypes Pixel-level representation Pixels in a concept
Update by pixel-level representations Update by each other
Figure 2. Intuitive explanation for the basic idea of the ACG.
The concepts are explicitly encoded to learnable prototypes and
dynamically updated according to the input pixel-level representa-
tions. After update, the pixels are assigned to the nearest concept
in the representation space.
image-level pre-trained models.
For evaluation, we apply ACSeg on commonly used
semantic segmentation datasets, including PASCAL VOC
2012 [11] and COCO-Stuff [21,29]. The experimental
results show that the proposed ACSeg surpasses previous
methods on different settings of unsupervised semantic seg-
mentation tasks and achieves state-of-the-art performance
on the PASCAL VOC 2012 unsupervised semantic segmen-
tation benchmark without post-processing and re-training.
Moreover, the visualization of the pixel-level representa-
tions and the concepts shows that the ACG is applicable for
decomposing images with various scene complexity. Since
the ACG is fast to converge without learning new represen-
tations and the concept classifier is employed in a zero-shot
manner, we draw the proposed ACSeg as a generalizable
method which is easy to modify and adapt to a wide range
of unsupervised image understanding.
2. Related Works
Vison Transformer. Transformer, a model mainly based
on self-attention mechanism, is widely used in natural
language processing [3,8] and cross-modal understand-
ing [22,26,27,38]. Vison Transformer (ViT) [9] is the
first pure visual transformer model to process images. Re-
cently, Caron et al. [4] propose self-distillation with no la-
bels (DINO) to train the ViT, and found a property that its
features contain explicit information about the segmenta-
tion of an image. Based on DINO, some previous stud-
ies [15,33,42,46,50,55] successfully demonstrate extending
this property to unsupervised dense prediction tasks.
Unsupervised Semantic Segmentation. With the devel-
opment of self-supervised and unsupervised learning, un-
supervised methods for semantic segmentation task start
to emerge. Among them, some methods focus on pixel-
level self-supervised representation learning by introduc-
ing cross-view consistency [6,21,23,52,53,57,61], visual
prior [20,45,58], and continuity of video frames [2]. In
contrast, Zadaianchuk et al. [56] adopt pre-trained object-
centric representations and cluster them to segment ob-
jects. Other methods exploit pixel-level knowledge of
2