ACSeg Adaptive Conceptualization for Unsupervised Semantic Segmentation Kehan Li13Zhennan Wang2Zesen Cheng13Runyi Yu13Yian Zhao5Guoli Song2 Chang Liu4Li Yuan123Jie Chen123

2025-04-27 0 0 3.74MB 16 页 10玖币
侵权投诉
ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation
Kehan Li1,3Zhennan Wang2Zesen Cheng1,3Runyi Yu1,3Yian Zhao5Guoli Song2
Chang Liu4Li Yuan1,2,3*Jie Chen1,2,3
1School of Electronic and Computer Engineering, Peking University, Shenzhen, China 2Peng Cheng Laboratory, Shenzhen, China
3AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, China
4Department of Automation and BNRist, Tsinghua University, Beijing, China 5Dalian University of Technology
Abstract
Recently, self-supervised large-scale visual pre-training
models have shown great promise in representing pixel-
level semantic relationships, significantly promoting the de-
velopment of unsupervised dense prediction tasks, e.g., un-
supervised semantic segmentation (USS). The extracted re-
lationship among pixel-level representations typically con-
tains rich class-aware information that semantically iden-
tical pixel embeddings in the representation space gather
together to form sophisticated concepts. However, lever-
aging the learned models to ascertain semantically con-
sistent pixel groups or regions in the image is non-trivial
since over/ under-clustering overwhelms the conceptualiza-
tion procedure under various semantic distributions of dif-
ferent images. In this work, we investigate the pixel-level
semantic aggregation in self-supervised ViT pre-trained
models as image Segmentation and propose the Adaptive
Conceptualization approach for USS, termed ACSeg. Con-
cretely, we explicitly encode concepts into learnable proto-
types and design the Adaptive Concept Generator (ACG),
which adaptively maps these prototypes to informative con-
cepts for each image. Meanwhile, considering the scene
complexity of different images, we propose the modularity
loss to optimize ACG independent of the concept number
based on estimating the intensity of pixel pairs belonging to
the same concept. Finally, we turn the USS task into clas-
sifying the discovered concepts in an unsupervised manner.
Extensive experiments with state-of-the-art results demon-
strate the effectiveness of the proposed ACSeg.
1. Introduction
Semantic segmentation is one of the primary tasks in
computer vision, which has been widely used in many do-
mains, such as autonomous driving [7,14] and medical
*Corresponding author. Project page: https://lkhl.github.io/ACSeg.
(a) Under-clustering (b) Over-clustering
(c) Our Adaptive Conceptualization
Pixel-level representation Prototype Inactive prototype Pixels in a concept
Concept
Adaptive
Concept
Generator Concept
Figure 1. Comparison between existing methods and our adaptive
conceptualization on finding underlying “concepts” in the pixel-
level representations produced by a pre-trained model. While
under-clustering just focuses on a single object and over-clustering
splits objects, our adaptive conceptualization processes different
images adaptively through updating the initialized prototypes with
the representations for each image.
imaging [12,25,43]. With the development of deep learn-
ing and the increasing amount of data [7,11,29,59], uplift-
ing performance has been achieved on this task by optimiz-
ing deep neural networks with pixel-level annotations [30].
However, large-scale pixel-level annotations are expensive
and laborious to obtain. Different kinds of weak supervision
have been explored to achieve label efficiency [39], e.g.,
image-level [1,51], scribble-level [28], and box-level super-
vision [36]. More than this, some methods also achieve se-
mantic segmentation without relying on any labels [20,21],
namely unsupervised semantic segmentation (USS).
Early approaches for USS are based on pixel-level self-
supervised representation learning by introducing cross-
1
arXiv:2210.05944v3 [cs.CV] 30 Mar 2023
view consistency [6,21], edge detection [20,58], or saliency
prior [45]. Recently, the self-supervised ViT [4] provides a
new paradigm for USS due to its property of containing se-
mantic information in pixel-level representations. We make
it more intuitive through Figure 1, which shows that in the
representation space of an image, the pixel-level representa-
tions produced by the self-supervised ViT contain underly-
ing clusters. When projecting these clusters into the image,
they become semantically consistent groups of pixels or re-
gions representing “concepts”.
In this work, we aim to achieve USS by accurately ex-
tracting and classifying these “concepts” in the pixel rep-
resentation space of each image. Unlike the previous at-
tempts which only consider foreground-background parti-
tion [42,46,50] or divide each image into a fixed number
of clusters [19,33], we argue that it is crucial to consider
different images distinguishably due to the complexity of
various scenarios (Figure 1). We thus propose the Adaptive
Conceptualization for unsupervised semantic Segmentation
(ACSeg), a framework that finds these underlying concepts
adaptively for each image and achieves USS by classifying
the discovered concepts in an unsupervised manner.
To achieve conceptualization, we explicitly encode con-
cepts to learnable prototypes and adaptively update them for
different images by a network, as shown in Figure 2. This
network, named as Adaptive Concept Generator (ACG), is
implemented by iteratively applying scaled dot-product at-
tention [47] on the prototypes and pixel-level representa-
tions in the image to be processed. Through such a struc-
ture, the ACG learns to project the initial prototypes to the
concept in the representation space depending on the in-
put pixel-level representations. Then the concepts are ex-
plicitly presented in the image as different regions by as-
signing each pixel to the nearest concept in the representa-
tion space. The ACG is end-to-end optimized without any
annotations by the proposed modularity loss. Specifically,
we construct an affinity graph on the pixel-level representa-
tions and use the connection relationship of two pixels in the
affinity graph to adjust the strength of assigning two pixels
to the same concept, motivated by the modularity [35].
As the main part of ACSeg, the ACG achieves precise
conceptualization for different images due to its adaptive-
ness, which is reflected in two aspect: Firstly, it can adap-
tively operate on pixel-level representations of different im-
ages thanks to the dynamic update structure. Secondly, the
training objective does not enforce the number of concepts,
resulting in adaptive number of concepts for different im-
ages. With these properties, we get accurate partition for
images with different scene complexity via the concepts
produced by the ACG, as shown in Figure 1(c). Therefore,
in ACSeg, the semantic segmentation of an image can fi-
nally be achieved by matting the corresponding regions in
the image and classifying them with the help of powerful
Update Updated prototypes
Initialized prototypes
Prototypes Pixel-level representation Pixels in a concept
Update by pixel-level representations Update by each other
Figure 2. Intuitive explanation for the basic idea of the ACG.
The concepts are explicitly encoded to learnable prototypes and
dynamically updated according to the input pixel-level representa-
tions. After update, the pixels are assigned to the nearest concept
in the representation space.
image-level pre-trained models.
For evaluation, we apply ACSeg on commonly used
semantic segmentation datasets, including PASCAL VOC
2012 [11] and COCO-Stuff [21,29]. The experimental
results show that the proposed ACSeg surpasses previous
methods on different settings of unsupervised semantic seg-
mentation tasks and achieves state-of-the-art performance
on the PASCAL VOC 2012 unsupervised semantic segmen-
tation benchmark without post-processing and re-training.
Moreover, the visualization of the pixel-level representa-
tions and the concepts shows that the ACG is applicable for
decomposing images with various scene complexity. Since
the ACG is fast to converge without learning new represen-
tations and the concept classifier is employed in a zero-shot
manner, we draw the proposed ACSeg as a generalizable
method which is easy to modify and adapt to a wide range
of unsupervised image understanding.
2. Related Works
Vison Transformer. Transformer, a model mainly based
on self-attention mechanism, is widely used in natural
language processing [3,8] and cross-modal understand-
ing [22,26,27,38]. Vison Transformer (ViT) [9] is the
first pure visual transformer model to process images. Re-
cently, Caron et al. [4] propose self-distillation with no la-
bels (DINO) to train the ViT, and found a property that its
features contain explicit information about the segmenta-
tion of an image. Based on DINO, some previous stud-
ies [15,33,42,46,50,55] successfully demonstrate extending
this property to unsupervised dense prediction tasks.
Unsupervised Semantic Segmentation. With the devel-
opment of self-supervised and unsupervised learning, un-
supervised methods for semantic segmentation task start
to emerge. Among them, some methods focus on pixel-
level self-supervised representation learning by introduc-
ing cross-view consistency [6,21,23,52,53,57,61], visual
prior [20,45,58], and continuity of video frames [2]. In
contrast, Zadaianchuk et al. [56] adopt pre-trained object-
centric representations and cluster them to segment ob-
jects. Other methods exploit pixel-level knowledge of
2
pre-trained generative models [32] or self-supervised pre-
trained convolutional neural network [19,49]. Recently,
self-supervised ViTs trained with DINO has recently been
explored for unsupervised dense prediction tasks due to the
ability of representing pixel-level semantic relationships.
For semantic segmentation, Hamilton et al. [15] train a seg-
mentation head by distilling the feature correspondences,
which further encourages pixel features to form compact
clusters and learn better pixel-level representations. Trans-
FGU [55] obtains semantic segmentation in a top-down
manner by extracting class activate maps from DINO mod-
els. Some approaches use the representations from DINO to
segment images into regions. Melas et al. [33] adopt spec-
tral decomposition on the affinity graph to discover mean-
ingful parts in an image and implement semantic segmen-
tation of an image. MaskDistill [46] uses some hand-made
rules based on pixel-level representations to find the salient
region in an image. In contrast, we aim at better extracting
underlying concepts among the representations from DINO
in an image by tackling the over/under-clustering problem.
Semantic Segmentation with Text. Vision-language pre-
training models enable learning without annotations or
zero-shot transfer on vision task [38]. For semantic seg-
mentation, MaskCLIP [60] modifies the visual encoder of
CLIP [38] and applies the text-based classifier on pixel
level. Xu et al. [54] propose GroupViT, a hierarchical
grouping vision transformer, and train it with image-to-text
contrastive loss. Finally, the semantic segmentation results
can be obtained by the grouping result and text embeddings.
ReCo [41] leverages the retrieval abilities of CLIP and the
robust correspondences offered by modern image represen-
tations to co-segment entities. Shin et al. [40] use CLIP
to construct category-specific images and produce pseudo-
label with a category-agnostic salient object detector boot-
strapped from DINO. In this paper, we also show that our
method can be combined with recent vision-language pre-
trained model to perform semantic segmentation with only
text-image supervision.
3. The Proposed ACSeg
In this section, we describe the proposed method for USS
in detail, starting from the whole framework.
3.1. Overall Approach
Figure 3illustrates the overall structure of ACSeg. Start-
ing with an image, we first apply a self-supervised ViT to
generate pixel-level representations. As mentioned above,
these representations contain underlying concepts, which
represents meaningful groups or regions of pixels. The
Adaptive Concept Generator (ACG) is designed to output
the concepts explicitly. Specifically, the ACG takes a se-
ries of learnable prototypes as input and iteratively updates
them by interacting with the pixel-level representations, re-
sulting in adaptive concept representations for each image.
Finally, the concepts are explicitly represented by pixel
groups, which are obtained by assigning each pixel to the
nearest concept in the representation space.
For optimization, we propose a novel loss function called
modularity loss to train the ACG without any annotations.
Intuitively, the modularity loss works on pixel pairs. We
construct an affinity graph taking the pixel-level represen-
tations as vertices and their cosine similarity as edges. The
modularity loss calculates the intensity of two pixels be-
longing to the same concept using the metric defined in
modularity [35], thus adjusting the concept representations.
At last, the concept classifier assigns each concept to a pre-
defined category to obtain per-pixel class prediction, i.e., se-
mantic segmentation of an image. We introduce the details
of each component in ACSeg in the following sections.
3.2. Adaptive Concept Generator
The role of ACG is to map the initial prototypes to the
concept representations in each image. Since the concept
representations are different in different images and depend
on the pixel-level representations of the image, we intro-
duce the scaled dot-product attention [47] to iteratively up-
date the prototypes according to the pixel-level representa-
tions. Specifically, we first apply cross-attention taking the
prototypes as the query and the pixel-level representations
as the key and value. Let ClRk×ddenote kprototypes
after l-th update and XRn×ddenote npixel-level repre-
sentations from an image, the cross attention can be formu-
lated as
¯
Cl= Softmax(Cl1Wq(XWk)T
d)(XWv),(1)
Cl=Cl1+¯
ClWo,(2)
where Wq,Wk,Wv,WoRd×dare learnable linear pro-
jections. The cross-attention updates prototypes adaptively
with the pixel-level representations, which makes it possi-
ble to generate concepts adaptively for different images.
After that, self-attention is used to model the connections
for different concepts. Formulaically, it can be expressed as
¯
Cl= Softmax(Cl1Wq(Cl1Wk)T
d)(Cl1Wv),(3)
Cl=Cl1+¯
ClWo.(4)
The self-attention updates each prototype with other proto-
types and makes it aware of the presence of others, for better
adjusting their relative positions in the embedding space.
The ACG consists of Nupdate steps, and each up-
date step is made up of cross-attention, self-attention, and
3
Pixel
Assignment
𝑘updated prototypes
𝑘initial prototypes 𝑛pixel-level representations
Concept Classifier
Pixels assigned to a concept
···
Self-
supervised
ViT
Modularity
Loss
Affinity Graph
Supervision
Adaptive Concept Generator
Self-
attention
Cross-
attention
Feature
Extractor
Unsupervised
Classifier
···
···
Figure 3. Illustration of the proposed ACSeg. For an image, we first use a self-supervised ViT to extract pixel-level representations,
which imply the semantic relationship of pixels. The Adaptive Concept Generator (ACG) dynamically updates the initial prototypes to the
underlying concepts in the representation space through scaled dot-product attention. Finally, the assignment of pixels is produced by the
cosine similarity between pixel-level representations and the concepts, and the modularity loss is used to optimize the ACG. At last, the
concept classifier is used to assign each concept to a pre-defined category thus obtain semantic segmentation of an image.
a Feed-forward Network (FFN) [47]. With the attention
mechanism, the ACG can learn the map from initial pro-
totypes to concepts adaptively for different images. For im-
plementation, we adopt multi-head attention, layer normal-
ization, and residual connection after the attention operation
and the FFN, following the transformer [47].
3.3. Pixel Assignment
After ACG, each image has its own concepts. Abstractly,
each concept is a vector in the representation space, approx-
imately the average of a group of gathered pixels. Con-
cretely, a concept consists of pixels with the same seman-
tics. This abstract-to-concrete transformation is achieved by
assigning each pixel to a concept in ACSeg.
We first get a soft assignment for each pixel by calcu-
lating the cosine similarity with the concepts in the same
image
Si,j = cos <xi,cj>, (5)
where SRn×kis the assignment matrix, xi=Xi,:is the
i-th pixel embedding and cj=Cj,:is the j-th concept. The
soft assignment is differentiable and is used to optimize the
network when training, which is described in Section 3.4.
We assign each pixel to a definite concept during inference
by the maximum similarity
ai= argmax
j
cos <xi,cj> . (6)
By doing that, an image is segmented into mregions. Each
region is identified by a concept and consists of pixels as-
signed to this concept. It is worth noting that mcan be
different for different images because the assignment is ob-
tained by argmax operation which do not guarantee that
every concept is assigned at least once. Due to the adap-
tive nature of this assignment and the adaptive generating
of concepts for each image, we name the network Adaptive
Concept Generator.
3.4. Modularity Loss
For training the ACG, we design a loss function based
on the idea of estimating the intensity of assigning two pix-
els to the same concept. To achieve this goal, we introduce
modularity [35], which is commonly used in community de-
tection. The modularity is built upon a graph, thus we first
construct a fully connected undirected affinity graph for pix-
els from an image by taking them as vertices. The weight
of edge between two pixels which represents their affinity
is calculated by the cosine similarity of them
Ai,j = max(0,cos <xi,xj>).(7)
Here we truncate the value to a minimum of zero to avoid
negative values in calculation. Given two vertices i, j, fol-
lowing the modularity, we estimate the intensity wij of as-
signing them to the same concept by
wij =Ai,j ki·kj
2m,(8)
where ki=PjAi,j is the sum of edges that are connected
to vertex iand 2m=Pi,j Ai,j is the sum of all edges in
the graph. For intuition, wij reflects the intensity of dividing
pixels iand jinto the same cluster via comparing the actual
4
摘要:

ACSeg:AdaptiveConceptualizationforUnsupervisedSemanticSegmentationKehanLi1,3ZhennanWang2ZesenCheng1,3RunyiYu1,3YianZhao5GuoliSong2ChangLiu4LiYuan1,2,3*JieChen1,2,3∗1SchoolofElectronicandComputerEngineering,PekingUniversity,Shenzhen,China2PengChengLaboratory,Shenzhen,China3AIforScience(AI4S)-Preferre...

展开>> 收起<<
ACSeg Adaptive Conceptualization for Unsupervised Semantic Segmentation Kehan Li13Zhennan Wang2Zesen Cheng13Runyi Yu13Yian Zhao5Guoli Song2 Chang Liu4Li Yuan123Jie Chen123.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:3.74MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注