ACSeg Adaptive Conceptualization for Unsupervised Semantic Segmentation Kehan Li13Zhennan Wang2Zesen Cheng13Runyi Yu13Yian Zhao5Guoli Song2 Chang Liu4Li Yuan123Jie Chen123

2025-04-27 0 0 3.74MB 16 页 10玖币

ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation

Kehan Li1,3Zhennan Wang2Zesen Cheng1,3Runyi Yu1,3Yian Zhao5Guoli Song2

Chang Liu4Li Yuan1,2,3*Jie Chen1,2,3∗

1School of Electronic and Computer Engineering, Peking University, Shenzhen, China 2Peng Cheng Laboratory, Shenzhen, China

3AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, China

4Department of Automation and BNRist, Tsinghua University, Beijing, China 5Dalian University of Technology

Abstract

Recently, self-supervised large-scale visual pre-training

models have shown great promise in representing pixel-

level semantic relationships, signiﬁcantly promoting the de-

velopment of unsupervised dense prediction tasks, e.g., un-

supervised semantic segmentation (USS). The extracted re-

lationship among pixel-level representations typically con-

tains rich class-aware information that semantically iden-

tical pixel embeddings in the representation space gather

together to form sophisticated concepts. However, lever-

aging the learned models to ascertain semantically con-

sistent pixel groups or regions in the image is non-trivial

since over/ under-clustering overwhelms the conceptualiza-

tion procedure under various semantic distributions of dif-

ferent images. In this work, we investigate the pixel-level

semantic aggregation in self-supervised ViT pre-trained

models as image Segmentation and propose the Adaptive

Conceptualization approach for USS, termed ACSeg. Con-

cretely, we explicitly encode concepts into learnable proto-

types and design the Adaptive Concept Generator (ACG),

which adaptively maps these prototypes to informative con-

cepts for each image. Meanwhile, considering the scene

complexity of different images, we propose the modularity

loss to optimize ACG independent of the concept number

based on estimating the intensity of pixel pairs belonging to

the same concept. Finally, we turn the USS task into clas-

sifying the discovered concepts in an unsupervised manner.

Extensive experiments with state-of-the-art results demon-

strate the effectiveness of the proposed ACSeg.

1. Introduction

Semantic segmentation is one of the primary tasks in

computer vision, which has been widely used in many do-

mains, such as autonomous driving [7,14] and medical

*Corresponding author. Project page: https://lkhl.github.io/ACSeg.

(a) Under-clustering (b) Over-clustering

Pixel-level representation Prototype Inactive prototype Pixels in a concept

Concept

Adaptive

Concept

Generator Concept

Figure 1. Comparison between existing methods and our adaptive

conceptualization on ﬁnding underlying “concepts” in the pixel-

level representations produced by a pre-trained model. While

under-clustering just focuses on a single object and over-clustering

splits objects, our adaptive conceptualization processes different

images adaptively through updating the initialized prototypes with

the representations for each image.

imaging [12,25,43]. With the development of deep learn-

ing and the increasing amount of data [7,11,29,59], uplift-

ing performance has been achieved on this task by optimiz-

ing deep neural networks with pixel-level annotations [30].

However, large-scale pixel-level annotations are expensive

and laborious to obtain. Different kinds of weak supervision

have been explored to achieve label efﬁciency [39], e.g.,

image-level [1,51], scribble-level [28], and box-level super-

vision [36]. More than this, some methods also achieve se-

mantic segmentation without relying on any labels [20,21],

namely unsupervised semantic segmentation (USS).

Early approaches for USS are based on pixel-level self-

supervised representation learning by introducing cross-

arXiv:2210.05944v3 [cs.CV] 30 Mar 2023

view consistency [6,21], edge detection [20,58], or saliency

prior [45]. Recently, the self-supervised ViT [4] provides a

new paradigm for USS due to its property of containing se-

mantic information in pixel-level representations. We make

it more intuitive through Figure 1, which shows that in the

representation space of an image, the pixel-level representa-

tions produced by the self-supervised ViT contain underly-

ing clusters. When projecting these clusters into the image,

they become semantically consistent groups of pixels or re-

gions representing “concepts”.

In this work, we aim to achieve USS by accurately ex-

tracting and classifying these “concepts” in the pixel rep-

resentation space of each image. Unlike the previous at-

tempts which only consider foreground-background parti-

tion [42,46,50] or divide each image into a ﬁxed number

of clusters [19,33], we argue that it is crucial to consider

different images distinguishably due to the complexity of

various scenarios (Figure 1). We thus propose the Adaptive

Conceptualization for unsupervised semantic Segmentation

(ACSeg), a framework that ﬁnds these underlying concepts

adaptively for each image and achieves USS by classifying

the discovered concepts in an unsupervised manner.

To achieve conceptualization, we explicitly encode con-

cepts to learnable prototypes and adaptively update them for

different images by a network, as shown in Figure 2. This

network, named as Adaptive Concept Generator (ACG), is

implemented by iteratively applying scaled dot-product at-

tention [47] on the prototypes and pixel-level representa-

tions in the image to be processed. Through such a struc-

ture, the ACG learns to project the initial prototypes to the

concept in the representation space depending on the in-

put pixel-level representations. Then the concepts are ex-

plicitly presented in the image as different regions by as-

signing each pixel to the nearest concept in the representa-

tion space. The ACG is end-to-end optimized without any

annotations by the proposed modularity loss. Speciﬁcally,

we construct an afﬁnity graph on the pixel-level representa-

tions and use the connection relationship of two pixels in the

afﬁnity graph to adjust the strength of assigning two pixels

to the same concept, motivated by the modularity [35].

As the main part of ACSeg, the ACG achieves precise

conceptualization for different images due to its adaptive-

ness, which is reﬂected in two aspect: Firstly, it can adap-

tively operate on pixel-level representations of different im-

ages thanks to the dynamic update structure. Secondly, the

training objective does not enforce the number of concepts,

resulting in adaptive number of concepts for different im-

ages. With these properties, we get accurate partition for

images with different scene complexity via the concepts

produced by the ACG, as shown in Figure 1(c). Therefore,

in ACSeg, the semantic segmentation of an image can ﬁ-

nally be achieved by matting the corresponding regions in

the image and classifying them with the help of powerful

Update Updated prototypes

Initialized prototypes

Prototypes Pixel-level representation Pixels in a concept

Update by pixel-level representations Update by each other

Figure 2. Intuitive explanation for the basic idea of the ACG.

The concepts are explicitly encoded to learnable prototypes and

dynamically updated according to the input pixel-level representa-

tions. After update, the pixels are assigned to the nearest concept

in the representation space.

image-level pre-trained models.

For evaluation, we apply ACSeg on commonly used

semantic segmentation datasets, including PASCAL VOC

2012 [11] and COCO-Stuff [21,29]. The experimental

results show that the proposed ACSeg surpasses previous

methods on different settings of unsupervised semantic seg-

mentation tasks and achieves state-of-the-art performance

on the PASCAL VOC 2012 unsupervised semantic segmen-

tation benchmark without post-processing and re-training.

Moreover, the visualization of the pixel-level representa-

tions and the concepts shows that the ACG is applicable for

decomposing images with various scene complexity. Since

the ACG is fast to converge without learning new represen-

tations and the concept classiﬁer is employed in a zero-shot

manner, we draw the proposed ACSeg as a generalizable

method which is easy to modify and adapt to a wide range

of unsupervised image understanding.

2. Related Works

Vison Transformer. Transformer, a model mainly based

on self-attention mechanism, is widely used in natural

language processing [3,8] and cross-modal understand-

ing [22,26,27,38]. Vison Transformer (ViT) [9] is the

ﬁrst pure visual transformer model to process images. Re-

cently, Caron et al. [4] propose self-distillation with no la-

bels (DINO) to train the ViT, and found a property that its

features contain explicit information about the segmenta-

tion of an image. Based on DINO, some previous stud-

ies [15,33,42,46,50,55] successfully demonstrate extending

this property to unsupervised dense prediction tasks.

Unsupervised Semantic Segmentation. With the devel-

opment of self-supervised and unsupervised learning, un-

supervised methods for semantic segmentation task start

to emerge. Among them, some methods focus on pixel-

level self-supervised representation learning by introduc-

ing cross-view consistency [6,21,23,52,53,57,61], visual

prior [20,45,58], and continuity of video frames [2]. In

contrast, Zadaianchuk et al. [56] adopt pre-trained object-

centric representations and cluster them to segment ob-

jects. Other methods exploit pixel-level knowledge of

pre-trained generative models [32] or self-supervised pre-

trained convolutional neural network [19,49]. Recently,

self-supervised ViTs trained with DINO has recently been

explored for unsupervised dense prediction tasks due to the

ability of representing pixel-level semantic relationships.

For semantic segmentation, Hamilton et al. [15] train a seg-

mentation head by distilling the feature correspondences,

which further encourages pixel features to form compact

clusters and learn better pixel-level representations. Trans-

FGU [55] obtains semantic segmentation in a top-down

manner by extracting class activate maps from DINO mod-

els. Some approaches use the representations from DINO to

segment images into regions. Melas et al. [33] adopt spec-

tral decomposition on the afﬁnity graph to discover mean-

ingful parts in an image and implement semantic segmen-

tation of an image. MaskDistill [46] uses some hand-made

rules based on pixel-level representations to ﬁnd the salient

region in an image. In contrast, we aim at better extracting

underlying concepts among the representations from DINO

in an image by tackling the over/under-clustering problem.

Semantic Segmentation with Text. Vision-language pre-

training models enable learning without annotations or

zero-shot transfer on vision task [38]. For semantic seg-

mentation, MaskCLIP [60] modiﬁes the visual encoder of

CLIP [38] and applies the text-based classiﬁer on pixel

level. Xu et al. [54] propose GroupViT, a hierarchical

grouping vision transformer, and train it with image-to-text

contrastive loss. Finally, the semantic segmentation results

can be obtained by the grouping result and text embeddings.

ReCo [41] leverages the retrieval abilities of CLIP and the

robust correspondences offered by modern image represen-

tations to co-segment entities. Shin et al. [40] use CLIP

to construct category-speciﬁc images and produce pseudo-

label with a category-agnostic salient object detector boot-

strapped from DINO. In this paper, we also show that our

method can be combined with recent vision-language pre-

trained model to perform semantic segmentation with only

text-image supervision.

3. The Proposed ACSeg

In this section, we describe the proposed method for USS

in detail, starting from the whole framework.

3.1. Overall Approach

Figure 3illustrates the overall structure of ACSeg. Start-

ing with an image, we ﬁrst apply a self-supervised ViT to

generate pixel-level representations. As mentioned above,

these representations contain underlying concepts, which

represents meaningful groups or regions of pixels. The

Adaptive Concept Generator (ACG) is designed to output

the concepts explicitly. Speciﬁcally, the ACG takes a se-

ries of learnable prototypes as input and iteratively updates

them by interacting with the pixel-level representations, re-

sulting in adaptive concept representations for each image.

Finally, the concepts are explicitly represented by pixel

groups, which are obtained by assigning each pixel to the

nearest concept in the representation space.

For optimization, we propose a novel loss function called

modularity loss to train the ACG without any annotations.

Intuitively, the modularity loss works on pixel pairs. We

construct an afﬁnity graph taking the pixel-level represen-

tations as vertices and their cosine similarity as edges. The

modularity loss calculates the intensity of two pixels be-

longing to the same concept using the metric deﬁned in

modularity [35], thus adjusting the concept representations.

At last, the concept classiﬁer assigns each concept to a pre-

deﬁned category to obtain per-pixel class prediction, i.e., se-

mantic segmentation of an image. We introduce the details

of each component in ACSeg in the following sections.

3.2. Adaptive Concept Generator

The role of ACG is to map the initial prototypes to the

concept representations in each image. Since the concept

representations are different in different images and depend

on the pixel-level representations of the image, we intro-

duce the scaled dot-product attention [47] to iteratively up-

date the prototypes according to the pixel-level representa-

tions. Speciﬁcally, we ﬁrst apply cross-attention taking the

prototypes as the query and the pixel-level representations

as the key and value. Let Cl∈Rk×ddenote kprototypes

after l-th update and X∈Rn×ddenote npixel-level repre-

sentations from an image, the cross attention can be formu-

lated as

Cl= Softmax(Cl−1Wq(XWk)T

√d)(XWv),(1)

Cl=Cl−1+¯

ClWo,(2)

where Wq,Wk,Wv,Wo∈Rd×dare learnable linear pro-

jections. The cross-attention updates prototypes adaptively

with the pixel-level representations, which makes it possi-

ble to generate concepts adaptively for different images.

After that, self-attention is used to model the connections

for different concepts. Formulaically, it can be expressed as

Cl= Softmax(Cl−1Wq(Cl−1Wk)T

√d)(Cl−1Wv),(3)

Cl=Cl−1+¯

ClWo.(4)

The self-attention updates each prototype with other proto-

types and makes it aware of the presence of others, for better

adjusting their relative positions in the embedding space.

The ACG consists of Nupdate steps, and each up-

date step is made up of cross-attention, self-attention, and

Pixel

Assignment

𝑘updated prototypes

𝑘initial prototypes 𝑛pixel-level representations

Concept Classifier

Pixels assigned to a concept

···

Self-

supervised

ViT

Modularity

Loss

Affinity Graph

Supervision

Adaptive Concept Generator

Self-

attention

Cross-

attention

Feature

Extractor

Unsupervised

Classifier

···

Figure 3. Illustration of the proposed ACSeg. For an image, we ﬁrst use a self-supervised ViT to extract pixel-level representations,

which imply the semantic relationship of pixels. The Adaptive Concept Generator (ACG) dynamically updates the initial prototypes to the

underlying concepts in the representation space through scaled dot-product attention. Finally, the assignment of pixels is produced by the

cosine similarity between pixel-level representations and the concepts, and the modularity loss is used to optimize the ACG. At last, the

concept classiﬁer is used to assign each concept to a pre-deﬁned category thus obtain semantic segmentation of an image.

a Feed-forward Network (FFN) [47]. With the attention

mechanism, the ACG can learn the map from initial pro-

totypes to concepts adaptively for different images. For im-

plementation, we adopt multi-head attention, layer normal-

ization, and residual connection after the attention operation

and the FFN, following the transformer [47].

3.3. Pixel Assignment

After ACG, each image has its own concepts. Abstractly,

each concept is a vector in the representation space, approx-

imately the average of a group of gathered pixels. Con-

cretely, a concept consists of pixels with the same seman-

tics. This abstract-to-concrete transformation is achieved by

assigning each pixel to a concept in ACSeg.

We ﬁrst get a soft assignment for each pixel by calcu-

lating the cosine similarity with the concepts in the same

image

Si,j = cos <xi,cj>, (5)

where S∈Rn×kis the assignment matrix, xi=Xi,:is the

i-th pixel embedding and cj=Cj,:is the j-th concept. The

soft assignment is differentiable and is used to optimize the

network when training, which is described in Section 3.4.

We assign each pixel to a deﬁnite concept during inference

by the maximum similarity

ai= argmax

cos <xi,cj> . (6)

By doing that, an image is segmented into mregions. Each

region is identiﬁed by a concept and consists of pixels as-

signed to this concept. It is worth noting that mcan be

different for different images because the assignment is ob-

tained by argmax operation which do not guarantee that

every concept is assigned at least once. Due to the adap-

tive nature of this assignment and the adaptive generating

of concepts for each image, we name the network Adaptive

Concept Generator.

3.4. Modularity Loss

For training the ACG, we design a loss function based

on the idea of estimating the intensity of assigning two pix-

els to the same concept. To achieve this goal, we introduce

modularity [35], which is commonly used in community de-

tection. The modularity is built upon a graph, thus we ﬁrst

construct a fully connected undirected afﬁnity graph for pix-

els from an image by taking them as vertices. The weight

of edge between two pixels which represents their afﬁnity

is calculated by the cosine similarity of them

Ai,j = max(0,cos <xi,xj>).(7)

Here we truncate the value to a minimum of zero to avoid

negative values in calculation. Given two vertices i, j, fol-

lowing the modularity, we estimate the intensity wij of as-

signing them to the same concept by

wij =Ai,j −ki·kj

2m,(8)

where ki=PjAi,j is the sum of edges that are connected

to vertex iand 2m=Pi,j Ai,j is the sum of all edges in

the graph. For intuition, wij reﬂects the intensity of dividing

pixels iand jinto the same cluster via comparing the actual

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ACSeg:AdaptiveConceptualizationforUnsupervisedSemanticSegmentationKehanLi1,3ZhennanWang2ZesenCheng1,3RunyiYu1,3YianZhao5GuoliSong2ChangLiu4LiYuan1,2,3*JieChen1,2,3∗1SchoolofElectronicandComputerEngineering,PekingUniversity,Shenzhen,China2PengChengLaboratory,Shenzhen,China3AIforScience(AI4S)-Preferre...

展开>> 收起<<

ACSeg Adaptive Conceptualization for Unsupervised Semantic Segmentation Kehan Li13Zhennan Wang2Zesen Cheng13Runyi Yu13Yian Zhao5Guoli Song2 Chang Liu4Li Yuan123Jie Chen123.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ACSeg Adaptive Conceptualization for Unsupervised Semantic Segmentation Kehan Li13Zhennan Wang2Zesen Cheng13Runyi Yu13Yian Zhao5Guoli Song2 Chang Liu4Li Yuan123Jie Chen123

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: