Unsupervised Image Semantic Segmentation through Superpixels and Graph Neural Networks Moshe Eliasof

2025-05-06 0 0 2.76MB 13 页 10玖币

侵权投诉

Unsupervised Image Semantic Segmentation through Superpixels and Graph

Neural Networks

Moshe Eliasof

Ben-Gurion University of the Negev

Beer-Sheva, Israel

eliasof@post.bgu.ac.il

Nir Ben Zikri

Ben-Gurion University of the Negev

Beer-Sheva, Israel

nirbenz@post.bgu.ac.il

Eran Treister

Ben-Gurion University of the Negev

Beer-Sheva, Israel

erant@cs.bgu.ac.il

Abstract

Unsupervised image segmentation is an important task

in many real-world scenarios where labelled data is of

scarce availability. In this paper we propose a novel

approach that harnesses recent advances in unsupervised

learning using a combination of Mutual Information Max-

imization (MIM), Neural Superpixel Segmentation and

Graph Neural Networks (GNNs) in an end-to-end manner,

an approach that has not been explored yet. We take ad-

vantage of the compact representation of superpixels and

combine it with GNNs in order to learn strong and seman-

tically meaningful representations of images. Speciﬁcally,

we show that our GNN based approach allows to model in-

teractions between distant pixels in the image and serves as

a strong prior to existing CNNs for an improved accuracy.

Our experiments reveal both the qualitative and quantita-

tive advantages of our approach compared to current state-

of-the-art methods over four popular datasets.

1. Introduction

The emergence of Convolutional Neural Networks

(CNNs) in recent years [37, 26] has tremendously trans-

formed the handling and processing of information in var-

ious ﬁelds, from Computer Vision [46, 26, 9] to Compu-

tational Biology [48] and others. Speciﬁcally, the task of

supervised semantic segmentation has been widely studied

in a series of works like VGG [49], U-net [46], DeepLab

[9] and others [50, 39, 63, 19, 40]. However, the task of un-

supervised image semantic segmentation using deep learn-

ing frameworks, where no labels are available has been less

researched until recently [31, 42, 25, 10]. Thse methods

(a) (b) (c) (d)

Figure 1: Illustration of the superpixel extraction and se-

mantic segmentation on COCO-Stuff. (a) Input images. (b)

Superpixelated images. (c) Predicted semantic segmenta-

tion maps. (d) Ground-truth semantic segmentation maps.

Non-stuff examples are marked in black.

mostly rely on the concept of Mutual Information Maxi-

mization (MIM) which was used for image and volume reg-

istration in classical methods [57, 55], and recently was in-

corporated in CNNs and GNNs by [27] and [54], respec-

tively. The concept of MIM suggests to generate two or

more perturbations of the input image (e.g., by geometrical

or photometeric variations), feed them to a neural network

and demand consistent outputs. The motivation of this ap-

proach is to gravitate the network towards learning semantic

features from the data, while ignoring low-level variations

arXiv:2210.11810v1 [cs.CV] 21 Oct 2022

like brightness or afﬁne transformations of the image.

In this paper we propose a novel method that leverages

the recent advances in the ﬁeld of Neural Superpixel Seg-

mentation and GNNs to improve segmentation accuracy.

Speciﬁcally, given an input image, we propose to ﬁrst pre-

dict its soft superpixel representation using a superpixel ex-

traction CNN (SPNN). Then, we employ a GNN to reﬁne

and allow the interaction of superpixel features. This step is

crucial to share semantic information across distant pixels

in the original image. Lastly, we project the learnt features

back into an image and feed it to a segmentation designated

CNN to obtain the ﬁnal semantic segmentation map. We

distinguish our work from existing methods by ﬁrst observ-

ing that usually other methods rely on a pixel-wise label

prediction, mostly varying by the training procedure or ar-

chitectural changes. Also, while some recent works [41]

propose the incorporation of superpixel information for the

task of unsupervised image semantic segmentation, it was

obtained from a non-differentiable method like SLIC [1] or

SEEDS [52] that rely on low-level features like color and

pixel proximity. Our proposed method is differentiable and

can be trained in an end-to-end manner, which allows to

jointly learn the superpixel representation and their high-

level features together with the predicted segmentation map.

Also, the aforementioned methods did not incorporate a

GNN as part of the network architecture. We show in our

experiments that those additions are key to obtain accuracy

improvement. Our contributions are as follows:

• We propose SGSeg: a method that incorporates super-

pixel segmentation, GNNs and MIM for unsupervised

image semantic segmentation.

• We show that our network improves the CNN baseline

and other recent models, reading similar or better ac-

curacy than state-of-the-art models on 4 datasets.

• Our extensive ablation study reveals the importance of

superpixels and GNNs in unsupervised image segmen-

tation, and provides a solid evaluation of our method.

The rest of the paper is outlined as follows: In Sec. 2

we cover in detail existing methods and related background

material. In Sec. 3 we present our method and in Sec. 4 we

present our numerical experiments.

2. Related Work

2.1. Unsupervised Semantic Segmentation

The unsupervised image segmentation task seeks to se-

mantically classify each pixel in an image without the use of

ground-truth labels. Early models like the Geodesic Active

Contours [7] propose a variational based approach that min-

imizes a functional to obtain background-object segmenta-

tion. Other works propose to extract information from low-

level features, for instance by considering the histogram of

the red-green-blue (RGB) values of the image pixels [44]

and by employing a Markov random ﬁeld [12] to model the

semantic relations of pixels. In the context of deep learn-

ing frameworks, there has been great improvement in recent

years [42, 25, 31, 10]. The common factor of those meth-

ods is the incorporation of the concept MIM, which mea-

sures the similarity between two tensors of possibly differ-

ent sizes and from different sources [57]. Speciﬁcally, the

aforementioned methods promote the network towards pre-

dicting consistent segmentation maps with respect to image

transformations and perturbations such as afﬁne transfor-

mations [31]. Other methods propose to apply the transfor-

mation to the convolutional kernels [42] and demand con-

sistent predictions upon different convolution kernels ras-

terizations. Such an approach improves the consistency of

the learnt features under transformations while avoiding for-

ward and inverse transformation costs in [31]. Additional

works like [41] also follow the idea of MIM but instead

of using only geometrical transformations, it is proposed

to also employ adversarial perturbations to the image.

Operating on a pixel-level representation of images to

obtain a segmentation map is computationally demanding

[33, 14], and modelling the interactions between far pixels

in the image requires very deep networks which are also

harder to train [22]. It is therefore natural to consider the

compact superpixel representation of the image. The su-

perpixel representation stems from the over-segmentation

problem [1], where one seeks to segment the image into se-

mantically meaningful regions represented by the superpix-

els. Also, superpixels are often utilized as an initial guess

for non-deep learning based unsupervised image semantic

segmentation [58, 64, 60] that greatly reduces the complex-

ity of transferring an input image to its semantic segmenta-

tion map. In the context of CNNs, it was proposed [41] to

utilize a superpixel representation based on SLIC in addi-

tion to the pixel representation to improve accuracy.

Furthermore, other works propose to use pre-trained net-

works in order to achieve semantic information of the im-

ages, [23, 6], and [32] utilized co-segmentation of from

multiviews of images. In this work we focus on networks

that are neither trained nor pre-trained with any labelled

data and are based on a single image view.

2.2. Neural Superpixel Segmentation

The superpixel segmentation task considers the problem

of over-segmenting an image, i.e., dividing the image into

several sub-regions, where each sub-region has similar fea-

tures within its pixels. That is, given an image I∈RH×W,

a superpixel segmentation algorithm returns an assignment

matrix π∈[0, . . . , N −1]H×Wthat classiﬁes each pixel

into a superpixel, where Nis the number of superpixels.

Classical methods like SLIC [1] use Euclidean coordinates

and color space similarity to deﬁne superpixels, while other

works like SEEDS [52] and FH [17] deﬁne an energy func-

tional that is minimized by graph cuts. Recently, methods

like [30] proposed to use CNNs to extract superpixels in

asupervised manner, and following that it was shown in

[51, 61, 16] that a CNN is also beneﬁcial for unsupervised

superpixel extraction and to substantially improve the ac-

curacy of classical methods like SLIC, SEEDS and FH.

We note that in addition to performing better than classi-

cal methods, the mentioned CNN based models are fully-

differentiable, which is a desired property that we leverage

in this work. Speciﬁcally, it allows the end-to-end learn-

ing of superpixels and semantic segmentation from images,

which are intimately coupled problems [30].

2.3. Graph Neural Networks

Graph Neural Networks (GNNs) are a generalization of

CNNs, operating on an unstructured grid, and speciﬁcally

on data that can be represented as a graph, like point clouds

[56, 15], social networks [35] and protein structures [48].

Among the popular GNNs one can ﬁnd ChebNet [11], GCN

[35], GAT [53] and others [4, 24, 59]. For a comprehensive

overview of GNNs, we refer the interested reader to [3].

Building on the success of GNNs in handling sparse

point-clouds in 3D, we propose to treat the obtained su-

perpixel representation as a point-cloud in 2D, where each

point corresponds to a superpixel located in 2D inside the

image boundaries. We may also attach a high-dimensional

feature vector to each superpixel, as we elaborate later in

Sec. 3. We consider three types of GNNs that are known

to be useful for data that is geometrically meaningful like

superpixels. We start from a baseline PointNet [45], which

acts as a graph-free GNN consisting of point-wise 1×1

convolutions. We then examine the utilization of DGCNN

[56] which has shown great improvement over PointNet

for 3D points-cloud tasks. However, as we show in Sec.

3, DGCNN does not consider variable distances of points,

which may occur when considering superpixels. We there-

fore turn to a GNN that considers the distances along the x-

and- yaxes, based on DiffGCN [15].

2.4. Mutual Information in Neural Networks

The concept of Mutual Information in machine learn-

ing tasks has been utilized for image and volume alignment

[57, 55]. Recently, it was implemented into CNNs by the

seminal Deep InfoMax [27] and in GNNs by [54], where

unsupervised learning tasks are considered by deﬁning a

task that is deﬁned by the data. For example, by demand-

ing signal reconstruction and enforcing MIM between in-

puts and their reconstruction or predictions. This concept

was found to be useful in a wide array of applications, from

image superpixel segmentation [51, 61, 16] to unsupervised

image semantic segmentation [31, 42, 41] to unsupervised

graph related tasks [54]. In this paper we utilize mutual

information maximization in a similar fashion to [42] by

demanding similar prediction given different rotations and

rasterization of the learnt convolution kernels.

3. Method

We start by deﬁning the notations and setup that will be

used throughout this paper. We denote an RGB input im-

age by I∈RH×W×3, where Hand Wdenote the image

height and width, respectively. The goal of the unsuper-

vised image semantic segmentation with kclasses is to pre-

dict a segmentation map ˆ

M∈RH×W×k, and we denote

the ground-truth one-hot labels tensor by M∈RH×W×k.

3.1. Superpixel Setup and Notation

In this paper we consider superpixels as a medium to

predict the desired segmentation map ˆ

M. Let us denote by

Nthe maximal number of superpixels, which is a hyper-

parameter of our method.

In the image superixel segmentation task, the goal is to

assign every pixel in the image Ito a superpixel. We there-

fore may treat the pixel assignment prediction as a classi-

ﬁcation problem. Namely, we denote by P∈RH×W×N

a probabilistic representation of the superpixels. The su-

perpixel to which the (i, j)-th pixel belongs is given by the

hard-assignment of si,j = arg maxsPi,j,s. Thus, we de-

ﬁne value of the (i, j)-th pixel of the hard-superpixelated

image IPas follows:

i,j =Ph,w si,j =sh,w Ih,w

Ph,w si,j =sh,w

,(1)

where si,j =sh,w is an indicator function that reads 1 if

the hard assignment of the (i, j)-th and (h, w)-th pixel

is the same. Also, let us deﬁne the differentiable soft-

superpixelated image, which at the (i, j)-th pixel reads

i,j =

N−1

s=0

Pi,j,s Ph,w Ph,w,sIh,w

Ph,w Ph,w,s !.(2)

Consequently, we can also consider the set of superpix-

els and their features as a cloud of points by denoting

Fsp =Fsp

0,...,Fsp

N−1. This set of points is deﬁned as

a weighted average of a features map F∈RH×W×cthat

includes the x, y coordinates, RGB values and higher di-

mensional features that stem from the penultimate layer of

the SPNN network. We deﬁne the weighting according to

Psuch that feature vector of the i-th superpixel is:

Fsp

s=Ph,w Ph,w,sFh,w

Ph,w Ph,w,s

∈Rc.(3)

We note that it is also possible to deﬁne Eq. (3) using a

hard-assignment as in Eq. (1). However, such a deﬁnition

is not differentiable.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedImageSemanticSegmentationthroughSuperpixelsandGraphNeuralNetworksMosheEliasofBen-GurionUniversityoftheNegevBeer-Sheva,Israeleliasof@post.bgu.ac.ilNirBenZikriBen-GurionUniversityoftheNegevBeer-Sheva,Israelnirbenz@post.bgu.ac.ilEranTreisterBen-GurionUniversityoftheNegevBeer-Sheva,Israelera...

展开>> 收起<<

Unsupervised Image Semantic Segmentation through Superpixels and Graph Neural Networks Moshe Eliasof.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Image Semantic Segmentation through Superpixels and Graph Neural Networks Moshe Eliasof

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: