Unsupervised Image Semantic Segmentation through Superpixels and Graph Neural Networks Moshe Eliasof

2025-05-06 0 0 2.76MB 13 页 10玖币
侵权投诉
Unsupervised Image Semantic Segmentation through Superpixels and Graph
Neural Networks
Moshe Eliasof
Ben-Gurion University of the Negev
Beer-Sheva, Israel
eliasof@post.bgu.ac.il
Nir Ben Zikri
Ben-Gurion University of the Negev
Beer-Sheva, Israel
nirbenz@post.bgu.ac.il
Eran Treister
Ben-Gurion University of the Negev
Beer-Sheva, Israel
erant@cs.bgu.ac.il
Abstract
Unsupervised image segmentation is an important task
in many real-world scenarios where labelled data is of
scarce availability. In this paper we propose a novel
approach that harnesses recent advances in unsupervised
learning using a combination of Mutual Information Max-
imization (MIM), Neural Superpixel Segmentation and
Graph Neural Networks (GNNs) in an end-to-end manner,
an approach that has not been explored yet. We take ad-
vantage of the compact representation of superpixels and
combine it with GNNs in order to learn strong and seman-
tically meaningful representations of images. Specifically,
we show that our GNN based approach allows to model in-
teractions between distant pixels in the image and serves as
a strong prior to existing CNNs for an improved accuracy.
Our experiments reveal both the qualitative and quantita-
tive advantages of our approach compared to current state-
of-the-art methods over four popular datasets.
1. Introduction
The emergence of Convolutional Neural Networks
(CNNs) in recent years [37, 26] has tremendously trans-
formed the handling and processing of information in var-
ious fields, from Computer Vision [46, 26, 9] to Compu-
tational Biology [48] and others. Specifically, the task of
supervised semantic segmentation has been widely studied
in a series of works like VGG [49], U-net [46], DeepLab
[9] and others [50, 39, 63, 19, 40]. However, the task of un-
supervised image semantic segmentation using deep learn-
ing frameworks, where no labels are available has been less
researched until recently [31, 42, 25, 10]. Thse methods
(a) (b) (c) (d)
Figure 1: Illustration of the superpixel extraction and se-
mantic segmentation on COCO-Stuff. (a) Input images. (b)
Superpixelated images. (c) Predicted semantic segmenta-
tion maps. (d) Ground-truth semantic segmentation maps.
Non-stuff examples are marked in black.
mostly rely on the concept of Mutual Information Maxi-
mization (MIM) which was used for image and volume reg-
istration in classical methods [57, 55], and recently was in-
corporated in CNNs and GNNs by [27] and [54], respec-
tively. The concept of MIM suggests to generate two or
more perturbations of the input image (e.g., by geometrical
or photometeric variations), feed them to a neural network
and demand consistent outputs. The motivation of this ap-
proach is to gravitate the network towards learning semantic
features from the data, while ignoring low-level variations
arXiv:2210.11810v1 [cs.CV] 21 Oct 2022
like brightness or affine transformations of the image.
In this paper we propose a novel method that leverages
the recent advances in the field of Neural Superpixel Seg-
mentation and GNNs to improve segmentation accuracy.
Specifically, given an input image, we propose to first pre-
dict its soft superpixel representation using a superpixel ex-
traction CNN (SPNN). Then, we employ a GNN to refine
and allow the interaction of superpixel features. This step is
crucial to share semantic information across distant pixels
in the original image. Lastly, we project the learnt features
back into an image and feed it to a segmentation designated
CNN to obtain the final semantic segmentation map. We
distinguish our work from existing methods by first observ-
ing that usually other methods rely on a pixel-wise label
prediction, mostly varying by the training procedure or ar-
chitectural changes. Also, while some recent works [41]
propose the incorporation of superpixel information for the
task of unsupervised image semantic segmentation, it was
obtained from a non-differentiable method like SLIC [1] or
SEEDS [52] that rely on low-level features like color and
pixel proximity. Our proposed method is differentiable and
can be trained in an end-to-end manner, which allows to
jointly learn the superpixel representation and their high-
level features together with the predicted segmentation map.
Also, the aforementioned methods did not incorporate a
GNN as part of the network architecture. We show in our
experiments that those additions are key to obtain accuracy
improvement. Our contributions are as follows:
We propose SGSeg: a method that incorporates super-
pixel segmentation, GNNs and MIM for unsupervised
image semantic segmentation.
We show that our network improves the CNN baseline
and other recent models, reading similar or better ac-
curacy than state-of-the-art models on 4 datasets.
Our extensive ablation study reveals the importance of
superpixels and GNNs in unsupervised image segmen-
tation, and provides a solid evaluation of our method.
The rest of the paper is outlined as follows: In Sec. 2
we cover in detail existing methods and related background
material. In Sec. 3 we present our method and in Sec. 4 we
present our numerical experiments.
2. Related Work
2.1. Unsupervised Semantic Segmentation
The unsupervised image segmentation task seeks to se-
mantically classify each pixel in an image without the use of
ground-truth labels. Early models like the Geodesic Active
Contours [7] propose a variational based approach that min-
imizes a functional to obtain background-object segmenta-
tion. Other works propose to extract information from low-
level features, for instance by considering the histogram of
the red-green-blue (RGB) values of the image pixels [44]
and by employing a Markov random field [12] to model the
semantic relations of pixels. In the context of deep learn-
ing frameworks, there has been great improvement in recent
years [42, 25, 31, 10]. The common factor of those meth-
ods is the incorporation of the concept MIM, which mea-
sures the similarity between two tensors of possibly differ-
ent sizes and from different sources [57]. Specifically, the
aforementioned methods promote the network towards pre-
dicting consistent segmentation maps with respect to image
transformations and perturbations such as affine transfor-
mations [31]. Other methods propose to apply the transfor-
mation to the convolutional kernels [42] and demand con-
sistent predictions upon different convolution kernels ras-
terizations. Such an approach improves the consistency of
the learnt features under transformations while avoiding for-
ward and inverse transformation costs in [31]. Additional
works like [41] also follow the idea of MIM but instead
of using only geometrical transformations, it is proposed
to also employ adversarial perturbations to the image.
Operating on a pixel-level representation of images to
obtain a segmentation map is computationally demanding
[33, 14], and modelling the interactions between far pixels
in the image requires very deep networks which are also
harder to train [22]. It is therefore natural to consider the
compact superpixel representation of the image. The su-
perpixel representation stems from the over-segmentation
problem [1], where one seeks to segment the image into se-
mantically meaningful regions represented by the superpix-
els. Also, superpixels are often utilized as an initial guess
for non-deep learning based unsupervised image semantic
segmentation [58, 64, 60] that greatly reduces the complex-
ity of transferring an input image to its semantic segmenta-
tion map. In the context of CNNs, it was proposed [41] to
utilize a superpixel representation based on SLIC in addi-
tion to the pixel representation to improve accuracy.
Furthermore, other works propose to use pre-trained net-
works in order to achieve semantic information of the im-
ages, [23, 6], and [32] utilized co-segmentation of from
multiviews of images. In this work we focus on networks
that are neither trained nor pre-trained with any labelled
data and are based on a single image view.
2.2. Neural Superpixel Segmentation
The superpixel segmentation task considers the problem
of over-segmenting an image, i.e., dividing the image into
several sub-regions, where each sub-region has similar fea-
tures within its pixels. That is, given an image IRH×W,
a superpixel segmentation algorithm returns an assignment
matrix π[0, . . . , N 1]H×Wthat classifies each pixel
into a superpixel, where Nis the number of superpixels.
Classical methods like SLIC [1] use Euclidean coordinates
and color space similarity to define superpixels, while other
works like SEEDS [52] and FH [17] define an energy func-
tional that is minimized by graph cuts. Recently, methods
like [30] proposed to use CNNs to extract superpixels in
asupervised manner, and following that it was shown in
[51, 61, 16] that a CNN is also beneficial for unsupervised
superpixel extraction and to substantially improve the ac-
curacy of classical methods like SLIC, SEEDS and FH.
We note that in addition to performing better than classi-
cal methods, the mentioned CNN based models are fully-
differentiable, which is a desired property that we leverage
in this work. Specifically, it allows the end-to-end learn-
ing of superpixels and semantic segmentation from images,
which are intimately coupled problems [30].
2.3. Graph Neural Networks
Graph Neural Networks (GNNs) are a generalization of
CNNs, operating on an unstructured grid, and specifically
on data that can be represented as a graph, like point clouds
[56, 15], social networks [35] and protein structures [48].
Among the popular GNNs one can find ChebNet [11], GCN
[35], GAT [53] and others [4, 24, 59]. For a comprehensive
overview of GNNs, we refer the interested reader to [3].
Building on the success of GNNs in handling sparse
point-clouds in 3D, we propose to treat the obtained su-
perpixel representation as a point-cloud in 2D, where each
point corresponds to a superpixel located in 2D inside the
image boundaries. We may also attach a high-dimensional
feature vector to each superpixel, as we elaborate later in
Sec. 3. We consider three types of GNNs that are known
to be useful for data that is geometrically meaningful like
superpixels. We start from a baseline PointNet [45], which
acts as a graph-free GNN consisting of point-wise 1×1
convolutions. We then examine the utilization of DGCNN
[56] which has shown great improvement over PointNet
for 3D points-cloud tasks. However, as we show in Sec.
3, DGCNN does not consider variable distances of points,
which may occur when considering superpixels. We there-
fore turn to a GNN that considers the distances along the x-
and- yaxes, based on DiffGCN [15].
2.4. Mutual Information in Neural Networks
The concept of Mutual Information in machine learn-
ing tasks has been utilized for image and volume alignment
[57, 55]. Recently, it was implemented into CNNs by the
seminal Deep InfoMax [27] and in GNNs by [54], where
unsupervised learning tasks are considered by defining a
task that is defined by the data. For example, by demand-
ing signal reconstruction and enforcing MIM between in-
puts and their reconstruction or predictions. This concept
was found to be useful in a wide array of applications, from
image superpixel segmentation [51, 61, 16] to unsupervised
image semantic segmentation [31, 42, 41] to unsupervised
graph related tasks [54]. In this paper we utilize mutual
information maximization in a similar fashion to [42] by
demanding similar prediction given different rotations and
rasterization of the learnt convolution kernels.
3. Method
We start by defining the notations and setup that will be
used throughout this paper. We denote an RGB input im-
age by IRH×W×3, where Hand Wdenote the image
height and width, respectively. The goal of the unsuper-
vised image semantic segmentation with kclasses is to pre-
dict a segmentation map ˆ
MRH×W×k, and we denote
the ground-truth one-hot labels tensor by MRH×W×k.
3.1. Superpixel Setup and Notation
In this paper we consider superpixels as a medium to
predict the desired segmentation map ˆ
M. Let us denote by
Nthe maximal number of superpixels, which is a hyper-
parameter of our method.
In the image superixel segmentation task, the goal is to
assign every pixel in the image Ito a superpixel. We there-
fore may treat the pixel assignment prediction as a classi-
fication problem. Namely, we denote by PRH×W×N
a probabilistic representation of the superpixels. The su-
perpixel to which the (i, j)-th pixel belongs is given by the
hard-assignment of si,j = arg maxsPi,j,s. Thus, we de-
fine value of the (i, j)-th pixel of the hard-superpixelated
image IPas follows:
IP
i,j =Ph,w si,j =sh,w Ih,w
Ph,w si,j =sh,w
,(1)
where si,j =sh,w is an indicator function that reads 1 if
the hard assignment of the (i, j)-th and (h, w)-th pixel
is the same. Also, let us define the differentiable soft-
superpixelated image, which at the (i, j)-th pixel reads
ˆ
IP
i,j =
N1
X
s=0
Pi,j,s Ph,w Ph,w,sIh,w
Ph,w Ph,w,s !.(2)
Consequently, we can also consider the set of superpix-
els and their features as a cloud of points by denoting
Fsp =Fsp
0,...,Fsp
N1. This set of points is defined as
a weighted average of a features map FRH×W×cthat
includes the x, y coordinates, RGB values and higher di-
mensional features that stem from the penultimate layer of
the SPNN network. We define the weighting according to
Psuch that feature vector of the i-th superpixel is:
Fsp
s=Ph,w Ph,w,sFh,w
Ph,w Ph,w,s
Rc.(3)
We note that it is also possible to define Eq. (3) using a
hard-assignment as in Eq. (1). However, such a definition
is not differentiable.
摘要:

UnsupervisedImageSemanticSegmentationthroughSuperpixelsandGraphNeuralNetworksMosheEliasofBen-GurionUniversityoftheNegevBeer-Sheva,Israeleliasof@post.bgu.ac.ilNirBenZikriBen-GurionUniversityoftheNegevBeer-Sheva,Israelnirbenz@post.bgu.ac.ilEranTreisterBen-GurionUniversityoftheNegevBeer-Sheva,Israelera...

展开>> 收起<<
Unsupervised Image Semantic Segmentation through Superpixels and Graph Neural Networks Moshe Eliasof.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:2.76MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注