like brightness or affine transformations of the image.
In this paper we propose a novel method that leverages
the recent advances in the field of Neural Superpixel Seg-
mentation and GNNs to improve segmentation accuracy.
Specifically, given an input image, we propose to first pre-
dict its soft superpixel representation using a superpixel ex-
traction CNN (SPNN). Then, we employ a GNN to refine
and allow the interaction of superpixel features. This step is
crucial to share semantic information across distant pixels
in the original image. Lastly, we project the learnt features
back into an image and feed it to a segmentation designated
CNN to obtain the final semantic segmentation map. We
distinguish our work from existing methods by first observ-
ing that usually other methods rely on a pixel-wise label
prediction, mostly varying by the training procedure or ar-
chitectural changes. Also, while some recent works [41]
propose the incorporation of superpixel information for the
task of unsupervised image semantic segmentation, it was
obtained from a non-differentiable method like SLIC [1] or
SEEDS [52] that rely on low-level features like color and
pixel proximity. Our proposed method is differentiable and
can be trained in an end-to-end manner, which allows to
jointly learn the superpixel representation and their high-
level features together with the predicted segmentation map.
Also, the aforementioned methods did not incorporate a
GNN as part of the network architecture. We show in our
experiments that those additions are key to obtain accuracy
improvement. Our contributions are as follows:
• We propose SGSeg: a method that incorporates super-
pixel segmentation, GNNs and MIM for unsupervised
image semantic segmentation.
• We show that our network improves the CNN baseline
and other recent models, reading similar or better ac-
curacy than state-of-the-art models on 4 datasets.
• Our extensive ablation study reveals the importance of
superpixels and GNNs in unsupervised image segmen-
tation, and provides a solid evaluation of our method.
The rest of the paper is outlined as follows: In Sec. 2
we cover in detail existing methods and related background
material. In Sec. 3 we present our method and in Sec. 4 we
present our numerical experiments.
2. Related Work
2.1. Unsupervised Semantic Segmentation
The unsupervised image segmentation task seeks to se-
mantically classify each pixel in an image without the use of
ground-truth labels. Early models like the Geodesic Active
Contours [7] propose a variational based approach that min-
imizes a functional to obtain background-object segmenta-
tion. Other works propose to extract information from low-
level features, for instance by considering the histogram of
the red-green-blue (RGB) values of the image pixels [44]
and by employing a Markov random field [12] to model the
semantic relations of pixels. In the context of deep learn-
ing frameworks, there has been great improvement in recent
years [42, 25, 31, 10]. The common factor of those meth-
ods is the incorporation of the concept MIM, which mea-
sures the similarity between two tensors of possibly differ-
ent sizes and from different sources [57]. Specifically, the
aforementioned methods promote the network towards pre-
dicting consistent segmentation maps with respect to image
transformations and perturbations such as affine transfor-
mations [31]. Other methods propose to apply the transfor-
mation to the convolutional kernels [42] and demand con-
sistent predictions upon different convolution kernels ras-
terizations. Such an approach improves the consistency of
the learnt features under transformations while avoiding for-
ward and inverse transformation costs in [31]. Additional
works like [41] also follow the idea of MIM but instead
of using only geometrical transformations, it is proposed
to also employ adversarial perturbations to the image.
Operating on a pixel-level representation of images to
obtain a segmentation map is computationally demanding
[33, 14], and modelling the interactions between far pixels
in the image requires very deep networks which are also
harder to train [22]. It is therefore natural to consider the
compact superpixel representation of the image. The su-
perpixel representation stems from the over-segmentation
problem [1], where one seeks to segment the image into se-
mantically meaningful regions represented by the superpix-
els. Also, superpixels are often utilized as an initial guess
for non-deep learning based unsupervised image semantic
segmentation [58, 64, 60] that greatly reduces the complex-
ity of transferring an input image to its semantic segmenta-
tion map. In the context of CNNs, it was proposed [41] to
utilize a superpixel representation based on SLIC in addi-
tion to the pixel representation to improve accuracy.
Furthermore, other works propose to use pre-trained net-
works in order to achieve semantic information of the im-
ages, [23, 6], and [32] utilized co-segmentation of from
multiviews of images. In this work we focus on networks
that are neither trained nor pre-trained with any labelled
data and are based on a single image view.
2.2. Neural Superpixel Segmentation
The superpixel segmentation task considers the problem
of over-segmenting an image, i.e., dividing the image into
several sub-regions, where each sub-region has similar fea-
tures within its pixels. That is, given an image I∈RH×W,
a superpixel segmentation algorithm returns an assignment
matrix π∈[0, . . . , N −1]H×Wthat classifies each pixel
into a superpixel, where Nis the number of superpixels.
Classical methods like SLIC [1] use Euclidean coordinates
and color space similarity to define superpixels, while other