
IDEAL: IMPROVED DENSE LOCAL CONTRASTIVE LEARNING FOR SEMI-SUPERVISED
MEDICAL IMAGE SEGMENTATION
Hritam Basak1, Soumitri Chattopadhyay∗2, Rohit Kundu∗2, Sayan Nag∗3, Rammohan Mallipeddi 4
1Stony Brook University, 2Jadavpur University, 3University of Toronto, 4Kyungpook National University
ABSTRACT
Due to the scarcity of labeled data, Contrastive Self-Supervised
Learning (SSL) frameworks have lately shown great potential
in several medical image analysis tasks. However, the existing
contrastive mechanisms are sub-optimal for dense pixel-level
segmentation tasks due to their inability to mine local fea-
tures. To this end, we extend the concept of metric learning to
the segmentation task, using a dense (dis)similarity learning
for pre-training a deep encoder network, and employing a
semi-supervised paradigm to fine-tune for the downstream
task. Specifically, we propose a simple convolutional pro-
jection head for obtaining dense pixel-level features, and a
new contrastive loss to utilize these dense projections thereby
improving the local representations. A bidirectional consis-
tency regularization mechanism involving two-stream model
training is devised for the downstream task. Upon compar-
ison, our IDEAL method outperforms the SoTA methods
by fair margins on cardiac MRI segmentation. Our source
codes are publicly accessible at: https://github.com/Rohit-
Kundu/IDEAL-ICASSP23.
Index Terms—Semi-supervised learning, Segmentation,
MRI, Contrastive learning
1. INTRODUCTION
The success of supervised deep learning approaches can be
attributed to the availability of large quantities of labeled data
that is essential for network training [1]. However, in the
biomedical domain [2,3], it is difficult to acquire such large
quantities of annotated data as the annotations are to be done
by trained medical professionals. Although supervised learn-
ing has been used extensively in the past decade in biomedi-
cal imaging [1], Self-supervised Learning (SSL) [4,5,6,7,8]
provides more traction for sustaining deep learning methods
in the medical vision domain [9,10,11]. SSL-based pre-
training alleviates the data annotation problem by utilizing
only unstructured data to learn distinctive information which
can further be utilized in downstream applications, typically
in a semi-supervised fashion [12,13,9]. Typically, SSL have
shown to be exceedingly promising in domains with enor-
mous amounts of data such as natural images [6,7] with a
*Equal contribution.
focus on both contrastive [14,7,6,15] and non-contrastive
variants [16,17]. Recently, SSL has begun to be employed
for medical imaging as well, evident from the survey [2].
One of the most popular ways of employing SSL is con-
trastive learning [14,7,6], which owing to its prolific per-
formance across various vision tasks has almost become a
de facto standard for self-supervision. The intuition of con-
trastive learning is to pull embeddings of semantically similar
objects closer and simultaneously push away the representa-
tions of dissimilar objects. In medical image segmentation,
contrastive learning has been leveraged in a few prior works
[13,18]. However, naive contrastive learning frameworks
such as SimCLR [6] and MoCo [7] learn a global represen-
tation and thus, are not suitable to be applied directly to seg-
mentation, a pixel-level task. The seminal work by Chaitanya
et. al. [13] utilized a contrastive pre-training strategy to learn
useful representations for a segmentation task in a low-data
regime. Despite having success, their method has a few limi-
tations, which we attempt to address and improve upon in this
paper. In particular, we take a cue from the aforementioned
work [13] to propose a novel contrastive learning-based med-
ical image segmentation framework.
Although our work is built upon [13], there are several
salient points of difference between the two models that ren-
der our contributions non-trivial and unique. First of all, our
proposed method preserves spatial information and constructs
a dense output projection that preserves locality. This is in
contrast to the projection used in [13] where a global pooling
is applied to the backbone encoder, thereby obtaining a sin-
gle global feature representation vector for every input image.
In other words, [13] employed global contrastive learning for
encoder pre-training and the local contrastive learning is only
restricted to the decoder fine-tuning. In contrast, our method
involves pre-training of the encoder on dense local features.
The intuitive difference between our proposed framework and
that of [13] has been depicted in Fig. 1. We argue that this
will benefit pixel-level informed downstream tasks such as
segmentation (supported by our findings in Section 3).
Secondly, our definition of finding positive and negative
pairs for contrastive learning is different from that of tradi-
tional contrastive [6,7] methods. This is because, in our
case, we find dense correspondences across views and define
positives and negatives accordingly, i.e., we perform feature-
arXiv:2210.15075v2 [cs.CV] 2 Mar 2023