segmentation model.
II. RELATED WORK
Our work relates to previous self-supervised learning
efforts in robotics, exploiting the idea that robots are able
to gather their own data and with proper training strategies
do not have to rely on laborious labeling required for
supervised training. In the past this idea was explored in
the context of different tasks, such as pose-estimation [13],
object detection [14], [15] and learning object specific visual
representation for manipulation and tracking [16], to mention
a few. Reinforcement learning and contrastive learning were
the two most commonly used frameworks that spear-headed
the technical advances in self-supervised learning in both
computer vision and robotics. Below we review in more
detail the works that are most relevant to our task of semantic
segmentation.
A. Self-supervised Learning
To mitigate the need for large amounts of labeled data
required for training deep models, self-supervised methods
typically use various pretext tasks to generate training data.
In the past, in single image setting, these included masked
image modeling [7], object mask prediction [8], instance
discrimination [17] and others. These auxiliary tasks provide
the model with the desired objective to embed semantically
similar inputs closer in the learned embedding space.
a) Contrastive Learning: Contrastive learning [18] has
been a ”workhorse” of self-supervised learning approaches
for training DCNNs. It exploits the ability to associate
(semantically) similar examples (positive pairs) and distin-
guish them from negative pairs as a supervision signal for
learning suitable representations (embeddings). The existing
approaches vary depending on the final task, CNN archi-
tectures (often using variations of Siamese Neural Network
architectures), loss functions and methods for obtaining
similar and dissimilar training examples. The most common
concern in representation learning is to avoid model collapse.
Authors in [17] consider each image as a separate class
and the model is trained to disambiguate an image from
other images in the dataset. MoCo [19] employed a dynamic
memory bank [17] to store features from current the iteration
of the model as negatives during training. In order to remove
the requirement of memory bank, SimCLR [20] proposed to
use negative pairs from the mini-batch itself but consequently
necessitating a larger batch size. BYOL [21] further remove
the need for the negative examples altogether, by introducing
asymmetric architecture. Recently, [10] introduced a new ob-
jective function termed Barlow Twins, based on redundancy
reduction, which removes the need for large batches and
asymmetric models altogether.
b) Point Level Contrast.: While contrastive learning
at the global image level has proven to be beneficial for
image classification, the problems such as object detection
and semantic segmentation where predictions are done at the
object bounding box or pixel level, require disambiguation of
finer features. Getting pixel level positive pairs is more dif-
ficult than their image level counterpart. PixPro [22] follow
SimSiam [23] like training but at the pixel level; the positives
are obtained by thresholding the cosine distance between the
features at the pixel level within a given image. Authors
in [24] sample positive pixel pairs within regions obtained
by k-means clustering of the initial features, while [25]
circumvent the need of finding regions by dividing the image
into a fixed N x N grid where each grid-cell is considered a
separate region.
In situations where a small number of labeled examples is
available one can approach learning using both cross entropy
loss based on available labels along with self-supervised loss.
This requires a momentum-updated teacher network along
with a memory bank to associate features between examples
outside of the images in a given iteration [26]. In order
to obtain additional labels if labels are sparse [27], [28]
proposed to use label propagation techniques. This however
requires complete and accurate 3D reconstruction in order to
fuse predictions from different frames.
Our work is most closely related to the efforts of self-
supervised learning for object detection [15], [14] by using
multiple views and view association to guide the training,
but extends these ideas to dense pixel-level prediction tasks
such as semantic segmentation.
III. METHOD
We assume a robotic agent with the ability to perceive and
recover ego-motion and 3D structure of the environment and
associate overlapping views of the same scene. This can be
achieved with appropriate sensors such as a depth sensor or
stereo camera or with the use of 3D structure and motion
estimation techniques [29] or suitable SLAM approach [1].
To instantiate a self-supervised learning approach for se-
mantic segmentation we propose a (Region Consistency)
RegConsist method for temporal and spatial alignment of
overlapping views that we describe next.
A. Temporal Consistency
Let I1and I2be two images captured by the agent in
the fixed indoor environment. Assuming the availability of
known intrinsic and extrinsic camera parameters and depth,
we can associate the pixels in the overlapping views of the
same scene using (1).
T1→2(I1) = {K(T2−1(T1(K−1(x)) ∀x∈I1}(1)
where, Kis the intrinsic parameters of the camera, T1=
[R1|t1]is the camera pose for the image I1having rotation
R1and translation t1with respect to a fixed coordinate
system and xis a pixel in I1. The operator T1→2transforms
the 2D pixel coordinates in I1to the 3D world coordinate
system and projects it back to pixel coordinates in I2. We
assume that reliable correspondences can be estimated either
using learning based method [30] or, as in our case, with
availability of the depth sensor. Let Ip
1and Iq
2be pth and
qth pixels in the images I1and I2respectively. If the pixels
belong to the same 3D location in the environment and