Self-supervised Pre-training for Semantic Segmentation in an Indoor Scene

2025-04-15
0
0
1.79MB
7 页
10玖币
侵权投诉
Self-supervised Pre-training for Semantic Segmentation in an Indoor
Scene
Sulabh Shrestha1, Yimeng Li1and Jana Koˇ
secka1
Abstract— The ability to endow maps of indoor scenes with
semantic information is an integral part of robotic agents
which perform different tasks such as target driven navigation,
object search or object rearrangement. The state-of-the-art
methods use Deep Convolutional Neural Networks (DCNNs)
for predicting semantic segmentation of an image as useful
representation for these tasks. The accuracy of semantic seg-
mentation depends on the availability and the amount of labeled
data from the target environment or the ability to bridge the
domain gap between test and training environment. We propose
RegConsist, a method for self-supervised pre-training of a
semantic segmentation model, exploiting the ability of the agent
to move and register multiple views in the novel environment.
Given the spatial and temporal consistency cues used for pixel
level data association, we use a variant of contrastive learning to
train a DCNN model for predicting semantic segmentation from
RGB views in the target environment. The proposed method
outperforms models pre-trained on ImageNet and achieves
competitive performance when using models that are trained
for exactly the same task but on a different dataset. We also
perform various ablation studies to analyze and demonstrate
the efficacy of our proposed method.
I. INTRODUCTION
Semantic segmentation has been used extensively for both
semantic mapping [1] and also as input representation
for training policies for embodied agents (e.g. policies for
target driven or point goal navigation) that rely on visual
perception [2], [3]. Training semantic segmentation model
for a particular environment requires a large amount of
per-pixel annotations [4] that is very costly and laborious.
Alternatively for similar classes of environments that share
a large subset of semantic labels, a model can be trained for
the entire domain (say indoors environments) followed by
domain adaptation [5]. In a robotic setting the agent is often
able to move around and capture large amounts of visual data
and the ability to estimate ego-motion and depth perception
enables the agent to effectively associate multiple views of
the same scene.
Recent years marked notable progress in various self-
supervised methods for training large DCNN’s from scratch
without relying on commonly used backbones pre-trained
on ImageNet. The existing techniques used various forms
of contrastive learning and different pretext tasks such as
predicting the masked portion of the image [7], predicting
masks of objects [8] or predicting the rotation of the image
[9]. The self-supervised pre-training is then followed by fine-
tuning with a small fraction of the labeled data. Even though
1Sulabh Shrestha, Yimeng Li and Jana Koˇ
secka are with the Department
of Computer Science, George Mason University, 4400 University Dr,
Fairfax, VA, USA {sshres2,yli44,kosecka}@gmu.edu
Fig. 1: Example of consistency. A pair of views from
different location and pose of the agent inside an environment
in Replica Dataset [6]. We find and match pixels across
the view-pairs. In exact matching correspondence points are
matched (yellow arrow). In region matching, any pixel across
overlapping regions from the two views can be matched (blue
arrow). Best viewed digitally or in color.
these models have been proven to be very effective, they are
typically evaluated for image classification tasks.
In this paper we explore the use of self-supervision that
comes from the spatial and temporal consistency between
pairs of overlapping views and demonstrate how to pre-train
Deep Convolutional Neural Network (DCNN) model for se-
mantic segmentation using data captured in the environment
of interest. We assume that within a single traversal path, the
environment remains static so as to simplify the process of
computing correspondences between neighboring views, that
will be used for self-supervised training of the model.
Contribution 1) We propose RegConsist, a method for self-
supervised pre-training of a semantic segmentation model
using spatial and temporal consistency cues. We exploit cor-
respondences between multiple views for generating positive
examples for contrastive learning framework and evaluate
the effect of different sampling strategies on the result.
2) We demonstrate that the resulting model can be fine-
tuned with only a small fraction of image annotations,
obtaining competitive or better performance in a novel indoor
environment compared to fully supervised models. 3) We are
the first to demonstrate that the Barlow twins loss [10] works
well for semantic segmentation in an indoor environment. 4)
We demonstrate the efficacy of our method on Replica [6]
and AVD [11] datasets both qualitatively and quantitatively
while using as low as 5 percent of the annotated data. 5) We
perform extensive ablation studies to verify our method’s
performance including different starting conditions of the
arXiv:2210.01884v1 [cs.CV] 4 Oct 2022
segmentation model.
II. RELATED WORK
Our work relates to previous self-supervised learning
efforts in robotics, exploiting the idea that robots are able
to gather their own data and with proper training strategies
do not have to rely on laborious labeling required for
supervised training. In the past this idea was explored in
the context of different tasks, such as pose-estimation [13],
object detection [14], [15] and learning object specific visual
representation for manipulation and tracking [16], to mention
a few. Reinforcement learning and contrastive learning were
the two most commonly used frameworks that spear-headed
the technical advances in self-supervised learning in both
computer vision and robotics. Below we review in more
detail the works that are most relevant to our task of semantic
segmentation.
A. Self-supervised Learning
To mitigate the need for large amounts of labeled data
required for training deep models, self-supervised methods
typically use various pretext tasks to generate training data.
In the past, in single image setting, these included masked
image modeling [7], object mask prediction [8], instance
discrimination [17] and others. These auxiliary tasks provide
the model with the desired objective to embed semantically
similar inputs closer in the learned embedding space.
a) Contrastive Learning: Contrastive learning [18] has
been a ”workhorse” of self-supervised learning approaches
for training DCNNs. It exploits the ability to associate
(semantically) similar examples (positive pairs) and distin-
guish them from negative pairs as a supervision signal for
learning suitable representations (embeddings). The existing
approaches vary depending on the final task, CNN archi-
tectures (often using variations of Siamese Neural Network
architectures), loss functions and methods for obtaining
similar and dissimilar training examples. The most common
concern in representation learning is to avoid model collapse.
Authors in [17] consider each image as a separate class
and the model is trained to disambiguate an image from
other images in the dataset. MoCo [19] employed a dynamic
memory bank [17] to store features from current the iteration
of the model as negatives during training. In order to remove
the requirement of memory bank, SimCLR [20] proposed to
use negative pairs from the mini-batch itself but consequently
necessitating a larger batch size. BYOL [21] further remove
the need for the negative examples altogether, by introducing
asymmetric architecture. Recently, [10] introduced a new ob-
jective function termed Barlow Twins, based on redundancy
reduction, which removes the need for large batches and
asymmetric models altogether.
b) Point Level Contrast.: While contrastive learning
at the global image level has proven to be beneficial for
image classification, the problems such as object detection
and semantic segmentation where predictions are done at the
object bounding box or pixel level, require disambiguation of
finer features. Getting pixel level positive pairs is more dif-
ficult than their image level counterpart. PixPro [22] follow
SimSiam [23] like training but at the pixel level; the positives
are obtained by thresholding the cosine distance between the
features at the pixel level within a given image. Authors
in [24] sample positive pixel pairs within regions obtained
by k-means clustering of the initial features, while [25]
circumvent the need of finding regions by dividing the image
into a fixed N x N grid where each grid-cell is considered a
separate region.
In situations where a small number of labeled examples is
available one can approach learning using both cross entropy
loss based on available labels along with self-supervised loss.
This requires a momentum-updated teacher network along
with a memory bank to associate features between examples
outside of the images in a given iteration [26]. In order
to obtain additional labels if labels are sparse [27], [28]
proposed to use label propagation techniques. This however
requires complete and accurate 3D reconstruction in order to
fuse predictions from different frames.
Our work is most closely related to the efforts of self-
supervised learning for object detection [15], [14] by using
multiple views and view association to guide the training,
but extends these ideas to dense pixel-level prediction tasks
such as semantic segmentation.
III. METHOD
We assume a robotic agent with the ability to perceive and
recover ego-motion and 3D structure of the environment and
associate overlapping views of the same scene. This can be
achieved with appropriate sensors such as a depth sensor or
stereo camera or with the use of 3D structure and motion
estimation techniques [29] or suitable SLAM approach [1].
To instantiate a self-supervised learning approach for se-
mantic segmentation we propose a (Region Consistency)
RegConsist method for temporal and spatial alignment of
overlapping views that we describe next.
A. Temporal Consistency
Let I1and I2be two images captured by the agent in
the fixed indoor environment. Assuming the availability of
known intrinsic and extrinsic camera parameters and depth,
we can associate the pixels in the overlapping views of the
same scene using (1).
T1→2(I1) = {K(T2−1(T1(K−1(x)) ∀x∈I1}(1)
where, Kis the intrinsic parameters of the camera, T1=
[R1|t1]is the camera pose for the image I1having rotation
R1and translation t1with respect to a fixed coordinate
system and xis a pixel in I1. The operator T1→2transforms
the 2D pixel coordinates in I1to the 3D world coordinate
system and projects it back to pixel coordinates in I2. We
assume that reliable correspondences can be estimated either
using learning based method [30] or, as in our case, with
availability of the depth sensor. Let Ip
1and Iq
2be pth and
qth pixels in the images I1and I2respectively. If the pixels
belong to the same 3D location in the environment and
摘要:
展开>>
收起<<
Self-supervisedPre-trainingforSemanticSegmentationinanIndoorSceneSulabhShrestha1,YimengLi1andJanaKosecka1AbstractTheabilitytoendowmapsofindoorsceneswithsemanticinformationisanintegralpartofroboticagentswhichperformdifferenttaskssuchastargetdrivennavigation,objectsearchorobjectrearrangement.Thestat...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-14 22
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 28
-
VIP免费2024-11-23 11
-
VIP免费2024-11-23 21
-
VIP免费2024-11-23 12
-
VIP免费2024-11-23 5
分类:学术论文
价格:10玖币
属性:7 页
大小:1.79MB
格式:PDF
时间:2025-04-15
作者详情
-
Voltage-Controlled High-Bandwidth Terahertz Oscillators Based On Antiferromagnets Mike A. Lund1Davi R. Rodrigues2Karin Everschor-Sitte3and Kjetil M. D. Hals1 1Department of Engineering Sciences University of Agder 4879 Grimstad Norway10 玖币0人下载
-
Voltage-controlled topological interface states for bending waves in soft dielectric phononic crystal plates10 玖币0人下载