Self-supervised Pre-training for Semantic Segmentation in an Indoor Scene

2025-04-15 1 0 1.79MB 7 页 10玖币

侵权投诉

Self-supervised Pre-training for Semantic Segmentation in an Indoor

Scene

Sulabh Shrestha1, Yimeng Li1and Jana Koˇ

secka1

Abstract— The ability to endow maps of indoor scenes with

semantic information is an integral part of robotic agents

which perform different tasks such as target driven navigation,

object search or object rearrangement. The state-of-the-art

methods use Deep Convolutional Neural Networks (DCNNs)

for predicting semantic segmentation of an image as useful

representation for these tasks. The accuracy of semantic seg-

mentation depends on the availability and the amount of labeled

data from the target environment or the ability to bridge the

domain gap between test and training environment. We propose

RegConsist, a method for self-supervised pre-training of a

semantic segmentation model, exploiting the ability of the agent

to move and register multiple views in the novel environment.

Given the spatial and temporal consistency cues used for pixel

level data association, we use a variant of contrastive learning to

train a DCNN model for predicting semantic segmentation from

RGB views in the target environment. The proposed method

outperforms models pre-trained on ImageNet and achieves

competitive performance when using models that are trained

for exactly the same task but on a different dataset. We also

perform various ablation studies to analyze and demonstrate

the efﬁcacy of our proposed method.

I. INTRODUCTION

Semantic segmentation has been used extensively for both

semantic mapping [1] and also as input representation

for training policies for embodied agents (e.g. policies for

target driven or point goal navigation) that rely on visual

perception [2], [3]. Training semantic segmentation model

for a particular environment requires a large amount of

per-pixel annotations [4] that is very costly and laborious.

Alternatively for similar classes of environments that share

a large subset of semantic labels, a model can be trained for

the entire domain (say indoors environments) followed by

domain adaptation [5]. In a robotic setting the agent is often

able to move around and capture large amounts of visual data

and the ability to estimate ego-motion and depth perception

enables the agent to effectively associate multiple views of

the same scene.

Recent years marked notable progress in various self-

supervised methods for training large DCNN’s from scratch

without relying on commonly used backbones pre-trained

on ImageNet. The existing techniques used various forms

of contrastive learning and different pretext tasks such as

predicting the masked portion of the image [7], predicting

masks of objects [8] or predicting the rotation of the image

[9]. The self-supervised pre-training is then followed by ﬁne-

tuning with a small fraction of the labeled data. Even though

1Sulabh Shrestha, Yimeng Li and Jana Koˇ

secka are with the Department

of Computer Science, George Mason University, 4400 University Dr,

Fairfax, VA, USA {sshres2,yli44,kosecka}@gmu.edu

Fig. 1: Example of consistency. A pair of views from

different location and pose of the agent inside an environment

in Replica Dataset [6]. We ﬁnd and match pixels across

the view-pairs. In exact matching correspondence points are

matched (yellow arrow). In region matching, any pixel across

overlapping regions from the two views can be matched (blue

arrow). Best viewed digitally or in color.

these models have been proven to be very effective, they are

typically evaluated for image classiﬁcation tasks.

In this paper we explore the use of self-supervision that

comes from the spatial and temporal consistency between

pairs of overlapping views and demonstrate how to pre-train

Deep Convolutional Neural Network (DCNN) model for se-

mantic segmentation using data captured in the environment

of interest. We assume that within a single traversal path, the

environment remains static so as to simplify the process of

computing correspondences between neighboring views, that

will be used for self-supervised training of the model.

Contribution 1) We propose RegConsist, a method for self-

supervised pre-training of a semantic segmentation model

using spatial and temporal consistency cues. We exploit cor-

respondences between multiple views for generating positive

examples for contrastive learning framework and evaluate

the effect of different sampling strategies on the result.

2) We demonstrate that the resulting model can be ﬁne-

tuned with only a small fraction of image annotations,

obtaining competitive or better performance in a novel indoor

environment compared to fully supervised models. 3) We are

the ﬁrst to demonstrate that the Barlow twins loss [10] works

well for semantic segmentation in an indoor environment. 4)

We demonstrate the efﬁcacy of our method on Replica [6]

and AVD [11] datasets both qualitatively and quantitatively

while using as low as 5 percent of the annotated data. 5) We

perform extensive ablation studies to verify our method’s

performance including different starting conditions of the

arXiv:2210.01884v1 [cs.CV] 4 Oct 2022

segmentation model.

II. RELATED WORK

Our work relates to previous self-supervised learning

efforts in robotics, exploiting the idea that robots are able

to gather their own data and with proper training strategies

do not have to rely on laborious labeling required for

supervised training. In the past this idea was explored in

the context of different tasks, such as pose-estimation [13],

object detection [14], [15] and learning object speciﬁc visual

representation for manipulation and tracking [16], to mention

a few. Reinforcement learning and contrastive learning were

the two most commonly used frameworks that spear-headed

the technical advances in self-supervised learning in both

computer vision and robotics. Below we review in more

detail the works that are most relevant to our task of semantic

segmentation.

A. Self-supervised Learning

To mitigate the need for large amounts of labeled data

required for training deep models, self-supervised methods

typically use various pretext tasks to generate training data.

In the past, in single image setting, these included masked

image modeling [7], object mask prediction [8], instance

discrimination [17] and others. These auxiliary tasks provide

the model with the desired objective to embed semantically

similar inputs closer in the learned embedding space.

a) Contrastive Learning: Contrastive learning [18] has

been a ”workhorse” of self-supervised learning approaches

for training DCNNs. It exploits the ability to associate

(semantically) similar examples (positive pairs) and distin-

guish them from negative pairs as a supervision signal for

learning suitable representations (embeddings). The existing

approaches vary depending on the ﬁnal task, CNN archi-

tectures (often using variations of Siamese Neural Network

architectures), loss functions and methods for obtaining

similar and dissimilar training examples. The most common

concern in representation learning is to avoid model collapse.

Authors in [17] consider each image as a separate class

and the model is trained to disambiguate an image from

other images in the dataset. MoCo [19] employed a dynamic

memory bank [17] to store features from current the iteration

of the model as negatives during training. In order to remove

the requirement of memory bank, SimCLR [20] proposed to

use negative pairs from the mini-batch itself but consequently

necessitating a larger batch size. BYOL [21] further remove

the need for the negative examples altogether, by introducing

asymmetric architecture. Recently, [10] introduced a new ob-

jective function termed Barlow Twins, based on redundancy

reduction, which removes the need for large batches and

asymmetric models altogether.

b) Point Level Contrast.: While contrastive learning

at the global image level has proven to be beneﬁcial for

image classiﬁcation, the problems such as object detection

and semantic segmentation where predictions are done at the

object bounding box or pixel level, require disambiguation of

ﬁner features. Getting pixel level positive pairs is more dif-

ﬁcult than their image level counterpart. PixPro [22] follow

SimSiam [23] like training but at the pixel level; the positives

are obtained by thresholding the cosine distance between the

features at the pixel level within a given image. Authors

in [24] sample positive pixel pairs within regions obtained

by k-means clustering of the initial features, while [25]

circumvent the need of ﬁnding regions by dividing the image

into a ﬁxed N x N grid where each grid-cell is considered a

separate region.

In situations where a small number of labeled examples is

available one can approach learning using both cross entropy

loss based on available labels along with self-supervised loss.

This requires a momentum-updated teacher network along

with a memory bank to associate features between examples

outside of the images in a given iteration [26]. In order

to obtain additional labels if labels are sparse [27], [28]

proposed to use label propagation techniques. This however

requires complete and accurate 3D reconstruction in order to

fuse predictions from different frames.

Our work is most closely related to the efforts of self-

supervised learning for object detection [15], [14] by using

multiple views and view association to guide the training,

but extends these ideas to dense pixel-level prediction tasks

such as semantic segmentation.

III. METHOD

We assume a robotic agent with the ability to perceive and

recover ego-motion and 3D structure of the environment and

associate overlapping views of the same scene. This can be

achieved with appropriate sensors such as a depth sensor or

stereo camera or with the use of 3D structure and motion

estimation techniques [29] or suitable SLAM approach [1].

To instantiate a self-supervised learning approach for se-

mantic segmentation we propose a (Region Consistency)

RegConsist method for temporal and spatial alignment of

overlapping views that we describe next.

A. Temporal Consistency

Let I1and I2be two images captured by the agent in

the ﬁxed indoor environment. Assuming the availability of

known intrinsic and extrinsic camera parameters and depth,

we can associate the pixels in the overlapping views of the

same scene using (1).

T1→2(I1) = {K(T2−1(T1(K−1(x)) ∀x∈I1}(1)

where, Kis the intrinsic parameters of the camera, T1=

[R1|t1]is the camera pose for the image I1having rotation

R1and translation t1with respect to a ﬁxed coordinate

system and xis a pixel in I1. The operator T1→2transforms

the 2D pixel coordinates in I1to the 3D world coordinate

system and projects it back to pixel coordinates in I2. We

assume that reliable correspondences can be estimated either

using learning based method [30] or, as in our case, with

availability of the depth sensor. Let Ip

1and Iq

2be pth and

qth pixels in the images I1and I2respectively. If the pixels

belong to the same 3D location in the environment and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-supervisedPre-trainingforSemanticSegmentationinanIndoorSceneSulabhShrestha1,YimengLi1andJanaKosecka1AbstractTheabilitytoendowmapsofindoorsceneswithsemanticinformationisanintegralpartofroboticagentswhichperformdifferenttaskssuchastargetdrivennavigation,objectsearchorobjectrearrangement.Thestat...

展开>> 收起<<

Self-supervised Pre-training for Semantic Segmentation in an Indoor Scene.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-supervised Pre-training for Semantic Segmentation in an Indoor Scene

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: