Evaluating the label eciency of contrastive self-supervised learning for multi-resolution satellite imagery Jules Bourcierab Gohar Dashyanb Jocelyn Chanussota and Karteek Alaharia

2025-04-24 1 0 378.82KB 10 页 10玖币

侵权投诉

Evaluating the label eﬃciency of contrastive self-supervised

learning for multi-resolution satellite imagery

Jules Bourciera,b, Gohar Dashyanb, Jocelyn Chanussota, and Karteek Alaharia

aUniv. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

bPreligens (ex-Earthcube), 75000 Paris, France

ABSTRACT

The application of deep neural networks to remote sensing imagery is often constrained by the lack of ground-

truth annotations. Adressing this issue requires models that generalize eﬃciently from limited amounts of

labeled data, allowing us to tackle a wider range of Earth observation tasks. Another challenge in this domain

is developing algorithms that operate at variable spatial resolutions, e.g., for the problem of classifying land use

at diﬀerent scales. Recently, self-supervised learning has been applied in the remote sensing domain to exploit

readily-available unlabeled data, and was shown to reduce or even close the gap with supervised learning. In this

paper, we study self-supervised visual representation learning through the lens of label eﬃciency, for the task

of land use classiﬁcation on multi-resolution/multi-scale satellite images. We benchmark two contrastive self-

supervised methods adapted from Momentum Contrast (MoCo) and provide evidence that these methods can be

perform eﬀectively given little downstream supervision, where randomly initialized networks fail to generalize.

Moreover, they outperform out-of-domain pretraining alternatives. We use the large-scale fMoW dataset to

pretrain and evaluate the networks, and validate our observations with transfer to the RESISC45 dataset.

Keywords: deep learning, computer vision, remote sensing, self-supervised learning, land use classiﬁcation,

label-eﬃcient learning, optical imagery

1. INTRODUCTION

The application of deep learning techniques to remote sensing imagery presents many challenges, one of the most

important being the scarcity of annotations. Although large amounts of satellite imagery are readily available,

curating and annotating them for a speciﬁc Earth observation tasks is usually very expensive, time-consuming

and requires ﬁne domain expertise. This implies that it is impractical in many real-world contexts to acquire the

data needed to eﬀectively leverage classical supervised learning methods. From this perspective, it is necessary to

develop label-eﬃcient approaches, i.e., models that are able to learn with few annotated samples. Self-supervised

learning (SSL) is a promising approach for this purpose, as it pretrains representations without requiring human

labeling. Inspired by the success of recent methods on natural images benchmarks,1–5SSL has been applied in

the remote sensing domain to exploit the plentiful unlabeled data, and was shown to reduce or even close the

gap with supervised learning and transfer from ImageNet6,7(IN).

Another common problem in remote sensing is to process images covering various spatial scales, e.g., for

the task of classifying land use, where categories can range from individual storage tanks to full harbors. The

capacity of SSL methods to generalize from few labels on this important problem was not explored by previous

works, to the best of our knowledge.

To address this, in this paper we study self-supervised visual representation learning through the lens of

label eﬃciency, for the classiﬁcation of land use at diﬀerent spatial resolutions. We benchmark two contrastive

self-supervised methods adapted from Momentum Contrast (MoCo)1to assess their capacity to learn generic,

multi-resolution features, which do not need many labeled examples for downstream image classiﬁcation. We use

the large-scale and diverse fMoW dataset8to pretrain and evaluate the networks, and validate our observations

with transfer to the RESISC45 dataset,9using diverse evaluation methods and amounts of labels. We provide

Further author information: (Send correspondence to J. Bourcier)

J. Bourcier: E-mail: jules.bourcier@preligens.com

arXiv:2210.06786v1 [eess.IV] 13 Oct 2022

evidence that these methods can be trained eﬀectively in few-label settings that are insuﬃcient for randomly

initialized networks to generalize. Thanks to MoCo with temporal positives,,6when ﬁnetuning the pretrained

models to RESISC45 with only about 4 examples per class, we reach 5×the accuracy of a classiﬁer trained

from scratch. Moreover, with simple linear probing on frozen representations, we surpass from-scratch networks

in every label setting on fMoW and RESISC45. Additionally, the MoCo variants applied on fMoW images

outperform out-of-domain pretraining on IN by signiﬁcant margins, despite being pretrained on 3×less data.

We also reveal that a basic k-nearest neighbors (k-NN) classiﬁer on the learned representations provides out-

of-the-box eﬃcient generalization, and competes or outperforms ﬁnetuning with other methods when only few

labels are available.

Our main contributions are as follows:

•We experiment with two contrastive SSL methods, MoCo1and MoCoTP,6on multi-resolution land use

classiﬁcation on the fMoW dataset, and observe their eﬃciency in terms of annotations required, on the

common evaluation settings for representation learning with k-NN, linear, and ﬁnetuned classiﬁers.

•To demonstrate the transferability of the pretrained representations for land use prediction at diﬀerent

spatial resolutions, we study how the models pretrained on fMoW generalize to the smaller RESISC45

dataset, including extremely few images per class.

2. RELATED WORK

2.1 Self-supervised learning and contrastive learning

SSL methods learn representations of data without relying on manual annotations. It consists in pretraining a

neural network to solve a pretext task on unlabeled data, for the purpose of extracting semantic representations

that allow for eﬀective transfer to downstream predictive tasks such as classiﬁcation, segmentation or object

detection. In computer vision, the popularity of SSL is due to recent methods that have shown to perform

comparably well or even better than their supervised counterparts on natural image benchmarks.1–5

Contrastive learning has established itself as the staple framework for SSL of visual representations, with

approaches such as MoCo,1SimCLR,2and SwAV.3These methods work by attracting embeddings of pairs of

sample images known to be semantically similar (positive pairs) while simultaneously repelling pairs of dissimilar

samples (negative pairs). The most common way to deﬁne similarity is with the instance discrimination pretext

task,10 in which positives are generated as random data augmentations on the same image, and negatives are

simply generated from diﬀerent images. Thanks to this objective, the encoder learns to encode close representa-

tions for diﬀerent views of the same object instance in an image and distant representations for other instances.

In this work, we use the strong contrastive method of MoCo1as well as an extension proposed in Ref. 6that

adapts the learning objective to the spatio-temporal structure of satellite imagery.

2.2 SSL in remote sensing

Following their success in computer vision, several works have applied SSL methods to remotely sensed imagery

for Earth observation tasks. Ref. 11 was one of the ﬁrst to use of contrastive learning for remote sensing represen-

tation learning. Ref. 12 applies a spatial augmentation criteria on top of MoCo.1These works exploit a relevant

assumption about the remote sensing domain: images that are geographically close should be semantically more

similar than distant images. Another way of making the learning procedure geography-aware is to exploit the

spatio-temporal nature of satellite imagery. Ref. 6uses spatially-aligned images over time to construct temporal

positive pairs with MoCo. The resulting temporally-aligned features were shown to improve generalization for

classiﬁcation, segmentation and object detection downstream tasks. In this work, we adopt this model with

notable improvements, to study how such temporal invariance can beneﬁt learning from few labels. In the same

vein, Ref. 7proposes a method that learns representations that are simultaneously variant and invariant to tem-

poral changes. One can also exploit the multi-spectral and multi-sensor nature of remote sensing. Ref. 13 applies

CMC14 on multi-spectral images, using diﬀerent subsets of channels as augmented (positive) views. Ref. 15 ex-

tends this to co-located images from multiple sensors, combining diﬀerent sensor channels to construct positive

pairs.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Evaluatingthelabeleciencyofcontrastiveself-supervisedlearningformulti-resolutionsatelliteimageryJulesBourciera,b,GoharDashyanb,JocelynChanussota,andKarteekAlahariaaUniv.GrenobleAlpes,Inria,CNRS,GrenobleINP,LJK,38000Grenoble,FrancebPreligens(ex-Earthcube),75000Paris,FranceABSTRACTTheapplicationofdee...

展开>> 收起<<

Evaluating the label eciency of contrastive self-supervised learning for multi-resolution satellite imagery Jules Bourcierab Gohar Dashyanb Jocelyn Chanussota and Karteek Alaharia.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Evaluating the label eciency of contrastive self-supervised learning for multi-resolution satellite imagery Jules Bourcierab Gohar Dashyanb Jocelyn Chanussota and Karteek Alaharia

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: