Evaluating the label eciency of contrastive self-supervised learning for multi-resolution satellite imagery Jules Bourcierab Gohar Dashyanb Jocelyn Chanussota and Karteek Alaharia

2025-04-24 0 0 378.82KB 10 页 10玖币
侵权投诉
Evaluating the label efficiency of contrastive self-supervised
learning for multi-resolution satellite imagery
Jules Bourciera,b, Gohar Dashyanb, Jocelyn Chanussota, and Karteek Alaharia
aUniv. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
bPreligens (ex-Earthcube), 75000 Paris, France
ABSTRACT
The application of deep neural networks to remote sensing imagery is often constrained by the lack of ground-
truth annotations. Adressing this issue requires models that generalize efficiently from limited amounts of
labeled data, allowing us to tackle a wider range of Earth observation tasks. Another challenge in this domain
is developing algorithms that operate at variable spatial resolutions, e.g., for the problem of classifying land use
at different scales. Recently, self-supervised learning has been applied in the remote sensing domain to exploit
readily-available unlabeled data, and was shown to reduce or even close the gap with supervised learning. In this
paper, we study self-supervised visual representation learning through the lens of label efficiency, for the task
of land use classification on multi-resolution/multi-scale satellite images. We benchmark two contrastive self-
supervised methods adapted from Momentum Contrast (MoCo) and provide evidence that these methods can be
perform effectively given little downstream supervision, where randomly initialized networks fail to generalize.
Moreover, they outperform out-of-domain pretraining alternatives. We use the large-scale fMoW dataset to
pretrain and evaluate the networks, and validate our observations with transfer to the RESISC45 dataset.
Keywords: deep learning, computer vision, remote sensing, self-supervised learning, land use classification,
label-efficient learning, optical imagery
1. INTRODUCTION
The application of deep learning techniques to remote sensing imagery presents many challenges, one of the most
important being the scarcity of annotations. Although large amounts of satellite imagery are readily available,
curating and annotating them for a specific Earth observation tasks is usually very expensive, time-consuming
and requires fine domain expertise. This implies that it is impractical in many real-world contexts to acquire the
data needed to effectively leverage classical supervised learning methods. From this perspective, it is necessary to
develop label-efficient approaches, i.e., models that are able to learn with few annotated samples. Self-supervised
learning (SSL) is a promising approach for this purpose, as it pretrains representations without requiring human
labeling. Inspired by the success of recent methods on natural images benchmarks,15SSL has been applied in
the remote sensing domain to exploit the plentiful unlabeled data, and was shown to reduce or even close the
gap with supervised learning and transfer from ImageNet6,7(IN).
Another common problem in remote sensing is to process images covering various spatial scales, e.g., for
the task of classifying land use, where categories can range from individual storage tanks to full harbors. The
capacity of SSL methods to generalize from few labels on this important problem was not explored by previous
works, to the best of our knowledge.
To address this, in this paper we study self-supervised visual representation learning through the lens of
label efficiency, for the classification of land use at different spatial resolutions. We benchmark two contrastive
self-supervised methods adapted from Momentum Contrast (MoCo)1to assess their capacity to learn generic,
multi-resolution features, which do not need many labeled examples for downstream image classification. We use
the large-scale and diverse fMoW dataset8to pretrain and evaluate the networks, and validate our observations
with transfer to the RESISC45 dataset,9using diverse evaluation methods and amounts of labels. We provide
Further author information: (Send correspondence to J. Bourcier)
J. Bourcier: E-mail: jules.bourcier@preligens.com
arXiv:2210.06786v1 [eess.IV] 13 Oct 2022
evidence that these methods can be trained effectively in few-label settings that are insufficient for randomly
initialized networks to generalize. Thanks to MoCo with temporal positives,,6when finetuning the pretrained
models to RESISC45 with only about 4 examples per class, we reach 5×the accuracy of a classifier trained
from scratch. Moreover, with simple linear probing on frozen representations, we surpass from-scratch networks
in every label setting on fMoW and RESISC45. Additionally, the MoCo variants applied on fMoW images
outperform out-of-domain pretraining on IN by significant margins, despite being pretrained on 3×less data.
We also reveal that a basic k-nearest neighbors (k-NN) classifier on the learned representations provides out-
of-the-box efficient generalization, and competes or outperforms finetuning with other methods when only few
labels are available.
Our main contributions are as follows:
We experiment with two contrastive SSL methods, MoCo1and MoCoTP,6on multi-resolution land use
classification on the fMoW dataset, and observe their efficiency in terms of annotations required, on the
common evaluation settings for representation learning with k-NN, linear, and finetuned classifiers.
To demonstrate the transferability of the pretrained representations for land use prediction at different
spatial resolutions, we study how the models pretrained on fMoW generalize to the smaller RESISC45
dataset, including extremely few images per class.
2. RELATED WORK
2.1 Self-supervised learning and contrastive learning
SSL methods learn representations of data without relying on manual annotations. It consists in pretraining a
neural network to solve a pretext task on unlabeled data, for the purpose of extracting semantic representations
that allow for effective transfer to downstream predictive tasks such as classification, segmentation or object
detection. In computer vision, the popularity of SSL is due to recent methods that have shown to perform
comparably well or even better than their supervised counterparts on natural image benchmarks.15
Contrastive learning has established itself as the staple framework for SSL of visual representations, with
approaches such as MoCo,1SimCLR,2and SwAV.3These methods work by attracting embeddings of pairs of
sample images known to be semantically similar (positive pairs) while simultaneously repelling pairs of dissimilar
samples (negative pairs). The most common way to define similarity is with the instance discrimination pretext
task,10 in which positives are generated as random data augmentations on the same image, and negatives are
simply generated from different images. Thanks to this objective, the encoder learns to encode close representa-
tions for different views of the same object instance in an image and distant representations for other instances.
In this work, we use the strong contrastive method of MoCo1as well as an extension proposed in Ref. 6that
adapts the learning objective to the spatio-temporal structure of satellite imagery.
2.2 SSL in remote sensing
Following their success in computer vision, several works have applied SSL methods to remotely sensed imagery
for Earth observation tasks. Ref. 11 was one of the first to use of contrastive learning for remote sensing represen-
tation learning. Ref. 12 applies a spatial augmentation criteria on top of MoCo.1These works exploit a relevant
assumption about the remote sensing domain: images that are geographically close should be semantically more
similar than distant images. Another way of making the learning procedure geography-aware is to exploit the
spatio-temporal nature of satellite imagery. Ref. 6uses spatially-aligned images over time to construct temporal
positive pairs with MoCo. The resulting temporally-aligned features were shown to improve generalization for
classification, segmentation and object detection downstream tasks. In this work, we adopt this model with
notable improvements, to study how such temporal invariance can benefit learning from few labels. In the same
vein, Ref. 7proposes a method that learns representations that are simultaneously variant and invariant to tem-
poral changes. One can also exploit the multi-spectral and multi-sensor nature of remote sensing. Ref. 13 applies
CMC14 on multi-spectral images, using different subsets of channels as augmented (positive) views. Ref. 15 ex-
tends this to co-located images from multiple sensors, combining different sensor channels to construct positive
pairs.
摘要:

Evaluatingthelabeleciencyofcontrastiveself-supervisedlearningformulti-resolutionsatelliteimageryJulesBourciera,b,GoharDashyanb,JocelynChanussota,andKarteekAlahariaaUniv.GrenobleAlpes,Inria,CNRS,GrenobleINP,LJK,38000Grenoble,FrancebPreligens(ex-Earthcube),75000Paris,FranceABSTRACTTheapplicationofdee...

展开>> 收起<<
Evaluating the label eciency of contrastive self-supervised learning for multi-resolution satellite imagery Jules Bourcierab Gohar Dashyanb Jocelyn Chanussota and Karteek Alaharia.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:10 页 大小:378.82KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注