
Evaluating the label efficiency of contrastive self-supervised
learning for multi-resolution satellite imagery
Jules Bourciera,b, Gohar Dashyanb, Jocelyn Chanussota, and Karteek Alaharia
aUniv. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
bPreligens (ex-Earthcube), 75000 Paris, France
ABSTRACT
The application of deep neural networks to remote sensing imagery is often constrained by the lack of ground-
truth annotations. Adressing this issue requires models that generalize efficiently from limited amounts of
labeled data, allowing us to tackle a wider range of Earth observation tasks. Another challenge in this domain
is developing algorithms that operate at variable spatial resolutions, e.g., for the problem of classifying land use
at different scales. Recently, self-supervised learning has been applied in the remote sensing domain to exploit
readily-available unlabeled data, and was shown to reduce or even close the gap with supervised learning. In this
paper, we study self-supervised visual representation learning through the lens of label efficiency, for the task
of land use classification on multi-resolution/multi-scale satellite images. We benchmark two contrastive self-
supervised methods adapted from Momentum Contrast (MoCo) and provide evidence that these methods can be
perform effectively given little downstream supervision, where randomly initialized networks fail to generalize.
Moreover, they outperform out-of-domain pretraining alternatives. We use the large-scale fMoW dataset to
pretrain and evaluate the networks, and validate our observations with transfer to the RESISC45 dataset.
Keywords: deep learning, computer vision, remote sensing, self-supervised learning, land use classification,
label-efficient learning, optical imagery
1. INTRODUCTION
The application of deep learning techniques to remote sensing imagery presents many challenges, one of the most
important being the scarcity of annotations. Although large amounts of satellite imagery are readily available,
curating and annotating them for a specific Earth observation tasks is usually very expensive, time-consuming
and requires fine domain expertise. This implies that it is impractical in many real-world contexts to acquire the
data needed to effectively leverage classical supervised learning methods. From this perspective, it is necessary to
develop label-efficient approaches, i.e., models that are able to learn with few annotated samples. Self-supervised
learning (SSL) is a promising approach for this purpose, as it pretrains representations without requiring human
labeling. Inspired by the success of recent methods on natural images benchmarks,1–5SSL has been applied in
the remote sensing domain to exploit the plentiful unlabeled data, and was shown to reduce or even close the
gap with supervised learning and transfer from ImageNet6,7(IN).
Another common problem in remote sensing is to process images covering various spatial scales, e.g., for
the task of classifying land use, where categories can range from individual storage tanks to full harbors. The
capacity of SSL methods to generalize from few labels on this important problem was not explored by previous
works, to the best of our knowledge.
To address this, in this paper we study self-supervised visual representation learning through the lens of
label efficiency, for the classification of land use at different spatial resolutions. We benchmark two contrastive
self-supervised methods adapted from Momentum Contrast (MoCo)1to assess their capacity to learn generic,
multi-resolution features, which do not need many labeled examples for downstream image classification. We use
the large-scale and diverse fMoW dataset8to pretrain and evaluate the networks, and validate our observations
with transfer to the RESISC45 dataset,9using diverse evaluation methods and amounts of labels. We provide
Further author information: (Send correspondence to J. Bourcier)
J. Bourcier: E-mail: jules.bourcier@preligens.com
arXiv:2210.06786v1 [eess.IV] 13 Oct 2022