be intrinsically rare, difficult to localize and to
identify accurately. This makes the acquisition of
thousands of examples impractical, as is typically
required for classic supervised deep learning meth-
ods to generalize. Consequently, a major challenge
is the development of label-efficient approaches, i.e.
models that are able to learn with few annotated
examples.
To reduce the number of training samples for
difficult vision tasks such as object detection, trans-
fer learning of pretrained neural networks is used
extensively. The idea is to reuse a network trained
upstream on a large, diverse source dataset. Ima-
geNet [18] has become the de facto standard for
pretraining: due to its large-scale and genericity,
ImageNet-pretrained models show to be adaptable
beyond their source domain, including remote sens-
ing imagery [16]. Nonetheless, the domain gap be-
tween ImageNet and remote sensing domains brings
questions about the limitations of this transfer when
there are very few samples on the task at hand,
e.g. the detection of rare observables from satellite
images. To fit the distributions of downstream tasks
with maximum efficiency, one would ideally use
generic in-domain representations, obtained by pre-
training on large amounts of remote sensing data.
This is infeasible in the remote sensing domain due
to the difficulty of curating and labeling these data
at the scale of ImageNet. However, imaging satel-
lites provide an ever-growing amount of unlabeled
data, which makes it highly relevant for learning
visual representations in an unsupervised way.
Self-supervised learning (SSL) has recently
emerged as an effective paradigm for learning repre-
sentations on unlabeled data. It uses unlabeled data
as a supervision signal, by solving a pretext task on
these input data, in order to learn semantic represen-
tations. A model trained in a self-supervised fashion
can then be transferred using the same methods as
a network pretrained on a downstream supervised
task. In the last two years, SSL has shown impres-
sive results that closed the gap or even outperformed
supervised learning for multiple benchmarks [2],
[3], [7], [8]. Recently, SSL has been applied in the
remote sensing domain to exploit readily-available
unlabeled data, and was shown to reduce or even
close the gap with transfer from ImageNet [1], [15],
[25]. Nonetheless, the capacity of these methods to
generalize from few labels was not been explored on
the important problem of object detection in VHR
satellite images.
In this paper, we explore in-domain self-
supervised representation learning for the task of
object detection on VHR optical satellite imagery.
We use the large land use classification dataset
Functional Map of the World (fMoW) [5] to pretrain
representations using the unsupervised framework
of MoCo [8]. We then investigate the transferability
on a difficult real-world task of fine-grained vehicle
detection on proprietary data, which is designed
to be representative of an operational use case of
strategic site surveillance. Our contributions are:
• We apply a method based on MoCo with
temporal positives [1] to learn self-supervised
representations of remote sensing images, that
we improve using (i) additional augmentations
for rotational invariance; (ii) a fixed loss func-
tion that removes the false temporal negatives
in the learning process.
• We investigate the benefit of in-domain self-
supervised pretraining as a function of the
annotation effort, using different budgets of
annotated instances for detecting vehicles.
• We show that our method is better than or
at least competitive with supervised ImageNet
pretraining, despite using no upstream labels
and 3× less upstream data.
Furthermore, our in-domain SSL model is more
label-efficient than ImageNet: when using very
limited annotations budgets ('20 images totalling
'12k observables), we outperform ImageNet pre-
training by 4 points AP on vehicle detection and
0.5 point mAP on joint detection and classification.
II. RELATED WORK
A. Self supervised representation learning
SSL methods use unlabeled data to learn repre-
sentations that are transferable to downstream tasks
(e.g. image classification or object detection) for
which annotated data samples are insufficient. In
recent years, these methods have been successfully
applied to computer vision with impressive results