CityScapes [10] setting. Thus, when copying objects from an
image onto the other they are likely to end up in a reasonable
context. This is not true for aerial images, where there is no
consistent semantic structure (see Fig. 1 left).
To solve both problems, we propose a new mixing strategy
for aerial segmentation across domains called Hierarchical
Instance Mixing (HIMix). HIMix extracts from each seman-
tic mask a set of connected components, akin to instance
labels. The intuition is that aerial tiles often present very
large stretches of land, divided into instances (e.g., forested
areas separated by a road). HIMix randomly selects from
the individual instances a set of layers that will compose
the binary mixing mask. This helps to mitigate the pixel
imbalance between source and target domains in the artificial
image. Afterwards, HIMix composes these sampled layers by
sorting them based on the observation that there is a semantic
hierarchy in the aerial scenes (e.g., cars lie on the road and
roads lie on stretches of land). We use the pixel count of the
instances to determine their order in this hierarchy, placing
smaller layers on top of larger ones. While not optimal in
some contexts (e.g., buildings should not appear on top of
water bodies), this ordering also reduces the bias towards
those categories with larger surfaces in terms of pixels as
they are placed below the other layers of the mask (see Fig. 1
right).
Besides the mixing strategy itself, there is also the general
problem that the effectiveness of the domain mixing is
strongly dependent on the accuracy of the pseudo-labels gen-
erated on the target images during training. This is especially
true when the combination itself requires layering individual
entities from either domain into a more coherent label. A key
factor for an effective domain adaptation using self-training
is in fact the ability to produce consistent predictions, re-
silient to visual changes. For this reason, we propose as a
second contribution a twin-head UDA architecture in which
two separate segmentation heads are fed with contrastive
variations of the same images to improve pseudo-label confi-
dence and make the model more robust and less susceptible
to perturbations across domains, inevitably driving the model
towards augmentation-consistent representations.
We test our complete framework on the LoveDA bench-
mark [18], the only dataset designed for evaluating unsu-
pervised domain-adaption in aerial segmentation, where we
exceed the current state-of-the-art. We further provide a
comprehensive ablation study to assess the impact of the
proposed solutions. The code will be made available to the
public to foster the research in this field.
II. RELATED WORK
A. Aerial Semantic Segmentation
Current semantic segmentation methods mostly rely on
convolutional encoder-decoder architectures [19]–[22], but
the recent breakthroughs of vision Transformers introduced
new effective encoder architectures such as ViT [23], Swin
[24] or Twins [25], as well as end-to-end segmentation
approaches such as Segmenter [26] and SegFormer [27].
Concerning the application to aerial images, despite the
comparable processing pipeline as in other settings, there
are peculiar challenges that demand for specific solutions.
Firstly, aerial and satellite data often include multiple spec-
tra besides the visible bands, which can be leveraged in
different ways, such as including them as extra channels
[9] or adopting multi-modal encoders [4]. Visual features
represent another major difference: unlike other settings,
aerial scenes often display a large number of entities on
complex backgrounds, with wider spatial relationships. In
this case, attention layers [28] or relation networks [29] are
employed to better model long-distance similarities among
pixels. Another distinctive trait of aerial imagery is the top-
down point of view and the lack of reference points that
can be observed in natural images (e.g., sky always on top).
This can be exploited to produce rotation-invariant features
using ad-hoc networks [30], [31], or through regularization
[32]. Lastly, aerial images are characterized by disparities
in class distributions, since these include small objects (e.g.
cars) and large stretches of land. This pixel imbalance can be
addressed with sampling and class weighting [13], or ad-hoc
loss functions [33].
B. Domain Adaptation
Domain Adaptation (DA) is the task of attempting to train
a model on one domain while adapting to another. The
main objective of domain adaptation is to close the domain
shift between these two dissimilar distributions, which are
commonly referred to as the source and target domains.
The initial DA techniques proposed in the literature attempt
to minimize a measure of divergence across domains by
utilizing a distance measure such as the MMD [34]–[36]. An-
other popular approach to DA in Semantic Segmentation is
adversarial training [37]–[40], which involves playing a min-
max game between the segmentation network and a discrim-
inator. This latter is responsible for discriminating between
domains, whereas the segmentation network attempts to trick
it by making features of the two distributions identical.
Other approaches, such as [41]–[43], employ image-to-image
translation algorithms to generate target pictures styled as
source images or vice versa, while [44] discovers the major
bottleneck with domain adaptation in the batch normalization
layer. More recent methods like [45]–[47] use self-learning
techniques to generate fine pseudo-labels on target data to
fine-tune the model, whereas [12], [13] combine self-training
with class mix to reduce low-quality pseudo-labels caused by
domain shifts among the different distributions.
These mixing algorithms are very effective on data with a
consistent semantic organization of the scene, such as in self-
driving scenes [10], [48]. In these scenarios, naively copying
half of the source image onto the target image increases
the likelihood that the semantic elements will end up in a
reasonable context. This is not the case with aerial imagery
(see Fig. 1). HIMix not only mitigates this problem, but it
also reduces the bias towards categories with larger surfaces.
2