Hierarchical Instance Mixing across Domains in Aerial Segmentation Edoardo Arnaudo12 Antonio Tavera1 Fabrizio Dominici2 Carlo Masone3 Barbara Caputo1 1Politecnico di Torino Turin Italy

2025-05-01 0 0 7.55MB 8 页 10玖币
侵权投诉
Hierarchical Instance Mixing across Domains in Aerial Segmentation
Edoardo Arnaudo1,2, Antonio Tavera1, Fabrizio Dominici2, Carlo Masone3, Barbara Caputo1
1Politecnico di Torino, Turin, Italy
2LINKS Foundation, Turin, Italy
3CINI - Consorzio Interuniversitario Nazionale per l’Informatica, Rome, Italy
1{edoardo.arnaudo, antonio.tavera, barbara.caputo}@polito.it
Abstract We investigate the task of unsupervised domain
adaptation in aerial semantic segmentation and discover that
the current state-of-the-art algorithms designed for autonomous
driving based on domain mixing do not translate well to the
aerial setting. This is due to two factors: (i) a large disparity in
the extension of the semantic categories, which causes a domain
imbalance in the mixed image, and (ii) a weaker structural
consistency in aerial scenes than in driving scenes since the
same scene might be viewed from different perspectives and
there is no well-defined and repeatable structure of the semantic
elements in the images. Our solution to these problems is
composed of: (i) a new mixing strategy for aerial segmentation
across domains called Hierarchical Instance Mixing (HIMix),
which extracts a set of connected components from each
semantic mask and mixes them according to a semantic hier-
archy and, (ii) a twin-head architecture in which two separate
segmentation heads are fed with variations of the same images
in a contrastive fashion to produce finer segmentation maps.
We conduct extensive experiments on the LoveDA benchmark,
where our solution outperforms the current state-of-the-art.
I. INTRODUCTION
Semantic segmentation aims to predict, for each individual
pixel in an image, a semantic category from a predefined set
of labels. Such a fine grained understanding of images finds
numerous applications in aerial robotics [1]–[9], where it
has achieved remarkable results by leveraging deep learning
models trained on open datasets with large quantities of la-
beled images. However, these results do not carry over when
the models are deployed to operate on images that come
from a distribution (target domain) different from the data
experienced during training (source domain). The difficulty
in adapting semantic segmentation models to different data
distributions is not only limited to the aerial setting and it
is tightly linked to the high cost of generating pixel-level
annotations [10], which makes it unreasonable to supplement
the training dataset with large quantities of labeled images
from the target domain. A recent trend in the state-of-the-art
addresses this challenge using domain mixing as an online
augmentation to create artificial images with elements from
both the source and the target domain, thus encouraging
the model to learn domain-agnostic features [11]–[14]. In
particular, both DACS [12] and DAFormer [13] rely on
ClassMix [15] to dynamically create a binary mixing mask
for a pair of source-target images by randomly selecting half
of the classes from their semantic labels (the true label for the
Equal contribution.
Source Domain
Image Ground Truth Image Pseudo Label
Target Domain
Standard Class Mix HIMix
Class Mix Mask Mixed Image HIMix Mask Mixed Image
Fig. 1: Class Mix superimposes classes of the source domain
onto the target without taking into account the semantic
hierarchy of the visual elements. As a result, it generates
erroneous images that are detrimental to Unsupervised Do-
main Adaptation training in the aerial scenario. Instead,
our HIMix extracts instances from each semantic label and
then composes the mixing mask after sorting the extracted
instances based on their pixel count. This mitigates some
artifacts (e.g. partial buildings) and improves the balance of
the two domains.
source, the predicted pseudo-label for the target). Although
this mixing strategy yields state-of-the-art results in driving
scenes, it is less effective in an aerial context. We conjecture
that this is largely caused by two factors:
Domain imbalance in mixed images. Segmentation-
oriented aerial datasets are often characterized by categories
with vastly different extensions (e.g., cars and forest). While
this may be dealt with techniques such as multi-scale training
in standard semantic segmentation [16], the disparity in raw
pixel counts between classes may be detrimental for an
effective domain adaptation through class mixing, as the
composition may favor either domain (see Fig. 1 left).
Weak structural consistency. The scenes captured by a
front-facing camera onboard a car have a consistent structure,
with the street at the bottom, the sky at the top, sidewalks and
buildings at the sides, etcetera. This structure is preserved
also across domains, as in the classic Synthia [17]
1
arXiv:2210.06216v1 [cs.CV] 12 Oct 2022
CityScapes [10] setting. Thus, when copying objects from an
image onto the other they are likely to end up in a reasonable
context. This is not true for aerial images, where there is no
consistent semantic structure (see Fig. 1 left).
To solve both problems, we propose a new mixing strategy
for aerial segmentation across domains called Hierarchical
Instance Mixing (HIMix). HIMix extracts from each seman-
tic mask a set of connected components, akin to instance
labels. The intuition is that aerial tiles often present very
large stretches of land, divided into instances (e.g., forested
areas separated by a road). HIMix randomly selects from
the individual instances a set of layers that will compose
the binary mixing mask. This helps to mitigate the pixel
imbalance between source and target domains in the artificial
image. Afterwards, HIMix composes these sampled layers by
sorting them based on the observation that there is a semantic
hierarchy in the aerial scenes (e.g., cars lie on the road and
roads lie on stretches of land). We use the pixel count of the
instances to determine their order in this hierarchy, placing
smaller layers on top of larger ones. While not optimal in
some contexts (e.g., buildings should not appear on top of
water bodies), this ordering also reduces the bias towards
those categories with larger surfaces in terms of pixels as
they are placed below the other layers of the mask (see Fig. 1
right).
Besides the mixing strategy itself, there is also the general
problem that the effectiveness of the domain mixing is
strongly dependent on the accuracy of the pseudo-labels gen-
erated on the target images during training. This is especially
true when the combination itself requires layering individual
entities from either domain into a more coherent label. A key
factor for an effective domain adaptation using self-training
is in fact the ability to produce consistent predictions, re-
silient to visual changes. For this reason, we propose as a
second contribution a twin-head UDA architecture in which
two separate segmentation heads are fed with contrastive
variations of the same images to improve pseudo-label confi-
dence and make the model more robust and less susceptible
to perturbations across domains, inevitably driving the model
towards augmentation-consistent representations.
We test our complete framework on the LoveDA bench-
mark [18], the only dataset designed for evaluating unsu-
pervised domain-adaption in aerial segmentation, where we
exceed the current state-of-the-art. We further provide a
comprehensive ablation study to assess the impact of the
proposed solutions. The code will be made available to the
public to foster the research in this field.
II. RELATED WORK
A. Aerial Semantic Segmentation
Current semantic segmentation methods mostly rely on
convolutional encoder-decoder architectures [19]–[22], but
the recent breakthroughs of vision Transformers introduced
new effective encoder architectures such as ViT [23], Swin
[24] or Twins [25], as well as end-to-end segmentation
approaches such as Segmenter [26] and SegFormer [27].
Concerning the application to aerial images, despite the
comparable processing pipeline as in other settings, there
are peculiar challenges that demand for specific solutions.
Firstly, aerial and satellite data often include multiple spec-
tra besides the visible bands, which can be leveraged in
different ways, such as including them as extra channels
[9] or adopting multi-modal encoders [4]. Visual features
represent another major difference: unlike other settings,
aerial scenes often display a large number of entities on
complex backgrounds, with wider spatial relationships. In
this case, attention layers [28] or relation networks [29] are
employed to better model long-distance similarities among
pixels. Another distinctive trait of aerial imagery is the top-
down point of view and the lack of reference points that
can be observed in natural images (e.g., sky always on top).
This can be exploited to produce rotation-invariant features
using ad-hoc networks [30], [31], or through regularization
[32]. Lastly, aerial images are characterized by disparities
in class distributions, since these include small objects (e.g.
cars) and large stretches of land. This pixel imbalance can be
addressed with sampling and class weighting [13], or ad-hoc
loss functions [33].
B. Domain Adaptation
Domain Adaptation (DA) is the task of attempting to train
a model on one domain while adapting to another. The
main objective of domain adaptation is to close the domain
shift between these two dissimilar distributions, which are
commonly referred to as the source and target domains.
The initial DA techniques proposed in the literature attempt
to minimize a measure of divergence across domains by
utilizing a distance measure such as the MMD [34]–[36]. An-
other popular approach to DA in Semantic Segmentation is
adversarial training [37]–[40], which involves playing a min-
max game between the segmentation network and a discrim-
inator. This latter is responsible for discriminating between
domains, whereas the segmentation network attempts to trick
it by making features of the two distributions identical.
Other approaches, such as [41]–[43], employ image-to-image
translation algorithms to generate target pictures styled as
source images or vice versa, while [44] discovers the major
bottleneck with domain adaptation in the batch normalization
layer. More recent methods like [45]–[47] use self-learning
techniques to generate fine pseudo-labels on target data to
fine-tune the model, whereas [12], [13] combine self-training
with class mix to reduce low-quality pseudo-labels caused by
domain shifts among the different distributions.
These mixing algorithms are very effective on data with a
consistent semantic organization of the scene, such as in self-
driving scenes [10], [48]. In these scenarios, naively copying
half of the source image onto the target image increases
the likelihood that the semantic elements will end up in a
reasonable context. This is not the case with aerial imagery
(see Fig. 1). HIMix not only mitigates this problem, but it
also reduces the bias towards categories with larger surfaces.
2
摘要:

HierarchicalInstanceMixingacrossDomainsinAerialSegmentationEdoardoArnaudo1;2,AntonioTavera1,FabrizioDominici2,CarloMasone3,BarbaraCaputo11PolitecnicodiTorino,Turin,Italy2LINKSFoundation,Turin,Italy3CINI-ConsorzioInteruniversitarioNazionaleperl'Informatica,Rome,Italy1fedoardo.arnaudo,antonio.tavera...

展开>> 收起<<
Hierarchical Instance Mixing across Domains in Aerial Segmentation Edoardo Arnaudo12 Antonio Tavera1 Fabrizio Dominici2 Carlo Masone3 Barbara Caputo1 1Politecnico di Torino Turin Italy.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:7.55MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注