Hierarchical Instance Mixing across Domains in Aerial Segmentation Edoardo Arnaudo12 Antonio Tavera1 Fabrizio Dominici2 Carlo Masone3 Barbara Caputo1 1Politecnico di Torino Turin Italy

2025-05-01 0 0 7.55MB 8 页 10玖币

侵权投诉

Hierarchical Instance Mixing across Domains in Aerial Segmentation

Edoardo Arnaudo∗1,2, Antonio Tavera∗1, Fabrizio Dominici2, Carlo Masone3, Barbara Caputo1

1Politecnico di Torino, Turin, Italy

2LINKS Foundation, Turin, Italy

3CINI - Consorzio Interuniversitario Nazionale per l’Informatica, Rome, Italy

1{edoardo.arnaudo, antonio.tavera, barbara.caputo}@polito.it

Abstract— We investigate the task of unsupervised domain

adaptation in aerial semantic segmentation and discover that

the current state-of-the-art algorithms designed for autonomous

driving based on domain mixing do not translate well to the

aerial setting. This is due to two factors: (i) a large disparity in

the extension of the semantic categories, which causes a domain

imbalance in the mixed image, and (ii) a weaker structural

consistency in aerial scenes than in driving scenes since the

same scene might be viewed from different perspectives and

there is no well-deﬁned and repeatable structure of the semantic

elements in the images. Our solution to these problems is

composed of: (i) a new mixing strategy for aerial segmentation

across domains called Hierarchical Instance Mixing (HIMix),

which extracts a set of connected components from each

semantic mask and mixes them according to a semantic hier-

archy and, (ii) a twin-head architecture in which two separate

segmentation heads are fed with variations of the same images

in a contrastive fashion to produce ﬁner segmentation maps.

We conduct extensive experiments on the LoveDA benchmark,

where our solution outperforms the current state-of-the-art.

I. INTRODUCTION

Semantic segmentation aims to predict, for each individual

pixel in an image, a semantic category from a predeﬁned set

of labels. Such a ﬁne grained understanding of images ﬁnds

numerous applications in aerial robotics [1]–[9], where it

has achieved remarkable results by leveraging deep learning

models trained on open datasets with large quantities of la-

beled images. However, these results do not carry over when

the models are deployed to operate on images that come

from a distribution (target domain) different from the data

experienced during training (source domain). The difﬁculty

in adapting semantic segmentation models to different data

distributions is not only limited to the aerial setting and it

is tightly linked to the high cost of generating pixel-level

annotations [10], which makes it unreasonable to supplement

the training dataset with large quantities of labeled images

from the target domain. A recent trend in the state-of-the-art

addresses this challenge using domain mixing as an online

augmentation to create artiﬁcial images with elements from

both the source and the target domain, thus encouraging

the model to learn domain-agnostic features [11]–[14]. In

particular, both DACS [12] and DAFormer [13] rely on

ClassMix [15] to dynamically create a binary mixing mask

for a pair of source-target images by randomly selecting half

of the classes from their semantic labels (the true label for the

∗Equal contribution.

Source Domain

Image Ground Truth Image Pseudo Label

Target Domain

Standard Class Mix HIMix

Class Mix Mask Mixed Image HIMix Mask Mixed Image

Fig. 1: Class Mix superimposes classes of the source domain

onto the target without taking into account the semantic

hierarchy of the visual elements. As a result, it generates

erroneous images that are detrimental to Unsupervised Do-

main Adaptation training in the aerial scenario. Instead,

our HIMix extracts instances from each semantic label and

then composes the mixing mask after sorting the extracted

instances based on their pixel count. This mitigates some

artifacts (e.g. partial buildings) and improves the balance of

the two domains.

source, the predicted pseudo-label for the target). Although

this mixing strategy yields state-of-the-art results in driving

scenes, it is less effective in an aerial context. We conjecture

that this is largely caused by two factors:

Domain imbalance in mixed images. Segmentation-

oriented aerial datasets are often characterized by categories

with vastly different extensions (e.g., cars and forest). While

this may be dealt with techniques such as multi-scale training

in standard semantic segmentation [16], the disparity in raw

pixel counts between classes may be detrimental for an

effective domain adaptation through class mixing, as the

composition may favor either domain (see Fig. 1 left).

Weak structural consistency. The scenes captured by a

front-facing camera onboard a car have a consistent structure,

with the street at the bottom, the sky at the top, sidewalks and

buildings at the sides, etcetera. This structure is preserved

also across domains, as in the classic Synthia [17] →

arXiv:2210.06216v1 [cs.CV] 12 Oct 2022

CityScapes [10] setting. Thus, when copying objects from an

image onto the other they are likely to end up in a reasonable

context. This is not true for aerial images, where there is no

consistent semantic structure (see Fig. 1 left).

To solve both problems, we propose a new mixing strategy

for aerial segmentation across domains called Hierarchical

Instance Mixing (HIMix). HIMix extracts from each seman-

tic mask a set of connected components, akin to instance

labels. The intuition is that aerial tiles often present very

large stretches of land, divided into instances (e.g., forested

areas separated by a road). HIMix randomly selects from

the individual instances a set of layers that will compose

the binary mixing mask. This helps to mitigate the pixel

imbalance between source and target domains in the artiﬁcial

image. Afterwards, HIMix composes these sampled layers by

sorting them based on the observation that there is a semantic

hierarchy in the aerial scenes (e.g., cars lie on the road and

roads lie on stretches of land). We use the pixel count of the

instances to determine their order in this hierarchy, placing

smaller layers on top of larger ones. While not optimal in

some contexts (e.g., buildings should not appear on top of

water bodies), this ordering also reduces the bias towards

those categories with larger surfaces in terms of pixels as

they are placed below the other layers of the mask (see Fig. 1

right).

Besides the mixing strategy itself, there is also the general

problem that the effectiveness of the domain mixing is

strongly dependent on the accuracy of the pseudo-labels gen-

erated on the target images during training. This is especially

true when the combination itself requires layering individual

entities from either domain into a more coherent label. A key

factor for an effective domain adaptation using self-training

is in fact the ability to produce consistent predictions, re-

silient to visual changes. For this reason, we propose as a

second contribution a twin-head UDA architecture in which

two separate segmentation heads are fed with contrastive

variations of the same images to improve pseudo-label conﬁ-

dence and make the model more robust and less susceptible

to perturbations across domains, inevitably driving the model

towards augmentation-consistent representations.

We test our complete framework on the LoveDA bench-

mark [18], the only dataset designed for evaluating unsu-

pervised domain-adaption in aerial segmentation, where we

exceed the current state-of-the-art. We further provide a

comprehensive ablation study to assess the impact of the

proposed solutions. The code will be made available to the

public to foster the research in this ﬁeld.

II. RELATED WORK

A. Aerial Semantic Segmentation

Current semantic segmentation methods mostly rely on

convolutional encoder-decoder architectures [19]–[22], but

the recent breakthroughs of vision Transformers introduced

new effective encoder architectures such as ViT [23], Swin

[24] or Twins [25], as well as end-to-end segmentation

approaches such as Segmenter [26] and SegFormer [27].

Concerning the application to aerial images, despite the

comparable processing pipeline as in other settings, there

are peculiar challenges that demand for speciﬁc solutions.

Firstly, aerial and satellite data often include multiple spec-

tra besides the visible bands, which can be leveraged in

different ways, such as including them as extra channels

[9] or adopting multi-modal encoders [4]. Visual features

represent another major difference: unlike other settings,

aerial scenes often display a large number of entities on

complex backgrounds, with wider spatial relationships. In

this case, attention layers [28] or relation networks [29] are

employed to better model long-distance similarities among

pixels. Another distinctive trait of aerial imagery is the top-

down point of view and the lack of reference points that

can be observed in natural images (e.g., sky always on top).

This can be exploited to produce rotation-invariant features

using ad-hoc networks [30], [31], or through regularization

[32]. Lastly, aerial images are characterized by disparities

in class distributions, since these include small objects (e.g.

cars) and large stretches of land. This pixel imbalance can be

addressed with sampling and class weighting [13], or ad-hoc

loss functions [33].

B. Domain Adaptation

Domain Adaptation (DA) is the task of attempting to train

a model on one domain while adapting to another. The

main objective of domain adaptation is to close the domain

shift between these two dissimilar distributions, which are

commonly referred to as the source and target domains.

The initial DA techniques proposed in the literature attempt

to minimize a measure of divergence across domains by

utilizing a distance measure such as the MMD [34]–[36]. An-

other popular approach to DA in Semantic Segmentation is

adversarial training [37]–[40], which involves playing a min-

max game between the segmentation network and a discrim-

inator. This latter is responsible for discriminating between

domains, whereas the segmentation network attempts to trick

it by making features of the two distributions identical.

Other approaches, such as [41]–[43], employ image-to-image

translation algorithms to generate target pictures styled as

source images or vice versa, while [44] discovers the major

bottleneck with domain adaptation in the batch normalization

layer. More recent methods like [45]–[47] use self-learning

techniques to generate ﬁne pseudo-labels on target data to

ﬁne-tune the model, whereas [12], [13] combine self-training

with class mix to reduce low-quality pseudo-labels caused by

domain shifts among the different distributions.

These mixing algorithms are very effective on data with a

consistent semantic organization of the scene, such as in self-

driving scenes [10], [48]. In these scenarios, naively copying

half of the source image onto the target image increases

the likelihood that the semantic elements will end up in a

reasonable context. This is not the case with aerial imagery

(see Fig. 1). HIMix not only mitigates this problem, but it

also reduces the bias towards categories with larger surfaces.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HierarchicalInstanceMixingacrossDomainsinAerialSegmentationEdoardoArnaudo1;2,AntonioTavera1,FabrizioDominici2,CarloMasone3,BarbaraCaputo11PolitecnicodiTorino,Turin,Italy2LINKSFoundation,Turin,Italy3CINI-ConsorzioInteruniversitarioNazionaleperl'Informatica,Rome,Italy1fedoardo.arnaudo,antonio.tavera...

展开>> 收起<<

Hierarchical Instance Mixing across Domains in Aerial Segmentation Edoardo Arnaudo12 Antonio Tavera1 Fabrizio Dominici2 Carlo Masone3 Barbara Caputo1 1Politecnico di Torino Turin Italy.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hierarchical Instance Mixing across Domains in Aerial Segmentation Edoardo Arnaudo12 Antonio Tavera1 Fabrizio Dominici2 Carlo Masone3 Barbara Caputo1 1Politecnico di Torino Turin Italy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: