
Figure 1. (a), (b), and (c) show the architecture for the Unet, TransUnet, and SwinUnet respectively. These figures were inspired directly
by [40, 7, 3], respectively
evidence in that context. However, it is unclear whether
their success extends to the unique statistics and condi-
tions present in overhead imagery, a major area of vision
research. Transformers excel at modeling long range de-
pendencies, which while generally beneficial in most vision
tasks, may not be as important in segmentation of overhead
imagery where building information is compact, highly lo-
calized, and many times isolated from other structures. To
our knowledge there has been no systematic study of this
question for overhead imagery tasks.
In this work we perform a carefully-controlled empiri-
cal comparison of three state-of-the-art segmentation mod-
els using overhead imagery, where each model utilizes
progressively more transformer-based structures. Specifi-
cally, we consider the following three models: U-Net [21],
TransUNet[7], and SwinUnet[3]. This is the first time the
TransUnet and SwinUnet have been applied to a large-scale
dataset of overhead imagery 1. Aside from the model varia-
tion we carefully control all other experimental factors, such
as the size of the models, their quantity of training data,
and training procedures. We use a large and diverse dataset
of overhead imagery, comprising two publicly-available
benchmarks to maximize the generality and relevance of
our results. To provide a transparent and unbiased hyper-
parameter optimization procedure, we use Bayesian Opti-
mization (BO) with a fixed budget of iterations to select the
hyperparameters of each model. We provide each model
with approximately 330 hours of optimization time in or-
der to identify effective hyperparameters for each model.
These experimental controls allow us to study whether, and
to what degree, transformers are beneficial in the context
of overhead imagery. Using our optimized models, we also
conduct several additional ablation studies to evaluate the
impact of specific design choices in the transformer-based
models. We can summarize our contributions as follows:
• The first investigation of two recent state-of-the-art
segmentation models for processing overhead im-
agery: the transUnet [7], and the swinUnet [3].
1The recent work in [43] independently and concurrently studied these
models, in a complementary setting
• The first controlled evaluation of whether, and to what
extent, transformers are beneficial for vision models in
overhead imagery.
2. Related Work
Segmentation in overhead imagery. Segmentation of
overhead imagery requires complex features to describe the
vast domain as well as pixel level precision. Initially devel-
oped for medical imagery, Unet [40] has been show to be a
powerful model in overhead image segmentation [18, 16]
and in the broader segmentation community, with many
variations on the model such as Dense-Unet [29], Res-Unet
[44], Unet++ [50], V-Net[33], and Unet3+ [19]. This is at-
tributed to the auto encoder-like structure where it receives
its ”U” shape and name combined with the skip connec-
tions, feeding high resolution spatial information into the
last layers of the model.
Other models such as DeepLabv3 [9] and Mask-RCNN
[14] have been used successfully in segmentation of over-
head imagery [27, 4]. While these models also perform
very well, we chose to evaluate Unet-based architectures
due to the high performance and large number of variant
models. This high number of variants allowed us to more
easily compare small changes in the model architecture.
Transformers in segmentation. Very recently trans-
formers have started to be used in that segmentation of over-
head imagery [17, 42], achieving good performance. Trans-
formers had already started to become common in other do-
mains such as TransUnet [7], ViT-V-Net [6], TransClaw U-
Net [5], UTNet [13], Cotr [45], and SwinUnet [3] in medi-
cal image segmentation.
Evaluation of transformers. Transformers are rather
new in computer vision and have only recently become
state-of-the-art. As a result, their impact on performance
has not been thoroughly analyzed in many domains and ap-
plications, including our own. While work has been done in
evaluating their generalization capabilities in respect to dis-
tribution shift [47] and how transferable their learned repre-
sentations are [49], these are very general results about the
feature representations derived for other applications. For
segmentation of overhead imagery, to our knowledge, there