Transformers For Recognition In Overhead Imagery A Reality Check Francesco Luzi Duke University

2025-05-06 0 0 3.39MB 10 页 10玖币
侵权投诉
Transformers For Recognition In Overhead Imagery: A Reality Check
Francesco Luzi
Duke University
francesco.luzi@duke.edu
Aneesh Gupta
Duke University
aneeshgupta8@gmail.com
Leslie Collins
Duke University
leslie.collins@duke.edu
Kyle Bradbury
Duke University
kyle.bradbury@duke.edu
Jordan Malof
University of Montana
jordan.malof@umontana.edu
Abstract
There is evidence that transformers offer state-of-the-
art recognition performance on tasks involving overhead
imagery (e.g., satellite imagery). However, it is difficult
to make unbiased empirical comparisons between com-
peting deep learning models, making it unclear whether,
and to what extent, transformer-based models are benefi-
cial. In this paper we systematically compare the impact of
adding transformer structures into state-of-the-art segmen-
tation models for overhead imagery. Each model is given a
similar budget of free parameters, and their hyperparame-
ters are optimized using Bayesian Optimization with a fixed
quantity of data and computation time. We conduct our ex-
periments with a large and diverse dataset comprising two
large public benchmarks: Inria and DeepGlobe. We per-
form additional ablation studies to explore the impact of
specific transformer-based modeling choices. Our results
suggest that transformers provide consistent, but modest,
performance improvements. We only observe this advan-
tage however in hybrid models that combine convolutional
and transformer-based structures, while fully transformer-
based models achieve relatively poor performance.
1. Introduction
Transformer-based models have become prevalent in
computer vision tasks and have achieved state-of-the-art
performance in classification [12, 31], object detection
[10, 48], and segmentation [7, 3]. This success might osten-
sibly suggest that transformers are superior to other existing
models, such as those based upon convolutional structures,
however this is difficult to conclude based upon the exist-
ing research literature due to the absence of experimental
controls when comparing different vision models. The per-
formance of modern vision models - all of which are based
upon deep neural networks - are affected by numerous fac-
Dataset Region Country Size (km2)
Austin USA 81
Chicago USA 81
Inria Kitsap County USA 81
West Tyrol Austria 81
Vienna Austria 81
Las Vegas USA 150.2
DeepGlobe Paris France 41.88
Shanghai China 173.32
Khartoum Sudan 32.88
Table 1. The cities and their size that compose the Inria and Deep-
Globe datasets.
tors that vary widely among competing models used in pub-
lic benchmarks, and in the research literature [34]. This in-
cludes factors such as the quantity and quality of training
data, the training algorithm (e.g., optimizer), the training
time allotted, and the model’s size (i.e., the number of free
model parameters). Another more subtle, but highly influ-
ential factor, is the computation time and effort invested by
the designer on hyperparameter optimization, which can re-
sult in misleading performance comparisons [34].
If one or more of the aforementioned factors vary be-
tween competing vision models, then it is unclear which
factors among them are responsible for any performance
differences [38, 24, 20, 34]. Consequently, it is unclear
whether the recent success of transformer-based models ap-
plied to overhead imagery has been driven by the use of
transformers, or the variety of other factors that vary among
the competing models. A major goal of vision research is
to uncover the underlying causal factors and design princi-
ples that underpin vision systems; this not only advances
our understanding of vision systems, but also often leads
to substantive performance improvements in such systems.
Therefore an important question in the vision literature is
whether, and to what extent, transformers generally bene-
fit vision models. Controlled studies of transformers have
been conducted with natural imagery [37], providing some
arXiv:2210.12599v2 [cs.CV] 31 Oct 2022
Figure 1. (a), (b), and (c) show the architecture for the Unet, TransUnet, and SwinUnet respectively. These figures were inspired directly
by [40, 7, 3], respectively
evidence in that context. However, it is unclear whether
their success extends to the unique statistics and condi-
tions present in overhead imagery, a major area of vision
research. Transformers excel at modeling long range de-
pendencies, which while generally beneficial in most vision
tasks, may not be as important in segmentation of overhead
imagery where building information is compact, highly lo-
calized, and many times isolated from other structures. To
our knowledge there has been no systematic study of this
question for overhead imagery tasks.
In this work we perform a carefully-controlled empiri-
cal comparison of three state-of-the-art segmentation mod-
els using overhead imagery, where each model utilizes
progressively more transformer-based structures. Specifi-
cally, we consider the following three models: U-Net [21],
TransUNet[7], and SwinUnet[3]. This is the first time the
TransUnet and SwinUnet have been applied to a large-scale
dataset of overhead imagery 1. Aside from the model varia-
tion we carefully control all other experimental factors, such
as the size of the models, their quantity of training data,
and training procedures. We use a large and diverse dataset
of overhead imagery, comprising two publicly-available
benchmarks to maximize the generality and relevance of
our results. To provide a transparent and unbiased hyper-
parameter optimization procedure, we use Bayesian Opti-
mization (BO) with a fixed budget of iterations to select the
hyperparameters of each model. We provide each model
with approximately 330 hours of optimization time in or-
der to identify effective hyperparameters for each model.
These experimental controls allow us to study whether, and
to what degree, transformers are beneficial in the context
of overhead imagery. Using our optimized models, we also
conduct several additional ablation studies to evaluate the
impact of specific design choices in the transformer-based
models. We can summarize our contributions as follows:
The first investigation of two recent state-of-the-art
segmentation models for processing overhead im-
agery: the transUnet [7], and the swinUnet [3].
1The recent work in [43] independently and concurrently studied these
models, in a complementary setting
The first controlled evaluation of whether, and to what
extent, transformers are beneficial for vision models in
overhead imagery.
2. Related Work
Segmentation in overhead imagery. Segmentation of
overhead imagery requires complex features to describe the
vast domain as well as pixel level precision. Initially devel-
oped for medical imagery, Unet [40] has been show to be a
powerful model in overhead image segmentation [18, 16]
and in the broader segmentation community, with many
variations on the model such as Dense-Unet [29], Res-Unet
[44], Unet++ [50], V-Net[33], and Unet3+ [19]. This is at-
tributed to the auto encoder-like structure where it receives
its ”U” shape and name combined with the skip connec-
tions, feeding high resolution spatial information into the
last layers of the model.
Other models such as DeepLabv3 [9] and Mask-RCNN
[14] have been used successfully in segmentation of over-
head imagery [27, 4]. While these models also perform
very well, we chose to evaluate Unet-based architectures
due to the high performance and large number of variant
models. This high number of variants allowed us to more
easily compare small changes in the model architecture.
Transformers in segmentation. Very recently trans-
formers have started to be used in that segmentation of over-
head imagery [17, 42], achieving good performance. Trans-
formers had already started to become common in other do-
mains such as TransUnet [7], ViT-V-Net [6], TransClaw U-
Net [5], UTNet [13], Cotr [45], and SwinUnet [3] in medi-
cal image segmentation.
Evaluation of transformers. Transformers are rather
new in computer vision and have only recently become
state-of-the-art. As a result, their impact on performance
has not been thoroughly analyzed in many domains and ap-
plications, including our own. While work has been done in
evaluating their generalization capabilities in respect to dis-
tribution shift [47] and how transferable their learned repre-
sentations are [49], these are very general results about the
feature representations derived for other applications. For
segmentation of overhead imagery, to our knowledge, there
摘要:

TransformersForRecognitionInOverheadImagery:ARealityCheckFrancescoLuziDukeUniversityfrancesco.luzi@duke.eduAneeshGuptaDukeUniversityaneeshgupta8@gmail.comLeslieCollinsDukeUniversityleslie.collins@duke.eduKyleBradburyDukeUniversitykyle.bradbury@duke.eduJordanMalofUniversityofMontanajordan.malof@umont...

展开>> 收起<<
Transformers For Recognition In Overhead Imagery A Reality Check Francesco Luzi Duke University.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:3.39MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注