Transformers For Recognition In Overhead Imagery A Reality Check Francesco Luzi Duke University

2025-05-06 0 0 3.39MB 10 页 10玖币

侵权投诉

Transformers For Recognition In Overhead Imagery: A Reality Check

Francesco Luzi

Duke University

francesco.luzi@duke.edu

Aneesh Gupta

Duke University

aneeshgupta8@gmail.com

Leslie Collins

Duke University

leslie.collins@duke.edu

Kyle Bradbury

Duke University

kyle.bradbury@duke.edu

Jordan Malof

University of Montana

jordan.malof@umontana.edu

Abstract

There is evidence that transformers offer state-of-the-

art recognition performance on tasks involving overhead

imagery (e.g., satellite imagery). However, it is difﬁcult

to make unbiased empirical comparisons between com-

peting deep learning models, making it unclear whether,

and to what extent, transformer-based models are beneﬁ-

cial. In this paper we systematically compare the impact of

adding transformer structures into state-of-the-art segmen-

tation models for overhead imagery. Each model is given a

similar budget of free parameters, and their hyperparame-

ters are optimized using Bayesian Optimization with a ﬁxed

quantity of data and computation time. We conduct our ex-

periments with a large and diverse dataset comprising two

large public benchmarks: Inria and DeepGlobe. We per-

form additional ablation studies to explore the impact of

speciﬁc transformer-based modeling choices. Our results

suggest that transformers provide consistent, but modest,

performance improvements. We only observe this advan-

tage however in hybrid models that combine convolutional

and transformer-based structures, while fully transformer-

based models achieve relatively poor performance.

1. Introduction

Transformer-based models have become prevalent in

computer vision tasks and have achieved state-of-the-art

performance in classiﬁcation [12, 31], object detection

[10, 48], and segmentation [7, 3]. This success might osten-

sibly suggest that transformers are superior to other existing

models, such as those based upon convolutional structures,

however this is difﬁcult to conclude based upon the exist-

ing research literature due to the absence of experimental

controls when comparing different vision models. The per-

formance of modern vision models - all of which are based

upon deep neural networks - are affected by numerous fac-

Dataset Region Country Size (km2)

Austin USA 81

Chicago USA 81

Inria Kitsap County USA 81

West Tyrol Austria 81

Vienna Austria 81

Las Vegas USA 150.2

DeepGlobe Paris France 41.88

Shanghai China 173.32

Khartoum Sudan 32.88

Table 1. The cities and their size that compose the Inria and Deep-

Globe datasets.

tors that vary widely among competing models used in pub-

lic benchmarks, and in the research literature [34]. This in-

cludes factors such as the quantity and quality of training

data, the training algorithm (e.g., optimizer), the training

time allotted, and the model’s size (i.e., the number of free

model parameters). Another more subtle, but highly inﬂu-

ential factor, is the computation time and effort invested by

the designer on hyperparameter optimization, which can re-

sult in misleading performance comparisons [34].

If one or more of the aforementioned factors vary be-

tween competing vision models, then it is unclear which

factors among them are responsible for any performance

differences [38, 24, 20, 34]. Consequently, it is unclear

whether the recent success of transformer-based models ap-

plied to overhead imagery has been driven by the use of

transformers, or the variety of other factors that vary among

the competing models. A major goal of vision research is

to uncover the underlying causal factors and design princi-

ples that underpin vision systems; this not only advances

our understanding of vision systems, but also often leads

to substantive performance improvements in such systems.

Therefore an important question in the vision literature is

whether, and to what extent, transformers generally bene-

ﬁt vision models. Controlled studies of transformers have

been conducted with natural imagery [37], providing some

arXiv:2210.12599v2 [cs.CV] 31 Oct 2022

Figure 1. (a), (b), and (c) show the architecture for the Unet, TransUnet, and SwinUnet respectively. These ﬁgures were inspired directly

by [40, 7, 3], respectively

evidence in that context. However, it is unclear whether

their success extends to the unique statistics and condi-

tions present in overhead imagery, a major area of vision

research. Transformers excel at modeling long range de-

pendencies, which while generally beneﬁcial in most vision

tasks, may not be as important in segmentation of overhead

imagery where building information is compact, highly lo-

calized, and many times isolated from other structures. To

our knowledge there has been no systematic study of this

question for overhead imagery tasks.

In this work we perform a carefully-controlled empiri-

cal comparison of three state-of-the-art segmentation mod-

els using overhead imagery, where each model utilizes

progressively more transformer-based structures. Speciﬁ-

cally, we consider the following three models: U-Net [21],

TransUNet[7], and SwinUnet[3]. This is the ﬁrst time the

TransUnet and SwinUnet have been applied to a large-scale

dataset of overhead imagery 1. Aside from the model varia-

tion we carefully control all other experimental factors, such

as the size of the models, their quantity of training data,

and training procedures. We use a large and diverse dataset

of overhead imagery, comprising two publicly-available

benchmarks to maximize the generality and relevance of

our results. To provide a transparent and unbiased hyper-

parameter optimization procedure, we use Bayesian Opti-

mization (BO) with a ﬁxed budget of iterations to select the

hyperparameters of each model. We provide each model

with approximately 330 hours of optimization time in or-

der to identify effective hyperparameters for each model.

These experimental controls allow us to study whether, and

to what degree, transformers are beneﬁcial in the context

of overhead imagery. Using our optimized models, we also

conduct several additional ablation studies to evaluate the

impact of speciﬁc design choices in the transformer-based

models. We can summarize our contributions as follows:

• The ﬁrst investigation of two recent state-of-the-art

segmentation models for processing overhead im-

agery: the transUnet [7], and the swinUnet [3].

1The recent work in [43] independently and concurrently studied these

models, in a complementary setting

• The ﬁrst controlled evaluation of whether, and to what

extent, transformers are beneﬁcial for vision models in

overhead imagery.

2. Related Work

Segmentation in overhead imagery. Segmentation of

overhead imagery requires complex features to describe the

vast domain as well as pixel level precision. Initially devel-

oped for medical imagery, Unet [40] has been show to be a

powerful model in overhead image segmentation [18, 16]

and in the broader segmentation community, with many

variations on the model such as Dense-Unet [29], Res-Unet

[44], Unet++ [50], V-Net[33], and Unet3+ [19]. This is at-

tributed to the auto encoder-like structure where it receives

its ”U” shape and name combined with the skip connec-

tions, feeding high resolution spatial information into the

last layers of the model.

Other models such as DeepLabv3 [9] and Mask-RCNN

[14] have been used successfully in segmentation of over-

head imagery [27, 4]. While these models also perform

very well, we chose to evaluate Unet-based architectures

due to the high performance and large number of variant

models. This high number of variants allowed us to more

easily compare small changes in the model architecture.

Transformers in segmentation. Very recently trans-

formers have started to be used in that segmentation of over-

head imagery [17, 42], achieving good performance. Trans-

formers had already started to become common in other do-

mains such as TransUnet [7], ViT-V-Net [6], TransClaw U-

Net [5], UTNet [13], Cotr [45], and SwinUnet [3] in medi-

cal image segmentation.

Evaluation of transformers. Transformers are rather

new in computer vision and have only recently become

state-of-the-art. As a result, their impact on performance

has not been thoroughly analyzed in many domains and ap-

plications, including our own. While work has been done in

evaluating their generalization capabilities in respect to dis-

tribution shift [47] and how transferable their learned repre-

sentations are [49], these are very general results about the

feature representations derived for other applications. For

segmentation of overhead imagery, to our knowledge, there

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TransformersForRecognitionInOverheadImagery:ARealityCheckFrancescoLuziDukeUniversityfrancesco.luzi@duke.eduAneeshGuptaDukeUniversityaneeshgupta8@gmail.comLeslieCollinsDukeUniversityleslie.collins@duke.eduKyleBradburyDukeUniversitykyle.bradbury@duke.eduJordanMalofUniversityofMontanajordan.malof@umont...

展开>> 收起<<

Transformers For Recognition In Overhead Imagery A Reality Check Francesco Luzi Duke University.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transformers For Recognition In Overhead Imagery A Reality Check Francesco Luzi Duke University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: