Centerpoints Are All You Need in Overhead Imagery James Inder1 Mark Lowell2 and A.J. Maltenfort2

2025-04-27 0 0 1.66MB 13 页 10玖币
侵权投诉
Centerpoints Are All You Need in Overhead
Imagery
James Inder1, Mark Lowell2, and A.J. Maltenfort2
1Booz Allen Hamilton
2National Geospatial-Intelligence Agency
October 6, 2022
Abstract
Labeling data to use for training object detectors is expensive and
time consuming. Publicly available overhead datasets for object detection
are labeled with image-aligned bounding boxes, object-aligned bounding
boxes, or object masks, but it is not clear whether such detailed labeling
is necessary. To test the idea, we developed novel single- and two-stage
network architectures that use centerpoints for labeling. In this paper we
show that these architectures achieve nearly equivalent performance to ap-
proaches using more detailed labeling on three overhead object detection
datasets.
1 Introduction
Every day, observation satellites capture terabytes of imagery of the Earth’s sur-
face that feed into a wide variety of civil and military applications. This stream
of data has grown so large that only automated methods can feasibly analyze
it. One critical component of remote sensing analysis is object detection: locat-
ing objects of interest on the Earth’s surface in overhead imagery. Automated
object detection algorithms have advanced by leaps and bounds over the last
decade, but they still require vast amounts of labeled data for training, which is
expensive and tedious to produce. Any technique that can reduce the resources
needed to label objects in overhead imagery is therefore desirable.
Most existing datasets for training overhead object detectors are labeled
with horizontal bounding boxes [1][2][3][4][5], object-aligned bounding boxes
[6][7][8][9][10], or segmentation masks [11][12]. These methods of labeling appear
to have been inherited from work on natural images – primarily cell phone
Corresponding author: Mark.C.Lowell@nga.mil
Equal contribution
1
arXiv:2210.01857v1 [cs.CV] 4 Oct 2022
pictures. Unlike objects in cell phone pictures, objects in overhead images are
only seen in a narrow range of viewpoints and scales. This paper examines
whether the extra work required to create such detailed labels is worthwhile in
terms of the resulting detector performance.
In this paper, we show that centerpoints alone are sufficient for training over-
head object detectors for most targets in overhead imagery and that they require
significantly less time and work by labelers then image-aligned or object-aligned
bounding boxes. We designed single- and two-stage object detection architec-
tures for centerpoints based on RetinaNet [13] and Faster Region-Based Con-
volutional Neural Network (Faster R-CNN) [14]. We compare the performance
of our Centerpoint RetinaNet and Centerpoint R-CNN against RetinaNet and
Faster R-CNN trained with horizontal and object-aligned bounding boxes on a
variety of overhead datasets, and show that our centerpoint detectors match or
exceed the performance of bounding box detectors.
In Section 2, we review past work on object detection, focusing on overhead
imagery. In Section 3, we describe our centerpoint architectures and our meth-
ods for evaluating detectors for centerpoints, horizontal bounding boxes, and
object-aligned bounding boxes on a common basis. In Section 4, we present the
results of our experiments using each detector on a variety of overhead datasets.
In Section 5, we conclude by discussing the implications of our results for further
work in object detection in overhead imagery.
2 Related Work
Labeling Methods in Overhead Imagery Datasets: A survey of over-
head object detection datasets shows that most use horizontal bounding boxes
[1][2][3][4][5], object-aligned bounding boxes [6][7][8][9][10], or segmentation masks
[11][12]. Although the cost and difficulty of labeling large object detection
datasets is generally acknowledged, regardless of the domain, we are unaware
of any published systematic studies of the costs and benefits of different label-
ing approaches. Published efforts to reduce labeling costs for overhead imagery
have instead focused on the use of synthetic data [15][16][17][18]. Outside of
the overhead domain specifically, approaches include active learning [19][20],
weak supervision [21], few-shot learning [22], zero-shot learning [23], and semi-
supervised learning [24][25]. However, networks trained solely on synthetic im-
agery struggle to match the performance of networks trained with real, fully
annotated data, and the other approaches all require at least some human an-
notation.
A small number of works have examined the use of point annotations. Pa-
padopoulous et al. 2017 [26] used Amazon Mechanical Turk to relabel the
PASCAL VOC object detection dataset with centerpoints and then used those
centerpoints to train object detectors. They found that nearly equivalent accu-
racy could be obtained at substantially lower labeling cost. However, instead of
training a detector to predict centerpoints, they used the Edge-Boxes algorithm
[27] to propose bounding boxes for the centerpoints and trained a Fast R-CNN
2
detector [28] to classify the proposals. Fast R-CNN is now obsolete compared to
networks that generate their own proposals, such as Faster R-CNN [14], which
combine higher performance with a faster runtime.
Mundhenk et al. 2016 [29] labeled cars in overhead imagery using center-
points and trained sliding window classifiers and regression networks to count
them in aerial images. They experimented with object detection using a heatmap
approach with a strided classifier, but they did not compare their performance
to networks trained with bounding box labels.
The work closest to our own is Ribera et al. 2019 [30], which labeled com-
puter vision datasets using centerpoints, including a dataset of overhead im-
agery, and trained a modified U-Net [31] to predict those centerpoints using
Hausdorff distance. They showed that their U-Net achieved equal or superior
performance to a Faster R-CNN that predicted bounding boxes, but the Faster
R-CNN used a different feature extractor architecture and was trained by im-
puting fixed-size bounding boxes to the centerpoints. They did not address
whether the Faster R-CNN would have performed better if it had been trained
with true bounding boxes tight around the targets or whether the difference in
performance was attributable to the architecture of the feature extractor. De-
shapriya et al. 2021 [32] trained a similar network using Gaussian kernels on a
dataset of buildings and a dataset of coconut trees but did not compare their
results to conventional object detectors.
Object Detectors: Modern object detectors can be classified as single-
stage or multi-stage. Single-stage detectors such as RetinaNet [13] treat object
detection as a regression problem. They use a backbone such as a ResNet [33]
as a feature extractor and then pass these features through a region proposal
network to produce a set of predictions. Each prediction consists of class logits
and offsets to an associated anchor box. These predictions are then compared
to the ground truth and trained directly using a regression loss.
Multi-stage detectors such as Faster R-CNN [14] follow the region proposal
network by subsampling the proposals, then using a ROIAlign operation to crop
features corresponding to the proposals out of the features from the backbone.
These features are then passed to a classifier head, which predicts both the
class logits and a set of corrections to the anchor box offsets. Some multi-stage
networks such as ROI Transformer [34] repeat this process several times, refining
the prediction at each stage. Multi-stage methods tend to perform slightly better
than single-stage methods in public rankings, but they are significantly slower.
All of these detectors make their predictions as bounding boxes. These are
usually horizontal bounding boxes, but variants using object-aligned boxes or
segmentation masks have been created for both single-stage and multi-stage
detectors [35][34][36][37]. Some detectors incorporate a centerpoint prediction,
most famously Duan et al. 2019 [38], but only as a step in predicting a bounding
box. The only work that we are aware of on detectors that specifically predict
centerpoints is Ribera et al. 2019 [30], but this work cannot be directly compared
to existing bounding box detectors because of the difference in feature extractor
architecture. Regression networks have been trained to generate a heatmap
as part of their processing [29][39][40], but this heatmap is used to generate a
3
摘要:

CenterpointsAreAllYouNeedinOverheadImageryJamesInder1,MarkLowell*„2,andA.J.Maltenfort„21BoozAllenHamilton2NationalGeospatial-IntelligenceAgencyOctober6,2022AbstractLabelingdatatousefortrainingobjectdetectorsisexpensiveandtimeconsuming.Publiclyavailableoverheaddatasetsforobjectdetectionarelabeledwith...

展开>> 收起<<
Centerpoints Are All You Need in Overhead Imagery James Inder1 Mark Lowell2 and A.J. Maltenfort2.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.66MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注