Centerpoints Are All You Need in Overhead Imagery James Inder1 Mark Lowell2 and A.J. Maltenfort2

2025-04-27 0 0 1.66MB 13 页 10玖币

侵权投诉

Centerpoints Are All You Need in Overhead

Imagery

James Inder1, Mark Lowell∗†2, and A.J. Maltenfort†2

1Booz Allen Hamilton

2National Geospatial-Intelligence Agency

October 6, 2022

Abstract

Labeling data to use for training object detectors is expensive and

time consuming. Publicly available overhead datasets for object detection

are labeled with image-aligned bounding boxes, object-aligned bounding

boxes, or object masks, but it is not clear whether such detailed labeling

is necessary. To test the idea, we developed novel single- and two-stage

network architectures that use centerpoints for labeling. In this paper we

show that these architectures achieve nearly equivalent performance to ap-

proaches using more detailed labeling on three overhead object detection

datasets.

1 Introduction

Every day, observation satellites capture terabytes of imagery of the Earth’s sur-

face that feed into a wide variety of civil and military applications. This stream

of data has grown so large that only automated methods can feasibly analyze

it. One critical component of remote sensing analysis is object detection: locat-

ing objects of interest on the Earth’s surface in overhead imagery. Automated

object detection algorithms have advanced by leaps and bounds over the last

decade, but they still require vast amounts of labeled data for training, which is

expensive and tedious to produce. Any technique that can reduce the resources

needed to label objects in overhead imagery is therefore desirable.

Most existing datasets for training overhead object detectors are labeled

with horizontal bounding boxes [1][2][3][4][5], object-aligned bounding boxes

[6][7][8][9][10], or segmentation masks [11][12]. These methods of labeling appear

to have been inherited from work on natural images – primarily cell phone

∗Corresponding author: Mark.C.Lowell@nga.mil

†Equal contribution

arXiv:2210.01857v1 [cs.CV] 4 Oct 2022

pictures. Unlike objects in cell phone pictures, objects in overhead images are

only seen in a narrow range of viewpoints and scales. This paper examines

whether the extra work required to create such detailed labels is worthwhile in

terms of the resulting detector performance.

In this paper, we show that centerpoints alone are suﬃcient for training over-

head object detectors for most targets in overhead imagery and that they require

signiﬁcantly less time and work by labelers then image-aligned or object-aligned

bounding boxes. We designed single- and two-stage object detection architec-

tures for centerpoints based on RetinaNet [13] and Faster Region-Based Con-

volutional Neural Network (Faster R-CNN) [14]. We compare the performance

of our Centerpoint RetinaNet and Centerpoint R-CNN against RetinaNet and

Faster R-CNN trained with horizontal and object-aligned bounding boxes on a

variety of overhead datasets, and show that our centerpoint detectors match or

exceed the performance of bounding box detectors.

In Section 2, we review past work on object detection, focusing on overhead

imagery. In Section 3, we describe our centerpoint architectures and our meth-

ods for evaluating detectors for centerpoints, horizontal bounding boxes, and

object-aligned bounding boxes on a common basis. In Section 4, we present the

results of our experiments using each detector on a variety of overhead datasets.

In Section 5, we conclude by discussing the implications of our results for further

work in object detection in overhead imagery.

2 Related Work

Labeling Methods in Overhead Imagery Datasets: A survey of over-

head object detection datasets shows that most use horizontal bounding boxes

[1][2][3][4][5], object-aligned bounding boxes [6][7][8][9][10], or segmentation masks

[11][12]. Although the cost and diﬃculty of labeling large object detection

datasets is generally acknowledged, regardless of the domain, we are unaware

of any published systematic studies of the costs and beneﬁts of diﬀerent label-

ing approaches. Published eﬀorts to reduce labeling costs for overhead imagery

have instead focused on the use of synthetic data [15][16][17][18]. Outside of

the overhead domain speciﬁcally, approaches include active learning [19][20],

weak supervision [21], few-shot learning [22], zero-shot learning [23], and semi-

supervised learning [24][25]. However, networks trained solely on synthetic im-

agery struggle to match the performance of networks trained with real, fully

annotated data, and the other approaches all require at least some human an-

notation.

A small number of works have examined the use of point annotations. Pa-

padopoulous et al. 2017 [26] used Amazon Mechanical Turk to relabel the

PASCAL VOC object detection dataset with centerpoints and then used those

centerpoints to train object detectors. They found that nearly equivalent accu-

racy could be obtained at substantially lower labeling cost. However, instead of

training a detector to predict centerpoints, they used the Edge-Boxes algorithm

[27] to propose bounding boxes for the centerpoints and trained a Fast R-CNN

detector [28] to classify the proposals. Fast R-CNN is now obsolete compared to

networks that generate their own proposals, such as Faster R-CNN [14], which

combine higher performance with a faster runtime.

Mundhenk et al. 2016 [29] labeled cars in overhead imagery using center-

points and trained sliding window classiﬁers and regression networks to count

them in aerial images. They experimented with object detection using a heatmap

approach with a strided classiﬁer, but they did not compare their performance

to networks trained with bounding box labels.

The work closest to our own is Ribera et al. 2019 [30], which labeled com-

puter vision datasets using centerpoints, including a dataset of overhead im-

agery, and trained a modiﬁed U-Net [31] to predict those centerpoints using

Hausdorﬀ distance. They showed that their U-Net achieved equal or superior

performance to a Faster R-CNN that predicted bounding boxes, but the Faster

R-CNN used a diﬀerent feature extractor architecture and was trained by im-

puting ﬁxed-size bounding boxes to the centerpoints. They did not address

whether the Faster R-CNN would have performed better if it had been trained

with true bounding boxes tight around the targets or whether the diﬀerence in

performance was attributable to the architecture of the feature extractor. De-

shapriya et al. 2021 [32] trained a similar network using Gaussian kernels on a

dataset of buildings and a dataset of coconut trees but did not compare their

results to conventional object detectors.

Object Detectors: Modern object detectors can be classiﬁed as single-

stage or multi-stage. Single-stage detectors such as RetinaNet [13] treat object

detection as a regression problem. They use a backbone such as a ResNet [33]

as a feature extractor and then pass these features through a region proposal

network to produce a set of predictions. Each prediction consists of class logits

and oﬀsets to an associated anchor box. These predictions are then compared

to the ground truth and trained directly using a regression loss.

Multi-stage detectors such as Faster R-CNN [14] follow the region proposal

network by subsampling the proposals, then using a ROIAlign operation to crop

features corresponding to the proposals out of the features from the backbone.

These features are then passed to a classiﬁer head, which predicts both the

class logits and a set of corrections to the anchor box oﬀsets. Some multi-stage

networks such as ROI Transformer [34] repeat this process several times, reﬁning

the prediction at each stage. Multi-stage methods tend to perform slightly better

than single-stage methods in public rankings, but they are signiﬁcantly slower.

All of these detectors make their predictions as bounding boxes. These are

usually horizontal bounding boxes, but variants using object-aligned boxes or

segmentation masks have been created for both single-stage and multi-stage

detectors [35][34][36][37]. Some detectors incorporate a centerpoint prediction,

most famously Duan et al. 2019 [38], but only as a step in predicting a bounding

box. The only work that we are aware of on detectors that speciﬁcally predict

centerpoints is Ribera et al. 2019 [30], but this work cannot be directly compared

to existing bounding box detectors because of the diﬀerence in feature extractor

architecture. Regression networks have been trained to generate a heatmap

as part of their processing [29][39][40], but this heatmap is used to generate a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CenterpointsAreAllYouNeedinOverheadImageryJamesInder1,MarkLowell*2,andA.J.Maltenfort21BoozAllenHamilton2NationalGeospatial-IntelligenceAgencyOctober6,2022AbstractLabelingdatatousefortrainingobjectdetectorsisexpensiveandtimeconsuming.Publiclyavailableoverheaddatasetsforobjectdetectionarelabeledwith...

展开>> 收起<<

Centerpoints Are All You Need in Overhead Imagery James Inder1 Mark Lowell2 and A.J. Maltenfort2.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Centerpoints Are All You Need in Overhead Imagery James Inder1 Mark Lowell2 and A.J. Maltenfort2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: