detector [28] to classify the proposals. Fast R-CNN is now obsolete compared to
networks that generate their own proposals, such as Faster R-CNN [14], which
combine higher performance with a faster runtime.
Mundhenk et al. 2016 [29] labeled cars in overhead imagery using center-
points and trained sliding window classifiers and regression networks to count
them in aerial images. They experimented with object detection using a heatmap
approach with a strided classifier, but they did not compare their performance
to networks trained with bounding box labels.
The work closest to our own is Ribera et al. 2019 [30], which labeled com-
puter vision datasets using centerpoints, including a dataset of overhead im-
agery, and trained a modified U-Net [31] to predict those centerpoints using
Hausdorff distance. They showed that their U-Net achieved equal or superior
performance to a Faster R-CNN that predicted bounding boxes, but the Faster
R-CNN used a different feature extractor architecture and was trained by im-
puting fixed-size bounding boxes to the centerpoints. They did not address
whether the Faster R-CNN would have performed better if it had been trained
with true bounding boxes tight around the targets or whether the difference in
performance was attributable to the architecture of the feature extractor. De-
shapriya et al. 2021 [32] trained a similar network using Gaussian kernels on a
dataset of buildings and a dataset of coconut trees but did not compare their
results to conventional object detectors.
Object Detectors: Modern object detectors can be classified as single-
stage or multi-stage. Single-stage detectors such as RetinaNet [13] treat object
detection as a regression problem. They use a backbone such as a ResNet [33]
as a feature extractor and then pass these features through a region proposal
network to produce a set of predictions. Each prediction consists of class logits
and offsets to an associated anchor box. These predictions are then compared
to the ground truth and trained directly using a regression loss.
Multi-stage detectors such as Faster R-CNN [14] follow the region proposal
network by subsampling the proposals, then using a ROIAlign operation to crop
features corresponding to the proposals out of the features from the backbone.
These features are then passed to a classifier head, which predicts both the
class logits and a set of corrections to the anchor box offsets. Some multi-stage
networks such as ROI Transformer [34] repeat this process several times, refining
the prediction at each stage. Multi-stage methods tend to perform slightly better
than single-stage methods in public rankings, but they are significantly slower.
All of these detectors make their predictions as bounding boxes. These are
usually horizontal bounding boxes, but variants using object-aligned boxes or
segmentation masks have been created for both single-stage and multi-stage
detectors [35][34][36][37]. Some detectors incorporate a centerpoint prediction,
most famously Duan et al. 2019 [38], but only as a step in predicting a bounding
box. The only work that we are aware of on detectors that specifically predict
centerpoints is Ribera et al. 2019 [30], but this work cannot be directly compared
to existing bounding box detectors because of the difference in feature extractor
architecture. Regression networks have been trained to generate a heatmap
as part of their processing [29][39][40], but this heatmap is used to generate a
3