
Visual-Inertial
Odometry
Trajectory
Prediction
Visual Object
Tracker
Object Localization
ROI
Feature
Tracking
Ground
Plane
Mask
Robust
Plane
Fitting
𝛽
ℎ
𝑥
Single
Image
Depth
Temporal
fusion
Fig. 2: Overview of our object tracking and 3D localization
system.
II. RELATED WORK
Recently, in generic visual object tracking, several
learning-based efforts have been made to address the problem
of target switching [3,4]. In Multi-Object Tracking [5,6],
this problem is formulated as a data association problem
which is commonly addressed by using a constant-velocity
Kalman Filter and a Hungarian algorithm. However this
Kalman Filter works on the image space, where optical flow
is non-linear and it assumes a static camera. A camera motion
compensation is used in [6] based on image registration and
in [7] based on homography warping, which assumes the
homography is estimated using the object’s ground plane. In
[8], target 3D trajectory is modeled in a SLAM factor graph
but it relies on a stereo-camera. In [9,10] trajectory models
are learned for human motion using LSTMs.
Object 3D localization from a UAV has been addressed
using GPS receivers [11], laser range finder [12], georefer-
enced topographic maps [13] and flat earth assumption [14].
There has been extensive work [15–17] using ground plane
estimation for 3D object localization. The most similar to
our work is [15], which uses depth estimates from Visual
Odometry and a barometer to estimate the plane normals
and height but this also assumes the scene is planar. In
terms of monocular object 3D localization from the ground,
[18] proposes to estimate 3D car poses by combining 2D
bounding boxes, orientation regression and the object di-
mensions. Single-image depth networks [19,20] have been
demonstrating compelling results on several datasets (e.g
KITTI). In this work, we investigate how these generalize
to aerial downward-looking cameras.
III. SYSTEM OVERVIEW
Our system pipeline is shown in Fig. 2. Firstly, a Discrim-
inative Correlation Filter (DCF) tracker is initialized as usual
with a bounding box on an initial frame. The bounding box is
then used to initialize the ROI Feature Tracking by detecting
Harris corners within a Region of Interest (ROI) surrounding
the bounding box, which is excluded from the ROI. The
ROI is then shifted as the bounding box moves through
tracking. Using this dedicated feature tracking module allows
to maintain a dense distribution of features around the object,
without adding overhead to the VIO.
For every frame, depth is estimated and refined for the ROI
tracks given the camera poses from VIO. These tracks can
be backprojected to a point cloud and fit a plane. However,
since not all tracks are from the object’s ground plane, we
first select tracks based on our ground plane segmentation
which relies on a single image depth model to provide dense
depth for both the target and the ROI. However since this is
only relative depth, we use the ROI feature depth estimates
to effectively scale it.
Given the resulting ground plane mask, the selected ROI
tracks with 3D coordinates are used in a RANSAC multi-
plane fitting routine. Since the ground plane segmentation
can fail and the ROI features may not be enough, we use a
temporally-fused plane model, which aggregates the inlier
points from the last RANSAC plane fitting in a buffer
together with inliers from past frames. The temporal fusion
also includes a gating strategy to enforce temporal consis-
tency. The aggregated points are used in a RANSAC multi-
plane fitting loop once again to estimate the final plane. Then,
given the target image coordinates from the DCF tracker and
the camera pose we can raycast the 3D location. This is then
used to update the trajectory model, whose predictions for the
next frames are used to guide the DCF Tracker, as described
in the next section. The remaining sections provide more
details for each module of our pipeline.
IV. TRACKING WITH TRAJECTORY ESTIMATES
Visual object trackers generally output a 2D score map
(shown in Fig. 1) that maps to locations in an image
search window around the previous target location. Then the
location with highest score is simply selected as the new
target location.
Instead, we first center the search window around the
location predicted by our trajectory model which is projected
to the current image using the camera pose. We then perform
peak selection on the score map: First we normalize the score
map with a softmax function, then, using Non-Maximum
suppression, we select as location candidates the peaks
within a certain fraction of the maximum peak and take the
peak that is closer to the search window origin.
As a trajectory model, We use a linear Kalman filter to
estimate the state {p, v, a}, respectively the object absolute
3D location, velocity and acceleration. To prevent unbounded
motion during temporary tracking loss, we use a damping
factor both in velocity and acceleration instead of a constant
model in the state transition. The state is updated using
only the 3D location observation residuals. The process and
measurement noise was set empirically.
V. ROBUST OBJECT 3D LOCALIZATION
Our object localization is based on the projection of the
object bounding box center on the ground plane. However
as illustrated in Fig. 3.a, camera off-nadir βleads to a
lateral error ˜x=htan βwhere his the height at which
the ray intersects the object. To reduce this error we lift the
ground plane by an estimate of half the object height before
raycasting the depth. The next subsections cover all modules
of our localization approach.