ROBUST MONOCULAR LOCALIZATION OF DRONES BY ADAPTING DOMAIN MAPS TO
DEPTH PREDICTION INACCURACIES
Priyesh Shukla, Sureshkumar S., Alex C. Stutts, Sathya Ravi, Theja Tulabandhula, and Amit R. Trivedi
University of Illinois at Chicago, USA
ABSTRACT
We present a novel monocular localization framework by
jointly training deep learning-based depth prediction and
Bayesian filtering-based pose reasoning. The proposed cross-
modal framework significantly outperforms deep learning-
only predictions with respect to model scalability and tol-
erance to environmental variations. Specifically, we show
little-to-no degradation of pose accuracy even with extremely
poor depth estimates from a lightweight depth predictor. Our
framework also maintains high pose accuracy in extreme
lighting variations compared to standard deep learning, even
without explicit domain adaptation. By openly representing
the map and intermediate feature maps (such as depth es-
timates), our framework also allows for faster updates and
reusing intermediate predictions for other tasks, such as ob-
stacle avoidance, resulting in much higher resource efficiency.
Index Terms—Depth neural network, drone localization.
1 Introduction
For self-navigation, the most fundamental computation re-
quired for a vehicle is to determine its position and orienta-
tion, i.e., pose during motion. Higher-level path planning ob-
jectives such as motion tracking and obstacle avoidance oper-
ate by continuously estimating vehicle’s pose. Recently, deep
neural networks (DNNs) have shown a remarkable ability for
vision-based pose estimation in highly complex and cluttered
environments [1–3]. For visual pose estimation, DNNs can
learn the correlation of vehicle’s position/orientation and vi-
sual fields to a mounted camera. Thereby, vehicle’s pose can
be predicted using a monocular camera alone. In contrast, the
traditional methods required bulky and power-hungry range
sensors or stereo vision sensors to resolve the ambiguity be-
tween an object’s distance and its scale [4, 5].
However, DNN’s implicit learning of flying domain fea-
tures such as its map, placement of objects, coordinate frame,
domain structure, etc. in a standard pose-DNN also affects
the robustness and adaptability of pose estimations. The tra-
ditional filtering-based approaches [6] account for the flying
space structure using explicit representations such as voxel
grids, occupancy grid, Gaussian mixture model (GMM), etc.
[7]; thereby, updates to the flying space such as map exten-
sion, new objects, and locations can be more easily accommo-
dated. Comparatively, DNN-based estimators cannot handle
selective map updates, and the entire model must be retrained
even under small randomized or structured perturbations. Ad-
ditionally, filtering loops in traditional methods can adjudi-
cate predictive uncertainties against measurements to system-
atically prune hypothesis space and can express prediction
confidence along with the prediction itself [8]. Whereas feed-
forward pose estimations from a deterministic DNN are vul-
nerable to measurement and modeling uncertainties.
In thie paper, we use integrate traditional filtering tech-
niques with deep learning to overcome such limitations of
DNN-based pose estimation while exploiting their suitability
to operate efficiently with monocular cameras alone. Specif-
ically, we present a novel framework for visual localization
by integrating DNN-based depth prediction and Bayesian
filtering-based pose localization. In Figure 1, avoiding range
sensors for localization, we utilize a DNN-based lightweight
depth prediction network at the front end and sequential
Bayesian estimation at the back end. Our key observation
is that, unlike pose estimation, which innately depends on
map characteristics such as spatial structure, objects, coordi-
nate frame, etc., depth prediction is map-independent [9, 10].
Thus, by applying deep learning only on domain-independent
tasks and utilizing traditional models where domain is openly
(or explicitly) represented helps improve the predictive ro-
bustness. Limiting deep learning to only domain-independent
tasks also allows our framework to utilize vast training sets
from unrelated domains. Open representation of map and
depth estimates enables faster domain-specific updates and
utilization of intermediate feature maps for other autonomy
objectives, such as obstacle avoidance, thus improving com-
putational efficiency.
2 Monocular Localization with Depth
Neural Network and Pose Filters
In Figure 1, our framework integrates deep learning-based
depth prediction and Bayesian filters for visual pose local-
ization in the 3D space. At the front end, a depth DNN scans
monocular camera images to predict the relative depth of im-
age pixels from the camera’s focal point. A particle filter lo-
arXiv:2210.15559v1 [cs.CV] 27 Oct 2022