
DMODE: Differential Monocular Object Distance Estimation Module
without Class Specific Information
Pedram Agand1, Michael Chang1, and Mo Chen1
Abstract— Utilizing a single camera for measuring object
distances is a cost-effective alternative to stereo-vision and
LiDAR. Although monocular distance estimation has been
explored in the literature, most existing techniques rely on
object class knowledge to achieve high performance. Without
this contextual data, monocular distance estimation becomes
more challenging, lacking reference points and object-specific
cues. However, these cues can be misleading for objects with
wide-range variation or adversarial situations, which is a
challenging aspect of object-agnostic distance estimation. In
this paper, we propose DMODE, a class-agnostic method for
monocular distance estimation that does not require object class
knowledge. DMODE estimates an object’s distance by fusing its
fluctuation in size over time with the camera’s motion, making
it adaptable to various object detectors and unknown objects,
thus addressing these challenges. We evaluate our model on the
KITTI MOTS dataset using ground-truth bounding box anno-
tations and outputs from TrackRCNN and EagerMOT. The
object’s location is determined using the change in bounding
box sizes and camera position without measuring the object’s
detection source or class attributes. Our approach demonstrates
superior performance in multi-class object distance detection
scenarios compared to conventional methods.
I. INTRODUCTION
For AI-enabled object detection, applications in simul-
taneous localization and mapping (SLAM), virtual reality,
surveillance video perception and autonomous vehicles, real-
time and precise estimation of object distances is crucial for
safe and efficient navigation [1]–[4]. Traditionally, distance
estimation is performed using stereo or multi-camera imag-
ing systems or LiDAR measurements, both of which have
their own limitations that can impact their use cases and
scalability. Stereo imaging requires precise synchronization
between two cameras, which can introduce multiple points of
failure. Furthermore, stereo vision is limited by the distance
between the cameras and the texture of the region of interest
[5]. Although accurate, LiDAR systems are considerably
more expensive to purchase and operate than a single camera.
Moreover, they have several moving parts and components
that can fail, and equipping a vehicle with multiple LiDAR
devices for redundancy is prohibitively expensive [6]. In
contrast, a system that uses a single camera can incorporate
several backup cameras for the price of a single LiDAR
device, making it more cost-effective and scalable.
However, existing monocular object distance estimation
techniques suffer from accuracy issues or labor-intensive
data collection requirements. Monocular vision has inherent
difficulties in accurately estimating object distances, and
1Simon Fraser University, Burnaby, Canada {pagand,
michael chang 7 mochen}@sfu.ca
current solutions typically involve a combination of a 2D
object detector and a monocular image depth estimator or
a monocular 3D object detector [7]. The former approach
relies heavily on a monocular depth estimator that is not
optimized for precise object-wise depth estimation, while
the latter requires additional annotations of 3D bounding
box (BBox) coordinates for training, resulting in specialized
equipment and high labeling costs. Consequently, there is a
need for a reliable and cost-effective approach that can be
easily adapted to new settings and object detectors.
Numerous studies have investigated the use of deep neural
networks (DNN) for direct object distance estimation. Early
approaches such as inverse perspective mapping (IPM) [8]
converted image points into bird’s-eye view coordinates.
However, IPM has limitations, especially for distant ob-
jects (around 40m) or non-straight roads [9]. Other unsu-
pervised methods include learning from unstructured video
sequences [10], employing CNN feature extractors with
distance and keypoint regressors [9], and modifying MaskR-
CNN as demonstrated in [1]. Additionally, a self-supervised
framework for fisheye cameras in autonomous driving was
enhanced with a multi-task learning strategy [11]. Authors
in [12] proposed an end-to-end approach called structured
convolutional neural feld (SCNF) that combine CNN and
continuous condition random field.
The accuracy of class-specific object detection relies on
matching the training environment [13]. For example, in a
test scenario involving toy objects, a toy car at the same
distance as a real car will appear much smaller but may
still be detected as a “Car” by object classification networks.
Similarly, when an object is presented in the camera field of
view while tilted, a class-specific approach can only detect
the distance correctly if the object is in the dataset with
the exact pose, which is unlikely or requires an enormous
dataset. Finally, the precision in class-specific approaches
with multiple classes is limited to the accuracy of the
classification technique. These limitations don’t affect class-
agnostic approaches [14], which don’t require knowledge of
expected object sizes at varying distances.
In this paper, we introduce DMODE, a novel approach
to object distance estimation that addresses significant chal-
lenges. By avoiding reliance on object class information, we
prevent the model from memorizing object size patterns at
various distances. Instead, DMODE utilizes changes in an
object’s projected size over time and camera motion for
distance estimation. Our approach achieves three primary
contributions: 1) It provides accurate distance estimations
without requiring object class information, overcoming the
arXiv:2210.12596v3 [cs.CV] 7 May 2024