DMODE Differential Monocular Object Distance Estimation Module without Class Specific Information Pedram Agand1 Michael Chang1 and Mo Chen1

2025-05-03 0 0 859.18KB 7 页 10玖币
侵权投诉
DMODE: Differential Monocular Object Distance Estimation Module
without Class Specific Information
Pedram Agand1, Michael Chang1, and Mo Chen1
Abstract Utilizing a single camera for measuring object
distances is a cost-effective alternative to stereo-vision and
LiDAR. Although monocular distance estimation has been
explored in the literature, most existing techniques rely on
object class knowledge to achieve high performance. Without
this contextual data, monocular distance estimation becomes
more challenging, lacking reference points and object-specific
cues. However, these cues can be misleading for objects with
wide-range variation or adversarial situations, which is a
challenging aspect of object-agnostic distance estimation. In
this paper, we propose DMODE, a class-agnostic method for
monocular distance estimation that does not require object class
knowledge. DMODE estimates an object’s distance by fusing its
fluctuation in size over time with the camera’s motion, making
it adaptable to various object detectors and unknown objects,
thus addressing these challenges. We evaluate our model on the
KITTI MOTS dataset using ground-truth bounding box anno-
tations and outputs from TrackRCNN and EagerMOT. The
object’s location is determined using the change in bounding
box sizes and camera position without measuring the object’s
detection source or class attributes. Our approach demonstrates
superior performance in multi-class object distance detection
scenarios compared to conventional methods.
I. INTRODUCTION
For AI-enabled object detection, applications in simul-
taneous localization and mapping (SLAM), virtual reality,
surveillance video perception and autonomous vehicles, real-
time and precise estimation of object distances is crucial for
safe and efficient navigation [1]–[4]. Traditionally, distance
estimation is performed using stereo or multi-camera imag-
ing systems or LiDAR measurements, both of which have
their own limitations that can impact their use cases and
scalability. Stereo imaging requires precise synchronization
between two cameras, which can introduce multiple points of
failure. Furthermore, stereo vision is limited by the distance
between the cameras and the texture of the region of interest
[5]. Although accurate, LiDAR systems are considerably
more expensive to purchase and operate than a single camera.
Moreover, they have several moving parts and components
that can fail, and equipping a vehicle with multiple LiDAR
devices for redundancy is prohibitively expensive [6]. In
contrast, a system that uses a single camera can incorporate
several backup cameras for the price of a single LiDAR
device, making it more cost-effective and scalable.
However, existing monocular object distance estimation
techniques suffer from accuracy issues or labor-intensive
data collection requirements. Monocular vision has inherent
difficulties in accurately estimating object distances, and
1Simon Fraser University, Burnaby, Canada {pagand,
michael chang 7 mochen}@sfu.ca
current solutions typically involve a combination of a 2D
object detector and a monocular image depth estimator or
a monocular 3D object detector [7]. The former approach
relies heavily on a monocular depth estimator that is not
optimized for precise object-wise depth estimation, while
the latter requires additional annotations of 3D bounding
box (BBox) coordinates for training, resulting in specialized
equipment and high labeling costs. Consequently, there is a
need for a reliable and cost-effective approach that can be
easily adapted to new settings and object detectors.
Numerous studies have investigated the use of deep neural
networks (DNN) for direct object distance estimation. Early
approaches such as inverse perspective mapping (IPM) [8]
converted image points into bird’s-eye view coordinates.
However, IPM has limitations, especially for distant ob-
jects (around 40m) or non-straight roads [9]. Other unsu-
pervised methods include learning from unstructured video
sequences [10], employing CNN feature extractors with
distance and keypoint regressors [9], and modifying MaskR-
CNN as demonstrated in [1]. Additionally, a self-supervised
framework for fisheye cameras in autonomous driving was
enhanced with a multi-task learning strategy [11]. Authors
in [12] proposed an end-to-end approach called structured
convolutional neural feld (SCNF) that combine CNN and
continuous condition random field.
The accuracy of class-specific object detection relies on
matching the training environment [13]. For example, in a
test scenario involving toy objects, a toy car at the same
distance as a real car will appear much smaller but may
still be detected as a “Car” by object classification networks.
Similarly, when an object is presented in the camera field of
view while tilted, a class-specific approach can only detect
the distance correctly if the object is in the dataset with
the exact pose, which is unlikely or requires an enormous
dataset. Finally, the precision in class-specific approaches
with multiple classes is limited to the accuracy of the
classification technique. These limitations don’t affect class-
agnostic approaches [14], which don’t require knowledge of
expected object sizes at varying distances.
In this paper, we introduce DMODE, a novel approach
to object distance estimation that addresses significant chal-
lenges. By avoiding reliance on object class information, we
prevent the model from memorizing object size patterns at
various distances. Instead, DMODE utilizes changes in an
object’s projected size over time and camera motion for
distance estimation. Our approach achieves three primary
contributions: 1) It provides accurate distance estimations
without requiring object class information, overcoming the
arXiv:2210.12596v3 [cs.CV] 7 May 2024
challenge of class-agnostic estimation. 2) It is independent of
camera intrinsic parameters, ensuring adaptability to diverse
camera setups. 3) It is able to generalize and accurately
estimate distances for unseen object classes, enables efficient
transfer learning for new ones, and addresses the challenge of
adaptability across different object tracking networks (OTN)
and deployment scenarios. To facilitate future studies, the
code is available in GitHub at https://github.com/
pagand/distance-estimation
II. RELATED WORK
A. Monocular depth estimation
Depth estimation has been approached using DNNs, such
as continuous condition random fields for image patches
proposed by [15]. The accuracy of depth estimation was
improved by [16], who incorporated ordinal regression into
the depth estimation network and used scale-increasing dis-
cretization to convert continuous depth data into an ordinal
class vector. These techniques require significant manpower
and computing resources, as well as specific training images
and corresponding depth maps for each pixel [17]. In the
absence of a depth image serving as ground truth (GT),
unsupervised training might utilize additional depth cues
from stereo images [18] or optical flows [19]. Authors in
[20] introduced a novel deep visual-inertial odometry and
depth estimation framework to enhance the precision of
depth estimation and ego-motion using image sequences
and inertial measurement unit (IMU) raw data. However,
unsupervised depth estimation methods have inherent scale
ambiguity and poor performance due to the lack of perfect
GT and geometric constraints [20].
B. Monocular 3D object detection
The challenging task of 3D object recognition from
monocular images is related to object distance estimation.
Mousavian et al. [21] proposed Deep3DBox, which employs
a 3D regressor module to estimate the 3D box’s dimensions
and orientation and takes 2D detection input to crop the input
features. To replace the widely used 2D R-CNN architecture,
[22] introduced a 3D region proposal network, significantly
improving performance. Furthermore, some studies use point
cloud detection networks or monocular depth estimation
networks as supplementary components of monocular 3D
object recognition [23]. These additional details enhance the
accuracy of 3D object detection networks.
C. Monocular object distance estimation
Ali and Hussein [24] used a geometric model incorpo-
rating camera characteristics and vehicle height as inputs to
determine the distance between two cars. Bertoni et al. [25]
employed a lightweight network to predict human positions
from 2D human postures. A generic object distance estima-
tion was developed by adding a depth regression module to
the Faster R-CNN-based structure and an additional keypoint
regressor to improve performance for objects near the camera
[9]. Cai et al. [26] proposed a framework that breaks down
the problem of monocular 3D object recognition into smaller
𝑡!
𝑡"
𝑡#
𝑑!𝑑"𝑑#
𝐶!𝐶"𝐶#
∆𝐶"
∆𝐶#
∆𝐷"
∆𝐷#
𝐻!
𝐻"
𝐻#
Fig. 1. Simplified 1D DMODE: a mathematical viewpoint
tasks, including object depth estimation, and introduced
a height-guided depth estimation technique to address the
information loss caused by the 3D to 2D projection. An R-
CNN-based network was used to achieve object recognition
and distance estimation simultaneously [27].
III. PROBLEM STATEMENT
Our goal is to estimate object distances in 3D by determin-
ing their relative Cartesian positions to a single camera with-
out using any class-related information. The camera can be
mounted on a moving vehicle or be stationary. For the sake
of simplicity, Fig. 1 depicts an illustrative scenario involving
camera motion in 1D for three time instants, t0, t1, t2. Let dj
be the object’s distance from the camera, Djbe the object’s
absolute position, Hjthe object’s size as captured by the
camera, and Cjthe camera’s absolute position at time tj.
We define Fj=FjFj1, where Fcan be any of the
aforementioned variables. Our objective is to compute d2,
the object’s relative position with respect to the camera at
the current (latest) time, given H0, H1, H2,C1,C2.
Here are the assumptions: 1) The distance between the
object and the camera in the captured frames varies, which
can be caused by the movement of the object, camera, or
both. 2) The camera’s movement is measured by an IMU
[28]. 3) Within the captured frames, the object does not have
a pitch rate (rotation around a horizontal axis perpendicular
to the line connecting the camera and the object).
IV. METHOD
The framework is depicted in Figure 2. Our method
involves tracking the projected size of an object in the camera
lens over a predefined time frame while taking into account
the camera’s motion to estimate the object’s distance from the
camera. To achieve this, we require an OTN and an IMU for
the camera. The distance to the object with constant velocity
can be analytically computed using the following relation:
d=H0H1(∆C1C2)
H2H1H0H2
,(1)
where Hi, Ciare the object pixel height and camera location
at i-th sequence, and Xi=XiXi1for X∈ {H, C}. In
Sec. IV-A, to demonstrate the model-agnosticity, we derive
a mathematical expression for calculating the distance to
a 3D object with any dynamic movement of order qand
unknown parameters. This analytical proof will illustrate
that, to determine the distance to an object with sufficient
frames (i.e., q+ 1 frames), there is no requirement for
摘要:

DMODE:DifferentialMonocularObjectDistanceEstimationModulewithoutClassSpecificInformationPedramAgand1,MichaelChang1,andMoChen1Abstract—Utilizingasinglecameraformeasuringobjectdistancesisacost-effectivealternativetostereo-visionandLiDAR.Althoughmonoculardistanceestimationhasbeenexploredintheliterature...

展开>> 收起<<
DMODE Differential Monocular Object Distance Estimation Module without Class Specific Information Pedram Agand1 Michael Chang1 and Mo Chen1.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:859.18KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注