DMODE Differential Monocular Object Distance Estimation Module without Class Specific Information Pedram Agand1 Michael Chang1 and Mo Chen1

2025-05-03 1 0 859.18KB 7 页 10玖币

侵权投诉

DMODE: Differential Monocular Object Distance Estimation Module

without Class Speciﬁc Information

Pedram Agand1, Michael Chang1, and Mo Chen1

Abstract— Utilizing a single camera for measuring object

distances is a cost-effective alternative to stereo-vision and

LiDAR. Although monocular distance estimation has been

explored in the literature, most existing techniques rely on

object class knowledge to achieve high performance. Without

this contextual data, monocular distance estimation becomes

more challenging, lacking reference points and object-speciﬁc

cues. However, these cues can be misleading for objects with

wide-range variation or adversarial situations, which is a

challenging aspect of object-agnostic distance estimation. In

this paper, we propose DMODE, a class-agnostic method for

monocular distance estimation that does not require object class

knowledge. DMODE estimates an object’s distance by fusing its

ﬂuctuation in size over time with the camera’s motion, making

it adaptable to various object detectors and unknown objects,

thus addressing these challenges. We evaluate our model on the

KITTI MOTS dataset using ground-truth bounding box anno-

tations and outputs from TrackRCNN and EagerMOT. The

object’s location is determined using the change in bounding

box sizes and camera position without measuring the object’s

detection source or class attributes. Our approach demonstrates

superior performance in multi-class object distance detection

scenarios compared to conventional methods.

I. INTRODUCTION

For AI-enabled object detection, applications in simul-

taneous localization and mapping (SLAM), virtual reality,

surveillance video perception and autonomous vehicles, real-

time and precise estimation of object distances is crucial for

safe and efﬁcient navigation [1]–[4]. Traditionally, distance

estimation is performed using stereo or multi-camera imag-

ing systems or LiDAR measurements, both of which have

their own limitations that can impact their use cases and

scalability. Stereo imaging requires precise synchronization

between two cameras, which can introduce multiple points of

failure. Furthermore, stereo vision is limited by the distance

between the cameras and the texture of the region of interest

[5]. Although accurate, LiDAR systems are considerably

more expensive to purchase and operate than a single camera.

Moreover, they have several moving parts and components

that can fail, and equipping a vehicle with multiple LiDAR

devices for redundancy is prohibitively expensive [6]. In

contrast, a system that uses a single camera can incorporate

several backup cameras for the price of a single LiDAR

device, making it more cost-effective and scalable.

However, existing monocular object distance estimation

techniques suffer from accuracy issues or labor-intensive

data collection requirements. Monocular vision has inherent

difﬁculties in accurately estimating object distances, and

1Simon Fraser University, Burnaby, Canada {pagand,

michael chang 7 mochen}@sfu.ca

current solutions typically involve a combination of a 2D

object detector and a monocular image depth estimator or

a monocular 3D object detector [7]. The former approach

relies heavily on a monocular depth estimator that is not

optimized for precise object-wise depth estimation, while

the latter requires additional annotations of 3D bounding

box (BBox) coordinates for training, resulting in specialized

equipment and high labeling costs. Consequently, there is a

need for a reliable and cost-effective approach that can be

easily adapted to new settings and object detectors.

Numerous studies have investigated the use of deep neural

networks (DNN) for direct object distance estimation. Early

approaches such as inverse perspective mapping (IPM) [8]

converted image points into bird’s-eye view coordinates.

However, IPM has limitations, especially for distant ob-

jects (around 40m) or non-straight roads [9]. Other unsu-

pervised methods include learning from unstructured video

sequences [10], employing CNN feature extractors with

distance and keypoint regressors [9], and modifying MaskR-

CNN as demonstrated in [1]. Additionally, a self-supervised

framework for ﬁsheye cameras in autonomous driving was

enhanced with a multi-task learning strategy [11]. Authors

in [12] proposed an end-to-end approach called structured

convolutional neural feld (SCNF) that combine CNN and

continuous condition random ﬁeld.

The accuracy of class-speciﬁc object detection relies on

matching the training environment [13]. For example, in a

test scenario involving toy objects, a toy car at the same

distance as a real car will appear much smaller but may

still be detected as a “Car” by object classiﬁcation networks.

Similarly, when an object is presented in the camera ﬁeld of

view while tilted, a class-speciﬁc approach can only detect

the distance correctly if the object is in the dataset with

the exact pose, which is unlikely or requires an enormous

dataset. Finally, the precision in class-speciﬁc approaches

with multiple classes is limited to the accuracy of the

classiﬁcation technique. These limitations don’t affect class-

agnostic approaches [14], which don’t require knowledge of

expected object sizes at varying distances.

In this paper, we introduce DMODE, a novel approach

to object distance estimation that addresses signiﬁcant chal-

lenges. By avoiding reliance on object class information, we

prevent the model from memorizing object size patterns at

various distances. Instead, DMODE utilizes changes in an

object’s projected size over time and camera motion for

distance estimation. Our approach achieves three primary

contributions: 1) It provides accurate distance estimations

without requiring object class information, overcoming the

arXiv:2210.12596v3 [cs.CV] 7 May 2024

challenge of class-agnostic estimation. 2) It is independent of

camera intrinsic parameters, ensuring adaptability to diverse

camera setups. 3) It is able to generalize and accurately

estimate distances for unseen object classes, enables efﬁcient

transfer learning for new ones, and addresses the challenge of

adaptability across different object tracking networks (OTN)

and deployment scenarios. To facilitate future studies, the

code is available in GitHub at https://github.com/

pagand/distance-estimation

II. RELATED WORK

A. Monocular depth estimation

Depth estimation has been approached using DNNs, such

as continuous condition random ﬁelds for image patches

proposed by [15]. The accuracy of depth estimation was

improved by [16], who incorporated ordinal regression into

the depth estimation network and used scale-increasing dis-

cretization to convert continuous depth data into an ordinal

class vector. These techniques require signiﬁcant manpower

and computing resources, as well as speciﬁc training images

and corresponding depth maps for each pixel [17]. In the

absence of a depth image serving as ground truth (GT),

unsupervised training might utilize additional depth cues

from stereo images [18] or optical ﬂows [19]. Authors in

[20] introduced a novel deep visual-inertial odometry and

depth estimation framework to enhance the precision of

depth estimation and ego-motion using image sequences

and inertial measurement unit (IMU) raw data. However,

unsupervised depth estimation methods have inherent scale

ambiguity and poor performance due to the lack of perfect

GT and geometric constraints [20].

B. Monocular 3D object detection

The challenging task of 3D object recognition from

monocular images is related to object distance estimation.

Mousavian et al. [21] proposed Deep3DBox, which employs

a 3D regressor module to estimate the 3D box’s dimensions

and orientation and takes 2D detection input to crop the input

features. To replace the widely used 2D R-CNN architecture,

[22] introduced a 3D region proposal network, signiﬁcantly

improving performance. Furthermore, some studies use point

cloud detection networks or monocular depth estimation

networks as supplementary components of monocular 3D

object recognition [23]. These additional details enhance the

accuracy of 3D object detection networks.

C. Monocular object distance estimation

Ali and Hussein [24] used a geometric model incorpo-

rating camera characteristics and vehicle height as inputs to

determine the distance between two cars. Bertoni et al. [25]

employed a lightweight network to predict human positions

from 2D human postures. A generic object distance estima-

tion was developed by adding a depth regression module to

the Faster R-CNN-based structure and an additional keypoint

regressor to improve performance for objects near the camera

[9]. Cai et al. [26] proposed a framework that breaks down

the problem of monocular 3D object recognition into smaller

𝑡!

𝑡"

𝑡#

𝑑!𝑑"𝑑#

𝐶!𝐶"𝐶#

∆𝐶"

∆𝐶#

∆𝐷"

∆𝐷#

𝐻!

𝐻"

𝐻#

Fig. 1. Simpliﬁed 1D DMODE: a mathematical viewpoint

tasks, including object depth estimation, and introduced

a height-guided depth estimation technique to address the

information loss caused by the 3D to 2D projection. An R-

CNN-based network was used to achieve object recognition

and distance estimation simultaneously [27].

III. PROBLEM STATEMENT

Our goal is to estimate object distances in 3D by determin-

ing their relative Cartesian positions to a single camera with-

out using any class-related information. The camera can be

mounted on a moving vehicle or be stationary. For the sake

of simplicity, Fig. 1 depicts an illustrative scenario involving

camera motion in 1D for three time instants, t0, t1, t2. Let dj

be the object’s distance from the camera, Djbe the object’s

absolute position, Hjthe object’s size as captured by the

camera, and Cjthe camera’s absolute position at time tj.

We deﬁne ∆Fj=Fj−Fj−1, where Fcan be any of the

aforementioned variables. Our objective is to compute d2,

the object’s relative position with respect to the camera at

the current (latest) time, given H0, H1, H2,∆C1,∆C2.

Here are the assumptions: 1) The distance between the

object and the camera in the captured frames varies, which

can be caused by the movement of the object, camera, or

both. 2) The camera’s movement is measured by an IMU

[28]. 3) Within the captured frames, the object does not have

a pitch rate (rotation around a horizontal axis perpendicular

to the line connecting the camera and the object).

IV. METHOD

The framework is depicted in Figure 2. Our method

involves tracking the projected size of an object in the camera

lens over a predeﬁned time frame while taking into account

the camera’s motion to estimate the object’s distance from the

camera. To achieve this, we require an OTN and an IMU for

the camera. The distance to the object with constant velocity

can be analytically computed using the following relation:

d=H0H1(∆C1−∆C2)

H2∆H1−H0∆H2

,(1)

where Hi, Ciare the object pixel height and camera location

at i-th sequence, and ∆Xi=Xi−Xi−1for X∈ {H, C}. In

Sec. IV-A, to demonstrate the model-agnosticity, we derive

a mathematical expression for calculating the distance to

a 3D object with any dynamic movement of order qand

unknown parameters. This analytical proof will illustrate

that, to determine the distance to an object with sufﬁcient

frames (i.e., q+ 1 frames), there is no requirement for

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DMODE:DifferentialMonocularObjectDistanceEstimationModulewithoutClassSpecificInformationPedramAgand1,MichaelChang1,andMoChen1Abstract—Utilizingasinglecameraformeasuringobjectdistancesisacost-effectivealternativetostereo-visionandLiDAR.Althoughmonoculardistanceestimationhasbeenexploredintheliterature...

展开>> 收起<<

DMODE Differential Monocular Object Distance Estimation Module without Class Specific Information Pedram Agand1 Michael Chang1 and Mo Chen1.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DMODE Differential Monocular Object Distance Estimation Module without Class Specific Information Pedram Agand1 Michael Chang1 and Mo Chen1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: