Depth Is All You Need for Monocular 3D Detection Dennis Park Jie Li Dian Chen Vitor Guizilini Adrien Gaidon Abstract A key contributor to recent progress in 3D de-_2

2025-05-06 0 0 2.14MB 8 页 10玖币

侵权投诉

Depth Is All You Need for Monocular 3D Detection

Dennis Park∗, Jie Li∗, Dian Chen, Vitor Guizilini, Adrien Gaidon

Abstract— A key contributor to recent progress in 3D de-

tection from single images is monocular depth estimation.

Existing methods focus on how to leverage depth explicitly by

generating pseudo-pointclouds or providing attention cues for

image features. More recent works leverage depth prediction

as a pretraining task and ﬁne-tune the depth representation

while training it for 3D detection. However, the adaptation is

insufﬁcient and is limited in scale by manual labels. In this

work, we propose further aligning depth representation with the

target domain in unsupervised fashions. Our methods leverage

commonly available LiDAR or RGB videos during training time

to ﬁne-tune the depth representation, which leads to improved

3D detectors. Especially when using RGB videos, we show that

our two-stage training by ﬁrst generating pseudo-depth labels is

critical because of the inconsistency in loss distribution between

the two tasks. With either type of reference data, our multi-

task learning approach improves over state of the art on both

KITTI and NuScenes, while matching the test-time complexity

of its single-task sub-network.

I. INTRODUCTION

Recognizing and localizing objects in 3D space is cru-

cial for applications in robotics, autonomous driving, and

augmented reality. Hence, in recent years monocular 3D

detection has attracted substantial scientiﬁc interest [1],

[2], [3], [4], because of its wide impact and the ubiquity

of cameras. However, as quantitatively shown in [5], the

biggest challenge in monocular 3D detection is the inherent

ambiguity in depth caused by camera projection. Monocular

depth estimation [6], [7], [8], [9] directly addresses this

limitation by learning statistical models between pixels and

their corresponding depth values, given monocular images.

One of the long-standing questions in 3D detection is

how to leverage advances in monocular depth estimation

to improve image-based 3D detection. Pioneered by [10],

pseudo-LiDAR detectors [11], [12], [13] leverage monocular

depth networks to generate intermediate pseudo point clouds,

which are then fed to a point cloud-based 3D detection

network. However, the performance of such methods is

bounded by the quality of the pseudo point clouds, which

deteriorates drastically when facing domain gaps. Alterna-

tively, [1] showed that by pre-training a network on a large-

scale multi-modal dataset where point cloud data serves as

supervision for depth, the simple end-to-end architecture

is capable of learning geometry-aware representation and

achieving state-of-the-art detection accuracy on the target

datasets.

However, in [1] the dataset used for pre-training exhibits

a signiﬁcant domain gap from the target data used for 3D

*Equal Contribution

Toyota Research Institute, firstname.lastname@tri.global

detection. The source of this domain gap includes geo-

graphical locations (which affects scene density, weather,

types of objects, etc) and sensor conﬁguration (e.g. camera

extrinsics and intrinsics). It is unclear whether the geometry-

aware representation learned during pretraining is sufﬁciently

adapted to the new domain during ﬁne-tuning. The goal

of this work is to push the boundaries of how much pre-

trained networks can be adapted for robust 3D detection

using various types of unlabeled data available in the target

domain.

We ﬁrst consider scenarios where in-domain point cloud

data is available at training time, sharing the assumptions

with [8], [9]. In this case, we show that a simple multi-task

framework supervised directly with projected depth maps

along with 3D bounding boxes yields impressive improve-

ments, compared with pseudo-LiDAR approaches [11], [12]

or pre-training based methods [1]. Unlike pseudo-LiDAR

methods, our methods entail no additional overhead at test

time.

While it spawns insightful research ideas, the assumption

that in-domain point cloud data is available during training

can be impractical. For example, most outdoor datasets for

3D detection assume either multi-modal settings [14], [15],

[16] or a camera-only setting [17], [18] during both training

and testing. Therefore, we propose an alternative variant

to our method which adapts depth representations requiring

only RGB videos.

Inspired by advances in self-supervised monocular depth

estimation [6], [7], [19], we extend our method to using

temporally adjacent video frames when LiDAR modality is

not available. In this case, we observe that naively applying

the same multi-task strategy with the two heterogeneous

types of loss (2D photometric loss [7] and 3D box L1

distance), results in sub-par performance. To address this

heterogeneity, we propose a two-stage method: ﬁrst, we train

a self-supervised depth estimator using raw sequence data

to generate dense depth predictions or pseudo-depth labels.

Afterward, we train a multi-task network supervised on these

pseudo labels, using a distance-based loss akin to the one

used to train the 3D detection. We show that this two-stage

framework is crucial to effectively harness the learned self-

supervised depth as a means for accurate 3D detection. In

summary, our contributions are as follows:

•We propose a simple and effective multi-task network,

DD3Dv2, to reﬁne depth representation for more accu-

rate 3D detection. Our method uses depth supervision

from unlabelled data in the target domain during only

training time.

•We propose methods for learning depth representa-

arXiv:2210.02493v1 [cs.CV] 5 Oct 2022

(a) DD3Dv2: Depth Supervision Improves 3D Detection (b) Multi-task Head

Fig. 1: DD3Dv2. This paper proposes a simple and effective algorithm to improve monocular 3D detection through depth

supervision. (a): The overall ﬂowchart of our proposed system can be adapted to both LiDAR supervision or Camera videos

through pseudo labels generated from self-supervision algorithms. (b) Our multi-task decoder head improves on top of the

original DD3D by removing redundant information streams.

tion under two practical scenarios of data availability:

LiDAR or RGB video. For the latter scenario, we

propose a two-stage training strategy to resolve the

heterogeneity among the multi-task losses imposed by

image-based self-supervised depth estimation. We show

that this is crucial for performance gain with empirical

experiments.

•We evaluate our proposed algorithms in two challenging

3D detection benchmarks and achieve state-of-the-art

performance.

II. RELATED WORK

A. Monocular 3D detection

Early methods in monocular 3D detection focused on

using geometry cues or pre-trained 3D representations to

predict 3D attributes from 2D detections and enforce 2D-3D

consistency [20], [21], [22], [2], [23]. They often to require

additional data to obtain geometry information, such as CAD

models or instance segmentation masks at training time, and

the resulting performance was quite limited.

Inspired by the success of point-cloud based detectors, a

series of Pseudo-LiDAR methods were proposed [10], [24],

[13], [25], [26], which ﬁrst convert images into a point-cloud

using depth estimators, and then apply ideas of point-cloud

based detector. A clear advantage of such methods is that, in

theory, a continuous improvement in depth estimation leads

to more accurate detectors. However, the additional depth

estimator incurs a large overhead in inference.

An alternative category is end-to-end 3D detection, in

which 3D bounding boxes are directly regressed from CNN

features [27], [4], [3], [1]. These methods directly regress 3D

cuboid parameterizations from standard 2D detectors [28],

[29]. While these methods tend to be simpler and more

efﬁcient, these methods do not address the biggest challenge

of image-based detectors, the ambiguity in depth. DD3D [1]

partially addresses this issue by pre-training the network on

a large-scale image-LiDAR dataset.

Our work adopts the idea of end-to-end detectors, pushing

the boundary of how far a good depth representation can help

accurate 3D detection. Our key idea is to leverage raw data

in the target domain, such as point clouds or video frames,

to improve the learning of geometry-aware representation for

accurate 3D detection.

Other recent works trying to leverage dense depth or its

uncertainty as explicit information for 3D lifting [30], feature

attention [31] or detection score [32]. MonoDTR [33] shares

a similar spirit with us in leveraging in-domain depth through

multitask network. However, MonoDTR focuses on the use

of the predicted depth to help query learning in a Transfomer-

style detector [34]. Compared to these methods, our method

focuses on implicit learning of the depth information through

proper supervision signal and training strategy. No additional

module or test-time overhead is involved in the baseline 3D

detector.

B. Monocular Depth Estimation

Monocular depth estimation is the task of generating per-

pixel depth from a single image. Such methods usually fall

within two different categories, depending on how training

is conducted. Supervised methods rely on ground-truth depth

maps, generated by projecting information from a range

sensor (e.g., LiDAR) onto the image plane. The training ob-

jective aims to directly minimize the 3D prediction error. In

contrast, self-supervised methods minimize the 2D reprojec-

tion error between temporally adjacent frames, obtained by

warping information from one onto another given predicted

depth and camera transformation. A photometric object is

used to minimize the error between original and warped

frames, which enables the learning of depth estimation as

a proxy task.

Another aspect that differentiates these two approaches is

the nature of learned features. Supervised methods optimize

3D quantities (i.e., the metric location of ground-truth and

predicted point-clouds), whereas self-supervised methods op-

erate in the 2D space, aiming to minimize reprojected RGB

information. Because of that, most semi-supervised methods,

that combine small-scale supervision with large-scale self-

supervision, need ways to harmonize these two losses, to

avoid task interference even though the task is the same. In

[35], the supervised loss is projected onto the image plane

in the form of a reprojected distance, leading to improved

results relative to the naive combination of both losses. In

this work, we take the opposite approach and propose to

revert the 2D self-supervised loss back onto the 3D space,

through pseudo-label.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DepthIsAllYouNeedforMonocular3DDetectionDennisPark,JieLi,DianChen,VitorGuizilini,AdrienGaidonAbstractAkeycontributortorecentprogressin3Dde-tectionfromsingleimagesismonoculardepthestimation.Existingmethodsfocusonhowtoleveragedepthexplicitlybygeneratingpseudo-pointcloudsorprovidingattentioncuesfori...

展开>> 收起<<

Depth Is All You Need for Monocular 3D Detection Dennis Park Jie Li Dian Chen Vitor Guizilini Adrien Gaidon Abstract A key contributor to recent progress in 3D de-_2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Depth Is All You Need for Monocular 3D Detection Dennis Park Jie Li Dian Chen Vitor Guizilini Adrien Gaidon Abstract A key contributor to recent progress in 3D de-_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: