Depth Is All You Need for Monocular 3D Detection Dennis Park Jie Li Dian Chen Vitor Guizilini Adrien Gaidon Abstract A key contributor to recent progress in 3D de-_2

2025-05-06 0 0 2.14MB 8 页 10玖币
侵权投诉
Depth Is All You Need for Monocular 3D Detection
Dennis Park, Jie Li, Dian Chen, Vitor Guizilini, Adrien Gaidon
Abstract A key contributor to recent progress in 3D de-
tection from single images is monocular depth estimation.
Existing methods focus on how to leverage depth explicitly by
generating pseudo-pointclouds or providing attention cues for
image features. More recent works leverage depth prediction
as a pretraining task and fine-tune the depth representation
while training it for 3D detection. However, the adaptation is
insufficient and is limited in scale by manual labels. In this
work, we propose further aligning depth representation with the
target domain in unsupervised fashions. Our methods leverage
commonly available LiDAR or RGB videos during training time
to fine-tune the depth representation, which leads to improved
3D detectors. Especially when using RGB videos, we show that
our two-stage training by first generating pseudo-depth labels is
critical because of the inconsistency in loss distribution between
the two tasks. With either type of reference data, our multi-
task learning approach improves over state of the art on both
KITTI and NuScenes, while matching the test-time complexity
of its single-task sub-network.
I. INTRODUCTION
Recognizing and localizing objects in 3D space is cru-
cial for applications in robotics, autonomous driving, and
augmented reality. Hence, in recent years monocular 3D
detection has attracted substantial scientific interest [1],
[2], [3], [4], because of its wide impact and the ubiquity
of cameras. However, as quantitatively shown in [5], the
biggest challenge in monocular 3D detection is the inherent
ambiguity in depth caused by camera projection. Monocular
depth estimation [6], [7], [8], [9] directly addresses this
limitation by learning statistical models between pixels and
their corresponding depth values, given monocular images.
One of the long-standing questions in 3D detection is
how to leverage advances in monocular depth estimation
to improve image-based 3D detection. Pioneered by [10],
pseudo-LiDAR detectors [11], [12], [13] leverage monocular
depth networks to generate intermediate pseudo point clouds,
which are then fed to a point cloud-based 3D detection
network. However, the performance of such methods is
bounded by the quality of the pseudo point clouds, which
deteriorates drastically when facing domain gaps. Alterna-
tively, [1] showed that by pre-training a network on a large-
scale multi-modal dataset where point cloud data serves as
supervision for depth, the simple end-to-end architecture
is capable of learning geometry-aware representation and
achieving state-of-the-art detection accuracy on the target
datasets.
However, in [1] the dataset used for pre-training exhibits
a significant domain gap from the target data used for 3D
*Equal Contribution
Toyota Research Institute, firstname.lastname@tri.global
detection. The source of this domain gap includes geo-
graphical locations (which affects scene density, weather,
types of objects, etc) and sensor configuration (e.g. camera
extrinsics and intrinsics). It is unclear whether the geometry-
aware representation learned during pretraining is sufficiently
adapted to the new domain during fine-tuning. The goal
of this work is to push the boundaries of how much pre-
trained networks can be adapted for robust 3D detection
using various types of unlabeled data available in the target
domain.
We first consider scenarios where in-domain point cloud
data is available at training time, sharing the assumptions
with [8], [9]. In this case, we show that a simple multi-task
framework supervised directly with projected depth maps
along with 3D bounding boxes yields impressive improve-
ments, compared with pseudo-LiDAR approaches [11], [12]
or pre-training based methods [1]. Unlike pseudo-LiDAR
methods, our methods entail no additional overhead at test
time.
While it spawns insightful research ideas, the assumption
that in-domain point cloud data is available during training
can be impractical. For example, most outdoor datasets for
3D detection assume either multi-modal settings [14], [15],
[16] or a camera-only setting [17], [18] during both training
and testing. Therefore, we propose an alternative variant
to our method which adapts depth representations requiring
only RGB videos.
Inspired by advances in self-supervised monocular depth
estimation [6], [7], [19], we extend our method to using
temporally adjacent video frames when LiDAR modality is
not available. In this case, we observe that naively applying
the same multi-task strategy with the two heterogeneous
types of loss (2D photometric loss [7] and 3D box L1
distance), results in sub-par performance. To address this
heterogeneity, we propose a two-stage method: first, we train
a self-supervised depth estimator using raw sequence data
to generate dense depth predictions or pseudo-depth labels.
Afterward, we train a multi-task network supervised on these
pseudo labels, using a distance-based loss akin to the one
used to train the 3D detection. We show that this two-stage
framework is crucial to effectively harness the learned self-
supervised depth as a means for accurate 3D detection. In
summary, our contributions are as follows:
We propose a simple and effective multi-task network,
DD3Dv2, to refine depth representation for more accu-
rate 3D detection. Our method uses depth supervision
from unlabelled data in the target domain during only
training time.
We propose methods for learning depth representa-
arXiv:2210.02493v1 [cs.CV] 5 Oct 2022
(a) DD3Dv2: Depth Supervision Improves 3D Detection (b) Multi-task Head
Fig. 1: DD3Dv2. This paper proposes a simple and effective algorithm to improve monocular 3D detection through depth
supervision. (a): The overall flowchart of our proposed system can be adapted to both LiDAR supervision or Camera videos
through pseudo labels generated from self-supervision algorithms. (b) Our multi-task decoder head improves on top of the
original DD3D by removing redundant information streams.
tion under two practical scenarios of data availability:
LiDAR or RGB video. For the latter scenario, we
propose a two-stage training strategy to resolve the
heterogeneity among the multi-task losses imposed by
image-based self-supervised depth estimation. We show
that this is crucial for performance gain with empirical
experiments.
We evaluate our proposed algorithms in two challenging
3D detection benchmarks and achieve state-of-the-art
performance.
II. RELATED WORK
A. Monocular 3D detection
Early methods in monocular 3D detection focused on
using geometry cues or pre-trained 3D representations to
predict 3D attributes from 2D detections and enforce 2D-3D
consistency [20], [21], [22], [2], [23]. They often to require
additional data to obtain geometry information, such as CAD
models or instance segmentation masks at training time, and
the resulting performance was quite limited.
Inspired by the success of point-cloud based detectors, a
series of Pseudo-LiDAR methods were proposed [10], [24],
[13], [25], [26], which first convert images into a point-cloud
using depth estimators, and then apply ideas of point-cloud
based detector. A clear advantage of such methods is that, in
theory, a continuous improvement in depth estimation leads
to more accurate detectors. However, the additional depth
estimator incurs a large overhead in inference.
An alternative category is end-to-end 3D detection, in
which 3D bounding boxes are directly regressed from CNN
features [27], [4], [3], [1]. These methods directly regress 3D
cuboid parameterizations from standard 2D detectors [28],
[29]. While these methods tend to be simpler and more
efficient, these methods do not address the biggest challenge
of image-based detectors, the ambiguity in depth. DD3D [1]
partially addresses this issue by pre-training the network on
a large-scale image-LiDAR dataset.
Our work adopts the idea of end-to-end detectors, pushing
the boundary of how far a good depth representation can help
accurate 3D detection. Our key idea is to leverage raw data
in the target domain, such as point clouds or video frames,
to improve the learning of geometry-aware representation for
accurate 3D detection.
Other recent works trying to leverage dense depth or its
uncertainty as explicit information for 3D lifting [30], feature
attention [31] or detection score [32]. MonoDTR [33] shares
a similar spirit with us in leveraging in-domain depth through
multitask network. However, MonoDTR focuses on the use
of the predicted depth to help query learning in a Transfomer-
style detector [34]. Compared to these methods, our method
focuses on implicit learning of the depth information through
proper supervision signal and training strategy. No additional
module or test-time overhead is involved in the baseline 3D
detector.
B. Monocular Depth Estimation
Monocular depth estimation is the task of generating per-
pixel depth from a single image. Such methods usually fall
within two different categories, depending on how training
is conducted. Supervised methods rely on ground-truth depth
maps, generated by projecting information from a range
sensor (e.g., LiDAR) onto the image plane. The training ob-
jective aims to directly minimize the 3D prediction error. In
contrast, self-supervised methods minimize the 2D reprojec-
tion error between temporally adjacent frames, obtained by
warping information from one onto another given predicted
depth and camera transformation. A photometric object is
used to minimize the error between original and warped
frames, which enables the learning of depth estimation as
a proxy task.
Another aspect that differentiates these two approaches is
the nature of learned features. Supervised methods optimize
3D quantities (i.e., the metric location of ground-truth and
predicted point-clouds), whereas self-supervised methods op-
erate in the 2D space, aiming to minimize reprojected RGB
information. Because of that, most semi-supervised methods,
that combine small-scale supervision with large-scale self-
supervision, need ways to harmonize these two losses, to
avoid task interference even though the task is the same. In
[35], the supervised loss is projected onto the image plane
in the form of a reprojected distance, leading to improved
results relative to the naive combination of both losses. In
this work, we take the opposite approach and propose to
revert the 2D self-supervised loss back onto the 3D space,
through pseudo-label.
摘要:

DepthIsAllYouNeedforMonocular3DDetectionDennisPark,JieLi,DianChen,VitorGuizilini,AdrienGaidonAbstract—Akeycontributortorecentprogressin3Dde-tectionfromsingleimagesismonoculardepthestimation.Existingmethodsfocusonhowtoleveragedepthexplicitlybygeneratingpseudo-pointcloudsorprovidingattentioncuesfori...

展开>> 收起<<
Depth Is All You Need for Monocular 3D Detection Dennis Park Jie Li Dian Chen Vitor Guizilini Adrien Gaidon Abstract A key contributor to recent progress in 3D de-_2.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:2.14MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注