Depth Is All You Need for Monocular 3D Detection
Dennis Park∗, Jie Li∗, Dian Chen, Vitor Guizilini, Adrien Gaidon
Abstract— A key contributor to recent progress in 3D de-
tection from single images is monocular depth estimation.
Existing methods focus on how to leverage depth explicitly by
generating pseudo-pointclouds or providing attention cues for
image features. More recent works leverage depth prediction
as a pretraining task and fine-tune the depth representation
while training it for 3D detection. However, the adaptation is
insufficient and is limited in scale by manual labels. In this
work, we propose further aligning depth representation with the
target domain in unsupervised fashions. Our methods leverage
commonly available LiDAR or RGB videos during training time
to fine-tune the depth representation, which leads to improved
3D detectors. Especially when using RGB videos, we show that
our two-stage training by first generating pseudo-depth labels is
critical because of the inconsistency in loss distribution between
the two tasks. With either type of reference data, our multi-
task learning approach improves over state of the art on both
KITTI and NuScenes, while matching the test-time complexity
of its single-task sub-network.
I. INTRODUCTION
Recognizing and localizing objects in 3D space is cru-
cial for applications in robotics, autonomous driving, and
augmented reality. Hence, in recent years monocular 3D
detection has attracted substantial scientific interest [1],
[2], [3], [4], because of its wide impact and the ubiquity
of cameras. However, as quantitatively shown in [5], the
biggest challenge in monocular 3D detection is the inherent
ambiguity in depth caused by camera projection. Monocular
depth estimation [6], [7], [8], [9] directly addresses this
limitation by learning statistical models between pixels and
their corresponding depth values, given monocular images.
One of the long-standing questions in 3D detection is
how to leverage advances in monocular depth estimation
to improve image-based 3D detection. Pioneered by [10],
pseudo-LiDAR detectors [11], [12], [13] leverage monocular
depth networks to generate intermediate pseudo point clouds,
which are then fed to a point cloud-based 3D detection
network. However, the performance of such methods is
bounded by the quality of the pseudo point clouds, which
deteriorates drastically when facing domain gaps. Alterna-
tively, [1] showed that by pre-training a network on a large-
scale multi-modal dataset where point cloud data serves as
supervision for depth, the simple end-to-end architecture
is capable of learning geometry-aware representation and
achieving state-of-the-art detection accuracy on the target
datasets.
However, in [1] the dataset used for pre-training exhibits
a significant domain gap from the target data used for 3D
*Equal Contribution
Toyota Research Institute, firstname.lastname@tri.global
detection. The source of this domain gap includes geo-
graphical locations (which affects scene density, weather,
types of objects, etc) and sensor configuration (e.g. camera
extrinsics and intrinsics). It is unclear whether the geometry-
aware representation learned during pretraining is sufficiently
adapted to the new domain during fine-tuning. The goal
of this work is to push the boundaries of how much pre-
trained networks can be adapted for robust 3D detection
using various types of unlabeled data available in the target
domain.
We first consider scenarios where in-domain point cloud
data is available at training time, sharing the assumptions
with [8], [9]. In this case, we show that a simple multi-task
framework supervised directly with projected depth maps
along with 3D bounding boxes yields impressive improve-
ments, compared with pseudo-LiDAR approaches [11], [12]
or pre-training based methods [1]. Unlike pseudo-LiDAR
methods, our methods entail no additional overhead at test
time.
While it spawns insightful research ideas, the assumption
that in-domain point cloud data is available during training
can be impractical. For example, most outdoor datasets for
3D detection assume either multi-modal settings [14], [15],
[16] or a camera-only setting [17], [18] during both training
and testing. Therefore, we propose an alternative variant
to our method which adapts depth representations requiring
only RGB videos.
Inspired by advances in self-supervised monocular depth
estimation [6], [7], [19], we extend our method to using
temporally adjacent video frames when LiDAR modality is
not available. In this case, we observe that naively applying
the same multi-task strategy with the two heterogeneous
types of loss (2D photometric loss [7] and 3D box L1
distance), results in sub-par performance. To address this
heterogeneity, we propose a two-stage method: first, we train
a self-supervised depth estimator using raw sequence data
to generate dense depth predictions or pseudo-depth labels.
Afterward, we train a multi-task network supervised on these
pseudo labels, using a distance-based loss akin to the one
used to train the 3D detection. We show that this two-stage
framework is crucial to effectively harness the learned self-
supervised depth as a means for accurate 3D detection. In
summary, our contributions are as follows:
•We propose a simple and effective multi-task network,
DD3Dv2, to refine depth representation for more accu-
rate 3D detection. Our method uses depth supervision
from unlabelled data in the target domain during only
training time.
•We propose methods for learning depth representa-
arXiv:2210.02493v1 [cs.CV] 5 Oct 2022