
2
encoder-decoder [5], [6], and hourglass [7], [8]. Although the
performance of the depth-only methods has been dramatically
improved, there is still a large gap with the RGB-guided
methods. We observe that there are three different distributions
of measurement points in the areas to be filled: 1) the measure-
ment points are almost accurate and the distribution is even,
2) the measurement points are overlapped by foreground and
background points due to occlusion, and some of the points
whose depth values abruptly change should be removed as
outlier points [9], [10], 3) there are no available points around.
As shown in Fig. 1, according to the distributions, we divide
the areas that need to be filled into normal areas, overlap
areas, and blank areas. Through experiments, we observe that
existing depth-only methods such as S2D (depth-only) [6]
and Ip basic [11] can obtain promising results in normal
areas, the performance gap between them and state-of-the-art
RGB-guided methods such as PENet [12] and S2D (RGB-
guided) [6] primarily derives from the overlap and blank
areas. We consider that this is because the depth-only methods
have no reliable input information in these areas, while the
RGB-guided methods can take advantage of the rich semantic
information of the color images to get better results.
Therefore, predicting initial input information for the blank
areas and overlap areas, and removing the outlier points from
the overlap areas are the key to the depth-only methods. To
address the issue, we propose the Coupled U-Net (CU-Net)
method, which is enhanced by an outlier removal module.
First, we adopt the first U-Net to predict an initial depth map
and a corresponding confidence map from the sparse LiDAR
measurements. The initial dense depth map has accurate depth
values and high confidence in the normal areas, and it can also
provide the second U-Net with reliable initial depth values in
the overlap and blank areas. Second, we propose a confidence-
based outlier removal method. Unlike removing the outliers
by judging whether each measurement point meets complex
conditions [10], the proposed method uses the confidence map
predicted by the first U-Net to identify the regions with outliers
and removes the outliers by a simple judgment condition.
Then, the corrected sparse depth map and the initial dense
map are fed to the second U-Net to obtain a dense depth
map with improved results in the overlap and blank areas
and the corresponding confidence map. Since the depth map
predicted by the first U-Net shows satisfactory results in the
ordered depth points and their local areas, while the depth
map predicted by the second U-Net has good results in other
global areas, we refer to these two U-Nets as the local U-
Net and the global U-Net, respectively. The dense depth maps
predicted by the local U-Net and the global U-Net are fused
by the confidence maps to obtain the final completion result.
We conduct comprehensive experiments to verify the ef-
fectiveness and generalization of our method on the KITTI
dataset [13], [14] and DDAD dataset [15]. Our contributions
can be summarized as follows:
•We quantitatively analyze the cause of the performance
gap between depth-only methods and RGB-guided meth-
ods, and show that the primary reason for the limited per-
formance of depth-only methods is their lack of reliable
input information in overlap regions and blank areas.
•To address the issue, we propose a two-stage network
with learned intermediate confidence maps, where the
first network provides initial depth values of the overlap
and blank areas for the second network. Furthermore, we
propose a confidence-based outlier removal method to
enhance the proposed method, which employs a learned
confidence map to identify the areas with outliers and
remove them.
•Experimental results on the popular KITTI benchmark
and the DDAD dataset show that our method achieves
state-of-the-art performance among all published papers
that employ only depth data during training and inference.
Meanwhile, it shows powerful generalization capabilities
under different depth densities, changing lighting, and
weather conditions.
II. RELATED WORK
This section introduces representative works of the RGB-
guided methods and the depth-only methods.
RGB-guided methods. The input of the RGB-guided methods
includes the sparse depth maps and their corresponding color
images. How to fuse the information of these two different
modalities is an open problem, a straightforward approach
called “early-fusion” is to concatenate the depth maps and
the color images to form a 4D tensor. Ma et al. [6] propose
a “later-fusion” method that extracts the feature of the color
images and the sparse depth maps separately, and feeds the
fusion features into an Encoder-decoder network. Gansbeke et
al. [8] propose a method based on color images guidance and
uncertainty, which employs two branches and achieves better
results. Qiu et al. [16] consider that the color images and depth
maps are not strongly correlated, and propose a method that
consists of the surface normal guided branch and the RGB
guided branch. The proposed method first predicts the surface
normal from the color images, and the results of two branches
are fused through confidence images to obtain the final dense
depth maps. Hu et al. [12] add the 3D information of the
sparse depth maps to the convolution operation. Considering
that the existing fusion methods between the color images and
depth maps are too simple, Tang et al. [17] propose a guided
convolutional network for feature fusion. To address the dense
depth maps predicted by end-to-end networks that are blurred
at the boundaries of objects, a series of affinity-based spatial
propagation methods [18], [19], [20], [21], [22] have been
proposed.
Depth-only methods. The depth-only methods adopt only the
given sparse depth maps to predict the corresponding dense
depth maps. Many classical methods can only efficiently fill
relatively dense depth maps, such as bilateral filtering. To fill
highly sparse depth maps, Ku et al. [11] propose a method that
employs traditional image processing techniques, such as mor-
phological transformations and image smoothing. The method
has a fast processing speed and its performance exceeds that of
contemporary learning methods. Zhao et al. [10] also propose
a non-learning method based on surface geometry, which is
enhanced by an outlier removal algorithm. In the deep learning
era, considering the sparsity of the input depth maps, Uhrig et