
2. To overcome these two problems, we first propose
Ambiguity-Masking to exclude the ambiguous pixels
producing irrational supervisions. Second, we present
Auto-Blur, which pioneeringly proves blurring im-
ages could universally enhance depth estimators by re-
ducing unfairness and enlarging receptive fields.
3. Our methods are highly versatile and lightweight, pro-
viding performance boosts to a large number of ex-
isting models, including those claiming SoTA, while
introducing no extra inference computation at all.
Despite our superior results, the key motivation of this
paper is to shed light on the problems rarely noticed by pre-
vious MDE researchers, and wish our analysis and solutions
could inspire more subsequent works.
2. Related Work
2.1. Supervised Depth Estimation
Plenty recent researches have proved that deep neural
networks bring remarkable improvements to MDE models.
Many MDE (or stereo matching [26,33]) methods are fully
supervised, requiring the depth labels collected from RGB-
D cameras or Lidar sensors. Eigen et al. [5] introduced a
multi-scale architecture to learn coarse depth and then re-
fined on another network. Fu et al. [7] changed depth re-
gression to classification of discrete depth values. [2] fur-
ther extended this idea to adaptively adjust depth bins for
each input image. With direct access to depth labels, loss
is formulated using the distance between predicted depth
and ground truth depth (Scale-Invariant loss [21,2], L1dis-
tance [19,33]), without relying on assumptions such as pho-
tometric consistency or static scenes. [1,29] also computed
L1loss between the gradient map of predicted and gt depth.
2.2. Self-Supervised Depth Estimation
Self-supervised MDE transforms depth regression into
image reconstruction [9,38]. Monodepth [11] introduced
the left-right consistency to alleviate depth map discontinu-
ity. Monodepth2 [12] proposed to use min. reprojection loss
to deal with occlusions, and auto-masking to alleviate mov-
ing objects and static cameras. In order to produce sharper
depth edges, [18] leveraged the off-the-shelf fine-grained
sematic segmentations, [35] designed an attention-based
network to capture detailed textures. In terms of image gra-
dient, self-supervised methods [8,12,27,32] usually adopt
the disparity smoothness loss [16]. [20] trained an addi-
tional ‘local network’ to predict depth gradients of small
image patches, and then integrated them with depths from
‘global network’. [22] computed photometric loss on the
gradient map to deal with sudden brightness change, but it is
not robust to objects with different colors but the same gra-
dients. Most related to our Auto-Blur is Depth-Hints [32],
which helped the network escape from local minima of thin
structures, by using the depth proxy labels obtained from
SGM stereo matching [17], while we make no use of any ad-
ditional supervision and are not restricted to stereo datasets.
3. The Need to Consider Spatial Frequency
This section mainly describes our motivation, specifi-
cally, revealing two problems that few previous works no-
ticed. We begin with a quick review of the universally used
photometric loss in self-supervised MDE (Sec. 3.1), then
we demonstrate from two aspects (Sec. 3.2 and Sec. 3.3)
that the photometric loss function is not a good supervisor
for guiding MDE models in some particular pixels or areas.
3.1. Appearance Based Reprojection Loss
In self-supervised MDE setting, the network predicts a
dense depth image Dtgiven an input RGB image Itat test
time. To evaluate Dt, based on the geometry projection
constraint, we generate the reconstructed image ˜
It+nby
sampling from the source images It+ntaken from differ-
ent viewpoints of the same scene. The loss is based on the
pixel-level appearance distance between Itand ˜
It+n. Ma-
jorities of self-supervised MDE methods [11,27,12,38,25,
36,23] adopt the L1+Lssim [31] as photometric loss:
L(It,˜
It+n) = α
2(1−SSIM(It,˜
It+n))+(1−α)kIt−˜
It+nk1,
(1)
where α= 0.85 by default and SSIM [31] computes pixel
similarity over a 3×3window.
3.2. Does Loss at Object Boundary Make Sense?
As seen from Fig. 1, when training gets to the middle
part, the losses appear in two types of regions:
1. On the whole object (true-positives). Because the es-
timation of the object’s depth (or camera motion) is
inaccurate, it reprojects to another object;
2. At the object boundaries (false-positives). Such as the
black chimney in the upper right corner.
So why does some loss only appear at the object bound-
aries, and is it reasonable? In fact, few works analyzed its
cause. In order to minimize the per-pixel reprojection er-
ror, the network adjusts every single pixel’s depth to make
it reproject to where it is in the source view. This process
works under the condition that each pixel belongs to one de-
terministic object, since we can never use one depth value
to characterize a pixel that represents two different objects.
However, we illustrate in Fig. 1c&d that, the anti-aliasing
breaks this training condition by making the object bound-
ary color the weighted sum of both sides’ colors.
Specifically, in self-supervised MDE, pixels are first
(2D-3D) back-projected to construct the 3D scene using