Second, we point out that the training objective of the
original triplet loss [37] is to distinguish / discriminate, the
network only has to make sure the correct answer’s score
(the anchor-positive distance D+) wins the other choices’
scores (the anchor-negative distances D−) by a predefined
margin, i.e,D−− D+> m, while the absolute value of
D+is not that important, see an example in Fig. 4. How-
ever, depth estimation is a regression problem since every
pixel has its unique depth solution. Here, we have no idea
of the exact depth differences between the intersecting ob-
jects, thus, it is also unknown how much D−should exceed
D+. But one thing for sure is that, in depth estimation, the
smaller the D+, the better, since depths within the same ob-
ject are generally the same. We therefore split D−and D+
from within the original triplet and optimize them in isola-
tion, where either error of the positives or negatives will be
penalized individually and more directly. The problematic
case illustrated in Fig. 5motivates this strategy, where the
negatives are so good enough to cover up the badness of
the positives. In other words, even though D+is large and
needs to be optimized, D−already exceeds D+more than
m, which hinders the optimization for D+.
To sum up, this paper’s contributions are:
• We show two weaknesses of the raw patch-based
triplet optimizing strategy in MDE, that it (i) could
miss some thin but poor fattening areas; (ii) suffered
from mutual effects between positives and negatives.
• To overcome these two limitations, (i) we present a
min. operator based strategy applied on all negative
samples, to prevent the good negatives sheltering the
error of poor (edge-fattening) negatives; (ii) We split
the anchor-positive distance and anchor-negative dis-
tance from within the original triplet, to prevent the
good negatives sheltering the error of poor positives.
• Our redesigned triplet loss is powerful, generalizable
and lightweight: Experiments show that it not only
makes our model outperform all previous methods by
a large margin, but also provides substantial boosts to
a large number of existing models, while introducing
no extra inference computation at all.
2. Related Work
2.1. Self-Supervised Monocular Depth Estimation
Garg et al. [7] firstly introduced the novel concept of es-
timating depth without depth labels. Then, SfMLearner pre-
sented by Zhou et al. [51] required only monocular videos
to predict depths, because they employed an additional pose
network to learn the camera ego-motion. Godard et al. [10]
presented Monodepth2, with surprisingly simple methods
handling occlusions and dynamic objects in a non-learning
manner, both of which add no network parameters. Multi-
ple works leveraged additional supervisions, e.g. estimat-
ing depth with traditional stereo matching method [45,38]
and semantics [22]. HR-Depth [30] proved that higher-
resolution input images can reduce photometric loss with
the same prediction. Manydepth [46] proposed to make use
of multiple frames available at test time and leverage the
geometric constraint by building a cost volume, achieving
superior performance. [35] integrated wavelet decomposi-
tion into the depth decoder, reducing its computational com-
plexity. Some other recent works estimated depths in more
severe environments, e.g. indoor scenes [21] or in night-
time [40,27]. Innovative loss functions were also devel-
oped, e.g. constraining 3D point clouds consistency [20].
To deal with the notorious edge-fattening issue as illus-
trated in Fig. 1, most existing methods utilized an occlusion
mask [11,52] in the way of removing the incorrect super-
vision under photometric loss. We argue that although this
exclusion strategy works to some extent, the masking tech-
nique could prevent these occluded regions from learning,
because no supervision exists for them any longer. In con-
trast, our triplet loss closes this gap by providing additional
supervision signals directly to these occluded areas.
2.2. Deep Metric Learning
The idea of comparing training samples in the high-level
feature space [3,1] is a powerful concept, since there could
be more task-specific semantic information in the feature
space than in the low-level image space. The contrastive
loss function (a.k.a. discriminative loss) [16] is formu-
lated by whether a pair of input samples belong to the same
class. It learns an embedding space where samples within
the same class are close in distance, whereas unassociated
ones are farther away from each other. The triplet loss [37]
is an extension of contrastive loss, with three samples as
input each time, i.e. the anchor, the positive(s) and the neg-
ative(s). The triplet loss encourages the anchor-positive dis-
tance to be smaller than the anchor-negative loss by a mar-
gin m. Generally, the triplet loss function was introduced to
face recognition and re-identification [37,54], image rank-
ing [41], to name a few. Jung et al. [24] first plugged
triplet loss into self-supervised MDE, guided by another
pretrained semantic segmentation network. In our experi-
ments, we show that without other contributions in [24], the
raw semantic-guided triplet loss only yielded a very limited
improvement. We tackle various difficulties when plugging
triplet loss into MDE, allowing our redesigned triplet loss
outperforms existing ones by a large margin, with unparal-
leled accuracy.
3. Analysis of the Edge-fattening Problem
Before delving into our powerful triplet loss, it is nec-
essary to make the problem clear. We first show the