Self-Supervised Monocular Depth Estimation Solving the Edge-Fattening Problem

2025-04-15 0 0 2.24MB 11 页 10玖币
侵权投诉
Self-Supervised Monocular Depth Estimation:
Solving the Edge-Fattening Problem
Xingyu Chen1Ruonan Zhang1Ji Jiang1Yan Wang1Ge Li1Thomas H. Li B1,2,3
1School of Electronic and Computer Engineering, Peking University 2Advanced Institute of Information Technology, Peking University
3Information Technology R&D Innovation Center of Peking University
{cxy,zhangrn,jiangji,wyan}@stu.pku.edu.cn geli@ece.pku.edu.cn tli@aiit.org.cn
https://github.com/xingyuuchen/tri-depth
Abstract
Self-supervised monocular depth estimation (MDE)
models universally suffer from the notorious edge-fattening
issue. Triplet loss, as a widespread metric learning strat-
egy, has largely succeeded in many computer vision appli-
cations. In this paper, we redesign the patch-based triplet
loss in MDE to alleviate the ubiquitous edge-fattening is-
sue. We show two drawbacks of the raw triplet loss in MDE
and demonstrate our problem-driven redesigns. First, we
present a min. operator based strategy applied to all nega-
tive samples, to prevent well-performing negatives shelter-
ing the error of edge-fattening negatives. Second, we split
the anchor-positive distance and anchor-negative distance
from within the original triplet, which directly optimizes
the positives without any mutual effect with the negatives.
Extensive experiments show the combination of these two
small redesigns can achieve unprecedented results: Our
powerful and versatile triplet loss not only makes our model
outperform all previous SoTA by a large margin, but also
provides substantial performance boosts to a large num-
ber of existing models, while introducing no extra inference
computation at all.
1. Introduction
Estimating how far each pixel in the image is away from
the camera is a fundamental problem in 3D computer vi-
sion. This technique is desirable in various fields, such as
autonomous driving [43,49], AR [29] and robotics [13].
The majority of these applications adopt the appealing off-
the-shelf hardware, e.g. Lidar sensor or RGB-D cameras,
to enable agents. On the contrary, monocular videos or
stereo pairs are much easier to obtain. Hence, quantities
of researches [51,10,45] have been conducted to estimate
promising dense depth map from only a single RGB image.
Most of them and their follow-ups formulated the problem
as image reconstruction [10,51,45,46,24]. In particular,
given a target image, the network infers its pixel-aligned
depth map. Next, with a known or estimated camera ego-
motion, every pixel in the target image can be reprojected to
the reference image(s) which is taken from different view-
point(s) of the same scene. The reconstructed image can
be generated by sampling from the source image, and the
training loss is based on the photometric distance between
the reconstructed and target image. In this way, the network
is trained under self-supervision.
Nevertheless, these approaches suffered severally from
the notorious ‘edge-fattening’ problem, where the objects’
depth predictions are always ‘fatter’ than themselves. We
visualize the problem and analyse its cause in Fig. 1. Dis-
appointingly, there is no corresponding one-size-fits-all so-
lution, let alone a lightweight one.
Deep metric learning seeks to learn a feature space where
semantically similar samples are mapped to close locations,
while semantically dissimilar samples are mapped to dis-
tant locations. Pioneeringly, [24] first introduced the patch-
based semantics-guided triplet loss into MDE. Its key idea
is to encourage pixels within each object instance to have
similar depths, while those across semantic boundaries to
have depth differences, as shown in Fig. 2.
However, we find that the straightforward application of
the triplet loss only produces poor results. In this paper, we
dig into the weakness of the patch-based triplet loss, and
improve it through a problem-driven manner, reaching un-
precedented performances.
First, in some boundary regions the edge-fattening areas
could be thin, their contributions are small compared to the
non-fattening area, in which case the thin but defective re-
gions’ error could be covered up. Therefore, we change the
optimizing strategy to only focus on the fattening area. The
problematic case illustrated in Fig. 3motivates our strategy,
where we leave the normal area alone, and concentrate the
optimization on the poor-performing negatives.
arXiv:2210.00411v3 [cs.CV] 3 Jan 2023
Second, we point out that the training objective of the
original triplet loss [37] is to distinguish / discriminate, the
network only has to make sure the correct answer’s score
(the anchor-positive distance D+) wins the other choices’
scores (the anchor-negative distances D) by a predefined
margin, i.e,D− D+> m, while the absolute value of
D+is not that important, see an example in Fig. 4. How-
ever, depth estimation is a regression problem since every
pixel has its unique depth solution. Here, we have no idea
of the exact depth differences between the intersecting ob-
jects, thus, it is also unknown how much Dshould exceed
D+. But one thing for sure is that, in depth estimation, the
smaller the D+, the better, since depths within the same ob-
ject are generally the same. We therefore split Dand D+
from within the original triplet and optimize them in isola-
tion, where either error of the positives or negatives will be
penalized individually and more directly. The problematic
case illustrated in Fig. 5motivates this strategy, where the
negatives are so good enough to cover up the badness of
the positives. In other words, even though D+is large and
needs to be optimized, Dalready exceeds D+more than
m, which hinders the optimization for D+.
To sum up, this paper’s contributions are:
We show two weaknesses of the raw patch-based
triplet optimizing strategy in MDE, that it (i) could
miss some thin but poor fattening areas; (ii) suffered
from mutual effects between positives and negatives.
To overcome these two limitations, (i) we present a
min. operator based strategy applied on all negative
samples, to prevent the good negatives sheltering the
error of poor (edge-fattening) negatives; (ii) We split
the anchor-positive distance and anchor-negative dis-
tance from within the original triplet, to prevent the
good negatives sheltering the error of poor positives.
Our redesigned triplet loss is powerful, generalizable
and lightweight: Experiments show that it not only
makes our model outperform all previous methods by
a large margin, but also provides substantial boosts to
a large number of existing models, while introducing
no extra inference computation at all.
2. Related Work
2.1. Self-Supervised Monocular Depth Estimation
Garg et al. [7] firstly introduced the novel concept of es-
timating depth without depth labels. Then, SfMLearner pre-
sented by Zhou et al. [51] required only monocular videos
to predict depths, because they employed an additional pose
network to learn the camera ego-motion. Godard et al. [10]
presented Monodepth2, with surprisingly simple methods
handling occlusions and dynamic objects in a non-learning
manner, both of which add no network parameters. Multi-
ple works leveraged additional supervisions, e.g. estimat-
ing depth with traditional stereo matching method [45,38]
and semantics [22]. HR-Depth [30] proved that higher-
resolution input images can reduce photometric loss with
the same prediction. Manydepth [46] proposed to make use
of multiple frames available at test time and leverage the
geometric constraint by building a cost volume, achieving
superior performance. [35] integrated wavelet decomposi-
tion into the depth decoder, reducing its computational com-
plexity. Some other recent works estimated depths in more
severe environments, e.g. indoor scenes [21] or in night-
time [40,27]. Innovative loss functions were also devel-
oped, e.g. constraining 3D point clouds consistency [20].
To deal with the notorious edge-fattening issue as illus-
trated in Fig. 1, most existing methods utilized an occlusion
mask [11,52] in the way of removing the incorrect super-
vision under photometric loss. We argue that although this
exclusion strategy works to some extent, the masking tech-
nique could prevent these occluded regions from learning,
because no supervision exists for them any longer. In con-
trast, our triplet loss closes this gap by providing additional
supervision signals directly to these occluded areas.
2.2. Deep Metric Learning
The idea of comparing training samples in the high-level
feature space [3,1] is a powerful concept, since there could
be more task-specific semantic information in the feature
space than in the low-level image space. The contrastive
loss function (a.k.a. discriminative loss) [16] is formu-
lated by whether a pair of input samples belong to the same
class. It learns an embedding space where samples within
the same class are close in distance, whereas unassociated
ones are farther away from each other. The triplet loss [37]
is an extension of contrastive loss, with three samples as
input each time, i.e. the anchor, the positive(s) and the neg-
ative(s). The triplet loss encourages the anchor-positive dis-
tance to be smaller than the anchor-negative loss by a mar-
gin m. Generally, the triplet loss function was introduced to
face recognition and re-identification [37,54], image rank-
ing [41], to name a few. Jung et al. [24] first plugged
triplet loss into self-supervised MDE, guided by another
pretrained semantic segmentation network. In our experi-
ments, we show that without other contributions in [24], the
raw semantic-guided triplet loss only yielded a very limited
improvement. We tackle various difficulties when plugging
triplet loss into MDE, allowing our redesigned triplet loss
outperforms existing ones by a large margin, with unparal-
leled accuracy.
3. Analysis of the Edge-fattening Problem
Before delving into our powerful triplet loss, it is nec-
essary to make the problem clear. We first show the
Background
Foreground
Left view
𝑞𝑝
(b)
𝑟
Background
Foreground
Occluded
Right view
𝑞𝑝
(a)
𝑞!
Occluded
(c)
Shift 10 pixels to left
Shift 5pixels to left
𝑝𝑒 𝑞
5disparity (pixel)
10
gt
𝑞!
𝑝𝑒 𝑝
5disparity (pixel)
10
gt
𝑝𝑒 𝑟
5disparity (pixel)
10
gt
𝑟
Figure 1. Analysis of the edge-fattening issue. (a) Example of the edge-fattening issue. The depth predictions of foreground objects (e.g.
the tree-trunk and poles) are ‘fatter’ than the objects themselves. (b) In the left view, pixel pand qare located in the background with a
disparity of 5 pixels , and qwill be occluded if any further to the right. ris on the tree with a disparity of 10 pixels. (c) pand rare OK - their
gt disparity is the global optimum of the photometric error (loss). qsuffers from edge-fattening issue. Since qis occluded by the tree in
the right view, the photometric error of its gt disparity 5 is large. The photometric loss function therefore struggles to find another location
that has a small loss, i.e., shifting another 5 pixels to reach the nearest background pixel q0. However, q0is not the true correspondence of
q. As a result, disparity of the background qequals to that of the foreground r, leading to the edge-fattening issue. Details in Sec. 3.
behaviour of the so-called edge-fattening issue in self-
supervised MDE, then analyse its cause in Fig. 1. This mo-
tivates us to introduce our resigned triplet loss.
The ubiquitous edge-fattening problem, limiting perfor-
mances of vast majorities of self-supervised MDE mod-
els [10,46,45,24,51,30], manifests itself as yielding inac-
curate object depths that partially leak into the background
at object boundaries, as shown in Fig. 1a.
We first lay out the final conclusion: The networks
misjudge the background near the foreground as the fore-
ground, so that the foreground looks fatter than itself.
The cause could be traced back to occlusions of back-
ground pixels as illustrated in Fig. 1b&c. The background
pixels visible in the target (left) image but invisible in the
source (right) image suffer from incorrect supervision un-
der the photometric loss, since no exact correspondences in
the source image exist for them at all.
The crux of the matter is that, for the occluded pixel (e.g.
pixel qin Fig. 1b), the photometric loss still needs to seek
a pixel with a similar appearance (in the right view) to be
its fake correspondence. Generally, for a background pixel,
only another background pixel could show a small photo-
metric loss, so the problem turns into finding the nearest
background pixel for the occluded pixel (q) in the source
(right) view. Since the foreground has a smaller depth Z
than the background, it has a larger disparity dowing to
the geometry projection constrain: Z=f·b
d, where bis the
fixed camera baseline. Consequently, the photometric loss
has to shift more to the left to find the nearst background
pixel (the fake solution, e.g. pixel q0in Fig. 1b&c). In this
way, the occluded background pixels share the same dispar-
ities as the foreground, forming the edge-fattening issue.
4. Methodology
4.1. Self-supervised Depth Estimation
Following [51,10], given a monocular and/or stereo
video, we first train a depth network ψdepth consuming a
single target image Itas input, and outputs its pixel-aligned
depth map Dt=ψdepth(It). Then, we train a pose network
ψpose taking temporally adjacent frames as input, and out-
puts the relative camera pose Ttt+n=ψpose(It, It+n).
Suppose we have access to the camera intrinsics K,
along with Dtand Ttt+n, we warp It+ninto Itto gen-
erate the reconstructed image ˜
It+n:
˜
It+n=It+nproj (Dt, Ttt+n, K),(1)
where h·i is the differentiable bilinear sampling operator ac-
cording to [10].
To make use of multiple input frames, we build a cost
volume [46] using discrete depth values from a predefined
range [dmin, dmax]. Moreover, dmin and dmax are dynami-
cally adjusted during training to find the best scale [46].
In order to evaluate the reconstructed images ˜
It+n, we
adopt the edge-aware smoothness loss [18] and the photo-
metric reprojection loss measured by L1+Lssim [51,10]:
Lpe =α
2(1 SSIM(It,˜
It+n)) + (1 α)kIt˜
It+nk1,
(2)
摘要:

Self-SupervisedMonocularDepthEstimation:SolvingtheEdge-FatteningProblemXingyuChen1RuonanZhang1JiJiang1YanWang1GeLi1ThomasH.LiB1,2,31SchoolofElectronicandComputerEngineering,PekingUniversity2AdvancedInstituteofInformationTechnology,PekingUniversity3InformationTechnologyR&DInnovationCenterofPekingUniv...

展开>> 收起<<
Self-Supervised Monocular Depth Estimation Solving the Edge-Fattening Problem.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:学术论文 价格:10玖币 属性:11 页 大小:2.24MB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注