Self-Supervised Monocular Depth Estimation Solving the Edge-Fattening Problem

2025-04-15 0 0 2.24MB 11 页 10玖币

Self-Supervised Monocular Depth Estimation:

Solving the Edge-Fattening Problem

Xingyu Chen1Ruonan Zhang1Ji Jiang1Yan Wang1Ge Li1Thomas H. Li B1,2,3

1School of Electronic and Computer Engineering, Peking University 2Advanced Institute of Information Technology, Peking University

3Information Technology R&D Innovation Center of Peking University

{cxy,zhangrn,jiangji,wyan}@stu.pku.edu.cn geli@ece.pku.edu.cn tli@aiit.org.cn

https://github.com/xingyuuchen/tri-depth

Abstract

Self-supervised monocular depth estimation (MDE)

models universally suffer from the notorious edge-fattening

issue. Triplet loss, as a widespread metric learning strat-

egy, has largely succeeded in many computer vision appli-

cations. In this paper, we redesign the patch-based triplet

loss in MDE to alleviate the ubiquitous edge-fattening is-

sue. We show two drawbacks of the raw triplet loss in MDE

and demonstrate our problem-driven redesigns. First, we

present a min. operator based strategy applied to all nega-

tive samples, to prevent well-performing negatives shelter-

ing the error of edge-fattening negatives. Second, we split

the anchor-positive distance and anchor-negative distance

from within the original triplet, which directly optimizes

the positives without any mutual effect with the negatives.

Extensive experiments show the combination of these two

small redesigns can achieve unprecedented results: Our

powerful and versatile triplet loss not only makes our model

outperform all previous SoTA by a large margin, but also

provides substantial performance boosts to a large num-

ber of existing models, while introducing no extra inference

computation at all.

1. Introduction

Estimating how far each pixel in the image is away from

the camera is a fundamental problem in 3D computer vi-

sion. This technique is desirable in various ﬁelds, such as

autonomous driving [43,49], AR [29] and robotics [13].

The majority of these applications adopt the appealing off-

the-shelf hardware, e.g. Lidar sensor or RGB-D cameras,

to enable agents. On the contrary, monocular videos or

stereo pairs are much easier to obtain. Hence, quantities

of researches [51,10,45] have been conducted to estimate

promising dense depth map from only a single RGB image.

Most of them and their follow-ups formulated the problem

as image reconstruction [10,51,45,46,24]. In particular,

given a target image, the network infers its pixel-aligned

depth map. Next, with a known or estimated camera ego-

motion, every pixel in the target image can be reprojected to

the reference image(s) which is taken from different view-

point(s) of the same scene. The reconstructed image can

be generated by sampling from the source image, and the

training loss is based on the photometric distance between

the reconstructed and target image. In this way, the network

is trained under self-supervision.

Nevertheless, these approaches suffered severally from

the notorious ‘edge-fattening’ problem, where the objects’

depth predictions are always ‘fatter’ than themselves. We

visualize the problem and analyse its cause in Fig. 1. Dis-

appointingly, there is no corresponding one-size-ﬁts-all so-

lution, let alone a lightweight one.

Deep metric learning seeks to learn a feature space where

semantically similar samples are mapped to close locations,

while semantically dissimilar samples are mapped to dis-

tant locations. Pioneeringly, [24] ﬁrst introduced the patch-

based semantics-guided triplet loss into MDE. Its key idea

is to encourage pixels within each object instance to have

similar depths, while those across semantic boundaries to

have depth differences, as shown in Fig. 2.

However, we ﬁnd that the straightforward application of

the triplet loss only produces poor results. In this paper, we

dig into the weakness of the patch-based triplet loss, and

improve it through a problem-driven manner, reaching un-

precedented performances.

First, in some boundary regions the edge-fattening areas

could be thin, their contributions are small compared to the

non-fattening area, in which case the thin but defective re-

gions’ error could be covered up. Therefore, we change the

optimizing strategy to only focus on the fattening area. The

problematic case illustrated in Fig. 3motivates our strategy,

where we leave the normal area alone, and concentrate the

optimization on the poor-performing negatives.

arXiv:2210.00411v3 [cs.CV] 3 Jan 2023

Second, we point out that the training objective of the

original triplet loss [37] is to distinguish / discriminate, the

network only has to make sure the correct answer’s score

(the anchor-positive distance D+) wins the other choices’

scores (the anchor-negative distances D−) by a predeﬁned

margin, i.e,D−− D+> m, while the absolute value of

D+is not that important, see an example in Fig. 4. How-

ever, depth estimation is a regression problem since every

pixel has its unique depth solution. Here, we have no idea

of the exact depth differences between the intersecting ob-

jects, thus, it is also unknown how much D−should exceed

D+. But one thing for sure is that, in depth estimation, the

smaller the D+, the better, since depths within the same ob-

ject are generally the same. We therefore split D−and D+

from within the original triplet and optimize them in isola-

tion, where either error of the positives or negatives will be

penalized individually and more directly. The problematic

case illustrated in Fig. 5motivates this strategy, where the

negatives are so good enough to cover up the badness of

the positives. In other words, even though D+is large and

needs to be optimized, D−already exceeds D+more than

m, which hinders the optimization for D+.

To sum up, this paper’s contributions are:

• We show two weaknesses of the raw patch-based

triplet optimizing strategy in MDE, that it (i) could

miss some thin but poor fattening areas; (ii) suffered

from mutual effects between positives and negatives.

• To overcome these two limitations, (i) we present a

min. operator based strategy applied on all negative

samples, to prevent the good negatives sheltering the

error of poor (edge-fattening) negatives; (ii) We split

the anchor-positive distance and anchor-negative dis-

tance from within the original triplet, to prevent the

good negatives sheltering the error of poor positives.

• Our redesigned triplet loss is powerful, generalizable

and lightweight: Experiments show that it not only

makes our model outperform all previous methods by

a large margin, but also provides substantial boosts to

a large number of existing models, while introducing

no extra inference computation at all.

2. Related Work

2.1. Self-Supervised Monocular Depth Estimation

Garg et al. [7] ﬁrstly introduced the novel concept of es-

timating depth without depth labels. Then, SfMLearner pre-

sented by Zhou et al. [51] required only monocular videos

to predict depths, because they employed an additional pose

network to learn the camera ego-motion. Godard et al. [10]

presented Monodepth2, with surprisingly simple methods

handling occlusions and dynamic objects in a non-learning

manner, both of which add no network parameters. Multi-

ple works leveraged additional supervisions, e.g. estimat-

ing depth with traditional stereo matching method [45,38]

and semantics [22]. HR-Depth [30] proved that higher-

resolution input images can reduce photometric loss with

the same prediction. Manydepth [46] proposed to make use

of multiple frames available at test time and leverage the

geometric constraint by building a cost volume, achieving

superior performance. [35] integrated wavelet decomposi-

tion into the depth decoder, reducing its computational com-

plexity. Some other recent works estimated depths in more

severe environments, e.g. indoor scenes [21] or in night-

time [40,27]. Innovative loss functions were also devel-

oped, e.g. constraining 3D point clouds consistency [20].

To deal with the notorious edge-fattening issue as illus-

trated in Fig. 1, most existing methods utilized an occlusion

mask [11,52] in the way of removing the incorrect super-

vision under photometric loss. We argue that although this

exclusion strategy works to some extent, the masking tech-

nique could prevent these occluded regions from learning,

because no supervision exists for them any longer. In con-

trast, our triplet loss closes this gap by providing additional

supervision signals directly to these occluded areas.

2.2. Deep Metric Learning

The idea of comparing training samples in the high-level

feature space [3,1] is a powerful concept, since there could

be more task-speciﬁc semantic information in the feature

space than in the low-level image space. The contrastive

loss function (a.k.a. discriminative loss) [16] is formu-

lated by whether a pair of input samples belong to the same

class. It learns an embedding space where samples within

the same class are close in distance, whereas unassociated

ones are farther away from each other. The triplet loss [37]

is an extension of contrastive loss, with three samples as

input each time, i.e. the anchor, the positive(s) and the neg-

ative(s). The triplet loss encourages the anchor-positive dis-

tance to be smaller than the anchor-negative loss by a mar-

gin m. Generally, the triplet loss function was introduced to

face recognition and re-identiﬁcation [37,54], image rank-

ing [41], to name a few. Jung et al. [24] ﬁrst plugged

triplet loss into self-supervised MDE, guided by another

pretrained semantic segmentation network. In our experi-

ments, we show that without other contributions in [24], the

raw semantic-guided triplet loss only yielded a very limited

improvement. We tackle various difﬁculties when plugging

triplet loss into MDE, allowing our redesigned triplet loss

outperforms existing ones by a large margin, with unparal-

leled accuracy.

3. Analysis of the Edge-fattening Problem

Before delving into our powerful triplet loss, it is nec-

essary to make the problem clear. We ﬁrst show the

Background

Foreground

Left view

⨀⨀

𝑞𝑝

(b)

⨀𝑟

Background

Foreground

Occluded

Right view

⨀⨀𝑞𝑝

(a)

⨀

𝑞!

Occluded

(c)

Shift 10 pixels to left

Shift 5pixels to left

𝑝𝑒 𝑞

5disparity (pixel)

𝑞!

𝑝𝑒 𝑝

5disparity (pixel)

𝑝𝑒 𝑟

5disparity (pixel)

𝑟

⨀

Figure 1. Analysis of the edge-fattening issue. (a) Example of the edge-fattening issue. The depth predictions of foreground objects (e.g.

the tree-trunk and poles) are ‘fatter’ than the objects themselves. (b) In the left view, pixel pand qare located in the background with a

disparity of 5 pixels , and qwill be occluded if any further to the right. ris on the tree with a disparity of 10 pixels. (c) pand rare OK - their

gt disparity is the global optimum of the photometric error (loss). qsuffers from edge-fattening issue. Since qis occluded by the tree in

the right view, the photometric error of its gt disparity 5 is large. The photometric loss function therefore struggles to ﬁnd another location

that has a small loss, i.e., shifting another 5 pixels to reach the nearest background pixel q0. However, q0is not the true correspondence of

q. As a result, disparity of the background qequals to that of the foreground r, leading to the edge-fattening issue. Details in Sec. 3.

behaviour of the so-called edge-fattening issue in self-

supervised MDE, then analyse its cause in Fig. 1. This mo-

tivates us to introduce our resigned triplet loss.

The ubiquitous edge-fattening problem, limiting perfor-

mances of vast majorities of self-supervised MDE mod-

els [10,46,45,24,51,30], manifests itself as yielding inac-

curate object depths that partially leak into the background

at object boundaries, as shown in Fig. 1a.

We ﬁrst lay out the ﬁnal conclusion: The networks

misjudge the background near the foreground as the fore-

ground, so that the foreground looks fatter than itself.

The cause could be traced back to occlusions of back-

ground pixels as illustrated in Fig. 1b&c. The background

pixels visible in the target (left) image but invisible in the

source (right) image suffer from incorrect supervision un-

der the photometric loss, since no exact correspondences in

the source image exist for them at all.

The crux of the matter is that, for the occluded pixel (e.g.

pixel qin Fig. 1b), the photometric loss still needs to seek

a pixel with a similar appearance (in the right view) to be

its fake correspondence. Generally, for a background pixel,

only another background pixel could show a small photo-

metric loss, so the problem turns into ﬁnding the nearest

background pixel for the occluded pixel (q) in the source

(right) view. Since the foreground has a smaller depth Z

than the background, it has a larger disparity dowing to

the geometry projection constrain: Z=f·b

d, where bis the

ﬁxed camera baseline. Consequently, the photometric loss

has to shift more to the left to ﬁnd the nearst background

pixel (the fake solution, e.g. pixel q0in Fig. 1b&c). In this

way, the occluded background pixels share the same dispar-

ities as the foreground, forming the edge-fattening issue.

4. Methodology

4.1. Self-supervised Depth Estimation

Following [51,10], given a monocular and/or stereo

video, we ﬁrst train a depth network ψdepth consuming a

single target image Itas input, and outputs its pixel-aligned

depth map Dt=ψdepth(It). Then, we train a pose network

ψpose taking temporally adjacent frames as input, and out-

puts the relative camera pose Tt→t+n=ψpose(It, It+n).

Suppose we have access to the camera intrinsics K,

along with Dtand Tt→t+n, we warp It+ninto Itto gen-

erate the reconstructed image ˜

It+n:

It+n=It+nproj (Dt, Tt→t+n, K),(1)

where h·i is the differentiable bilinear sampling operator ac-

cording to [10].

To make use of multiple input frames, we build a cost

volume [46] using discrete depth values from a predeﬁned

range [dmin, dmax]. Moreover, dmin and dmax are dynami-

cally adjusted during training to ﬁnd the best scale [46].

In order to evaluate the reconstructed images ˜

It+n, we

adopt the edge-aware smoothness loss [18] and the photo-

metric reprojection loss measured by L1+Lssim [51,10]:

Lpe =α

2(1 −SSIM(It,˜

It+n)) + (1 −α)kIt−˜

It+nk1,

(2)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-SupervisedMonocularDepthEstimation:SolvingtheEdge-FatteningProblemXingyuChen1RuonanZhang1JiJiang1YanWang1GeLi1ThomasH.LiB1,2,31SchoolofElectronicandComputerEngineering,PekingUniversity2AdvancedInstituteofInformationTechnology,PekingUniversity3InformationTechnologyR&DInnovationCenterofPekingUniv...

展开>> 收起<<

Self-Supervised Monocular Depth Estimation Solving the Edge-Fattening Problem.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-Supervised Monocular Depth Estimation Solving the Edge-Fattening Problem

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: