Frequency-Aware Self-Supervised Monocular Depth Estimation Xingyu Chen1Thomas H. Li123Ruonan Zhang1Ge LiB1 1School of Electronic and Computer Engineering Peking University2Advanced Institute of Information Technology Peking University

2025-04-24 0 0 2.92MB 10 页 10玖币

Frequency-Aware Self-Supervised Monocular Depth Estimation

Xingyu Chen1Thomas H. Li1,2,3 Ruonan Zhang1Ge Li B1

1School of Electronic and Computer Engineering, Peking University 2Advanced Institute of Information Technology, Peking University

3Information Technology R&D Innovation Center of Peking University

cxy@stu.pku.edu.cn tli@aiit.org.cn zhangrn@stu.pku.edu.cn geli@ece.pku.edu.cn

https://github.com/xingyuuchen/freq-aware-depth

Abstract

We present two versatile methods to generally enhance

self-supervised monocular depth estimation (MDE) mod-

els. The high generalizability of our methods is achieved by

solving the fundamental and ubiquitous problems in pho-

tometric loss function. In particular, from the perspective

of spatial frequency, we ﬁrst propose Ambiguity-Masking to

suppress the incorrect supervision under photometric loss

at speciﬁc object boundaries, the cause of which could be

traced to pixel-level ambiguity. Second, we present a novel

frequency-adaptive Gaussian low-pass ﬁlter, designed to ro-

bustify the photometric loss in high-frequency regions. We

are the ﬁrst to propose blurring images to improve depth es-

timators with an interpretable analysis. Both modules are

lightweight, adding no parameters and no need to manually

change the network structures. Experiments show that our

methods provide performance boosts to a large number of

existing models, including those who claimed state-of-the-

art, while introducing no extra inference computation at all.

1. Introduction

Inferring the depth of each pixel in a single RGB im-

age is a versatile tool for various ﬁelds, such as robot nav-

igation [14], autonomous driving [30,37] and augmented

reality [24]. However, it is extremely difﬁcult to obtain a

large number of depth labels from real world, and even ex-

pensive Lidar sensors can only obtain depth information of

sparse points on the image [34]. Therefore, a large number

of self-supervised MDE researches have been conducted,

with accuracy getting closer and closer to supervised meth-

ods. By exploiting the geometry projection constrain, the

self-supervision comes from image reconstructions, requir-

ing only known (or estimated) camera poses between differ-

ent viewpoints. Though signiﬁcant progress has been made,

there still remains some undiscovered general problems.

First. Many works [25,18,13,35,39] concentrated on

predicting clearer (sharper) depth of object boundaries. De-

spite their success, they mainly relied on well-designed net-

work architectures. In this work, we show a more funda-

mental reason for this limitation - from input images. An

interesting observation in Fig. 1b raises the question: Does

the photometric loss at the object boundaries really indi-

cates inaccurate depth predictions? Self-supervised train-

ing minimizes the per-pixel photometric loss based on the

2D-3D-2D reprojection [38,12]. Every single pixel is ex-

pected to attach to one deterministic object, otherwise the

depth of a mixed object is of no physical meaning. The

pixel-level ambiguity (Fig. 1c), as it happens, manifests as

making the object boundary the fused color of two differ-

ent objects. These ambiguous pixels belong to no objects

in the 2D-3D back-projection (see point cloud in Fig. 1d),

and have no correspondence when evaluating photometric

loss (on the target and synthesized images) after 3D-2D re-

projection. As a result, the network always learns irrational

loss from them, regardless of its predicted depths.

Second. Intuitively, for a loss function, predictions close

to gt should have small loss, whereas predictions with large

error ought to deserve harsh penalties (loss). However, pho-

tometric loss does not obey this rule in high-freq regions, as

shown in Fig. 2. In such regions, a tiny deviation from gt

receives a harsh penalty, while a large error probably has

an even smaller loss than gt. These unfairness comes from

high spatial frequency and the breaking of photometric con-

sistency assumption, respectively. To reduce such unfair-

ness, we present a frequency-adaptive Gaussian blur tech-

nique called Auto-Blur. It enlarges the receptive ﬁeld by

radiating photometric information of pixels when needed.

To sum up, our contributions are threefold:

1. We show the depth network suffers from irrational su-

pervision under the photometric loss at speciﬁc bound-

ary areas. We trace its cause to pixel-level ambigu-

ity due to the anti-aliasing technique. Furthermore,

we demonstrate the photometric loss cannot fairly

and accurately evaluate the depth predictions in high-

frequency regions.

arXiv:2210.05479v2 [cs.CV] 14 Oct 2022

2. To overcome these two problems, we ﬁrst propose

Ambiguity-Masking to exclude the ambiguous pixels

producing irrational supervisions. Second, we present

Auto-Blur, which pioneeringly proves blurring im-

ages could universally enhance depth estimators by re-

ducing unfairness and enlarging receptive ﬁelds.

3. Our methods are highly versatile and lightweight, pro-

viding performance boosts to a large number of ex-

isting models, including those claiming SoTA, while

introducing no extra inference computation at all.

Despite our superior results, the key motivation of this

paper is to shed light on the problems rarely noticed by pre-

vious MDE researchers, and wish our analysis and solutions

could inspire more subsequent works.

2. Related Work

2.1. Supervised Depth Estimation

Plenty recent researches have proved that deep neural

networks bring remarkable improvements to MDE models.

Many MDE (or stereo matching [26,33]) methods are fully

supervised, requiring the depth labels collected from RGB-

D cameras or Lidar sensors. Eigen et al. [5] introduced a

multi-scale architecture to learn coarse depth and then re-

ﬁned on another network. Fu et al. [7] changed depth re-

gression to classiﬁcation of discrete depth values. [2] fur-

ther extended this idea to adaptively adjust depth bins for

each input image. With direct access to depth labels, loss

is formulated using the distance between predicted depth

and ground truth depth (Scale-Invariant loss [21,2], L1dis-

tance [19,33]), without relying on assumptions such as pho-

tometric consistency or static scenes. [1,29] also computed

L1loss between the gradient map of predicted and gt depth.

2.2. Self-Supervised Depth Estimation

Self-supervised MDE transforms depth regression into

image reconstruction [9,38]. Monodepth [11] introduced

the left-right consistency to alleviate depth map discontinu-

ity. Monodepth2 [12] proposed to use min. reprojection loss

to deal with occlusions, and auto-masking to alleviate mov-

ing objects and static cameras. In order to produce sharper

depth edges, [18] leveraged the off-the-shelf ﬁne-grained

sematic segmentations, [35] designed an attention-based

network to capture detailed textures. In terms of image gra-

dient, self-supervised methods [8,12,27,32] usually adopt

the disparity smoothness loss [16]. [20] trained an addi-

tional ‘local network’ to predict depth gradients of small

image patches, and then integrated them with depths from

‘global network’. [22] computed photometric loss on the

gradient map to deal with sudden brightness change, but it is

not robust to objects with different colors but the same gra-

dients. Most related to our Auto-Blur is Depth-Hints [32],

which helped the network escape from local minima of thin

structures, by using the depth proxy labels obtained from

SGM stereo matching [17], while we make no use of any ad-

ditional supervision and are not restricted to stereo datasets.

3. The Need to Consider Spatial Frequency

This section mainly describes our motivation, speciﬁ-

cally, revealing two problems that few previous works no-

ticed. We begin with a quick review of the universally used

photometric loss in self-supervised MDE (Sec. 3.1), then

we demonstrate from two aspects (Sec. 3.2 and Sec. 3.3)

that the photometric loss function is not a good supervisor

for guiding MDE models in some particular pixels or areas.

3.1. Appearance Based Reprojection Loss

In self-supervised MDE setting, the network predicts a

dense depth image Dtgiven an input RGB image Itat test

time. To evaluate Dt, based on the geometry projection

constraint, we generate the reconstructed image ˜

It+nby

sampling from the source images It+ntaken from differ-

ent viewpoints of the same scene. The loss is based on the

pixel-level appearance distance between Itand ˜

It+n. Ma-

jorities of self-supervised MDE methods [11,27,12,38,25,

36,23] adopt the L1+Lssim [31] as photometric loss:

L(It,˜

It+n) = α

2(1−SSIM(It,˜

It+n))+(1−α)kIt−˜

It+nk1,

(1)

where α= 0.85 by default and SSIM [31] computes pixel

similarity over a 3×3window.

3.2. Does Loss at Object Boundary Make Sense?

As seen from Fig. 1, when training gets to the middle

part, the losses appear in two types of regions:

1. On the whole object (true-positives). Because the es-

timation of the object’s depth (or camera motion) is

inaccurate, it reprojects to another object;

2. At the object boundaries (false-positives). Such as the

black chimney in the upper right corner.

So why does some loss only appear at the object bound-

aries, and is it reasonable? In fact, few works analyzed its

cause. In order to minimize the per-pixel reprojection er-

ror, the network adjusts every single pixel’s depth to make

it reproject to where it is in the source view. This process

works under the condition that each pixel belongs to one de-

terministic object, since we can never use one depth value

to characterize a pixel that represents two different objects.

However, we illustrate in Fig. 1c&d that, the anti-aliasing

breaks this training condition by making the object bound-

ary color the weighted sum of both sides’ colors.

Speciﬁcally, in self-supervised MDE, pixels are ﬁrst

(2D-3D) back-projected to construct the 3D scene using

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Frequency-AwareSelf-SupervisedMonocularDepthEstimationXingyuChen1ThomasH.Li1,2,3RuonanZhang1GeLiB11SchoolofElectronicandComputerEngineering,PekingUniversity2AdvancedInstituteofInformationTechnology,PekingUniversity3InformationTechnologyR&DInnovationCenterofPekingUniversitycxy@stu.pku.edu.cntli@aiit....

展开>> 收起<<

Frequency-Aware Self-Supervised Monocular Depth Estimation Xingyu Chen1Thomas H. Li123Ruonan Zhang1Ge LiB1 1School of Electronic and Computer Engineering Peking University2Advanced Institute of Information Technology Peking University.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Frequency-Aware Self-Supervised Monocular Depth Estimation Xingyu Chen1Thomas H. Li123Ruonan Zhang1Ge LiB1 1School of Electronic and Computer Engineering Peking University2Advanced Institute of Information Technology Peking University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: