Frequency-Aware Self-Supervised Monocular Depth Estimation Xingyu Chen1Thomas H. Li123Ruonan Zhang1Ge LiB1 1School of Electronic and Computer Engineering Peking University2Advanced Institute of Information Technology Peking University

2025-04-24 0 0 2.92MB 10 页 10玖币
侵权投诉
Frequency-Aware Self-Supervised Monocular Depth Estimation
Xingyu Chen1Thomas H. Li1,2,3 Ruonan Zhang1Ge Li B1
1School of Electronic and Computer Engineering, Peking University 2Advanced Institute of Information Technology, Peking University
3Information Technology R&D Innovation Center of Peking University
cxy@stu.pku.edu.cn tli@aiit.org.cn zhangrn@stu.pku.edu.cn geli@ece.pku.edu.cn
https://github.com/xingyuuchen/freq-aware-depth
Abstract
We present two versatile methods to generally enhance
self-supervised monocular depth estimation (MDE) mod-
els. The high generalizability of our methods is achieved by
solving the fundamental and ubiquitous problems in pho-
tometric loss function. In particular, from the perspective
of spatial frequency, we first propose Ambiguity-Masking to
suppress the incorrect supervision under photometric loss
at specific object boundaries, the cause of which could be
traced to pixel-level ambiguity. Second, we present a novel
frequency-adaptive Gaussian low-pass filter, designed to ro-
bustify the photometric loss in high-frequency regions. We
are the first to propose blurring images to improve depth es-
timators with an interpretable analysis. Both modules are
lightweight, adding no parameters and no need to manually
change the network structures. Experiments show that our
methods provide performance boosts to a large number of
existing models, including those who claimed state-of-the-
art, while introducing no extra inference computation at all.
1. Introduction
Inferring the depth of each pixel in a single RGB im-
age is a versatile tool for various fields, such as robot nav-
igation [14], autonomous driving [30,37] and augmented
reality [24]. However, it is extremely difficult to obtain a
large number of depth labels from real world, and even ex-
pensive Lidar sensors can only obtain depth information of
sparse points on the image [34]. Therefore, a large number
of self-supervised MDE researches have been conducted,
with accuracy getting closer and closer to supervised meth-
ods. By exploiting the geometry projection constrain, the
self-supervision comes from image reconstructions, requir-
ing only known (or estimated) camera poses between differ-
ent viewpoints. Though significant progress has been made,
there still remains some undiscovered general problems.
First. Many works [25,18,13,35,39] concentrated on
predicting clearer (sharper) depth of object boundaries. De-
spite their success, they mainly relied on well-designed net-
work architectures. In this work, we show a more funda-
mental reason for this limitation - from input images. An
interesting observation in Fig. 1b raises the question: Does
the photometric loss at the object boundaries really indi-
cates inaccurate depth predictions? Self-supervised train-
ing minimizes the per-pixel photometric loss based on the
2D-3D-2D reprojection [38,12]. Every single pixel is ex-
pected to attach to one deterministic object, otherwise the
depth of a mixed object is of no physical meaning. The
pixel-level ambiguity (Fig. 1c), as it happens, manifests as
making the object boundary the fused color of two differ-
ent objects. These ambiguous pixels belong to no objects
in the 2D-3D back-projection (see point cloud in Fig. 1d),
and have no correspondence when evaluating photometric
loss (on the target and synthesized images) after 3D-2D re-
projection. As a result, the network always learns irrational
loss from them, regardless of its predicted depths.
Second. Intuitively, for a loss function, predictions close
to gt should have small loss, whereas predictions with large
error ought to deserve harsh penalties (loss). However, pho-
tometric loss does not obey this rule in high-freq regions, as
shown in Fig. 2. In such regions, a tiny deviation from gt
receives a harsh penalty, while a large error probably has
an even smaller loss than gt. These unfairness comes from
high spatial frequency and the breaking of photometric con-
sistency assumption, respectively. To reduce such unfair-
ness, we present a frequency-adaptive Gaussian blur tech-
nique called Auto-Blur. It enlarges the receptive field by
radiating photometric information of pixels when needed.
To sum up, our contributions are threefold:
1. We show the depth network suffers from irrational su-
pervision under the photometric loss at specific bound-
ary areas. We trace its cause to pixel-level ambigu-
ity due to the anti-aliasing technique. Furthermore,
we demonstrate the photometric loss cannot fairly
and accurately evaluate the depth predictions in high-
frequency regions.
arXiv:2210.05479v2 [cs.CV] 14 Oct 2022
2. To overcome these two problems, we first propose
Ambiguity-Masking to exclude the ambiguous pixels
producing irrational supervisions. Second, we present
Auto-Blur, which pioneeringly proves blurring im-
ages could universally enhance depth estimators by re-
ducing unfairness and enlarging receptive fields.
3. Our methods are highly versatile and lightweight, pro-
viding performance boosts to a large number of ex-
isting models, including those claiming SoTA, while
introducing no extra inference computation at all.
Despite our superior results, the key motivation of this
paper is to shed light on the problems rarely noticed by pre-
vious MDE researchers, and wish our analysis and solutions
could inspire more subsequent works.
2. Related Work
2.1. Supervised Depth Estimation
Plenty recent researches have proved that deep neural
networks bring remarkable improvements to MDE models.
Many MDE (or stereo matching [26,33]) methods are fully
supervised, requiring the depth labels collected from RGB-
D cameras or Lidar sensors. Eigen et al. [5] introduced a
multi-scale architecture to learn coarse depth and then re-
fined on another network. Fu et al. [7] changed depth re-
gression to classification of discrete depth values. [2] fur-
ther extended this idea to adaptively adjust depth bins for
each input image. With direct access to depth labels, loss
is formulated using the distance between predicted depth
and ground truth depth (Scale-Invariant loss [21,2], L1dis-
tance [19,33]), without relying on assumptions such as pho-
tometric consistency or static scenes. [1,29] also computed
L1loss between the gradient map of predicted and gt depth.
2.2. Self-Supervised Depth Estimation
Self-supervised MDE transforms depth regression into
image reconstruction [9,38]. Monodepth [11] introduced
the left-right consistency to alleviate depth map discontinu-
ity. Monodepth2 [12] proposed to use min. reprojection loss
to deal with occlusions, and auto-masking to alleviate mov-
ing objects and static cameras. In order to produce sharper
depth edges, [18] leveraged the off-the-shelf fine-grained
sematic segmentations, [35] designed an attention-based
network to capture detailed textures. In terms of image gra-
dient, self-supervised methods [8,12,27,32] usually adopt
the disparity smoothness loss [16]. [20] trained an addi-
tional ‘local network’ to predict depth gradients of small
image patches, and then integrated them with depths from
‘global network’. [22] computed photometric loss on the
gradient map to deal with sudden brightness change, but it is
not robust to objects with different colors but the same gra-
dients. Most related to our Auto-Blur is Depth-Hints [32],
which helped the network escape from local minima of thin
structures, by using the depth proxy labels obtained from
SGM stereo matching [17], while we make no use of any ad-
ditional supervision and are not restricted to stereo datasets.
3. The Need to Consider Spatial Frequency
This section mainly describes our motivation, specifi-
cally, revealing two problems that few previous works no-
ticed. We begin with a quick review of the universally used
photometric loss in self-supervised MDE (Sec. 3.1), then
we demonstrate from two aspects (Sec. 3.2 and Sec. 3.3)
that the photometric loss function is not a good supervisor
for guiding MDE models in some particular pixels or areas.
3.1. Appearance Based Reprojection Loss
In self-supervised MDE setting, the network predicts a
dense depth image Dtgiven an input RGB image Itat test
time. To evaluate Dt, based on the geometry projection
constraint, we generate the reconstructed image ˜
It+nby
sampling from the source images It+ntaken from differ-
ent viewpoints of the same scene. The loss is based on the
pixel-level appearance distance between Itand ˜
It+n. Ma-
jorities of self-supervised MDE methods [11,27,12,38,25,
36,23] adopt the L1+Lssim [31] as photometric loss:
L(It,˜
It+n) = α
2(1SSIM(It,˜
It+n))+(1α)kIt˜
It+nk1,
(1)
where α= 0.85 by default and SSIM [31] computes pixel
similarity over a 3×3window.
3.2. Does Loss at Object Boundary Make Sense?
As seen from Fig. 1, when training gets to the middle
part, the losses appear in two types of regions:
1. On the whole object (true-positives). Because the es-
timation of the object’s depth (or camera motion) is
inaccurate, it reprojects to another object;
2. At the object boundaries (false-positives). Such as the
black chimney in the upper right corner.
So why does some loss only appear at the object bound-
aries, and is it reasonable? In fact, few works analyzed its
cause. In order to minimize the per-pixel reprojection er-
ror, the network adjusts every single pixel’s depth to make
it reproject to where it is in the source view. This process
works under the condition that each pixel belongs to one de-
terministic object, since we can never use one depth value
to characterize a pixel that represents two different objects.
However, we illustrate in Fig. 1c&d that, the anti-aliasing
breaks this training condition by making the object bound-
ary color the weighted sum of both sides’ colors.
Specifically, in self-supervised MDE, pixels are first
(2D-3D) back-projected to construct the 3D scene using
摘要:

Frequency-AwareSelf-SupervisedMonocularDepthEstimationXingyuChen1ThomasH.Li1,2,3RuonanZhang1GeLiB11SchoolofElectronicandComputerEngineering,PekingUniversity2AdvancedInstituteofInformationTechnology,PekingUniversity3InformationTechnologyR&DInnovationCenterofPekingUniversitycxy@stu.pku.edu.cntli@aiit....

展开>> 收起<<
Frequency-Aware Self-Supervised Monocular Depth Estimation Xingyu Chen1Thomas H. Li123Ruonan Zhang1Ge LiB1 1School of Electronic and Computer Engineering Peking University2Advanced Institute of Information Technology Peking University.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:2.92MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注