Image Masking for Robust Self-Supervised Monocular Depth Estimation Hemang Chawla1 Kishaan Jeeveswaran1 Elahe Arani12 and Bahram Zonooz12 Abstract Self-supervised monocular depth estimation is a

2025-05-08 0 0 7.04MB 7 页 10玖币
侵权投诉
Image Masking for Robust Self-Supervised Monocular Depth Estimation
Hemang Chawla1, Kishaan Jeeveswaran1, Elahe Arani1,2*, and Bahram Zonooz1,2*
Abstract Self-supervised monocular depth estimation is a
salient task for 3D scene understanding. Learned jointly with
monocular ego-motion estimation, several methods have been
proposed to predict accurate pixel-wise depth without using
labeled data. Nevertheless, these methods focus on improving
performance under ideal conditions without natural or digital
corruptions. The general absence of occlusions is assumed
even for object-specific depth estimation. These methods are
also vulnerable to adversarial attacks, which is a pertinent
concern for their reliable deployment in robots and autonomous
driving systems. We propose MIMDepth, a method that adapts
masked image modeling (MIM) for self-supervised monocular
depth estimation. While MIM has been used to learn gener-
alizable features during pre-training, we show how it could
be adapted for direct training of monocular depth estimation.
Our experiments show that MIMDepth is more robust to noise,
blur, weather conditions, digital artifacts, occlusions, as well as
untargeted and targeted adversarial attacks.
I. INTRODUCTION
Depth estimation is an essential component of vision
systems that capture 3D scene structures for applications
in mobile robots, self-driving cars, and augmented reality.
Although expensive and power-hungry LiDARs offer high-
accuracy depth measurements, the ubiquity of low-cost,
energy-efficient cameras makes monocular depth estimation
techniques a popular alternative. Traditional methods often
estimate depth from multiple views of the scene [1]. Instead,
deep learning methods have demonstrated depth estimation
from a single image. Nevertheless, supervised depth estima-
tion approaches [2], [3] require ground truth labels, making
it difficult to scale. On the contrary, self-supervised depth
estimation approaches are trained without ground truth labels
by using concepts from traditional structure-from-motion
and offer the possibility of training on a wide variety of
data [4], [5]. However, the deployment requires a focus
on the generalizability and robustness of models beyond
performance under ideal conditions [6].
Recently, MT-SfMLearner [7] showed that the transformer
architecture for self-supervised depth estimation results in
higher robustness to image corruptions as well as against
adversarial attacks. This is attributed to transformers utiliz-
ing the global context of the scene for predictions, unlike
convolutional neural networks that have a limited receptive
field. However, most research in self-supervised monocular
depth estimation focuses primarily on achieving excellent
Authors are with 1Advanced Research Lab, NavInfo Europe, The Nether-
lands, and 2Department of Mathematics and Computer Science, Eindhoven
University of Technology, The Netherlands.
Contact: hemang.chawla@navinfo.eu
*Contributed equally.
Code: https://github.com/NeurAI-Lab/MIMDepth
performance on the independent and identically distributed
(i.i.d.) test set. It is assumed that the images are free from
noise (e.g. Gaussian) and blur (e.g. due to ego-motion or
moving objects in the scene), have clear daylight weather,
and are without digital artifacts (e.g. pixelation). Even for the
task of object-specific depth estimation [8], it is assumed that
the objects are without occlusions. Finally, the robustness
of methods against adversarial attacks is not considered,
which is a pertinent concern for safety while deploying deep
learning models.
Since the performance and robustness of the models are
determined by the learned representations, influencing the
encoding of features could lead to more robust estimations.
We hypothesize that integrating Masked Image Modeling
(MIM) into the training of depth estimation would result in
learning features that make the model more robust to natural
and digital corruptions as well as against adversarial attacks,
by modeling the global context in a better way. MIM is a
technique that has been used until now for self-supervised
representation learning in pre-training of image transform-
ers [9]–[12]. MIM pre-training involves masking a portion of
image patches and then using the unmasked portion to predict
the masked input [10], [11] or its features [9], [12]. It models
long-range dependencies and focuses on the low-frequency
details present in the images. However, when pre-trained
models are fine-tuned for downstream training, the general
features that were learned could possibly be overwritten.
Instead, adapting MIM for direct training of a task, such as
depth estimation, could lead to richer learned representations
that make the model more robust and generalizable.
While both MIM and depth estimation are self-supervised
methods, they differ in how they are trained. MIM, used
in pre-training of a network, learns by reconstructing the
input image generally passed through an autoencoder. In-
stead, the self-supervised depth estimation network is trained
along with a self-supervised ego-motion estimation network,
whose output is used to synthesize adjacent images in the
training set via the perspective projection transform [13].
Thus, applying MIM to depth estimation requires different
considerations than its use for pre-training. Here, we examine
the following questions:
Would integrating MIM to the depth and/or ego-motion
estimation networks result in improved robustness of
depth estimation?
MIM has been shown to work well with either block-
wise masking [9], or random masking with high mask
size [10]. Which masking strategy would work better
for depth estimation?
MIM pre-training uses a relatively high mask ratio and
arXiv:2210.02357v2 [cs.CV] 1 Feb 2023
Target
Source
View
Synthesis
Source
Target
M
M
M
MM M
M M
Mask
Generator
Image Patches
Masked Patches
Image Patches
Depth Estimate
MM
M
0 1 2 3 4 5 6 n-3 n-2 n-1
Masked Token
Embedding
Position
Embedding
+ + + + + + + + + +
Depth Estimation
Network
Ego-Motion
Estimation Network
Target
Fig. 1: An overview of Masked Image Modeling for Depth Estimation. The method learns to predict depth from the masked
as well as unmasked patches with a better understanding of the global context.
mask size [9]–[11] due to more information redundancy
in images than in sentences. Would the high mask ratio
or high mask size used for MIM pre-training be suitable
for integrating it into depth estimation?
MIM has been shown to result in better features for
downstream tasks when its loss is applied only to
masked regions [10]. Would similarly applying the loss
on only the masked regions result in a more robust depth
estimation?
With our proposed method MIMDepth, we demonstrate
that applying blockwise masking with a relatively lower
mask ratio (than MIM pre-training) only to the depth es-
timation network, with a loss on the complete image, results
in improved robustness to natural and digital corruptions,
occlusions, as well as untargeted and targeted adversarial
attacks. It is additionally found to improve the performance
of the ego-motion estimation network, while maintaining a
competitive performance of the depth estimation network on
the i.i.d. test sets.
II. RELATED WORKS
Self-supervised Monocular Depth Estimation One of
the challenging tasks of interest in 3D scene understanding
is monocular depth estimation. Although self-supervised
depth estimation was introduced for stereo pairs [14], it
was soon extended to a monocular setup [13]. Monoc-
ular self-supervised approaches to depth estimation have
the advantage of not requiring any labels and can learn
from a wide variety of data from multiple sources. Over
the years, improvements have been made to deal with
challenges due to occlusions [15], dynamic objects [16]–
[18], and scale-consistency issues [4], [19] and more. While
most methods generally used 2D convolutional architectures,
a 3D convolutional architecture was proposed to estimate
depth from symmetrical packing and unpacking blocks that
preserve depth details [5] . Recently, MTSfMLearner [7] has
shown that transformers can also be used for depth and
pose estimation resulting in comparable performance, but
improved robustness to natural corruptions and adversarial
attacks due to their global receptive fields. Although other
methods that use transformers have also been proposed [20],
[21], they do not consider the robustness of their proposed
approaches. We show that integrating mask image modeling
trains networks to identify long-range dependencies and
could further improve the robustness of depth estimation.
Masked Image Modeling Masked image modeling is a
method for self-supervised representation learning through
images corrupted by masking. This was developed following
masked language modeling (MLM) [22]. These methods
are based on replacing a portion of the tokenized input
sequence with learnable mask tokens and learning to predict
the missing context using only the visible context. iGPT [23]
operates on clustered pixel tokens and predicts unknown
pixels directly. ViT [24] explores masked patch prediction by
predicting the mean color. On the contrary, BEIT [9] operates
on image patches but uses an additional discrete Variational
AutoEncoder (dVAE) tokenizer [25] to learn to predict
discrete tokens corresponding to masked portions. BEIT uses
special blockwise masking that mitigates the wastage of
modeling capabilities on short-range dependencies and high-
frequency details similar to BERT. Instead, SiMMIM [10]
and MAE [11] show that even random masking with a higher
mask ratio or mask size can similarly perform well for self-
supervised pretraining from image data. However, MIM has
not been explored for directly training a relevant task of
interest rather than as a pre-training method. We demonstrate
how it can be adapted to improve the robustness of self-
supervised monocular depth estimation.
III. METHOD
We propose MIMDepth, a method for masked self-
supervised monocular depth estimation (see Figure 1) that in-
摘要:

ImageMaskingforRobustSelf-SupervisedMonocularDepthEstimationHemangChawla1,KishaanJeeveswaran1,ElaheArani1;2*,andBahramZonooz1;2*Abstract—Self-supervisedmonoculardepthestimationisasalienttaskfor3Dsceneunderstanding.Learnedjointlywithmonocularego-motionestimation,severalmethodshavebeenproposedtopredic...

展开>> 收起<<
Image Masking for Robust Self-Supervised Monocular Depth Estimation Hemang Chawla1 Kishaan Jeeveswaran1 Elahe Arani12 and Bahram Zonooz12 Abstract Self-supervised monocular depth estimation is a.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:7.04MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注