Image Masking for Robust Self-Supervised Monocular Depth Estimation
Hemang Chawla1, Kishaan Jeeveswaran1, Elahe Arani1,2*, and Bahram Zonooz1,2*
Abstract— Self-supervised monocular depth estimation is a
salient task for 3D scene understanding. Learned jointly with
monocular ego-motion estimation, several methods have been
proposed to predict accurate pixel-wise depth without using
labeled data. Nevertheless, these methods focus on improving
performance under ideal conditions without natural or digital
corruptions. The general absence of occlusions is assumed
even for object-specific depth estimation. These methods are
also vulnerable to adversarial attacks, which is a pertinent
concern for their reliable deployment in robots and autonomous
driving systems. We propose MIMDepth, a method that adapts
masked image modeling (MIM) for self-supervised monocular
depth estimation. While MIM has been used to learn gener-
alizable features during pre-training, we show how it could
be adapted for direct training of monocular depth estimation.
Our experiments show that MIMDepth is more robust to noise,
blur, weather conditions, digital artifacts, occlusions, as well as
untargeted and targeted adversarial attacks.
I. INTRODUCTION
Depth estimation is an essential component of vision
systems that capture 3D scene structures for applications
in mobile robots, self-driving cars, and augmented reality.
Although expensive and power-hungry LiDARs offer high-
accuracy depth measurements, the ubiquity of low-cost,
energy-efficient cameras makes monocular depth estimation
techniques a popular alternative. Traditional methods often
estimate depth from multiple views of the scene [1]. Instead,
deep learning methods have demonstrated depth estimation
from a single image. Nevertheless, supervised depth estima-
tion approaches [2], [3] require ground truth labels, making
it difficult to scale. On the contrary, self-supervised depth
estimation approaches are trained without ground truth labels
by using concepts from traditional structure-from-motion
and offer the possibility of training on a wide variety of
data [4], [5]. However, the deployment requires a focus
on the generalizability and robustness of models beyond
performance under ideal conditions [6].
Recently, MT-SfMLearner [7] showed that the transformer
architecture for self-supervised depth estimation results in
higher robustness to image corruptions as well as against
adversarial attacks. This is attributed to transformers utiliz-
ing the global context of the scene for predictions, unlike
convolutional neural networks that have a limited receptive
field. However, most research in self-supervised monocular
depth estimation focuses primarily on achieving excellent
Authors are with 1Advanced Research Lab, NavInfo Europe, The Nether-
lands, and 2Department of Mathematics and Computer Science, Eindhoven
University of Technology, The Netherlands.
Contact: hemang.chawla@navinfo.eu
*Contributed equally.
Code: https://github.com/NeurAI-Lab/MIMDepth
performance on the independent and identically distributed
(i.i.d.) test set. It is assumed that the images are free from
noise (e.g. Gaussian) and blur (e.g. due to ego-motion or
moving objects in the scene), have clear daylight weather,
and are without digital artifacts (e.g. pixelation). Even for the
task of object-specific depth estimation [8], it is assumed that
the objects are without occlusions. Finally, the robustness
of methods against adversarial attacks is not considered,
which is a pertinent concern for safety while deploying deep
learning models.
Since the performance and robustness of the models are
determined by the learned representations, influencing the
encoding of features could lead to more robust estimations.
We hypothesize that integrating Masked Image Modeling
(MIM) into the training of depth estimation would result in
learning features that make the model more robust to natural
and digital corruptions as well as against adversarial attacks,
by modeling the global context in a better way. MIM is a
technique that has been used until now for self-supervised
representation learning in pre-training of image transform-
ers [9]–[12]. MIM pre-training involves masking a portion of
image patches and then using the unmasked portion to predict
the masked input [10], [11] or its features [9], [12]. It models
long-range dependencies and focuses on the low-frequency
details present in the images. However, when pre-trained
models are fine-tuned for downstream training, the general
features that were learned could possibly be overwritten.
Instead, adapting MIM for direct training of a task, such as
depth estimation, could lead to richer learned representations
that make the model more robust and generalizable.
While both MIM and depth estimation are self-supervised
methods, they differ in how they are trained. MIM, used
in pre-training of a network, learns by reconstructing the
input image generally passed through an autoencoder. In-
stead, the self-supervised depth estimation network is trained
along with a self-supervised ego-motion estimation network,
whose output is used to synthesize adjacent images in the
training set via the perspective projection transform [13].
Thus, applying MIM to depth estimation requires different
considerations than its use for pre-training. Here, we examine
the following questions:
‚Would integrating MIM to the depth and/or ego-motion
estimation networks result in improved robustness of
depth estimation?
‚MIM has been shown to work well with either block-
wise masking [9], or random masking with high mask
size [10]. Which masking strategy would work better
for depth estimation?
‚MIM pre-training uses a relatively high mask ratio and
arXiv:2210.02357v2 [cs.CV] 1 Feb 2023