Image Masking for Robust Self-Supervised Monocular Depth Estimation Hemang Chawla1 Kishaan Jeeveswaran1 Elahe Arani12 and Bahram Zonooz12 Abstract Self-supervised monocular depth estimation is a

2025-05-08 2 0 7.04MB 7 页 10玖币

侵权投诉

Image Masking for Robust Self-Supervised Monocular Depth Estimation

Hemang Chawla1, Kishaan Jeeveswaran1, Elahe Arani1,2*, and Bahram Zonooz1,2*

Abstract— Self-supervised monocular depth estimation is a

salient task for 3D scene understanding. Learned jointly with

monocular ego-motion estimation, several methods have been

proposed to predict accurate pixel-wise depth without using

labeled data. Nevertheless, these methods focus on improving

performance under ideal conditions without natural or digital

corruptions. The general absence of occlusions is assumed

even for object-speciﬁc depth estimation. These methods are

also vulnerable to adversarial attacks, which is a pertinent

concern for their reliable deployment in robots and autonomous

driving systems. We propose MIMDepth, a method that adapts

masked image modeling (MIM) for self-supervised monocular

depth estimation. While MIM has been used to learn gener-

alizable features during pre-training, we show how it could

be adapted for direct training of monocular depth estimation.

Our experiments show that MIMDepth is more robust to noise,

blur, weather conditions, digital artifacts, occlusions, as well as

untargeted and targeted adversarial attacks.

I. INTRODUCTION

Depth estimation is an essential component of vision

systems that capture 3D scene structures for applications

in mobile robots, self-driving cars, and augmented reality.

Although expensive and power-hungry LiDARs offer high-

accuracy depth measurements, the ubiquity of low-cost,

energy-efﬁcient cameras makes monocular depth estimation

techniques a popular alternative. Traditional methods often

estimate depth from multiple views of the scene [1]. Instead,

deep learning methods have demonstrated depth estimation

from a single image. Nevertheless, supervised depth estima-

tion approaches [2], [3] require ground truth labels, making

it difﬁcult to scale. On the contrary, self-supervised depth

estimation approaches are trained without ground truth labels

by using concepts from traditional structure-from-motion

and offer the possibility of training on a wide variety of

data [4], [5]. However, the deployment requires a focus

on the generalizability and robustness of models beyond

performance under ideal conditions [6].

Recently, MT-SfMLearner [7] showed that the transformer

architecture for self-supervised depth estimation results in

higher robustness to image corruptions as well as against

adversarial attacks. This is attributed to transformers utiliz-

ing the global context of the scene for predictions, unlike

convolutional neural networks that have a limited receptive

ﬁeld. However, most research in self-supervised monocular

depth estimation focuses primarily on achieving excellent

Authors are with 1Advanced Research Lab, NavInfo Europe, The Nether-

lands, and 2Department of Mathematics and Computer Science, Eindhoven

University of Technology, The Netherlands.

Contact: hemang.chawla@navinfo.eu

*Contributed equally.

Code: https://github.com/NeurAI-Lab/MIMDepth

performance on the independent and identically distributed

(i.i.d.) test set. It is assumed that the images are free from

noise (e.g. Gaussian) and blur (e.g. due to ego-motion or

moving objects in the scene), have clear daylight weather,

and are without digital artifacts (e.g. pixelation). Even for the

task of object-speciﬁc depth estimation [8], it is assumed that

the objects are without occlusions. Finally, the robustness

of methods against adversarial attacks is not considered,

which is a pertinent concern for safety while deploying deep

learning models.

Since the performance and robustness of the models are

determined by the learned representations, inﬂuencing the

encoding of features could lead to more robust estimations.

We hypothesize that integrating Masked Image Modeling

(MIM) into the training of depth estimation would result in

learning features that make the model more robust to natural

and digital corruptions as well as against adversarial attacks,

by modeling the global context in a better way. MIM is a

technique that has been used until now for self-supervised

representation learning in pre-training of image transform-

ers [9]–[12]. MIM pre-training involves masking a portion of

image patches and then using the unmasked portion to predict

the masked input [10], [11] or its features [9], [12]. It models

long-range dependencies and focuses on the low-frequency

details present in the images. However, when pre-trained

models are ﬁne-tuned for downstream training, the general

features that were learned could possibly be overwritten.

Instead, adapting MIM for direct training of a task, such as

depth estimation, could lead to richer learned representations

that make the model more robust and generalizable.

While both MIM and depth estimation are self-supervised

methods, they differ in how they are trained. MIM, used

in pre-training of a network, learns by reconstructing the

input image generally passed through an autoencoder. In-

stead, the self-supervised depth estimation network is trained

along with a self-supervised ego-motion estimation network,

whose output is used to synthesize adjacent images in the

training set via the perspective projection transform [13].

Thus, applying MIM to depth estimation requires different

considerations than its use for pre-training. Here, we examine

the following questions:

‚Would integrating MIM to the depth and/or ego-motion

estimation networks result in improved robustness of

depth estimation?

‚MIM has been shown to work well with either block-

wise masking [9], or random masking with high mask

size [10]. Which masking strategy would work better

for depth estimation?

‚MIM pre-training uses a relatively high mask ratio and

arXiv:2210.02357v2 [cs.CV] 1 Feb 2023

Target

Source

View

Synthesis

Source

Target

MM M

M M

Mask

Generator

Image Patches

Masked Patches

Image Patches

Depth Estimate

0 1 2 3 4 5 6 n-3 n-2 n-1

Masked Token

Embedding

Position

Embedding

+ + + + + + + + + +

Depth Estimation

Network

Ego-Motion

Estimation Network

Target

Fig. 1: An overview of Masked Image Modeling for Depth Estimation. The method learns to predict depth from the masked

as well as unmasked patches with a better understanding of the global context.

mask size [9]–[11] due to more information redundancy

in images than in sentences. Would the high mask ratio

or high mask size used for MIM pre-training be suitable

for integrating it into depth estimation?

‚MIM has been shown to result in better features for

downstream tasks when its loss is applied only to

masked regions [10]. Would similarly applying the loss

on only the masked regions result in a more robust depth

estimation?

With our proposed method MIMDepth, we demonstrate

that applying blockwise masking with a relatively lower

mask ratio (than MIM pre-training) only to the depth es-

timation network, with a loss on the complete image, results

in improved robustness to natural and digital corruptions,

occlusions, as well as untargeted and targeted adversarial

attacks. It is additionally found to improve the performance

of the ego-motion estimation network, while maintaining a

competitive performance of the depth estimation network on

the i.i.d. test sets.

II. RELATED WORKS

Self-supervised Monocular Depth Estimation One of

the challenging tasks of interest in 3D scene understanding

is monocular depth estimation. Although self-supervised

depth estimation was introduced for stereo pairs [14], it

was soon extended to a monocular setup [13]. Monoc-

ular self-supervised approaches to depth estimation have

the advantage of not requiring any labels and can learn

from a wide variety of data from multiple sources. Over

the years, improvements have been made to deal with

challenges due to occlusions [15], dynamic objects [16]–

[18], and scale-consistency issues [4], [19] and more. While

most methods generally used 2D convolutional architectures,

a 3D convolutional architecture was proposed to estimate

depth from symmetrical packing and unpacking blocks that

preserve depth details [5] . Recently, MTSfMLearner [7] has

shown that transformers can also be used for depth and

pose estimation resulting in comparable performance, but

improved robustness to natural corruptions and adversarial

attacks due to their global receptive ﬁelds. Although other

methods that use transformers have also been proposed [20],

[21], they do not consider the robustness of their proposed

approaches. We show that integrating mask image modeling

trains networks to identify long-range dependencies and

could further improve the robustness of depth estimation.

Masked Image Modeling Masked image modeling is a

method for self-supervised representation learning through

images corrupted by masking. This was developed following

masked language modeling (MLM) [22]. These methods

are based on replacing a portion of the tokenized input

sequence with learnable mask tokens and learning to predict

the missing context using only the visible context. iGPT [23]

operates on clustered pixel tokens and predicts unknown

pixels directly. ViT [24] explores masked patch prediction by

predicting the mean color. On the contrary, BEIT [9] operates

on image patches but uses an additional discrete Variational

AutoEncoder (dVAE) tokenizer [25] to learn to predict

discrete tokens corresponding to masked portions. BEIT uses

special blockwise masking that mitigates the wastage of

modeling capabilities on short-range dependencies and high-

frequency details similar to BERT. Instead, SiMMIM [10]

and MAE [11] show that even random masking with a higher

mask ratio or mask size can similarly perform well for self-

supervised pretraining from image data. However, MIM has

not been explored for directly training a relevant task of

interest rather than as a pre-training method. We demonstrate

how it can be adapted to improve the robustness of self-

supervised monocular depth estimation.

III. METHOD

We propose MIMDepth, a method for masked self-

supervised monocular depth estimation (see Figure 1) that in-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImageMaskingforRobustSelf-SupervisedMonocularDepthEstimationHemangChawla1,KishaanJeeveswaran1,ElaheArani1;2*,andBahramZonooz1;2*AbstractSelf-supervisedmonoculardepthestimationisasalienttaskfor3Dsceneunderstanding.Learnedjointlywithmonocularego-motionestimation,severalmethodshavebeenproposedtopredic...

展开>> 收起<<

Image Masking for Robust Self-Supervised Monocular Depth Estimation Hemang Chawla1 Kishaan Jeeveswaran1 Elahe Arani12 and Bahram Zonooz12 Abstract Self-supervised monocular depth estimation is a.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Image Masking for Robust Self-Supervised Monocular Depth Estimation Hemang Chawla1 Kishaan Jeeveswaran1 Elahe Arani12 and Bahram Zonooz12 Abstract Self-supervised monocular depth estimation is a

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: