Self-Supervised Monocular Depth Underwater

2025-04-15 0 0 6.17MB 7 页 10玖币

侵权投诉

Shlomi Amitai, Itzik Klein Senior Member, IEEE, and Tali Treibitz

The Hatter Department of Marine Technologies

Charney School of Marine Sciences, University of Haifa, Haifa, Israel

shlomi.amitai@gmail.com, {kitzik,ttreibitz}@univ.haifa.ac.il

Abstract—Depth estimation is critical for any robotic system.

In the past years estimation of depth from monocular images

have shown great improvement, however, in the underwater

environment results are still lagging behind due to appearance

changes caused by the medium. So far little effort has been

invested on overcoming this. Moreover, underwater, there are

more limitations for using high resolution depth sensors, this

makes generating ground truth for learning methods another

enormous obstacle. So far unsupervised methods that tried to

solve this have achieved very limited success as they relied

on domain transfer from dataset in air. We suggest training

using subsequent frames self-supervised by a reprojection loss, as

was demonstrated successfully above water. We suggest several

additions to the self-supervised framework to cope with the

underwater environment and achieve state-of-the-art results on

a challenging forward-looking underwater dataset.

I. INTRODUCTION

There is a wide range of target applications for depth estima-

tion, from obstacle detection to object measurement and from

3D reconstruction to image enhancement. Underwater depth

estimation (note that here depth refers to the object range, and

not to the depth under water) is important for Autonomous

Underwater Vehicles (AUVs) [15] (Fig. 1), localization and

mapping, motion planing, and image dehazing [6]. As such in-

ferring depth from vision systems has been widely investigated

in the last years. There is a range of sensors and imaging setups

that can provide depth, such as stereo, multiple-view, and time-

of-ﬂight (ToF) [11], [12], [23]. Monocular depth estimation is

different from other vision systems in the sense that it uses

a single RGB image with no special setup or hardware, and

as such has many advantages. Because of mechanical design

considerations, in many AUVs it is difﬁcult to place a stereo

setup with a baseline that is wide enough, so there monocular

depth is particularly attractive and can be combined with other

sensors (e.g., Sonars) to set the scale.

Monocular depth methods can be trained either supervised

or self-supervised. Naturally, supervised methods achieve

higher accuracies, however, rely on having a substantial dataset

with pairs of images and their ground-truth depth. This is very

difﬁcult to achieve underwater as traditional multiple-view

methods struggle with appearance changes and are less stable.

The research was funded by Israel Science Foundation grant #680/18,

the Israeli Ministry of Science and Technology grant #3 −15621, the

Israel Data Science Initiative (IDSI) of the Council for Higher Education in

Israel, the Data Science Research Center at the University of Haifa, and the

European Union’s Horizon 2020 research and innovation programme under

grant agreement No. GA 101016958.

Fig. 1: The ALICE autonomous underwater vehicle

(AUV) [15] facing obstacles. Monocular depth maps

can aid obstacle avoidance and decision making.

Additionally, optical properties of water [2] change tempo-

rally and spatially, signiﬁcantly changing scene appearance.

Thus, for training supervised methods, a ground-truth dataset

is needed for every environment, which is very laborious.

Therefore, we chose to develop a self-supervised method, that

requires only a set of consecutive frames for training.

When testing state-of-the-art monocular depth estimation

methods to underwater, new problems arise. Visual cues that

one can beneﬁt from above water might cause exactly the

opposite and lead to estimation errors. Handling underwater

scenes requires adding more constraints and using priors.

Understanding the physical characteristics of underwater im-

ages can assist us in revealing new cues and using them for

extracting depth cues from the images.

We improve self-supervised underwater depth estimation

with the following contributions: 1) Examining how the re-

projection loss changes underwater, 2) Handling background

areas, 3) Adding a photometric prior, 4) Data augmentation

speciﬁc for underwater. To that end, we employ the FLSea

dataset, published in [27].

II. RELATED WORK

A. Supervised Monocular Depth Estimation

In the supervised monocular depth task a deep network is

trained to infer depth from an RGB image using a dataset

of paired images with their ground-truth (GT) depth [7],

[22]. Reference ground truth can be achieved from a depth

sensor or can be generated by classic computer vision methods

such as structure from motion (SFM) and from stereo. Li

et. al [20] suggest to collect the training data by applying

SFM on multi-view internet photo collections. Their network

architecture is based on an hourglass network structure with

suitable loss functions for ﬁne details reconstruction in the

arXiv:2210.03206v1 [cs.CV] 6 Oct 2022

Fig. 2: Example results on two underwater scenes from the FLSea dataset [27]. a) Input scene, b) Ground truth, c) Result

of Diffnet [33] and d) our estimated depth map. Magenta rectangle marks background area where our method signiﬁcantly

improves the results, and black rectangles mark foreground objects with better estimation using our method.

depth map. a newer method [3], [28] use transformers to

improve performance.

B. Self-Supervised Monocular Depth Estimation

To overcome the hurdle of ground-truth data collection, it

was suggested [12], [34] to use sequential frames for self-

supervised training leveraging the fact that they image the

same scene from different poses. The network estimates both

the depth and the motion between frames. The estimated

camera motion between sequential frames constrains the depth

network to predict up to scale depth, and the estimated depth

constrains the odometry network to predict relative camera

pose. The loss is the photometric reprojection error between

two subsequent frames using the estimated depth and motion.

Monodepth2 [12] proposed to overcome occlusion artifacts

by taking the minimum error between preceding and following

frames. DiffNet [33] is based on monodepth2 [12] with two

major differences. They replace the ResNet [18] encoder with

high-resolution representations using HRNet [31] which was

argued to perform better and added attention modules to

the decoder. DiffNet [33] is the current SOTA method on

KITTI 2015 stereo dataset [10], the top benchmark for self-

supervised monocular depth and also performed the best on

our underwater images. Therefore, we base our work on it.

C. Underwater Depth Estimation

Underwater, photometric cues have been used for inferring

depth from single images, as in scattering media the appear-

ance of objects depends on their distance from the camera.

Based on that several priors have been suggested for simulta-

neously estimating depth and restoring scene appearance.

One line of work is based on the dark channel prior

(DCP) [17] and several underwater variants UDCP [5], [8],

and the red channel prior [9]. Some methods use the per-

patch difference between the red channel and the maximum

between the blue and the green as a proxy for distance,

termed the maximum intensity prior (MIP) by Carlevaris-

Bianco et al. [4]. Song et al. [29] suggested the underwater

light attenuation prior (ULAP) that assumes the object distance

is linearly related to the difference between the red channel and

the maximum blue-green. The blurriness prior [25] leverages

the fact that images become blurrier with distance. Peng and

Cosman [24] combined this prior with MIP and suggested the

image blurring and light absorption (IBLA) prior. Bekerman et

al. [2] showed that improving estimation of the scene’s optical

properties improves depth estimation.

There have been also attempts of unsupervised learning-

based underwater depth estimation. UW-Net [14] uses gener-

ative adversarial training by learning the mapping functions

between unpaired RGB-D terrestrial images and arbitrary un-

derwater images. UW-GAN [16] also used a GAN to generate

depth, using supervision from a synthetic underwater dataset.

These and others based the training on single images and

none uses geometric cues between subsequent frames for self-

supervision as we do. As we show in the results, the self-

supervision signiﬁcantly improves the results.

III. SCIENTIFIC BACKGROUND

A. Reprojection Loss

The reprojection loss is the key self-supervision loss. It uses

two sequential frames [It−1, It], where tis the time index,

together with the estimated extrinsic rotation, translation, and

Dt, the estimated depth of frame It. These are used to compute

the coordinates bpt−1in It−1that are the projection of the

coordinates ptin It−1[34]:

bpt−1∼Kb

Tt→t−1b

Dt(pt)K−1pt.(1)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-SupervisedMonocularDepthUnderwaterShlomiAmitai,ItzikKleinSeniorMember,IEEE,andTaliTreibitzTheHatterDepartmentofMarineTechnologiesCharneySchoolofMarineSciences,UniversityofHaifa,Haifa,Israelshlomi.amitai@gmail.com,fkitzik,ttreibitzg@univ.haifa.ac.ilAbstractDepthestimationiscriticalforanyrobotic...

展开>> 收起<<

Self-Supervised Monocular Depth Underwater.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-Supervised Monocular Depth Underwater

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: