Self-Supervised Monocular Depth Underwater

2025-04-15
0
0
6.17MB
7 页
10玖币
侵权投诉
Self-Supervised Monocular Depth Underwater
Shlomi Amitai, Itzik Klein Senior Member, IEEE, and Tali Treibitz
The Hatter Department of Marine Technologies
Charney School of Marine Sciences, University of Haifa, Haifa, Israel
shlomi.amitai@gmail.com, {kitzik,ttreibitz}@univ.haifa.ac.il
Abstract—Depth estimation is critical for any robotic system.
In the past years estimation of depth from monocular images
have shown great improvement, however, in the underwater
environment results are still lagging behind due to appearance
changes caused by the medium. So far little effort has been
invested on overcoming this. Moreover, underwater, there are
more limitations for using high resolution depth sensors, this
makes generating ground truth for learning methods another
enormous obstacle. So far unsupervised methods that tried to
solve this have achieved very limited success as they relied
on domain transfer from dataset in air. We suggest training
using subsequent frames self-supervised by a reprojection loss, as
was demonstrated successfully above water. We suggest several
additions to the self-supervised framework to cope with the
underwater environment and achieve state-of-the-art results on
a challenging forward-looking underwater dataset.
I. INTRODUCTION
There is a wide range of target applications for depth estima-
tion, from obstacle detection to object measurement and from
3D reconstruction to image enhancement. Underwater depth
estimation (note that here depth refers to the object range, and
not to the depth under water) is important for Autonomous
Underwater Vehicles (AUVs) [15] (Fig. 1), localization and
mapping, motion planing, and image dehazing [6]. As such in-
ferring depth from vision systems has been widely investigated
in the last years. There is a range of sensors and imaging setups
that can provide depth, such as stereo, multiple-view, and time-
of-flight (ToF) [11], [12], [23]. Monocular depth estimation is
different from other vision systems in the sense that it uses
a single RGB image with no special setup or hardware, and
as such has many advantages. Because of mechanical design
considerations, in many AUVs it is difficult to place a stereo
setup with a baseline that is wide enough, so there monocular
depth is particularly attractive and can be combined with other
sensors (e.g., Sonars) to set the scale.
Monocular depth methods can be trained either supervised
or self-supervised. Naturally, supervised methods achieve
higher accuracies, however, rely on having a substantial dataset
with pairs of images and their ground-truth depth. This is very
difficult to achieve underwater as traditional multiple-view
methods struggle with appearance changes and are less stable.
The research was funded by Israel Science Foundation grant #680/18,
the Israeli Ministry of Science and Technology grant #3 −15621, the
Israel Data Science Initiative (IDSI) of the Council for Higher Education in
Israel, the Data Science Research Center at the University of Haifa, and the
European Union’s Horizon 2020 research and innovation programme under
grant agreement No. GA 101016958.
Fig. 1: The ALICE autonomous underwater vehicle
(AUV) [15] facing obstacles. Monocular depth maps
can aid obstacle avoidance and decision making.
Additionally, optical properties of water [2] change tempo-
rally and spatially, significantly changing scene appearance.
Thus, for training supervised methods, a ground-truth dataset
is needed for every environment, which is very laborious.
Therefore, we chose to develop a self-supervised method, that
requires only a set of consecutive frames for training.
When testing state-of-the-art monocular depth estimation
methods to underwater, new problems arise. Visual cues that
one can benefit from above water might cause exactly the
opposite and lead to estimation errors. Handling underwater
scenes requires adding more constraints and using priors.
Understanding the physical characteristics of underwater im-
ages can assist us in revealing new cues and using them for
extracting depth cues from the images.
We improve self-supervised underwater depth estimation
with the following contributions: 1) Examining how the re-
projection loss changes underwater, 2) Handling background
areas, 3) Adding a photometric prior, 4) Data augmentation
specific for underwater. To that end, we employ the FLSea
dataset, published in [27].
II. RELATED WORK
A. Supervised Monocular Depth Estimation
In the supervised monocular depth task a deep network is
trained to infer depth from an RGB image using a dataset
of paired images with their ground-truth (GT) depth [7],
[22]. Reference ground truth can be achieved from a depth
sensor or can be generated by classic computer vision methods
such as structure from motion (SFM) and from stereo. Li
et. al [20] suggest to collect the training data by applying
SFM on multi-view internet photo collections. Their network
architecture is based on an hourglass network structure with
suitable loss functions for fine details reconstruction in the
arXiv:2210.03206v1 [cs.CV] 6 Oct 2022
Fig. 2: Example results on two underwater scenes from the FLSea dataset [27]. a) Input scene, b) Ground truth, c) Result
of Diffnet [33] and d) our estimated depth map. Magenta rectangle marks background area where our method significantly
improves the results, and black rectangles mark foreground objects with better estimation using our method.
depth map. a newer method [3], [28] use transformers to
improve performance.
B. Self-Supervised Monocular Depth Estimation
To overcome the hurdle of ground-truth data collection, it
was suggested [12], [34] to use sequential frames for self-
supervised training leveraging the fact that they image the
same scene from different poses. The network estimates both
the depth and the motion between frames. The estimated
camera motion between sequential frames constrains the depth
network to predict up to scale depth, and the estimated depth
constrains the odometry network to predict relative camera
pose. The loss is the photometric reprojection error between
two subsequent frames using the estimated depth and motion.
Monodepth2 [12] proposed to overcome occlusion artifacts
by taking the minimum error between preceding and following
frames. DiffNet [33] is based on monodepth2 [12] with two
major differences. They replace the ResNet [18] encoder with
high-resolution representations using HRNet [31] which was
argued to perform better and added attention modules to
the decoder. DiffNet [33] is the current SOTA method on
KITTI 2015 stereo dataset [10], the top benchmark for self-
supervised monocular depth and also performed the best on
our underwater images. Therefore, we base our work on it.
C. Underwater Depth Estimation
Underwater, photometric cues have been used for inferring
depth from single images, as in scattering media the appear-
ance of objects depends on their distance from the camera.
Based on that several priors have been suggested for simulta-
neously estimating depth and restoring scene appearance.
One line of work is based on the dark channel prior
(DCP) [17] and several underwater variants UDCP [5], [8],
and the red channel prior [9]. Some methods use the per-
patch difference between the red channel and the maximum
between the blue and the green as a proxy for distance,
termed the maximum intensity prior (MIP) by Carlevaris-
Bianco et al. [4]. Song et al. [29] suggested the underwater
light attenuation prior (ULAP) that assumes the object distance
is linearly related to the difference between the red channel and
the maximum blue-green. The blurriness prior [25] leverages
the fact that images become blurrier with distance. Peng and
Cosman [24] combined this prior with MIP and suggested the
image blurring and light absorption (IBLA) prior. Bekerman et
al. [2] showed that improving estimation of the scene’s optical
properties improves depth estimation.
There have been also attempts of unsupervised learning-
based underwater depth estimation. UW-Net [14] uses gener-
ative adversarial training by learning the mapping functions
between unpaired RGB-D terrestrial images and arbitrary un-
derwater images. UW-GAN [16] also used a GAN to generate
depth, using supervision from a synthetic underwater dataset.
These and others based the training on single images and
none uses geometric cues between subsequent frames for self-
supervision as we do. As we show in the results, the self-
supervision significantly improves the results.
III. SCIENTIFIC BACKGROUND
A. Reprojection Loss
The reprojection loss is the key self-supervision loss. It uses
two sequential frames [It−1, It], where tis the time index,
together with the estimated extrinsic rotation, translation, and
b
Dt, the estimated depth of frame It. These are used to compute
the coordinates bpt−1in It−1that are the projection of the
coordinates ptin It−1[34]:
bpt−1∼Kb
Tt→t−1b
Dt(pt)K−1pt.(1)
摘要:
展开>>
收起<<
Self-SupervisedMonocularDepthUnderwaterShlomiAmitai,ItzikKleinSeniorMember,IEEE,andTaliTreibitzTheHatterDepartmentofMarineTechnologiesCharneySchoolofMarineSciences,UniversityofHaifa,Haifa,Israelshlomi.amitai@gmail.com,fkitzik,ttreibitzg@univ.haifa.ac.ilAbstractDepthestimationiscriticalforanyrobotic...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-14 22
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 28
-
VIP免费2024-11-23 11
-
VIP免费2024-11-23 21
-
VIP免费2024-11-23 12
-
VIP免费2024-11-23 5
分类:学术论文
价格:10玖币
属性:7 页
大小:6.17MB
格式:PDF
时间:2025-04-15
作者详情
-
Voltage-Controlled High-Bandwidth Terahertz Oscillators Based On Antiferromagnets Mike A. Lund1Davi R. Rodrigues2Karin Everschor-Sitte3and Kjetil M. D. Hals1 1Department of Engineering Sciences University of Agder 4879 Grimstad Norway10 玖币0人下载
-
Voltage-controlled topological interface states for bending waves in soft dielectric phononic crystal plates10 玖币0人下载