
Self-Supervised Monocular Depth Underwater
Shlomi Amitai, Itzik Klein Senior Member, IEEE, and Tali Treibitz
The Hatter Department of Marine Technologies
Charney School of Marine Sciences, University of Haifa, Haifa, Israel
shlomi.amitai@gmail.com, {kitzik,ttreibitz}@univ.haifa.ac.il
Abstract—Depth estimation is critical for any robotic system.
In the past years estimation of depth from monocular images
have shown great improvement, however, in the underwater
environment results are still lagging behind due to appearance
changes caused by the medium. So far little effort has been
invested on overcoming this. Moreover, underwater, there are
more limitations for using high resolution depth sensors, this
makes generating ground truth for learning methods another
enormous obstacle. So far unsupervised methods that tried to
solve this have achieved very limited success as they relied
on domain transfer from dataset in air. We suggest training
using subsequent frames self-supervised by a reprojection loss, as
was demonstrated successfully above water. We suggest several
additions to the self-supervised framework to cope with the
underwater environment and achieve state-of-the-art results on
a challenging forward-looking underwater dataset.
I. INTRODUCTION
There is a wide range of target applications for depth estima-
tion, from obstacle detection to object measurement and from
3D reconstruction to image enhancement. Underwater depth
estimation (note that here depth refers to the object range, and
not to the depth under water) is important for Autonomous
Underwater Vehicles (AUVs) [15] (Fig. 1), localization and
mapping, motion planing, and image dehazing [6]. As such in-
ferring depth from vision systems has been widely investigated
in the last years. There is a range of sensors and imaging setups
that can provide depth, such as stereo, multiple-view, and time-
of-flight (ToF) [11], [12], [23]. Monocular depth estimation is
different from other vision systems in the sense that it uses
a single RGB image with no special setup or hardware, and
as such has many advantages. Because of mechanical design
considerations, in many AUVs it is difficult to place a stereo
setup with a baseline that is wide enough, so there monocular
depth is particularly attractive and can be combined with other
sensors (e.g., Sonars) to set the scale.
Monocular depth methods can be trained either supervised
or self-supervised. Naturally, supervised methods achieve
higher accuracies, however, rely on having a substantial dataset
with pairs of images and their ground-truth depth. This is very
difficult to achieve underwater as traditional multiple-view
methods struggle with appearance changes and are less stable.
The research was funded by Israel Science Foundation grant #680/18,
the Israeli Ministry of Science and Technology grant #3 −15621, the
Israel Data Science Initiative (IDSI) of the Council for Higher Education in
Israel, the Data Science Research Center at the University of Haifa, and the
European Union’s Horizon 2020 research and innovation programme under
grant agreement No. GA 101016958.
Fig. 1: The ALICE autonomous underwater vehicle
(AUV) [15] facing obstacles. Monocular depth maps
can aid obstacle avoidance and decision making.
Additionally, optical properties of water [2] change tempo-
rally and spatially, significantly changing scene appearance.
Thus, for training supervised methods, a ground-truth dataset
is needed for every environment, which is very laborious.
Therefore, we chose to develop a self-supervised method, that
requires only a set of consecutive frames for training.
When testing state-of-the-art monocular depth estimation
methods to underwater, new problems arise. Visual cues that
one can benefit from above water might cause exactly the
opposite and lead to estimation errors. Handling underwater
scenes requires adding more constraints and using priors.
Understanding the physical characteristics of underwater im-
ages can assist us in revealing new cues and using them for
extracting depth cues from the images.
We improve self-supervised underwater depth estimation
with the following contributions: 1) Examining how the re-
projection loss changes underwater, 2) Handling background
areas, 3) Adding a photometric prior, 4) Data augmentation
specific for underwater. To that end, we employ the FLSea
dataset, published in [27].
II. RELATED WORK
A. Supervised Monocular Depth Estimation
In the supervised monocular depth task a deep network is
trained to infer depth from an RGB image using a dataset
of paired images with their ground-truth (GT) depth [7],
[22]. Reference ground truth can be achieved from a depth
sensor or can be generated by classic computer vision methods
such as structure from motion (SFM) and from stereo. Li
et. al [20] suggest to collect the training data by applying
SFM on multi-view internet photo collections. Their network
architecture is based on an hourglass network structure with
suitable loss functions for fine details reconstruction in the
arXiv:2210.03206v1 [cs.CV] 6 Oct 2022