
FreDSNet: Joint Monocular Depth and Semantic
Segmentation with Fast Fourier Convolutions
Bruno Berenguel-Baeta and Jes´
us Berm´
udez-Cameo and Jose J. Guerrero
Abstract— In this work we present FreDSNet, a deep learning
solution which obtains semantic 3D understanding of indoor
environments from single panoramas. Omnidirectional images
reveal task-specific advantages when addressing scene under-
standing problems due to the 360-degree contextual information
about the entire environment they provide. However, the inherent
characteristics of the omnidirectional images add additional
problems to obtain an accurate detection and segmentation of ob-
jects or a good depth estimation. To overcome these problems, we
exploit convolutions in the frequential domain obtaining a wider
receptive field in each convolutional layer. These convolutions
allow to leverage the whole context information from omnidirec-
tional images. FreDSNet is the first network that jointly provides
monocular depth estimation and semantic segmentation from
a single panoramic image exploiting fast Fourier convolutions.
Our experiments show that FreDSNet has similar performance
as specific state of the art methods for semantic segmentation
and depth estimation. FreDSNet code is publicly available in
https://github.com/Sbrunoberenguel/FreDSNet
I. INTRODUCTION
Understanding 3D indoor environments is a hot topic in
computer vision and robotics research [15][31]. The scene
understanding field has different branches which focus on
different key aspects of the environment. The layout recovery
problem has been in the spotlight for many years, obtaining
great results with the use of standard and omnidirectional
cameras [2][7][17][22]. This layout information is useful for
constricting the movement of autonomous robots [19][21] or
doing virtual and augmented reality systems. Another line of
research focuses on detecting and identifying objects and their
classes in the scene. There are many methods for conventional
cameras [4][10][20], which provide great results, however
conventional cameras are limited by their narrow field of view.
In recent years, works that use panoramas, usually in the
equirectangular projection, are increasing [5][9], providing a
better understanding of the whole environment. Besides, the
combination of semantic and depth information helps to gen-
erate richer representations of indoor environments [13][27].
In this work, we focus on obtaining, from equirectangular
panoramas, two of the main pillars of scene understanding:
semantic segmentation and monocular depth estimation.
Without the adequate sensor, navigating autonomous ve-
hicles in unknown environments is an extremely challenging
task. Nowadays there is a great variety of sensors that provide
All authors are with Instituto de Investigacion en Ingenieria de Aragon,
University of Zaragoza, Spain
Corresponding author: berenguel@unizar.es
A final version of this article can be found at https://doi.org/10.
1109/ICRA48891.2023.10161142
Fig. 1: Overview of our proposal. From a single RGB
panorama (up left), we make a semantic segmentation (up
right) and estimate a depth map (down left) of an indoor
environment. With this information we are able to reconstruct
in 3D the whole environment (down right).
accurate and diverse information of the environment (LIDARs,
cameras, microphones, etc.). Among these possibilities, we
choose to explore omnidirectional cameras, which have be-
come increasingly popular as main sensor for navigation and
interactive applications. These cameras provide RGB infor-
mation of all the surrounding and, with the use of computer
vision or deep learning algorithms, provide rich and useful
information of an environment.
In this paper, we introduce FreDSNet, a new deep neural
network which jointly provides semantic segmentation and
depth estimation from a single equirectangular panorama (see
Fig. 1). We propose the use of the fast Fourier convolution
(FFC) [3] to leverage the wider receptive field of these
convolutions and take advantage of the wide field of view of
360 panoramas. Besides, we use a joint training of semantic
segmentation and depth, where each task can benefit from
the other. Semantic segmentation provides information about
the distribution of the objects as well as their boundaries,
where usually are hard discontinuities in depth. On the other
hand, the depth estimation provides the scene’s scale and
the location of the objects inside the environment. With
this information, we provide accurate enough information for
applications as navigation of autonomous vehicles, virtual and
augmented reality and scene reconstruction.
The main contribution of this paper is that FreDSNet is the
first to jointly obtain semantic segmentation and monocular
depth estimation from single panoramas exploiting the FFC.
The main novelties of our work are: We include and exploit
the FCC in a new network architecture for visual scene
1 ArXiv Preprint
arXiv:2210.01595v2 [cs.CV] 5 Feb 2024