FreDSNet Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions Bruno Berenguel-Baeta and Jes us Berm udez-Cameo and Jose J. Guerrero

2025-04-27 0 0 4.98MB 7 页 10玖币
侵权投诉
FreDSNet: Joint Monocular Depth and Semantic
Segmentation with Fast Fourier Convolutions
Bruno Berenguel-Baeta and Jes´
us Berm´
udez-Cameo and Jose J. Guerrero
Abstract In this work we present FreDSNet, a deep learning
solution which obtains semantic 3D understanding of indoor
environments from single panoramas. Omnidirectional images
reveal task-specific advantages when addressing scene under-
standing problems due to the 360-degree contextual information
about the entire environment they provide. However, the inherent
characteristics of the omnidirectional images add additional
problems to obtain an accurate detection and segmentation of ob-
jects or a good depth estimation. To overcome these problems, we
exploit convolutions in the frequential domain obtaining a wider
receptive field in each convolutional layer. These convolutions
allow to leverage the whole context information from omnidirec-
tional images. FreDSNet is the first network that jointly provides
monocular depth estimation and semantic segmentation from
a single panoramic image exploiting fast Fourier convolutions.
Our experiments show that FreDSNet has similar performance
as specific state of the art methods for semantic segmentation
and depth estimation. FreDSNet code is publicly available in
https://github.com/Sbrunoberenguel/FreDSNet
I. INTRODUCTION
Understanding 3D indoor environments is a hot topic in
computer vision and robotics research [15][31]. The scene
understanding field has different branches which focus on
different key aspects of the environment. The layout recovery
problem has been in the spotlight for many years, obtaining
great results with the use of standard and omnidirectional
cameras [2][7][17][22]. This layout information is useful for
constricting the movement of autonomous robots [19][21] or
doing virtual and augmented reality systems. Another line of
research focuses on detecting and identifying objects and their
classes in the scene. There are many methods for conventional
cameras [4][10][20], which provide great results, however
conventional cameras are limited by their narrow field of view.
In recent years, works that use panoramas, usually in the
equirectangular projection, are increasing [5][9], providing a
better understanding of the whole environment. Besides, the
combination of semantic and depth information helps to gen-
erate richer representations of indoor environments [13][27].
In this work, we focus on obtaining, from equirectangular
panoramas, two of the main pillars of scene understanding:
semantic segmentation and monocular depth estimation.
Without the adequate sensor, navigating autonomous ve-
hicles in unknown environments is an extremely challenging
task. Nowadays there is a great variety of sensors that provide
All authors are with Instituto de Investigacion en Ingenieria de Aragon,
University of Zaragoza, Spain
Corresponding author: berenguel@unizar.es
A final version of this article can be found at https://doi.org/10.
1109/ICRA48891.2023.10161142
Fig. 1: Overview of our proposal. From a single RGB
panorama (up left), we make a semantic segmentation (up
right) and estimate a depth map (down left) of an indoor
environment. With this information we are able to reconstruct
in 3D the whole environment (down right).
accurate and diverse information of the environment (LIDARs,
cameras, microphones, etc.). Among these possibilities, we
choose to explore omnidirectional cameras, which have be-
come increasingly popular as main sensor for navigation and
interactive applications. These cameras provide RGB infor-
mation of all the surrounding and, with the use of computer
vision or deep learning algorithms, provide rich and useful
information of an environment.
In this paper, we introduce FreDSNet, a new deep neural
network which jointly provides semantic segmentation and
depth estimation from a single equirectangular panorama (see
Fig. 1). We propose the use of the fast Fourier convolution
(FFC) [3] to leverage the wider receptive field of these
convolutions and take advantage of the wide field of view of
360 panoramas. Besides, we use a joint training of semantic
segmentation and depth, where each task can benefit from
the other. Semantic segmentation provides information about
the distribution of the objects as well as their boundaries,
where usually are hard discontinuities in depth. On the other
hand, the depth estimation provides the scene’s scale and
the location of the objects inside the environment. With
this information, we provide accurate enough information for
applications as navigation of autonomous vehicles, virtual and
augmented reality and scene reconstruction.
The main contribution of this paper is that FreDSNet is the
first to jointly obtain semantic segmentation and monocular
depth estimation from single panoramas exploiting the FFC.
The main novelties of our work are: We include and exploit
the FCC in a new network architecture for visual scene
1 ArXiv Preprint
arXiv:2210.01595v2 [cs.CV] 5 Feb 2024
FBC-0.5
FBC-0.5
FBC-0.5
FBC-1.0
+
+
+
CFB-1.0
CFB-2.0
CFB-2.0
CFB-2.0 CFB-2.0 CFB-2.0
ResNet
W
W
W
WSemantic branch
Depth branch
Fig. 2: Architecture of our Frequential Depth estimation and Semantic segmentation Network (FreDSNet). The encoder part
is formed by a feature extractor (ResNet) and four encoder blocks. The decoder part is formed by six decoding blocks and
two branches that predict depth and semantic segmentation. The skip connections from the encoder to the decoder use learned
weights.
understanding. Also, we present a fully convolutional neural
network that jointly obtains semantic segmentation and depth
estimation from single panoramas.
II. RELATED WORKS
Semantic segmentation The semantic segmentation on
perspective images is a well-studied field. We can find
many works on object detection [20], semantic segmentation
[10][29] or both tasks [4][8] from perspective cameras. How-
ever, omnidirectional images pose a harder problem which is
more difficult to tackle. Then, only a few works are able to
make object detection or semantic segmentation from omni-
directional images [5][9][23]. Since omnidirectional images
present heavy distortions (e.g. in spherical projections, like
equirectangular images, this distortion is more accentuated in
the mapping of the poles) these kinds of images are difficult to
manually annotate. Nevertheless, due to the wide field of view
of these images (e.g. in the spherical projection, we can see all
the surroundings in a single image), the use of omnidirectional
images in semantic segmentation is an active field of study
since we can obtain a complete semantic understanding of the
environment from a single image.
Depth estimation Monocular depth estimation is a research
topic that has been on the spotlight in recent years. With
the rise of deep learning methods, many works on depth
estimation from conventional cameras have appeared for di-
verse applications [6][12][14][18]. Almost at the same time,
different works on depth estimation from panoramic images
started to appear for indoor scene understanding purposes
[16][23][26][30]. Each work presents particular approaches
for monocular depth estimation, being an open field of study
with great interest and many applications.
Network architecture Many recent works on semantic seg-
mentation or depth estimation rely on convolutional encoder-
decoder architectures with some recurrent [16] or attention
mechanism [23] as hidden representation of the environment.
This kind of architectures aim to reduce the spatial resolution
of the input image, increasing the number of feature maps
in the encoder part, relating the general context of the envi-
ronment in the hidden representation and up-sampling it in
the decoder part to obtain the desired information. However,
the traditional encoder-decoder architecture which relies on
standard convolutions [30] or geometrical approximations [5]
suffers from slow growth of the effective receptive field of
the convolutions, losing the general context information that
omnidirectional images provide.
In this work, we propose an encoder-decoder architecture
for our network. However, we propose to use the fast Fourier
convolution presented in [3], which we denominate Fourier
Block since we modify the behaviour of the block. These
convolutions have proved that can ’see’ the whole image at
once, obtaining a higher effective receptive field from early
layers. This is a key feature for our proposal since, being
aware of the context of the scene, which can only be obtained
with omnidirectional images, improves the understanding and
interaction with the environment.
III. FREDSNET:MONOCULAR DEPTH AND SEMANTIC
SEGMENTATION
Our network follows an encoder-decoder architecture with
Resnet [11] as initial feature extractor and two separated
branches for depth estimation and semantic segmentation.
(see Fig.2). It is inspired by BlitzNet [4] and PanoBlitzNet
[9], using multi-resolution encoding and decoding, in order
to obtain a multi-scale representation of the scene, and the
use of skip connections, which makes the training process
more stable. Each branch takes intermediate feature maps
from the decoder part to provide an output from the multi-
scale decoded information. The key novelty of our architecture
is how are composed the blocks of encoder and decoder and
how these parts are interconnected.
A. Architecture
The proposed encoder blocks (FBC-N) are formed by a
Fourier Block (FB) followed by a down-scaling (N) and a set
of standard convolutions (W-conv) as shown in Fig.3a. The
Fourier Block has the same structure as the FFC implemented
in [24], however we differ in the use of the activation function
(AF). In the original work, they use a ReLu activation function
in the FFC (they propose an in-painting method). However,
recent works as [16] have proved that ReLu function is not
really suited for depth estimation, since it is prone to make
gradients vanish. Instead, we use PReLu as activation function,
which is more stable for monocular depth estimation trainings
[16]. The same AF change has been made in the Spectral
block[3] from the FB in order to homogenize the behaviour
October, 2022 2 ArXiv Preprint
摘要:

FreDSNet:JointMonocularDepthandSemanticSegmentationwithFastFourierConvolutionsBrunoBerenguel-BaetaandJes´usBerm´udez-CameoandJoseJ.GuerreroAbstract—InthisworkwepresentFreDSNet,adeeplearningsolutionwhichobtainssemantic3Dunderstandingofindoorenvironmentsfromsinglepanoramas.Omnidirectionalimagesrevealt...

展开>> 收起<<
FreDSNet Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions Bruno Berenguel-Baeta and Jes us Berm udez-Cameo and Jose J. Guerrero.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:4.98MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注