FreDSNet Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions Bruno Berenguel-Baeta and Jes us Berm udez-Cameo and Jose J. Guerrero

2025-04-27 0 0 4.98MB 7 页 10玖币

侵权投诉

FreDSNet: Joint Monocular Depth and Semantic

Segmentation with Fast Fourier Convolutions

Bruno Berenguel-Baeta and Jes´

us Berm´

udez-Cameo and Jose J. Guerrero

Abstract— In this work we present FreDSNet, a deep learning

solution which obtains semantic 3D understanding of indoor

environments from single panoramas. Omnidirectional images

reveal task-speciﬁc advantages when addressing scene under-

standing problems due to the 360-degree contextual information

about the entire environment they provide. However, the inherent

characteristics of the omnidirectional images add additional

problems to obtain an accurate detection and segmentation of ob-

jects or a good depth estimation. To overcome these problems, we

exploit convolutions in the frequential domain obtaining a wider

receptive ﬁeld in each convolutional layer. These convolutions

allow to leverage the whole context information from omnidirec-

tional images. FreDSNet is the ﬁrst network that jointly provides

monocular depth estimation and semantic segmentation from

a single panoramic image exploiting fast Fourier convolutions.

Our experiments show that FreDSNet has similar performance

as speciﬁc state of the art methods for semantic segmentation

and depth estimation. FreDSNet code is publicly available in

https://github.com/Sbrunoberenguel/FreDSNet

I. INTRODUCTION

Understanding 3D indoor environments is a hot topic in

computer vision and robotics research [15][31]. The scene

understanding ﬁeld has different branches which focus on

different key aspects of the environment. The layout recovery

problem has been in the spotlight for many years, obtaining

great results with the use of standard and omnidirectional

cameras [2][7][17][22]. This layout information is useful for

constricting the movement of autonomous robots [19][21] or

doing virtual and augmented reality systems. Another line of

research focuses on detecting and identifying objects and their

classes in the scene. There are many methods for conventional

cameras [4][10][20], which provide great results, however

conventional cameras are limited by their narrow ﬁeld of view.

In recent years, works that use panoramas, usually in the

equirectangular projection, are increasing [5][9], providing a

better understanding of the whole environment. Besides, the

combination of semantic and depth information helps to gen-

erate richer representations of indoor environments [13][27].

In this work, we focus on obtaining, from equirectangular

panoramas, two of the main pillars of scene understanding:

semantic segmentation and monocular depth estimation.

Without the adequate sensor, navigating autonomous ve-

hicles in unknown environments is an extremely challenging

task. Nowadays there is a great variety of sensors that provide

All authors are with Instituto de Investigacion en Ingenieria de Aragon,

University of Zaragoza, Spain

Corresponding author: berenguel@unizar.es

A ﬁnal version of this article can be found at https://doi.org/10.

1109/ICRA48891.2023.10161142

Fig. 1: Overview of our proposal. From a single RGB

panorama (up left), we make a semantic segmentation (up

right) and estimate a depth map (down left) of an indoor

environment. With this information we are able to reconstruct

in 3D the whole environment (down right).

accurate and diverse information of the environment (LIDARs,

cameras, microphones, etc.). Among these possibilities, we

choose to explore omnidirectional cameras, which have be-

come increasingly popular as main sensor for navigation and

interactive applications. These cameras provide RGB infor-

mation of all the surrounding and, with the use of computer

vision or deep learning algorithms, provide rich and useful

information of an environment.

In this paper, we introduce FreDSNet, a new deep neural

network which jointly provides semantic segmentation and

depth estimation from a single equirectangular panorama (see

Fig. 1). We propose the use of the fast Fourier convolution

(FFC) [3] to leverage the wider receptive ﬁeld of these

convolutions and take advantage of the wide ﬁeld of view of

360 panoramas. Besides, we use a joint training of semantic

segmentation and depth, where each task can beneﬁt from

the other. Semantic segmentation provides information about

the distribution of the objects as well as their boundaries,

where usually are hard discontinuities in depth. On the other

hand, the depth estimation provides the scene’s scale and

the location of the objects inside the environment. With

this information, we provide accurate enough information for

applications as navigation of autonomous vehicles, virtual and

augmented reality and scene reconstruction.

The main contribution of this paper is that FreDSNet is the

ﬁrst to jointly obtain semantic segmentation and monocular

depth estimation from single panoramas exploiting the FFC.

The main novelties of our work are: We include and exploit

the FCC in a new network architecture for visual scene

1 ArXiv Preprint

arXiv:2210.01595v2 [cs.CV] 5 Feb 2024

FBC-0.5

FBC-1.0

CFB-1.0

CFB-2.0

CFB-2.0 CFB-2.0 CFB-2.0

ResNet

WSemantic branch

Depth branch

Fig. 2: Architecture of our Frequential Depth estimation and Semantic segmentation Network (FreDSNet). The encoder part

is formed by a feature extractor (ResNet) and four encoder blocks. The decoder part is formed by six decoding blocks and

two branches that predict depth and semantic segmentation. The skip connections from the encoder to the decoder use learned

weights.

understanding. Also, we present a fully convolutional neural

network that jointly obtains semantic segmentation and depth

estimation from single panoramas.

II. RELATED WORKS

Semantic segmentation The semantic segmentation on

perspective images is a well-studied ﬁeld. We can ﬁnd

many works on object detection [20], semantic segmentation

[10][29] or both tasks [4][8] from perspective cameras. How-

ever, omnidirectional images pose a harder problem which is

more difﬁcult to tackle. Then, only a few works are able to

make object detection or semantic segmentation from omni-

directional images [5][9][23]. Since omnidirectional images

present heavy distortions (e.g. in spherical projections, like

equirectangular images, this distortion is more accentuated in

the mapping of the poles) these kinds of images are difﬁcult to

manually annotate. Nevertheless, due to the wide ﬁeld of view

of these images (e.g. in the spherical projection, we can see all

the surroundings in a single image), the use of omnidirectional

images in semantic segmentation is an active ﬁeld of study

since we can obtain a complete semantic understanding of the

environment from a single image.

Depth estimation Monocular depth estimation is a research

topic that has been on the spotlight in recent years. With

the rise of deep learning methods, many works on depth

estimation from conventional cameras have appeared for di-

verse applications [6][12][14][18]. Almost at the same time,

different works on depth estimation from panoramic images

started to appear for indoor scene understanding purposes

[16][23][26][30]. Each work presents particular approaches

for monocular depth estimation, being an open ﬁeld of study

with great interest and many applications.

Network architecture Many recent works on semantic seg-

mentation or depth estimation rely on convolutional encoder-

decoder architectures with some recurrent [16] or attention

mechanism [23] as hidden representation of the environment.

This kind of architectures aim to reduce the spatial resolution

of the input image, increasing the number of feature maps

in the encoder part, relating the general context of the envi-

ronment in the hidden representation and up-sampling it in

the decoder part to obtain the desired information. However,

the traditional encoder-decoder architecture which relies on

standard convolutions [30] or geometrical approximations [5]

suffers from slow growth of the effective receptive ﬁeld of

the convolutions, losing the general context information that

omnidirectional images provide.

In this work, we propose an encoder-decoder architecture

for our network. However, we propose to use the fast Fourier

convolution presented in [3], which we denominate Fourier

Block since we modify the behaviour of the block. These

convolutions have proved that can ’see’ the whole image at

once, obtaining a higher effective receptive ﬁeld from early

layers. This is a key feature for our proposal since, being

aware of the context of the scene, which can only be obtained

with omnidirectional images, improves the understanding and

interaction with the environment.

III. FREDSNET:MONOCULAR DEPTH AND SEMANTIC

SEGMENTATION

Our network follows an encoder-decoder architecture with

Resnet [11] as initial feature extractor and two separated

branches for depth estimation and semantic segmentation.

(see Fig.2). It is inspired by BlitzNet [4] and PanoBlitzNet

[9], using multi-resolution encoding and decoding, in order

to obtain a multi-scale representation of the scene, and the

use of skip connections, which makes the training process

more stable. Each branch takes intermediate feature maps

from the decoder part to provide an output from the multi-

scale decoded information. The key novelty of our architecture

is how are composed the blocks of encoder and decoder and

how these parts are interconnected.

A. Architecture

The proposed encoder blocks (FBC-N) are formed by a

Fourier Block (FB) followed by a down-scaling (N) and a set

of standard convolutions (W-conv) as shown in Fig.3a. The

Fourier Block has the same structure as the FFC implemented

in [24], however we differ in the use of the activation function

(AF). In the original work, they use a ReLu activation function

in the FFC (they propose an in-painting method). However,

recent works as [16] have proved that ReLu function is not

really suited for depth estimation, since it is prone to make

gradients vanish. Instead, we use PReLu as activation function,

which is more stable for monocular depth estimation trainings

[16]. The same AF change has been made in the Spectral

block[3] from the FB in order to homogenize the behaviour

October, 2022 2 ArXiv Preprint

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FreDSNet:JointMonocularDepthandSemanticSegmentationwithFastFourierConvolutionsBrunoBerenguel-BaetaandJes´usBerm´udez-CameoandJoseJ.GuerreroAbstract—InthisworkwepresentFreDSNet,adeeplearningsolutionwhichobtainssemantic3Dunderstandingofindoorenvironmentsfromsinglepanoramas.Omnidirectionalimagesrevealt...

展开>> 收起<<

FreDSNet Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions Bruno Berenguel-Baeta and Jes us Berm udez-Cameo and Jose J. Guerrero.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FreDSNet Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions Bruno Berenguel-Baeta and Jes us Berm udez-Cameo and Jose J. Guerrero

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: