Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image 1stXin Zhang_2

2025-05-06 0 0 2.05MB 9 页 10玖币

侵权投诉

Depth Monocular Estimation with Attention-based

Encoder-Decoder Network from Single Image

1st Xin Zhang

University of South Carolina

Columbia, United States

xz8@email.sc.edu

2nd Rabab Abdelfattah

University of South Carolina

Columbia, United States

rabab@email.sc.edu

3rd Yuqi Song

University of South Carolina

Columbia, United States

yuqis@email.sc.edu

4th Samuel A. Dauchert

University of South Carolina

Columbia, United States

dauchert@email.sc.edu

5th Xiaofeng Wang

University of South Carolina

Columbia, United States

wangxi@cec.sc.edu

Abstract—Depth information is the foundation of perception,

essential for autonomous driving, robotics, and other source-

constrained applications. Promptly obtaining accurate and efﬁ-

cient depth information allows for a rapid response in dynamic

environments. Sensor-based methods using LIDAR and RADAR

obtain high precision at the cost of high power consumption,

price, and volume. While due to advances in deep learning,

vision-based approaches have recently received much attention

and can overcome these drawbacks. In this work, we explore an

extreme scenario in vision-based settings: estimate a depth map

from one monocular image severely plagued by grid artifacts

and blurry edges. To address this scenario, We ﬁrst design a

convolutional attention mechanism block (CAMB) which consists

of channel attention and spatial attention sequentially and insert

these CAMBs into skip connections. As a result, our novel

approach can ﬁnd the focus of current image with minimal

overhead and avoid losses of depth features. Next, by combining

the depth value, the gradients of X axis, Y axis and diagonal

directions, and the structural similarity index measure (SSIM),

we propose our novel loss function. Moreover, we utilize pixel

blocks to accelerate the computation of the loss function. Finally,

we show, through comprehensive experiments on two large-

scale image datasets, i.e. KITTI and NYU-V2, that our method

outperforms several representative baselines.

Index Terms—computer vision, deep learning, monocular

depth estimation, encoder-decoder, attention-based

I. INTRODUCTION

Perception is one of the key technologies in many areas,

such as autonomous driving, virtual reality, and robotics [1],

which helps to detect, understand, and interpret the surround-

ing environments, including dynamic and static obstacles. The

performance of perception usually relies on the accuracy of

depth information estimation [2]. For example, autonomous

driving requires to estimate the inter-vehicle distance and

warn potential rear-end collisions [3], robotic arms cannot

grasp the target without accurate depth information [4], and

so on.

There exist many strategies to infer depth information.

In general, these strategies can be classiﬁed into two cate-

gories: sensor-based methods and image-based methods [3],

[5]. Sensor-based strategies, such as utilizing like LIDAR,

Fig. 1. Generated depth map of different methods. Upper left is an image

in the KITTI dataset. The upper right and lower left images are generated

by [18] and [19] respectively. It can be seen that the objects in these two

images (cars, poles, framed by black boxes) are obviously incomplete and

blurred. Lower right is ours.

RGB-D camera, and other active sensors [6], are able to

collect depth information accurately. However, this type of

methods usually places heavy burdens on manpower and

computation [7]. In addition, there could be strict conditions

when applying these methods. For instance, LIDAR estimates

depth accurately only at sparse locations [8] and RGB-

D camera suffers from its limited measurement range and

outdoor sunlight sensitivity [9]. Alternatively, image-based

methods can overcome these issues and be applied in a wide

range of applications [10], [11]. The conventional image-

based depth estimation methods heavily rely on multi-view

geometry [12]–[14], such as stereo images [15], [16] and

consecutive frames. Nevertheless, it introduces issues such as

calibration drift over time [2], [8] as well as high demands on

computational resources and memory [17]. Therefore, using

a monocular camera becomes an alternative low-cost, efﬁ-

cient, and attractive solution with light maintenance require-

ments for autonomous driving, robotics, and other resource-

constrained applications [10].

This paper studies the extreme case in monocular depth

estimation, which is to estimate the depth map from one

image. This could be an ill-posed problem as there is an

ambiguity in the scale of the depth [17]. Owing to the

release of publicly available datasets and the advancement of

arXiv:2210.13646v1 [cs.CV] 24 Oct 2022

Convolutional Neural Networks (CNNs), Eigen et al. [20] ﬁrst

prove that the scale information can be learned by properly

designing the network structure [20]. After this, there has been

a lot of work along this direction [21], [22]. Despite their

success, there are still some critical issues to be addressed:

•Many methods do not consider the contextual informa-

tion and treat all pixels equally. It may result in the grid

artifacts problem [23] and the edges in depth maps may

be distorted or blurry [24], as shown in Figure 1.

•Depth estimation is often deeply integrated with in-

dustrial applications, which require real-time operation

with limited computational resources. In order to achieve

higher accuracy, however, deeper networks and complex

mechanisms are developed with more parameters [5].

The conﬂict between real-time requirements and ex-

pensive computational overhead should be mitigated

urgently.

•For traditional CNN architecture, such as fully connected

network (FCN), after multiple layers of information

processing, the depth features could be severely lost,

which may lead to low accuracy and cannot meet the

requirements in practice [5].

To alleviate these issues, this paper presents an new ap-

proach for depth monocular estimation from a single image.

The main contributions are summarized as follows:

•We propose an encoder-decoder attention based network

to effectively generate corresponding depth map from a

single image and avoid grid artifacts with least possible

overhead. To leverage the contextual information and

ﬁnd focuses of images, we design a convolutional atten-

tion mechanism block (CAMB) by combining channel

attention and spatial attention sequentially and insert

these CBAMs into the skip connections. Different from

many of the previous methods, our attention module is

light-weight and therefore more suitable for resource-

constrained applications.

•We design a novel loss function by combining the depth

value, the gradients of three dimensions (i.e. X-axis,

Y-axis and diagonal direction) and structural similarity

index measure (SSIM). In addition, we introduce pixel

blocks, instead of single pixel, to save computational

resources when calculating the loss.

•We conduct comprehensive experiments on two large-

scale datasets, i.e. KITTI and NYU-V2. It is shown that

our approach outperforms several representative baseline

methods, which verify the effectiveness of our approach.

The remainder of this paper is organized as follows. After

a brief review of related work in Section II, we present the

monocular depth estimation problem in Section III. Section IV

proposes our new approach. We evaluate the qualitative and

quantitative performance on KITTI and NYU-v2 in Section V.

Finally, conclusions are drawn in Section VI.

II. RELATED WORK

Recently, numerous methods have been proposed for

image-based depth estimation. We can roughly divide these

methods into two categories: geometry-based and monocular.

A. Geometry-based methods

Recovering 3D structures based on geometric constraints is

an optional method to estimate depth information. This kind

of methods relies on consecutive frames taken by one camera

or stereo matching based on binocular camera. For the former,

structure from motion (SfM) [25] is a representative method

by matching features of different frames and estimating the

camera motion, but the performance heavily relies on the

quality of image sequences [9]. To alleviate this problem, a

variety of sfM strategies has been proposed to deal with uncal-

ibrated or unordered images [12]. For example, incremental

sfM approaches [26], [27] add on one image at a time to grow

the reconstruction, global methods [28] consider the entire

view graph at the same time, and hierarchical methods [29]

divides the images into multiple clusters, reconstructs each

cluster separately and merges partial models into a complete

model. However, they still suffer from monocular scale am-

biguity and high computational complexity [9]. As for the

latter, it calculates the disparity maps [30] of images through

a cost function, and its bottleneck is the accuracy of matching

the pixels of different images [31]. Different from sfM, the

scale information is included in depth estimation since the

cameras is calibrated in advance in this case [32]. However, in

addition to the high consumption of computing and memory,

calibration drift is also an issue [8].

B. Monocular methods from Single Image

Since there is only one single image need to be calculated,

depth estimation from one image can effectively reduce the

computational complexity and memory overhead [8]. Nu-

merous methods have been proposed for estimating depth

information from one image in recent years. Herein, we

brieﬂy review the relevant studies.

This problem was ﬁrstly studied by Eigen et al. [20]. They

regard this problem as a regression problem and propose

a CNNs architecture which is composed of global coarse-

scale network and local ﬁne-scale network to generate depth

maps. By taking advantage of the 3D geometric constraints,

Yin et al. [21] implement ’virtual norm’ constraints [17]

and proposed a supervised framework to obtain a high-

quality depth estimation. Praful et al. [33] utilize UW-GAN

to estimate depth information, their network includes two

modules: the generator predicts depth maps, and the dis-

criminator determines the quality of the maps. Fu et al. [34]

introduce a spacing-increasing discretization (SID) strategy

to discretize depth and recast depth network learning as an

ordinal regression problem to generate depth maps. Xu et

al. [22] propose a conditional random ﬁeld (CRF) based

model for the multi-scale features to estiamte the ﬁne-grained

depth maps. Although these fully connected network (FCN)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DepthMonocularEstimationwithAttention-basedEncoder-DecoderNetworkfromSingleImage1stXinZhangUniversityofSouthCarolinaColumbia,UnitedStatesxz8@email.sc.edu2ndRababAbdelfattahUniversityofSouthCarolinaColumbia,UnitedStatesrabab@email.sc.edu3rdYuqiSongUniversityofSouthCarolinaColumbia,UnitedStatesyuqis@e...

展开>> 收起<<

Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image 1stXin Zhang_2.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image 1stXin Zhang_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: