Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image 1stXin Zhang_2

2025-05-06 0 0 2.05MB 9 页 10玖币
侵权投诉
Depth Monocular Estimation with Attention-based
Encoder-Decoder Network from Single Image
1st Xin Zhang
University of South Carolina
Columbia, United States
xz8@email.sc.edu
2nd Rabab Abdelfattah
University of South Carolina
Columbia, United States
rabab@email.sc.edu
3rd Yuqi Song
University of South Carolina
Columbia, United States
yuqis@email.sc.edu
4th Samuel A. Dauchert
University of South Carolina
Columbia, United States
dauchert@email.sc.edu
5th Xiaofeng Wang
University of South Carolina
Columbia, United States
wangxi@cec.sc.edu
Abstract—Depth information is the foundation of perception,
essential for autonomous driving, robotics, and other source-
constrained applications. Promptly obtaining accurate and effi-
cient depth information allows for a rapid response in dynamic
environments. Sensor-based methods using LIDAR and RADAR
obtain high precision at the cost of high power consumption,
price, and volume. While due to advances in deep learning,
vision-based approaches have recently received much attention
and can overcome these drawbacks. In this work, we explore an
extreme scenario in vision-based settings: estimate a depth map
from one monocular image severely plagued by grid artifacts
and blurry edges. To address this scenario, We first design a
convolutional attention mechanism block (CAMB) which consists
of channel attention and spatial attention sequentially and insert
these CAMBs into skip connections. As a result, our novel
approach can find the focus of current image with minimal
overhead and avoid losses of depth features. Next, by combining
the depth value, the gradients of X axis, Y axis and diagonal
directions, and the structural similarity index measure (SSIM),
we propose our novel loss function. Moreover, we utilize pixel
blocks to accelerate the computation of the loss function. Finally,
we show, through comprehensive experiments on two large-
scale image datasets, i.e. KITTI and NYU-V2, that our method
outperforms several representative baselines.
Index Terms—computer vision, deep learning, monocular
depth estimation, encoder-decoder, attention-based
I. INTRODUCTION
Perception is one of the key technologies in many areas,
such as autonomous driving, virtual reality, and robotics [1],
which helps to detect, understand, and interpret the surround-
ing environments, including dynamic and static obstacles. The
performance of perception usually relies on the accuracy of
depth information estimation [2]. For example, autonomous
driving requires to estimate the inter-vehicle distance and
warn potential rear-end collisions [3], robotic arms cannot
grasp the target without accurate depth information [4], and
so on.
There exist many strategies to infer depth information.
In general, these strategies can be classified into two cate-
gories: sensor-based methods and image-based methods [3],
[5]. Sensor-based strategies, such as utilizing like LIDAR,
Fig. 1. Generated depth map of different methods. Upper left is an image
in the KITTI dataset. The upper right and lower left images are generated
by [18] and [19] respectively. It can be seen that the objects in these two
images (cars, poles, framed by black boxes) are obviously incomplete and
blurred. Lower right is ours.
RGB-D camera, and other active sensors [6], are able to
collect depth information accurately. However, this type of
methods usually places heavy burdens on manpower and
computation [7]. In addition, there could be strict conditions
when applying these methods. For instance, LIDAR estimates
depth accurately only at sparse locations [8] and RGB-
D camera suffers from its limited measurement range and
outdoor sunlight sensitivity [9]. Alternatively, image-based
methods can overcome these issues and be applied in a wide
range of applications [10], [11]. The conventional image-
based depth estimation methods heavily rely on multi-view
geometry [12]–[14], such as stereo images [15], [16] and
consecutive frames. Nevertheless, it introduces issues such as
calibration drift over time [2], [8] as well as high demands on
computational resources and memory [17]. Therefore, using
a monocular camera becomes an alternative low-cost, effi-
cient, and attractive solution with light maintenance require-
ments for autonomous driving, robotics, and other resource-
constrained applications [10].
This paper studies the extreme case in monocular depth
estimation, which is to estimate the depth map from one
image. This could be an ill-posed problem as there is an
ambiguity in the scale of the depth [17]. Owing to the
release of publicly available datasets and the advancement of
arXiv:2210.13646v1 [cs.CV] 24 Oct 2022
Convolutional Neural Networks (CNNs), Eigen et al. [20] first
prove that the scale information can be learned by properly
designing the network structure [20]. After this, there has been
a lot of work along this direction [21], [22]. Despite their
success, there are still some critical issues to be addressed:
Many methods do not consider the contextual informa-
tion and treat all pixels equally. It may result in the grid
artifacts problem [23] and the edges in depth maps may
be distorted or blurry [24], as shown in Figure 1.
Depth estimation is often deeply integrated with in-
dustrial applications, which require real-time operation
with limited computational resources. In order to achieve
higher accuracy, however, deeper networks and complex
mechanisms are developed with more parameters [5].
The conflict between real-time requirements and ex-
pensive computational overhead should be mitigated
urgently.
For traditional CNN architecture, such as fully connected
network (FCN), after multiple layers of information
processing, the depth features could be severely lost,
which may lead to low accuracy and cannot meet the
requirements in practice [5].
To alleviate these issues, this paper presents an new ap-
proach for depth monocular estimation from a single image.
The main contributions are summarized as follows:
We propose an encoder-decoder attention based network
to effectively generate corresponding depth map from a
single image and avoid grid artifacts with least possible
overhead. To leverage the contextual information and
find focuses of images, we design a convolutional atten-
tion mechanism block (CAMB) by combining channel
attention and spatial attention sequentially and insert
these CBAMs into the skip connections. Different from
many of the previous methods, our attention module is
light-weight and therefore more suitable for resource-
constrained applications.
We design a novel loss function by combining the depth
value, the gradients of three dimensions (i.e. X-axis,
Y-axis and diagonal direction) and structural similarity
index measure (SSIM). In addition, we introduce pixel
blocks, instead of single pixel, to save computational
resources when calculating the loss.
We conduct comprehensive experiments on two large-
scale datasets, i.e. KITTI and NYU-V2. It is shown that
our approach outperforms several representative baseline
methods, which verify the effectiveness of our approach.
The remainder of this paper is organized as follows. After
a brief review of related work in Section II, we present the
monocular depth estimation problem in Section III. Section IV
proposes our new approach. We evaluate the qualitative and
quantitative performance on KITTI and NYU-v2 in Section V.
Finally, conclusions are drawn in Section VI.
II. RELATED WORK
Recently, numerous methods have been proposed for
image-based depth estimation. We can roughly divide these
methods into two categories: geometry-based and monocular.
A. Geometry-based methods
Recovering 3D structures based on geometric constraints is
an optional method to estimate depth information. This kind
of methods relies on consecutive frames taken by one camera
or stereo matching based on binocular camera. For the former,
structure from motion (SfM) [25] is a representative method
by matching features of different frames and estimating the
camera motion, but the performance heavily relies on the
quality of image sequences [9]. To alleviate this problem, a
variety of sfM strategies has been proposed to deal with uncal-
ibrated or unordered images [12]. For example, incremental
sfM approaches [26], [27] add on one image at a time to grow
the reconstruction, global methods [28] consider the entire
view graph at the same time, and hierarchical methods [29]
divides the images into multiple clusters, reconstructs each
cluster separately and merges partial models into a complete
model. However, they still suffer from monocular scale am-
biguity and high computational complexity [9]. As for the
latter, it calculates the disparity maps [30] of images through
a cost function, and its bottleneck is the accuracy of matching
the pixels of different images [31]. Different from sfM, the
scale information is included in depth estimation since the
cameras is calibrated in advance in this case [32]. However, in
addition to the high consumption of computing and memory,
calibration drift is also an issue [8].
B. Monocular methods from Single Image
Since there is only one single image need to be calculated,
depth estimation from one image can effectively reduce the
computational complexity and memory overhead [8]. Nu-
merous methods have been proposed for estimating depth
information from one image in recent years. Herein, we
briefly review the relevant studies.
This problem was firstly studied by Eigen et al. [20]. They
regard this problem as a regression problem and propose
a CNNs architecture which is composed of global coarse-
scale network and local fine-scale network to generate depth
maps. By taking advantage of the 3D geometric constraints,
Yin et al. [21] implement ’virtual norm’ constraints [17]
and proposed a supervised framework to obtain a high-
quality depth estimation. Praful et al. [33] utilize UW-GAN
to estimate depth information, their network includes two
modules: the generator predicts depth maps, and the dis-
criminator determines the quality of the maps. Fu et al. [34]
introduce a spacing-increasing discretization (SID) strategy
to discretize depth and recast depth network learning as an
ordinal regression problem to generate depth maps. Xu et
al. [22] propose a conditional random field (CRF) based
model for the multi-scale features to estiamte the fine-grained
depth maps. Although these fully connected network (FCN)
摘要:

DepthMonocularEstimationwithAttention-basedEncoder-DecoderNetworkfromSingleImage1stXinZhangUniversityofSouthCarolinaColumbia,UnitedStatesxz8@email.sc.edu2ndRababAbdelfattahUniversityofSouthCarolinaColumbia,UnitedStatesrabab@email.sc.edu3rdYuqiSongUniversityofSouthCarolinaColumbia,UnitedStatesyuqis@e...

展开>> 收起<<
Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image 1stXin Zhang_2.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:2.05MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注