Convolutional Neural Networks (CNNs), Eigen et al. [20] first
prove that the scale information can be learned by properly
designing the network structure [20]. After this, there has been
a lot of work along this direction [21], [22]. Despite their
success, there are still some critical issues to be addressed:
•Many methods do not consider the contextual informa-
tion and treat all pixels equally. It may result in the grid
artifacts problem [23] and the edges in depth maps may
be distorted or blurry [24], as shown in Figure 1.
•Depth estimation is often deeply integrated with in-
dustrial applications, which require real-time operation
with limited computational resources. In order to achieve
higher accuracy, however, deeper networks and complex
mechanisms are developed with more parameters [5].
The conflict between real-time requirements and ex-
pensive computational overhead should be mitigated
urgently.
•For traditional CNN architecture, such as fully connected
network (FCN), after multiple layers of information
processing, the depth features could be severely lost,
which may lead to low accuracy and cannot meet the
requirements in practice [5].
To alleviate these issues, this paper presents an new ap-
proach for depth monocular estimation from a single image.
The main contributions are summarized as follows:
•We propose an encoder-decoder attention based network
to effectively generate corresponding depth map from a
single image and avoid grid artifacts with least possible
overhead. To leverage the contextual information and
find focuses of images, we design a convolutional atten-
tion mechanism block (CAMB) by combining channel
attention and spatial attention sequentially and insert
these CBAMs into the skip connections. Different from
many of the previous methods, our attention module is
light-weight and therefore more suitable for resource-
constrained applications.
•We design a novel loss function by combining the depth
value, the gradients of three dimensions (i.e. X-axis,
Y-axis and diagonal direction) and structural similarity
index measure (SSIM). In addition, we introduce pixel
blocks, instead of single pixel, to save computational
resources when calculating the loss.
•We conduct comprehensive experiments on two large-
scale datasets, i.e. KITTI and NYU-V2. It is shown that
our approach outperforms several representative baseline
methods, which verify the effectiveness of our approach.
The remainder of this paper is organized as follows. After
a brief review of related work in Section II, we present the
monocular depth estimation problem in Section III. Section IV
proposes our new approach. We evaluate the qualitative and
quantitative performance on KITTI and NYU-v2 in Section V.
Finally, conclusions are drawn in Section VI.
II. RELATED WORK
Recently, numerous methods have been proposed for
image-based depth estimation. We can roughly divide these
methods into two categories: geometry-based and monocular.
A. Geometry-based methods
Recovering 3D structures based on geometric constraints is
an optional method to estimate depth information. This kind
of methods relies on consecutive frames taken by one camera
or stereo matching based on binocular camera. For the former,
structure from motion (SfM) [25] is a representative method
by matching features of different frames and estimating the
camera motion, but the performance heavily relies on the
quality of image sequences [9]. To alleviate this problem, a
variety of sfM strategies has been proposed to deal with uncal-
ibrated or unordered images [12]. For example, incremental
sfM approaches [26], [27] add on one image at a time to grow
the reconstruction, global methods [28] consider the entire
view graph at the same time, and hierarchical methods [29]
divides the images into multiple clusters, reconstructs each
cluster separately and merges partial models into a complete
model. However, they still suffer from monocular scale am-
biguity and high computational complexity [9]. As for the
latter, it calculates the disparity maps [30] of images through
a cost function, and its bottleneck is the accuracy of matching
the pixels of different images [31]. Different from sfM, the
scale information is included in depth estimation since the
cameras is calibrated in advance in this case [32]. However, in
addition to the high consumption of computing and memory,
calibration drift is also an issue [8].
B. Monocular methods from Single Image
Since there is only one single image need to be calculated,
depth estimation from one image can effectively reduce the
computational complexity and memory overhead [8]. Nu-
merous methods have been proposed for estimating depth
information from one image in recent years. Herein, we
briefly review the relevant studies.
This problem was firstly studied by Eigen et al. [20]. They
regard this problem as a regression problem and propose
a CNNs architecture which is composed of global coarse-
scale network and local fine-scale network to generate depth
maps. By taking advantage of the 3D geometric constraints,
Yin et al. [21] implement ’virtual norm’ constraints [17]
and proposed a supervised framework to obtain a high-
quality depth estimation. Praful et al. [33] utilize UW-GAN
to estimate depth information, their network includes two
modules: the generator predicts depth maps, and the dis-
criminator determines the quality of the maps. Fu et al. [34]
introduce a spacing-increasing discretization (SID) strategy
to discretize depth and recast depth network learning as an
ordinal regression problem to generate depth maps. Xu et
al. [22] propose a conditional random field (CRF) based
model for the multi-scale features to estiamte the fine-grained
depth maps. Although these fully connected network (FCN)