Hierarchical Normalization for Robust Monocular Depth Estimation Chi Zhang1 Wei Yin2 Zhibin Wang1 Gang Yu1 Bin Fu1 Chunhua Shen3

2025-05-06 0 0 4.95MB 12 页 10玖币
侵权投诉
Hierarchical Normalization for Robust Monocular
Depth Estimation
Chi Zhang1, Wei Yin2, Zhibin Wang1, Gang Yu1
, Bin Fu1, Chunhua Shen3
1Tecent PCG, China 2DJI Technology, China 3Zhejiang University, China
1{johnczhang, brianfu, skicyyu}@tencent.com; 2yvanwy@outlook.com; 3Chunhua@icloud.com
Abstract
In this paper, we address monocular depth estimation with deep neural networks.
To enable training of deep monocular estimation models with various sources
of datasets, state-of-the-art methods adopt image-level normalization strategies
to generate affine-invariant depth representations. However, learning with the
image-level normalization mainly emphasizes the relations of pixel representations
with the global statistic in the images, such as the structure of the scene, while
the fine-grained depth difference may be overlooked. In this paper, we propose
a novel multi-scale depth normalization method that hierarchically normalizes
the depth representations based on spatial information and depth distributions.
Compared with previous normalization strategies applied only at the holistic image
level, the proposed hierarchical normalization can effectively preserve the fine-
grained details and improve accuracy. We present two strategies that define the
hierarchical normalization contexts in the depth domain and the spatial domain,
respectively. Our extensive experiments show that the proposed normalization
strategy remarkably outperforms previous normalization methods, and we set new
state-of-the-art on five zero-shot transfer benchmark datasets.
SSI
Ours
Ours
SSI
SSI
Ours
SSI
Ours
Figure 1:
We propose a hierarchical depth normalization strategy to improve the training of monocular depth
estimation models. Compared with the previous normalization strategy SSI [
26
], our design effectively improves
the details and smoothness of predictions. We visualize the predictions of close regions with heat maps to
observe details.
1 Introduction
Data-driven deep learning based monocular depth estimation has gained wide interest in recent years,
due to its low requirements of sensing devices and impressive progress. Among various learning
Corresponding author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.09670v1 [cs.CV] 18 Oct 2022
objectives of deep monocular estimation, zero-shot transfer carries the promise of learning a generic
depth predictor that can generalize well across a variety of scenes. Rather than training and evaluation
on the subsets of individual benchmarks that usually share similar characteristics and biases, zero-shot
transfer expects the models to be deployed for predictions of any in-the-wild images.
To achieve this goal, large-scale datasets with equally high diversity for training are necessary
to enable good generalization. However, collecting data with high-quality depth annotations is
expensive, and existing benchmark datasets often show limitations in scales or diversities. Many recent
works [
26
,
39
] seek mix-dataset training, where datasets captured by various sensing modalities and in
diverse environments can be jointly utilized for model training, which largely alleviates the difficulty
of obtaining diverse annotated depth data at scale. Nevertheless, the mix-data training also comes
with its challenges, as different datasets may demonstrate inconsistency in depth representations,
which causes incompatibility between datasets. For example, the disparity map generated from web
stereo images [
34
] or 3D movies [
26
] can only provide depth annotations up to a scale and shift, due
to varied and unknown camera models.
To solve this problem, state-of-the-art methods [
26
,
39
] seek training objectives invariant to the scale-
and-shift changes in the depth representations by normalizing the predictions or depth annotations
based on statistics of the image instance, which largely facilitates the mix-data learning of depth
predictors. However, as the depth is represented by the magnitude of values, normalization based
on the instance inevitably squeezes the fine-grained depth difference, particularly in close regions.
Suppose that an object-centric dataset with depth annotations is available for training, where the
images sometimes include backgrounds with high depth values. Normalizing the depth representations
with global statistics can distinctly separate the foreground and background, but it may meanwhile
lead to an overlook of the depth difference in objects, which may be our real interest. As the result,
the learned depth predictor often excels at predicting the relative depth representations of each pixel
location with respect to the entire scene in the image, such as the overall scene structure, but struggles
to capture the fine-grained details.
Motivated by the bias issue in existing depth normalization approaches, we aim to design a training
objective that should have the flexibility to optimize both the overall scene structure and fine-grained
depth difference. Since depth estimation is a dense prediction task, we take inspiration from classic
local normalization approaches, such as local response normalization in AlexNet [
16
], local histogram
normalization in SIFT [
22
] and HOG features [
6
], and local deep features in DeepEMD [
42
], which
rely on normalized local statistics to enhance local contrast or generate discriminative local descriptors.
By varying the size of the local window, we can control how much context is involved to generate
a normalized representation. With such insights, we present a hierarchical depth normalization
(HDN) strategy that normalizes depth representations with different scopes of contexts for learning.
Intuitively, a large context for normalization emphasizes the depth difference globally, while a small
context focuses on the subtle difference locally. We present two implementations that define the
multi-scale contexts in the spatial domain and the depth domain, respectively. For the strategy in
the spatial domain, we divide the image into several sub-regions and the context is defined as the
pixels in individual cells. In this way, the fine-grained difference between spatially close locations
is emphasized. By varying the grid size, we obtain multiple depth representations of each pixel
that rely on different contexts. For the strategy in the depth domain, we group pixels based on
the ground truth depth values to construct contexts, such that pixels with similar depth values can
be differentiated. Similarly, we can change the number of groups to control the context size. By
combining the normalized depth representations under various contexts for optimization, the learner
emphasizes both the fine-grained details and the global scene structure, as shown in Fig. 1.
To validate the effectiveness of our design, we conduct extensive experiments on various benchmark
datasets. Our empirical results show that the proposed hierarchical depth normalization remark-
ably outperforms the existing instance-level normalization qualitatively and quantitatively. Our
contributions are summarized as follows:
We propose a hierarchical depth normalization strategy to improve the learning of deep
monocular estimation models.
We present two implementations to define the multi-scale normalization contexts of HDN
based on spatial information and depth distributions, respectively.
Experiments on five popular benchmark datasets show that our method significantly outper-
forms the baselines and sets new state-of-the-art results.
2
Next we review some works that are closest to ours and then present our method in Section 3. In
Section 4, we empirically validate the effectiveness of our method on several public benchmark
datasets.
2 Related Work
Deep monocular depth estimation.
As opposed to early works on monocular depth estimation based
on hand-crafted features, recent studies advocate end-to-end learning based on deep neural networks.
Since the pioneer work [
7
] first adopts deep neural networks to undertake monocular depth estimation,
significant progress has been made from many aspects, such as network architectures [
17
,
24
,
18
],
large-scale and diverse training datasets [
36
,
40
], loss functions [
26
,
39
], multi-task learning [
41
],
synthesized dataset [
8
], geometry constraint [
36
,
24
,
37
,
38
] and various sources of supervisions [
35
,
39].
For supervised training, collecting high-diversity data with ground-truth depth annotations at scale is
expensive. Recent works based on ranking loss [
2
,
35
] and scale-and-shift invariant losses [
26
,
39
]
enable network training with other forms of annotations, such as ordinal depth annotations [
2
,
34
] ,
or relative inverse depth map [
35
,
36
,
26
] generated by uncalibrated stereo images using optical flow
algorithms [
30
]. In particular, scale-and-shift invariant (SSI) loss [
26
] and image-level normalization
loss [
39
] allow data from multiple sources to be learned in a fully supervised depth regression manner,
which largely facilitates large-scale training and improves the generation ability of learning based
depth estimators. The SSI loss removes the major incompatibility between various datasets, i.e., the
scale and shift changes, by transforming the depth representation into a canonical space through
normalization. With such advances, zero-shot transfer is made possible, where the network learned
on a large-scale database with high diversity can be directly evaluated on various benchmarks without
seeing their training samples, which is the focus of this paper.
Furthermore, some literature [
1
,
49
,
10
,
28
] proposes to solve the monocular depth estimation problem
without sensor-captured ground truth but leverages the training signal from consecutive temporal
frames or stereo videos. However, most of these methods need the camera intrinsic parameters for
supervision.
Normalization in CNNs.
Normalization is widely adopted in deep neural networks, while different
normalization strategies are employed for different purposes. For instance, batch normalization
(BN) [
13
] normalizes the feature representations along the batch dimension to stabilize training and
accelerate convergence. BN usually prefers large normalization contexts to obtain robust feature
representation. On the other hand, another line of normalization methods relies on local statistics.
For example, instance normalization [
31
] and its variants [
12
,
23
] based on instance-level statistics
dominate the style transfer task, as they emphasize the unique styles of individual images. A collection
of literature seeks normalization in local regions. For example, the well-known SIFT feature [
22
]
and HOG features [
6
] are based on the normalized local statistics to generate discriminative local
features. DeepEMD [
42
,
43
] computes the optimal transport between local normalized deep features
as a distance metric between images. Since the local details and the overall scene structure are both
important for a depth estimator, we incorporate the ideas of both global normalization and local
normalization in the monocular depth estimation models.
3 Method
In this section, we first briefly summarize the preliminaries of the task. Then we define a unified
form of depth normalization and show that the normalization strategy in scale-and-shift invariant
loss[
26
] is a special case. Finally, we present two implementations of our proposed hierarchical depth
normalization approaches based on the spatial domain and the depth domain, respectively.
3.1 Preliminaries
We aim to boost the performance of zero-shot monocular depth estimation with diverse training data.
In our pipeline, we input a single RGB image
IRH×W×3
to the depth prediction network
F(·)
to
generate a depth map
DRH×W×3
. Instead of directly regressing the output map with the ground-
truth depth supervision, state-of-the-art methods [
36
,
26
,
39
] normalize the depth representations
3
摘要:

HierarchicalNormalizationforRobustMonocularDepthEstimationChiZhang1,WeiYin2,ZhibinWang1,GangYu1,BinFu1,ChunhuaShen31TecentPCG,China2DJITechnology,China3ZhejiangUniversity,China1{johnczhang,brianfu,skicyyu}@tencent.com;2yvanwy@outlook.com;3Chunhua@icloud.comAbstractInthispaper,weaddressmonoculardept...

展开>> 收起<<
Hierarchical Normalization for Robust Monocular Depth Estimation Chi Zhang1 Wei Yin2 Zhibin Wang1 Gang Yu1 Bin Fu1 Chunhua Shen3.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:4.95MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注