Hierarchical Normalization for Robust Monocular Depth Estimation Chi Zhang1 Wei Yin2 Zhibin Wang1 Gang Yu1 Bin Fu1 Chunhua Shen3

2025-05-06 0 0 4.95MB 12 页 10玖币

侵权投诉

Hierarchical Normalization for Robust Monocular

Depth Estimation

Chi Zhang1, Wei Yin2, Zhibin Wang1, Gang Yu1∗

, Bin Fu1, Chunhua Shen3

1Tecent PCG, China 2DJI Technology, China 3Zhejiang University, China

1{johnczhang, brianfu, skicyyu}@tencent.com; 2yvanwy@outlook.com; 3Chunhua@icloud.com

Abstract

In this paper, we address monocular depth estimation with deep neural networks.

To enable training of deep monocular estimation models with various sources

of datasets, state-of-the-art methods adopt image-level normalization strategies

to generate afﬁne-invariant depth representations. However, learning with the

image-level normalization mainly emphasizes the relations of pixel representations

with the global statistic in the images, such as the structure of the scene, while

the ﬁne-grained depth difference may be overlooked. In this paper, we propose

a novel multi-scale depth normalization method that hierarchically normalizes

the depth representations based on spatial information and depth distributions.

Compared with previous normalization strategies applied only at the holistic image

level, the proposed hierarchical normalization can effectively preserve the ﬁne-

grained details and improve accuracy. We present two strategies that deﬁne the

hierarchical normalization contexts in the depth domain and the spatial domain,

respectively. Our extensive experiments show that the proposed normalization

strategy remarkably outperforms previous normalization methods, and we set new

state-of-the-art on ﬁve zero-shot transfer benchmark datasets.

SSI

Ours

SSI

Ours

SSI

Ours

Figure 1:

We propose a hierarchical depth normalization strategy to improve the training of monocular depth

estimation models. Compared with the previous normalization strategy SSI [

], our design effectively improves

the details and smoothness of predictions. We visualize the predictions of close regions with heat maps to

observe details.

1 Introduction

Data-driven deep learning based monocular depth estimation has gained wide interest in recent years,

due to its low requirements of sensing devices and impressive progress. Among various learning

∗Corresponding author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.09670v1 [cs.CV] 18 Oct 2022

objectives of deep monocular estimation, zero-shot transfer carries the promise of learning a generic

depth predictor that can generalize well across a variety of scenes. Rather than training and evaluation

on the subsets of individual benchmarks that usually share similar characteristics and biases, zero-shot

transfer expects the models to be deployed for predictions of any in-the-wild images.

To achieve this goal, large-scale datasets with equally high diversity for training are necessary

to enable good generalization. However, collecting data with high-quality depth annotations is

expensive, and existing benchmark datasets often show limitations in scales or diversities. Many recent

works [

] seek mix-dataset training, where datasets captured by various sensing modalities and in

diverse environments can be jointly utilized for model training, which largely alleviates the difﬁculty

of obtaining diverse annotated depth data at scale. Nevertheless, the mix-data training also comes

with its challenges, as different datasets may demonstrate inconsistency in depth representations,

which causes incompatibility between datasets. For example, the disparity map generated from web

stereo images [

] or 3D movies [

] can only provide depth annotations up to a scale and shift, due

to varied and unknown camera models.

To solve this problem, state-of-the-art methods [

] seek training objectives invariant to the scale-

and-shift changes in the depth representations by normalizing the predictions or depth annotations

based on statistics of the image instance, which largely facilitates the mix-data learning of depth

predictors. However, as the depth is represented by the magnitude of values, normalization based

on the instance inevitably squeezes the ﬁne-grained depth difference, particularly in close regions.

Suppose that an object-centric dataset with depth annotations is available for training, where the

images sometimes include backgrounds with high depth values. Normalizing the depth representations

with global statistics can distinctly separate the foreground and background, but it may meanwhile

lead to an overlook of the depth difference in objects, which may be our real interest. As the result,

the learned depth predictor often excels at predicting the relative depth representations of each pixel

location with respect to the entire scene in the image, such as the overall scene structure, but struggles

to capture the ﬁne-grained details.

Motivated by the bias issue in existing depth normalization approaches, we aim to design a training

objective that should have the ﬂexibility to optimize both the overall scene structure and ﬁne-grained

depth difference. Since depth estimation is a dense prediction task, we take inspiration from classic

local normalization approaches, such as local response normalization in AlexNet [

], local histogram

normalization in SIFT [

] and HOG features [

], and local deep features in DeepEMD [

], which

rely on normalized local statistics to enhance local contrast or generate discriminative local descriptors.

By varying the size of the local window, we can control how much context is involved to generate

a normalized representation. With such insights, we present a hierarchical depth normalization

(HDN) strategy that normalizes depth representations with different scopes of contexts for learning.

Intuitively, a large context for normalization emphasizes the depth difference globally, while a small

context focuses on the subtle difference locally. We present two implementations that deﬁne the

multi-scale contexts in the spatial domain and the depth domain, respectively. For the strategy in

the spatial domain, we divide the image into several sub-regions and the context is deﬁned as the

pixels in individual cells. In this way, the ﬁne-grained difference between spatially close locations

is emphasized. By varying the grid size, we obtain multiple depth representations of each pixel

that rely on different contexts. For the strategy in the depth domain, we group pixels based on

the ground truth depth values to construct contexts, such that pixels with similar depth values can

be differentiated. Similarly, we can change the number of groups to control the context size. By

combining the normalized depth representations under various contexts for optimization, the learner

emphasizes both the ﬁne-grained details and the global scene structure, as shown in Fig. 1.

To validate the effectiveness of our design, we conduct extensive experiments on various benchmark

datasets. Our empirical results show that the proposed hierarchical depth normalization remark-

ably outperforms the existing instance-level normalization qualitatively and quantitatively. Our

contributions are summarized as follows:

•

We propose a hierarchical depth normalization strategy to improve the learning of deep

monocular estimation models.

•

We present two implementations to deﬁne the multi-scale normalization contexts of HDN

based on spatial information and depth distributions, respectively.

•

Experiments on ﬁve popular benchmark datasets show that our method signiﬁcantly outper-

forms the baselines and sets new state-of-the-art results.

Next we review some works that are closest to ours and then present our method in Section 3. In

Section 4, we empirically validate the effectiveness of our method on several public benchmark

datasets.

2 Related Work

Deep monocular depth estimation.

As opposed to early works on monocular depth estimation based

on hand-crafted features, recent studies advocate end-to-end learning based on deep neural networks.

Since the pioneer work [

] ﬁrst adopts deep neural networks to undertake monocular depth estimation,

signiﬁcant progress has been made from many aspects, such as network architectures [

large-scale and diverse training datasets [

], loss functions [

], multi-task learning [

synthesized dataset [

], geometry constraint [

] and various sources of supervisions [

39].

For supervised training, collecting high-diversity data with ground-truth depth annotations at scale is

expensive. Recent works based on ranking loss [

] and scale-and-shift invariant losses [

]

enable network training with other forms of annotations, such as ordinal depth annotations [

] ,

or relative inverse depth map [

] generated by uncalibrated stereo images using optical ﬂow

algorithms [

]. In particular, scale-and-shift invariant (SSI) loss [

] and image-level normalization

loss [

] allow data from multiple sources to be learned in a fully supervised depth regression manner,

which largely facilitates large-scale training and improves the generation ability of learning based

depth estimators. The SSI loss removes the major incompatibility between various datasets, i.e., the

scale and shift changes, by transforming the depth representation into a canonical space through

normalization. With such advances, zero-shot transfer is made possible, where the network learned

on a large-scale database with high diversity can be directly evaluated on various benchmarks without

seeing their training samples, which is the focus of this paper.

Furthermore, some literature [

] proposes to solve the monocular depth estimation problem

without sensor-captured ground truth but leverages the training signal from consecutive temporal

frames or stereo videos. However, most of these methods need the camera intrinsic parameters for

supervision.

Normalization in CNNs.

Normalization is widely adopted in deep neural networks, while different

normalization strategies are employed for different purposes. For instance, batch normalization

(BN) [

] normalizes the feature representations along the batch dimension to stabilize training and

accelerate convergence. BN usually prefers large normalization contexts to obtain robust feature

representation. On the other hand, another line of normalization methods relies on local statistics.

For example, instance normalization [

] and its variants [

] based on instance-level statistics

dominate the style transfer task, as they emphasize the unique styles of individual images. A collection

of literature seeks normalization in local regions. For example, the well-known SIFT feature [

]

and HOG features [

] are based on the normalized local statistics to generate discriminative local

features. DeepEMD [

] computes the optimal transport between local normalized deep features

as a distance metric between images. Since the local details and the overall scene structure are both

important for a depth estimator, we incorporate the ideas of both global normalization and local

normalization in the monocular depth estimation models.

3 Method

In this section, we ﬁrst brieﬂy summarize the preliminaries of the task. Then we deﬁne a uniﬁed

form of depth normalization and show that the normalization strategy in scale-and-shift invariant

loss[

] is a special case. Finally, we present two implementations of our proposed hierarchical depth

normalization approaches based on the spatial domain and the depth domain, respectively.

3.1 Preliminaries

We aim to boost the performance of zero-shot monocular depth estimation with diverse training data.

In our pipeline, we input a single RGB image

I∈RH×W×3

to the depth prediction network

F(·)

generate a depth map

D∈RH×W×3

. Instead of directly regressing the output map with the ground-

truth depth supervision, state-of-the-art methods [

] normalize the depth representations

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HierarchicalNormalizationforRobustMonocularDepthEstimationChiZhang1,WeiYin2,ZhibinWang1,GangYu1,BinFu1,ChunhuaShen31TecentPCG,China2DJITechnology,China3ZhejiangUniversity,China1{johnczhang,brianfu,skicyyu}@tencent.com;2yvanwy@outlook.com;3Chunhua@icloud.comAbstractInthispaper,weaddressmonoculardept...

展开>> 收起<<

Hierarchical Normalization for Robust Monocular Depth Estimation Chi Zhang1 Wei Yin2 Zhibin Wang1 Gang Yu1 Bin Fu1 Chunhua Shen3.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hierarchical Normalization for Robust Monocular Depth Estimation Chi Zhang1 Wei Yin2 Zhibin Wang1 Gang Yu1 Bin Fu1 Chunhua Shen3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: