objectives of deep monocular estimation, zero-shot transfer carries the promise of learning a generic
depth predictor that can generalize well across a variety of scenes. Rather than training and evaluation
on the subsets of individual benchmarks that usually share similar characteristics and biases, zero-shot
transfer expects the models to be deployed for predictions of any in-the-wild images.
To achieve this goal, large-scale datasets with equally high diversity for training are necessary
to enable good generalization. However, collecting data with high-quality depth annotations is
expensive, and existing benchmark datasets often show limitations in scales or diversities. Many recent
works [
26
,
39
] seek mix-dataset training, where datasets captured by various sensing modalities and in
diverse environments can be jointly utilized for model training, which largely alleviates the difficulty
of obtaining diverse annotated depth data at scale. Nevertheless, the mix-data training also comes
with its challenges, as different datasets may demonstrate inconsistency in depth representations,
which causes incompatibility between datasets. For example, the disparity map generated from web
stereo images [
34
] or 3D movies [
26
] can only provide depth annotations up to a scale and shift, due
to varied and unknown camera models.
To solve this problem, state-of-the-art methods [
26
,
39
] seek training objectives invariant to the scale-
and-shift changes in the depth representations by normalizing the predictions or depth annotations
based on statistics of the image instance, which largely facilitates the mix-data learning of depth
predictors. However, as the depth is represented by the magnitude of values, normalization based
on the instance inevitably squeezes the fine-grained depth difference, particularly in close regions.
Suppose that an object-centric dataset with depth annotations is available for training, where the
images sometimes include backgrounds with high depth values. Normalizing the depth representations
with global statistics can distinctly separate the foreground and background, but it may meanwhile
lead to an overlook of the depth difference in objects, which may be our real interest. As the result,
the learned depth predictor often excels at predicting the relative depth representations of each pixel
location with respect to the entire scene in the image, such as the overall scene structure, but struggles
to capture the fine-grained details.
Motivated by the bias issue in existing depth normalization approaches, we aim to design a training
objective that should have the flexibility to optimize both the overall scene structure and fine-grained
depth difference. Since depth estimation is a dense prediction task, we take inspiration from classic
local normalization approaches, such as local response normalization in AlexNet [
16
], local histogram
normalization in SIFT [
22
] and HOG features [
6
], and local deep features in DeepEMD [
42
], which
rely on normalized local statistics to enhance local contrast or generate discriminative local descriptors.
By varying the size of the local window, we can control how much context is involved to generate
a normalized representation. With such insights, we present a hierarchical depth normalization
(HDN) strategy that normalizes depth representations with different scopes of contexts for learning.
Intuitively, a large context for normalization emphasizes the depth difference globally, while a small
context focuses on the subtle difference locally. We present two implementations that define the
multi-scale contexts in the spatial domain and the depth domain, respectively. For the strategy in
the spatial domain, we divide the image into several sub-regions and the context is defined as the
pixels in individual cells. In this way, the fine-grained difference between spatially close locations
is emphasized. By varying the grid size, we obtain multiple depth representations of each pixel
that rely on different contexts. For the strategy in the depth domain, we group pixels based on
the ground truth depth values to construct contexts, such that pixels with similar depth values can
be differentiated. Similarly, we can change the number of groups to control the context size. By
combining the normalized depth representations under various contexts for optimization, the learner
emphasizes both the fine-grained details and the global scene structure, as shown in Fig. 1.
To validate the effectiveness of our design, we conduct extensive experiments on various benchmark
datasets. Our empirical results show that the proposed hierarchical depth normalization remark-
ably outperforms the existing instance-level normalization qualitatively and quantitatively. Our
contributions are summarized as follows:
•
We propose a hierarchical depth normalization strategy to improve the learning of deep
monocular estimation models.
•
We present two implementations to define the multi-scale normalization contexts of HDN
based on spatial information and depth distributions, respectively.
•
Experiments on five popular benchmark datasets show that our method significantly outper-
forms the baselines and sets new state-of-the-art results.
2