2
been studied intensively due to its robustness to radiometric differences, at the same time it can be computed
efficiently (Chen et al., 2019; Ma et al., 2013; Xia et al., 2018; Zabih & Woodfill, 1994). On the other hand, recently
developed deep learning (DL) algorithms have shown a promising performance that is assumed to surpass classical
cost metrics like Census (Chen et al., 2019; Hamid et al., 2020). There are mainly two ways that DL algorithms are
implemented in stereo DIM, for example, 1) Learning-based cost metric and 2) end-to-end (E2E) learning. The
learning-based cost metric learns similarity from image patches given examples of similar and dissimilar patches,
such methods were first introduced by (Žbontar & LeCun, 2015) through convolutional neural networks (MC-CNN)
as learnable feature extractors. Like Census, such patch-based cost metrics do not handle texture-less regions well,
and often require multistage optimization after cost computation, followed by cost aggregation for regularization and
smoothness of the disparity map (Bobick & Intille, 1999; Hirschmuller, 2005, 2008; Kolmogorov & Zabih, 2001).
There are plenty of cost aggregation methods ranging from local, semi-global, and global approaches. A well-known
example is the semi-global matching (SGM) algorithm (Hirschmuller, 2005), which is known to effectively leverage
both accuracy and speed well, thus nowadays it is used as a standard cost aggregation approach. However, even with
cost aggregation, studies found that MC-CNN still suffers from poor performance in ill-posed regions with
occlusion, high-reflective surfaces, lack of texture, and repetitive patterns (Chen et al., 2019; Ma et al., 2013; Xia et
al., 2018; Zabih & Woodfill, 1994) demanding further postprocessing and refinement. Alternatively, E2E learning
methods directly generate disparity from stereo pair rectified images without additional optimization steps. They
have become a popular line of stereo DIM algorithms because they can directly predict highly accurate disparity
maps through learning from geometry and context (e.g., cues such as shading, illumination, objects, etc.) rather than
low-level features (Chang & Chen, 2018; Cheng et al., 2020; Gu et al., 2020; Kendall et al., 2017; Xu & Zhang,
2020; Zhang, Prisacariu, et al., 2019). Examples of this type of work are Geometry and Context Network (GCNet),
pyramid stereo matching networks (PSMNet), and LEAStereo, which due to their high accessibility and
performances (i.e., winning the best rank in the KITTI 2012 and 2015 leaderboards (Chang & Chen, 2018; Cheng et
al., 2020; Kendall et al., 2017), have become popular in the field.
However, DL algorithms often suffer from generalization problems, at the same time, face challenges to process
large-volume and large-format remote sensing images, limiting their practical values (Pang et al., 2018; Song et al.,
2021; Zhang, Qi, et al., 2019). The degree of generalization may vary with both DL models and tasks. For instance,
studies indicate that most E2E approaches enjoy deep feature representations, which however, may encode scene-
specific information (Pang et al., 2018; Song et al., 2021; Zhang, Qi, et al., 2019), thus may poorly perform on
unseen scenes or data (Song et al., 2021). Most of the existing DL stereo DIM algorithms are trained using
Computer Vision (CV) benchmark datasets such as KITTI or Middlebury, etc. (Geiger et al., 2012; Hamid et al.,
2020), which mainly consist of images of everyday scenes. These images are distinctively different from satellite
images in terms of scene content, view perspectives, and object granularity, thus DL models trained from these CV
datasets may not perform well when applied to satellite datasets. However, acquiring a large number of high-quality
satellite datasets for training is a challenge mainly due to its high cost. Additionally, the scene content of the earth's
surface is extremely diverse across the entire Globe, thus it is difficult to create a reprehensive dataset by only using
data from a few regions, although these regions can be as large as an entire city. A few remote sensing benchmarks
such as the IARPA (The Intelligence Advanced Research Projects Activity) Multiview stereo 3D mapping challenge
(Bosch et al., 2016) and the 2019 Institute of Electrical and Electronics Engineers (IEEE) Geoscience and Remote
Sensing Society (GRSS) Data Fusion Contest (DFC) (Le Saux, 2019) provide the ground-truth dataset in the form of
LiDAR point clouds and raster DSMs, which must be post-processed and converted to the ground-truth disparity for
training. Unfortunately, this step may generate undesired errors in the training data due to geometric errors in the
orientation parameters, inconsistencies in the temporal and spatial resolutions, or other operations such as
triangulation, projection, and interpolation (Bosch et al., 2016; Cournet et al., 2020; Patil et al., 2019; Wu et al.,
2021).
In addition, a significantly under-evaluated criterion for DIM algorithms on satellite images is their robustness to
varying stereo configurations such as sun illuminations, intersection angles, different sensors, etc. A robust DIM
algorithm pertaining to these factors is extremely important because this will allow us to make full use of the already
limited satellite images (as compared to everyday images). Ultimately, we want these algorithms to be agnostic and
less selective to stereo configurations of data. It has been reported that the result of a typical stereo algorithm, i.e.,
Census with SGM, directly correlates with geometrical acquisition factors such as sun angle difference and
intersection angle (Qin, 2019). Unfortunately, the same analysis has not been covered for DL algorithms. Since DL
methods (especially E2E ones) can learn context, it is of interest to understand their performance under varying
stereo configurations.
In this paper, we aim to comprehensively explore the limitations and strengths of recent stereo DIM algorithms for
satellite datasets. We consider representative DIM algorithms of three main categories: 1) traditional approaches