1 A Comparative Study on Deep -Learning Methods for Dense Image Matching of Multi -angle and

2025-04-30 0 0 1.66MB 18 页 10玖币
侵权投诉
1
A Comparative Study on Deep-Learning Methods
for Dense Image Matching of Multi-angle and
Multi-date Remote Sensing Stereo Images
Hessah Albanwan1,2 and Rongjun Qin1, 2, 3,4,*
1Geospatial Data Analytics Laboratory, The Ohio State University, 218B Bolz Hall, 2036 Neil Avenue, Columbus, OH 43210, USA
2Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, 218B Bolz Hall, 2036 Neil Avenue, Columbus, OH
43210, USA
3Department of Electrical and Computer Engineering, The Ohio State University, 205 Dreese Lab, 2036 Neil Avenue, Columbus, OH 43210,
USA
4Translational Data Analytics Institute, The Ohio State University, USA
Email: Albanwan.1@osu.edu, Qin.324@osu.edu
* Corresponding author
Abstract: Deep learning (DL) stereo matching methods gained great attention in remote sensing satellite datasets.
However, most of these existing studies conclude assessments based only on a few/single stereo images lacking a
systematic evaluation on how robust DL methods are on satellite stereo images with varying radiometric and
geometric configurations. This paper provides an evaluation of four DL stereo matching methods through hundreds
of multi-date multi-site satellite stereo pairs with varying geometric configurations, against the traditional well-
practiced Census-SGM (Semi-global matching), to comprehensively understand their accuracy, robustness,
generalization capabilities, and their practical potential. The DL methods include a learning-based cost metric
through convolutional neural networks (MC-CNN) followed by SGM, and three end-to-end (E2E) learning models
using Geometry and Context Network (GCNet), Pyramid Stereo Matching Network (PSMNet), and LEAStereo. Our
experiments show that E2E algorithms can achieve upper limits of geometric accuracies, while may not generalize
well for unseen data. The learning-based cost metric and Census-SGM are rather robust and can consistently achieve
acceptable results. All DL algorithms are robust to geometric configurations of stereo pairs and are less sensitive in
comparison to the Census-SGM, while learning-based cost metrics can generalize on satellite images when trained
on different datasets (airborne or ground-view).
Keywords: Dense Image Matching (DIM), Deep Learning (DL), Convolutional Neural Network (MC-CNN), Pyramid
Stereo Matching Network (PSMNet), Geometry and Context Network (GCNet), LEAStereo
I. INTRODUCTION
Stereo dense image matching (DIM) has been an active area of study over the years, as it offers a cost-effective and
efficient approach to generate digital surface models (DSM) for applications such as 3D modeling, forestry
mapping, and change detection (Albanwan et al., 2020; Albanwan & Qin, 2020; Furukawa & Hernández, 2015;
Navarro et al., 2018; Nebiker et al., 2014; Tian et al., 2014; Wang et al., 2017). This is especially relevant when
these applications are fueled by satellite-based 3D reconstructions due to their wide data coverage and consistent
data collection over time (Albanwan & Qin, 2020; Qin, 2017). However, it has been noted (Brown et al., 2003; Qin,
2019; Seitz et al., 2006) that the quality of the DSM not only depends on a specific DIM algorithm, but also on
various factors of stereo pairs including: 1) sensor and image characteristics (e.g., spatial and radiometric
resolutions), 2) acquisition conditions including the atmosphere and position and orientation of the sun, camera, and
objects, 3) scene structure and texture across different geographical regions, for instance, different land patterns
including urban, suburban, or forest areas, water surfaces, and parallaxes formed by stereo pairs, etc. Although there
have been studies evaluating the achievable quality of these satellite-based reconstructions using different DIM
algorithms, most of them only take one or a few pairs for quality assessment, thus the resulting conclusions are often
not comprehensive to cover data with various stereo configurations appearing in practice.
A typical stereo DIM algorithm performs disparity estimation on rectified stereo images (i.e., epipolar images),
which generally follows several key steps, including cost matching, aggregation, disparity optimization, and filtering
(Scharstein et al., 2001). Cost matching is the main step for measuring feature similarities for disparity computation.
For the last decade, Census has been known as the classical cost matching metric for stereo DIM problems and has
2
been studied intensively due to its robustness to radiometric differences, at the same time it can be computed
efficiently (Chen et al., 2019; Ma et al., 2013; Xia et al., 2018; Zabih & Woodfill, 1994). On the other hand, recently
developed deep learning (DL) algorithms have shown a promising performance that is assumed to surpass classical
cost metrics like Census (Chen et al., 2019; Hamid et al., 2020). There are mainly two ways that DL algorithms are
implemented in stereo DIM, for example, 1) Learning-based cost metric and 2) end-to-end (E2E) learning. The
learning-based cost metric learns similarity from image patches given examples of similar and dissimilar patches,
such methods were first introduced by (Žbontar & LeCun, 2015) through convolutional neural networks (MC-CNN)
as learnable feature extractors. Like Census, such patch-based cost metrics do not handle texture-less regions well,
and often require multistage optimization after cost computation, followed by cost aggregation for regularization and
smoothness of the disparity map (Bobick & Intille, 1999; Hirschmuller, 2005, 2008; Kolmogorov & Zabih, 2001).
There are plenty of cost aggregation methods ranging from local, semi-global, and global approaches. A well-known
example is the semi-global matching (SGM) algorithm (Hirschmuller, 2005), which is known to effectively leverage
both accuracy and speed well, thus nowadays it is used as a standard cost aggregation approach. However, even with
cost aggregation, studies found that MC-CNN still suffers from poor performance in ill-posed regions with
occlusion, high-reflective surfaces, lack of texture, and repetitive patterns (Chen et al., 2019; Ma et al., 2013; Xia et
al., 2018; Zabih & Woodfill, 1994) demanding further postprocessing and refinement. Alternatively, E2E learning
methods directly generate disparity from stereo pair rectified images without additional optimization steps. They
have become a popular line of stereo DIM algorithms because they can directly predict highly accurate disparity
maps through learning from geometry and context (e.g., cues such as shading, illumination, objects, etc.) rather than
low-level features (Chang & Chen, 2018; Cheng et al., 2020; Gu et al., 2020; Kendall et al., 2017; Xu & Zhang,
2020; Zhang, Prisacariu, et al., 2019). Examples of this type of work are Geometry and Context Network (GCNet),
pyramid stereo matching networks (PSMNet), and LEAStereo, which due to their high accessibility and
performances (i.e., winning the best rank in the KITTI 2012 and 2015 leaderboards (Chang & Chen, 2018; Cheng et
al., 2020; Kendall et al., 2017), have become popular in the field.
However, DL algorithms often suffer from generalization problems, at the same time, face challenges to process
large-volume and large-format remote sensing images, limiting their practical values (Pang et al., 2018; Song et al.,
2021; Zhang, Qi, et al., 2019). The degree of generalization may vary with both DL models and tasks. For instance,
studies indicate that most E2E approaches enjoy deep feature representations, which however, may encode scene-
specific information (Pang et al., 2018; Song et al., 2021; Zhang, Qi, et al., 2019), thus may poorly perform on
unseen scenes or data (Song et al., 2021). Most of the existing DL stereo DIM algorithms are trained using
Computer Vision (CV) benchmark datasets such as KITTI or Middlebury, etc. (Geiger et al., 2012; Hamid et al.,
2020), which mainly consist of images of everyday scenes. These images are distinctively different from satellite
images in terms of scene content, view perspectives, and object granularity, thus DL models trained from these CV
datasets may not perform well when applied to satellite datasets. However, acquiring a large number of high-quality
satellite datasets for training is a challenge mainly due to its high cost. Additionally, the scene content of the earth's
surface is extremely diverse across the entire Globe, thus it is difficult to create a reprehensive dataset by only using
data from a few regions, although these regions can be as large as an entire city. A few remote sensing benchmarks
such as the IARPA (The Intelligence Advanced Research Projects Activity) Multiview stereo 3D mapping challenge
(Bosch et al., 2016) and the 2019 Institute of Electrical and Electronics Engineers (IEEE) Geoscience and Remote
Sensing Society (GRSS) Data Fusion Contest (DFC) (Le Saux, 2019) provide the ground-truth dataset in the form of
LiDAR point clouds and raster DSMs, which must be post-processed and converted to the ground-truth disparity for
training. Unfortunately, this step may generate undesired errors in the training data due to geometric errors in the
orientation parameters, inconsistencies in the temporal and spatial resolutions, or other operations such as
triangulation, projection, and interpolation (Bosch et al., 2016; Cournet et al., 2020; Patil et al., 2019; Wu et al.,
2021).
In addition, a significantly under-evaluated criterion for DIM algorithms on satellite images is their robustness to
varying stereo configurations such as sun illuminations, intersection angles, different sensors, etc. A robust DIM
algorithm pertaining to these factors is extremely important because this will allow us to make full use of the already
limited satellite images (as compared to everyday images). Ultimately, we want these algorithms to be agnostic and
less selective to stereo configurations of data. It has been reported that the result of a typical stereo algorithm, i.e.,
Census with SGM, directly correlates with geometrical acquisition factors such as sun angle difference and
intersection angle (Qin, 2019). Unfortunately, the same analysis has not been covered for DL algorithms. Since DL
methods (especially E2E ones) can learn context, it is of interest to understand their performance under varying
stereo configurations.
In this paper, we aim to comprehensively explore the limitations and strengths of recent stereo DIM algorithms for
satellite datasets. We consider representative DIM algorithms of three main categories: 1) traditional approaches
3
(e.g., Census cost metric with SGM), 2) deep learning-based cost metrics (e.g., MC-CNN cost matching metric with
SGM), and 3) three E2E learning methods (e.g., GCNet, PSMNet, and LEAStereo). Since “Census+SGM” has been
well studied and widely used in satellite stereo-photogrammetry (Qin, 2019), it serves as a baseline method in this
study. Deep learning-based cost metric is regarded as a simpler task than E2E learning for DIM because it learns
patch-level similarity as a binary classification problem (similar or not similar). The very one algorithm in this
category is MC-CNN, which applies a Siamese network to learn the similarities (Žbontar & LeCun, 2015). Despite
the very many E2E methods in the CV community (Laga, 2019), we choose three State-of-the-art (SOTA) methods
that are frequently used by the community, and well-performed in the leader board (Chang & Chen, 2018; Cheng et
al., 2020; Kendall et al., 2017), and have well-organized codes available. To achieve a comprehensive evaluation
and analysis, we use nine satellite datasets from different locations, and each dataset contains ~100-500 stereo pairs
with their respective ground truth LiDAR data. We train and test the models on the same and different datasets, and
analyze the results to understand their performance, robustness, and generalization. To be more specific, this paper
presents three contributions:
1. We comprehensively evaluate five stereo DIM algorithms (including four DL approaches) on satellite stereo
images using hundreds of pairs from nine test-sites, to inform the community of the performance of such DIM
algorithms under varying configurations.
2. We analyze the accuracy of the evaluated DL methods and study their robustness against stereo configurations
of data that were reported to be critical for the resulting accuracy of DSMs for traditional methods (Qin, 2019).
3. We study the generalization capability (or transferability) of these DL stereo DIM methods trained on and
applied to datasets across different geographical regions and resolution/sensors (including satellite, airborne and
ground-view images).
The remainder of the paper is organized as follows: Section II introduces relevant work including a brief review of
stereo DIM algorithms and existing comparative studies. Section III describes the dataset, preprocessing methods, and
experimental data and setup as used in this work. Section IV describes in detail the dense stereo DIM algorithms and
the evaluation method. Section V presents the results, evaluation, and discussion. Finally, the conclusion, limitations,
and potential future directions are discussed in Section VI.
II. RELATED WORKS
A. Stereo DIM algorithms
There has been a tremendous development in stereo DIM algorithms over the years, they are broadly classified into
traditional and DL methods (Zhou et al., 2020). Traditional methods are the very early algorithms with basic cost
matching metrics as the sum of absolute differences (SAD), normalized cross-correlation (NCC), mutual
information (MI), and Census transformation (Brown et al., 2003; Seitz et al., 2006). With the development of DL
methods, the cost-matching pipeline was replaced by convolutional neural networks (CNN) (Žbontar & LeCun,
2015). DL-based algorithms have attracted great attention due to their superior performances in benchmark testing
(Geiger et al., 2012). Depending on the task of learning, DL stereo methods can be further categorized into learning-
based cost metrics (Žbontar & LeCun, 2015, 2016) and E2E learning (Chang & Chen, 2018; Cheng et al., 2020;
Kendall et al., 2017; Xu & Zhang, 2020; Zhang, Prisacariu, et al., 2019). Learning-based cost metric was first
introduced by (Žbontar & LeCun, 2015) to learn similarities from image patches. Both traditional and learning-
based algorithms process low-level features as intensity or gradient patches to indicate similarity. As a result, their
performance is limited to repetitive patterns and texture-less regions. Thus, post-processing like cost aggregation,
optimization, and refinement based on these metrics is necessary to enhance the quality of the disparity map
(Hirschmuller, 2005, 2008; Scharstein et al., 2001). DL methods rapidly evolved to E2E learning algorithms, where
their main contribution is to replace the classical multistage optimization with a trainable network to directly predict
the disparity from stereo images (Hamid et al., 2020; Laga, 2019). The underlying concept is that these neural
networks can directly capture more global features, hence, they may better perform (Chang & Chen, 2018; Cheng et
al., 2020; Kendall et al., 2017). In addition to these intensively studied methods, there are a few methods that
perform context learning for part of the traditional pipeline but do not fully fall into either of these DL categories,
for example, SGM-Net (Seki & Pollefeys, 2017) learns the per-pixel smoothness penalty and GA-Net (Zhang,
Prisacariu, et al., 2019) learns networks to guide the cost-aggregation process.
4
B. Existing comparative studies of stereo dense matching algorithms
Most of the existing review papers on stereo DIM algorithms take upon single to few stereo pairs for evaluation
(Hamid et al., 2020; Laga, 2019; Xia et al., 2020; Zhou et al., 2020), which may be insufficient to provide an
accurate and conclusive evaluation. There are a few but limited studies concerning the use of more pairs to study
DIM algorithms: they indicate that the quality of the DSM is significantly correlated with the radiometric and
geometric characteristics and the configurations of the stereo pairs (Facciolo et al., 2017; Qin, 2019; Yan et al.,
2016; X. Zhou & Boulanger, 2012). For instance, Facciolo et al., (2017) found that selecting stereo pairs based on
specific heuristics as minimum seasonal differences improves DIM and reduces the uncertainties in the DSM. In
(Qin, 2019), the author observed a direct relation between the geometric configurations at the time of acquisition as
the sun angle difference and intersection angle (base-height ratio) and the accuracy of DSMs. In addition, the spatial
distribution of objects in space (e.g., buildings density, sizes, and distances) and land cover types (e.g., trees, roads,
water surfaces, etc.) are highly diverse across different test-sites (Chi et al., 2016), hence, may impact the
performance of stereo DIM algorithms. With satellite images being rich in content and acquisition configurations, it
is necessary to analyze the performance of stereo DIM algorithms on a large number of datasets covering a variety
of regions, complexities, and configurations to understand their practical values in various applications.
C. Deep learning training models and generalization
Despite the superiority of DL algorithms in DIM, the generalization issue remains a major challenge. Learning
across different domains such as data collected from different sensors, data of different locations, data with different
spatial resolutions, etc., often leads to a deep drop in the accuracy of DIM because of their inability to predict
disparity from unseen data (Pang et al., 2018; Song et al., 2021). One possible solution is to encapsulate a large
number of training datasets covering all scenarios and instances that a network may encounter (Najafabadi et al.,
2015). However, in practice obtaining a large training dataset with their ground truth (such as LiDAR) is costly and
often unavailable for large-scale areas (Chi et al., 2016), it also requires manual or post-processing to convert to the
ground-truth disparity which may produce systematic errors (Cournet et al., 2020; Patil et al., 2019; Song et al.,
2021). Although some benchmark datasets are available (Bosch et al., 2016; Le Saux, 2019; Rottensteiner et al.,
2012), existing evaluations are mostly performed on a dataset-by-dataset basis. Moreover, these datasets, although
seem to be large in data volume, mostly present typical urban scenes of a few major cities and remain very sparse
when considering the generalization in terms of scene contexts around the globe, sensor types, and resolutions.
III. DATASET, PREPROCESSING, AND EXPERIMENTAL SETUP
In this section, we will first describe the datasets used in this work. Then, we will describe in detail the
preprocessing methods and experimental setup applied to comprehensively train and evaluate the DL-based stereo
matching algorithms.
A. Dataset description
In this work, we have two types of data that are used for training and evaluation purposes. All our data are collected
from publicly available benchmarks. First, we will describe our evaluation datasets and then we will describe the
training datasets.
The evaluation is based on satellite images from IARPA (Bosch et al., 2016) and the 2019 DFC (track 3) (Le Saux,
2019) benchmarks. They provide stereo images from the WorldView-3 satellite sensor with a spatial resolution of
0.3 meters. In addition, they provide the airborne LiDAR which we convert to the ground-truth DSM and use for
evaluation. The IARPA benchmark provides 50 overlapping images for a 100 km2 area near San Fernando, Argentina
(ARG) collected between January 2015 and January 2016. The 2019 DFC benchmark provides 16 to 39 overlapping
images for 100 km2 of Omaha, NE, USA (OMA) and Jacksonville, FL, USA (JAX) collected from September 2014
to November 2015. We select three sub-regions as test-sites from OMA, JAX, and ARG datasets. Our selection
covers test-sites with a variety of densities, complexities, land types, and covers. The 16-50 stereo images provided
for all test-sites, yield approximately 6,278 stereo pair images. After omitting stereo pairs with extremely small
intersection angles (i.e., < 2 degrees) and those fail during feature matching for relative orientation (Kuschk et al.,
2014) due to large radiometric variations and clouds, we kept 2,861 stereo pairs for the experimental comparison.
The stereo pair images have a wide range of geometric configurations. The sun angle difference ranges between 0
and 50 degrees and the intersection angle ranges between 0 and 67 degrees. Figure 1(a) provides detailed information
about the evaluation dataset and test-sites.
The training dataset is used to train the DL algorithms. We include a variety of datasets from satellite, airborne, and
ground-view sensors. For training with a satellite dataset, we use another set from the 2019 DFC benchmark known
摘要:

1AComparativeStudyonDeep-LearningMethodsforDenseImageMatchingofMulti-angleandMulti-dateRemoteSensingStereoImagesHessahAlbanwan1,2andRongjunQin1,2,3,4,*1GeospatialDataAnalyticsLaboratory,TheOhioStateUniversity,218BBolzHall,2036NeilAvenue,Columbus,OH43210,USA2DepartmentofCivil,EnvironmentalandGeodetic...

展开>> 收起<<
1 A Comparative Study on Deep -Learning Methods for Dense Image Matching of Multi -angle and.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.66MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注