1 A Comparative Study on Deep -Learning Methods for Dense Image Matching of Multi -angle and

2025-04-30 0 0 1.66MB 18 页 10玖币

A Comparative Study on Deep-Learning Methods

for Dense Image Matching of Multi-angle and

Multi-date Remote Sensing Stereo Images

Hessah Albanwan1,2 and Rongjun Qin1, 2, 3,4,*

1Geospatial Data Analytics Laboratory, The Ohio State University, 218B Bolz Hall, 2036 Neil Avenue, Columbus, OH 43210, USA

2Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, 218B Bolz Hall, 2036 Neil Avenue, Columbus, OH

43210, USA

3Department of Electrical and Computer Engineering, The Ohio State University, 205 Dreese Lab, 2036 Neil Avenue, Columbus, OH 43210,

USA

4Translational Data Analytics Institute, The Ohio State University, USA

Email: Albanwan.1@osu.edu, Qin.324@osu.edu

* Corresponding author

Abstract: Deep learning (DL) stereo matching methods gained great attention in remote sensing satellite datasets.

However, most of these existing studies conclude assessments based only on a few/single stereo images lacking a

systematic evaluation on how robust DL methods are on satellite stereo images with varying radiometric and

geometric configurations. This paper provides an evaluation of four DL stereo matching methods through hundreds

of multi-date multi-site satellite stereo pairs with varying geometric configurations, against the traditional well-

practiced Census-SGM (Semi-global matching), to comprehensively understand their accuracy, robustness,

generalization capabilities, and their practical potential. The DL methods include a learning-based cost metric

through convolutional neural networks (MC-CNN) followed by SGM, and three end-to-end (E2E) learning models

using Geometry and Context Network (GCNet), Pyramid Stereo Matching Network (PSMNet), and LEAStereo. Our

experiments show that E2E algorithms can achieve upper limits of geometric accuracies, while may not generalize

well for unseen data. The learning-based cost metric and Census-SGM are rather robust and can consistently achieve

acceptable results. All DL algorithms are robust to geometric configurations of stereo pairs and are less sensitive in

comparison to the Census-SGM, while learning-based cost metrics can generalize on satellite images when trained

on different datasets (airborne or ground-view).

Keywords: Dense Image Matching (DIM), Deep Learning (DL), Convolutional Neural Network (MC-CNN), Pyramid

Stereo Matching Network (PSMNet), Geometry and Context Network (GCNet), LEAStereo

I. INTRODUCTION

Stereo dense image matching (DIM) has been an active area of study over the years, as it offers a cost-effective and

efficient approach to generate digital surface models (DSM) for applications such as 3D modeling, forestry

mapping, and change detection (Albanwan et al., 2020; Albanwan & Qin, 2020; Furukawa & Hernández, 2015;

Navarro et al., 2018; Nebiker et al., 2014; Tian et al., 2014; Wang et al., 2017). This is especially relevant when

these applications are fueled by satellite-based 3D reconstructions due to their wide data coverage and consistent

data collection over time (Albanwan & Qin, 2020; Qin, 2017). However, it has been noted (Brown et al., 2003; Qin,

2019; Seitz et al., 2006) that the quality of the DSM not only depends on a specific DIM algorithm, but also on

various factors of stereo pairs including: 1) sensor and image characteristics (e.g., spatial and radiometric

resolutions), 2) acquisition conditions including the atmosphere and position and orientation of the sun, camera, and

objects, 3) scene structure and texture across different geographical regions, for instance, different land patterns

including urban, suburban, or forest areas, water surfaces, and parallaxes formed by stereo pairs, etc. Although there

have been studies evaluating the achievable quality of these satellite-based reconstructions using different DIM

algorithms, most of them only take one or a few pairs for quality assessment, thus the resulting conclusions are often

not comprehensive to cover data with various stereo configurations appearing in practice.

A typical stereo DIM algorithm performs disparity estimation on rectified stereo images (i.e., epipolar images),

which generally follows several key steps, including cost matching, aggregation, disparity optimization, and filtering

(Scharstein et al., 2001). Cost matching is the main step for measuring feature similarities for disparity computation.

For the last decade, Census has been known as the classical cost matching metric for stereo DIM problems and has

been studied intensively due to its robustness to radiometric differences, at the same time it can be computed

efficiently (Chen et al., 2019; Ma et al., 2013; Xia et al., 2018; Zabih & Woodfill, 1994). On the other hand, recently

developed deep learning (DL) algorithms have shown a promising performance that is assumed to surpass classical

cost metrics like Census (Chen et al., 2019; Hamid et al., 2020). There are mainly two ways that DL algorithms are

implemented in stereo DIM, for example, 1) Learning-based cost metric and 2) end-to-end (E2E) learning. The

learning-based cost metric learns similarity from image patches given examples of similar and dissimilar patches,

such methods were first introduced by (Žbontar & LeCun, 2015) through convolutional neural networks (MC-CNN)

as learnable feature extractors. Like Census, such patch-based cost metrics do not handle texture-less regions well,

and often require multistage optimization after cost computation, followed by cost aggregation for regularization and

smoothness of the disparity map (Bobick & Intille, 1999; Hirschmuller, 2005, 2008; Kolmogorov & Zabih, 2001).

There are plenty of cost aggregation methods ranging from local, semi-global, and global approaches. A well-known

example is the semi-global matching (SGM) algorithm (Hirschmuller, 2005), which is known to effectively leverage

both accuracy and speed well, thus nowadays it is used as a standard cost aggregation approach. However, even with

cost aggregation, studies found that MC-CNN still suffers from poor performance in ill-posed regions with

occlusion, high-reflective surfaces, lack of texture, and repetitive patterns (Chen et al., 2019; Ma et al., 2013; Xia et

al., 2018; Zabih & Woodfill, 1994) demanding further postprocessing and refinement. Alternatively, E2E learning

methods directly generate disparity from stereo pair rectified images without additional optimization steps. They

have become a popular line of stereo DIM algorithms because they can directly predict highly accurate disparity

maps through learning from geometry and context (e.g., cues such as shading, illumination, objects, etc.) rather than

low-level features (Chang & Chen, 2018; Cheng et al., 2020; Gu et al., 2020; Kendall et al., 2017; Xu & Zhang,

2020; Zhang, Prisacariu, et al., 2019). Examples of this type of work are Geometry and Context Network (GCNet),

pyramid stereo matching networks (PSMNet), and LEAStereo, which due to their high accessibility and

performances (i.e., winning the best rank in the KITTI 2012 and 2015 leaderboards (Chang & Chen, 2018; Cheng et

al., 2020; Kendall et al., 2017), have become popular in the field.

However, DL algorithms often suffer from generalization problems, at the same time, face challenges to process

large-volume and large-format remote sensing images, limiting their practical values (Pang et al., 2018; Song et al.,

2021; Zhang, Qi, et al., 2019). The degree of generalization may vary with both DL models and tasks. For instance,

studies indicate that most E2E approaches enjoy deep feature representations, which however, may encode scene-

specific information (Pang et al., 2018; Song et al., 2021; Zhang, Qi, et al., 2019), thus may poorly perform on

unseen scenes or data (Song et al., 2021). Most of the existing DL stereo DIM algorithms are trained using

Computer Vision (CV) benchmark datasets such as KITTI or Middlebury, etc. (Geiger et al., 2012; Hamid et al.,

2020), which mainly consist of images of everyday scenes. These images are distinctively different from satellite

images in terms of scene content, view perspectives, and object granularity, thus DL models trained from these CV

datasets may not perform well when applied to satellite datasets. However, acquiring a large number of high-quality

satellite datasets for training is a challenge mainly due to its high cost. Additionally, the scene content of the earth's

surface is extremely diverse across the entire Globe, thus it is difficult to create a reprehensive dataset by only using

data from a few regions, although these regions can be as large as an entire city. A few remote sensing benchmarks

such as the IARPA (The Intelligence Advanced Research Projects Activity) Multiview stereo 3D mapping challenge

(Bosch et al., 2016) and the 2019 Institute of Electrical and Electronics Engineers (IEEE) Geoscience and Remote

Sensing Society (GRSS) Data Fusion Contest (DFC) (Le Saux, 2019) provide the ground-truth dataset in the form of

LiDAR point clouds and raster DSMs, which must be post-processed and converted to the ground-truth disparity for

training. Unfortunately, this step may generate undesired errors in the training data due to geometric errors in the

orientation parameters, inconsistencies in the temporal and spatial resolutions, or other operations such as

triangulation, projection, and interpolation (Bosch et al., 2016; Cournet et al., 2020; Patil et al., 2019; Wu et al.,

2021).

In addition, a significantly under-evaluated criterion for DIM algorithms on satellite images is their robustness to

varying stereo configurations such as sun illuminations, intersection angles, different sensors, etc. A robust DIM

algorithm pertaining to these factors is extremely important because this will allow us to make full use of the already

limited satellite images (as compared to everyday images). Ultimately, we want these algorithms to be agnostic and

less selective to stereo configurations of data. It has been reported that the result of a typical stereo algorithm, i.e.,

Census with SGM, directly correlates with geometrical acquisition factors such as sun angle difference and

intersection angle (Qin, 2019). Unfortunately, the same analysis has not been covered for DL algorithms. Since DL

methods (especially E2E ones) can learn context, it is of interest to understand their performance under varying

stereo configurations.

In this paper, we aim to comprehensively explore the limitations and strengths of recent stereo DIM algorithms for

satellite datasets. We consider representative DIM algorithms of three main categories: 1) traditional approaches

(e.g., Census cost metric with SGM), 2) deep learning-based cost metrics (e.g., MC-CNN cost matching metric with

SGM), and 3) three E2E learning methods (e.g., GCNet, PSMNet, and LEAStereo). Since “Census+SGM” has been

well studied and widely used in satellite stereo-photogrammetry (Qin, 2019), it serves as a baseline method in this

study. Deep learning-based cost metric is regarded as a simpler task than E2E learning for DIM because it learns

patch-level similarity as a binary classification problem (similar or not similar). The very one algorithm in this

category is MC-CNN, which applies a Siamese network to learn the similarities (Žbontar & LeCun, 2015). Despite

the very many E2E methods in the CV community (Laga, 2019), we choose three State-of-the-art (SOTA) methods

that are frequently used by the community, and well-performed in the leader board (Chang & Chen, 2018; Cheng et

al., 2020; Kendall et al., 2017), and have well-organized codes available. To achieve a comprehensive evaluation

and analysis, we use nine satellite datasets from different locations, and each dataset contains ~100-500 stereo pairs

with their respective ground truth LiDAR data. We train and test the models on the same and different datasets, and

analyze the results to understand their performance, robustness, and generalization. To be more specific, this paper

presents three contributions:

1. We comprehensively evaluate five stereo DIM algorithms (including four DL approaches) on satellite stereo

images using hundreds of pairs from nine test-sites, to inform the community of the performance of such DIM

algorithms under varying configurations.

2. We analyze the accuracy of the evaluated DL methods and study their robustness against stereo configurations

of data that were reported to be critical for the resulting accuracy of DSMs for traditional methods (Qin, 2019).

3. We study the generalization capability (or transferability) of these DL stereo DIM methods trained on and

applied to datasets across different geographical regions and resolution/sensors (including satellite, airborne and

ground-view images).

The remainder of the paper is organized as follows: Section II introduces relevant work including a brief review of

stereo DIM algorithms and existing comparative studies. Section III describes the dataset, preprocessing methods, and

experimental data and setup as used in this work. Section IV describes in detail the dense stereo DIM algorithms and

the evaluation method. Section V presents the results, evaluation, and discussion. Finally, the conclusion, limitations,

and potential future directions are discussed in Section VI.

II. RELATED WORKS

A. Stereo DIM algorithms

There has been a tremendous development in stereo DIM algorithms over the years, they are broadly classified into

traditional and DL methods (Zhou et al., 2020). Traditional methods are the very early algorithms with basic cost

matching metrics as the sum of absolute differences (SAD), normalized cross-correlation (NCC), mutual

information (MI), and Census transformation (Brown et al., 2003; Seitz et al., 2006). With the development of DL

methods, the cost-matching pipeline was replaced by convolutional neural networks (CNN) (Žbontar & LeCun,

2015). DL-based algorithms have attracted great attention due to their superior performances in benchmark testing

(Geiger et al., 2012). Depending on the task of learning, DL stereo methods can be further categorized into learning-

based cost metrics (Žbontar & LeCun, 2015, 2016) and E2E learning (Chang & Chen, 2018; Cheng et al., 2020;

Kendall et al., 2017; Xu & Zhang, 2020; Zhang, Prisacariu, et al., 2019). Learning-based cost metric was first

introduced by (Žbontar & LeCun, 2015) to learn similarities from image patches. Both traditional and learning-

based algorithms process low-level features as intensity or gradient patches to indicate similarity. As a result, their

performance is limited to repetitive patterns and texture-less regions. Thus, post-processing like cost aggregation,

optimization, and refinement based on these metrics is necessary to enhance the quality of the disparity map

(Hirschmuller, 2005, 2008; Scharstein et al., 2001). DL methods rapidly evolved to E2E learning algorithms, where

their main contribution is to replace the classical multistage optimization with a trainable network to directly predict

the disparity from stereo images (Hamid et al., 2020; Laga, 2019). The underlying concept is that these neural

networks can directly capture more global features, hence, they may better perform (Chang & Chen, 2018; Cheng et

al., 2020; Kendall et al., 2017). In addition to these intensively studied methods, there are a few methods that

perform context learning for part of the traditional pipeline but do not fully fall into either of these DL categories,

for example, SGM-Net (Seki & Pollefeys, 2017) learns the per-pixel smoothness penalty and GA-Net (Zhang,

Prisacariu, et al., 2019) learns networks to guide the cost-aggregation process.

B. Existing comparative studies of stereo dense matching algorithms

Most of the existing review papers on stereo DIM algorithms take upon single to few stereo pairs for evaluation

(Hamid et al., 2020; Laga, 2019; Xia et al., 2020; Zhou et al., 2020), which may be insufficient to provide an

accurate and conclusive evaluation. There are a few but limited studies concerning the use of more pairs to study

DIM algorithms: they indicate that the quality of the DSM is significantly correlated with the radiometric and

geometric characteristics and the configurations of the stereo pairs (Facciolo et al., 2017; Qin, 2019; Yan et al.,

2016; X. Zhou & Boulanger, 2012). For instance, Facciolo et al., (2017) found that selecting stereo pairs based on

specific heuristics as minimum seasonal differences improves DIM and reduces the uncertainties in the DSM. In

(Qin, 2019), the author observed a direct relation between the geometric configurations at the time of acquisition as

the sun angle difference and intersection angle (base-height ratio) and the accuracy of DSMs. In addition, the spatial

distribution of objects in space (e.g., buildings density, sizes, and distances) and land cover types (e.g., trees, roads,

water surfaces, etc.) are highly diverse across different test-sites (Chi et al., 2016), hence, may impact the

performance of stereo DIM algorithms. With satellite images being rich in content and acquisition configurations, it

is necessary to analyze the performance of stereo DIM algorithms on a large number of datasets covering a variety

of regions, complexities, and configurations to understand their practical values in various applications.

C. Deep learning training models and generalization

Despite the superiority of DL algorithms in DIM, the generalization issue remains a major challenge. Learning

across different domains such as data collected from different sensors, data of different locations, data with different

spatial resolutions, etc., often leads to a deep drop in the accuracy of DIM because of their inability to predict

disparity from unseen data (Pang et al., 2018; Song et al., 2021). One possible solution is to encapsulate a large

number of training datasets covering all scenarios and instances that a network may encounter (Najafabadi et al.,

2015). However, in practice obtaining a large training dataset with their ground truth (such as LiDAR) is costly and

often unavailable for large-scale areas (Chi et al., 2016), it also requires manual or post-processing to convert to the

ground-truth disparity which may produce systematic errors (Cournet et al., 2020; Patil et al., 2019; Song et al.,

2021). Although some benchmark datasets are available (Bosch et al., 2016; Le Saux, 2019; Rottensteiner et al.,

2012), existing evaluations are mostly performed on a dataset-by-dataset basis. Moreover, these datasets, although

seem to be large in data volume, mostly present typical urban scenes of a few major cities and remain very sparse

when considering the generalization in terms of scene contexts around the globe, sensor types, and resolutions.

III. DATASET, PREPROCESSING, AND EXPERIMENTAL SETUP

In this section, we will first describe the datasets used in this work. Then, we will describe in detail the

preprocessing methods and experimental setup applied to comprehensively train and evaluate the DL-based stereo

matching algorithms.

A. Dataset description

In this work, we have two types of data that are used for training and evaluation purposes. All our data are collected

from publicly available benchmarks. First, we will describe our evaluation datasets and then we will describe the

training datasets.

The evaluation is based on satellite images from IARPA (Bosch et al., 2016) and the 2019 DFC (track 3) (Le Saux,

2019) benchmarks. They provide stereo images from the WorldView-3 satellite sensor with a spatial resolution of

0.3 meters. In addition, they provide the airborne LiDAR which we convert to the ground-truth DSM and use for

evaluation. The IARPA benchmark provides 50 overlapping images for a 100 km2 area near San Fernando, Argentina

(ARG) collected between January 2015 and January 2016. The 2019 DFC benchmark provides 16 to 39 overlapping

images for 100 km2 of Omaha, NE, USA (OMA) and Jacksonville, FL, USA (JAX) collected from September 2014

to November 2015. We select three sub-regions as test-sites from OMA, JAX, and ARG datasets. Our selection

covers test-sites with a variety of densities, complexities, land types, and covers. The 16-50 stereo images provided

for all test-sites, yield approximately 6,278 stereo pair images. After omitting stereo pairs with extremely small

intersection angles (i.e., < 2 degrees) and those fail during feature matching for relative orientation (Kuschk et al.,

2014) due to large radiometric variations and clouds, we kept 2,861 stereo pairs for the experimental comparison.

The stereo pair images have a wide range of geometric configurations. The sun angle difference ranges between 0

and 50 degrees and the intersection angle ranges between 0 and 67 degrees. Figure 1(a) provides detailed information

about the evaluation dataset and test-sites.

The training dataset is used to train the DL algorithms. We include a variety of datasets from satellite, airborne, and

ground-view sensors. For training with a satellite dataset, we use another set from the 2019 DFC benchmark known

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1AComparativeStudyonDeep-LearningMethodsforDenseImageMatchingofMulti-angleandMulti-dateRemoteSensingStereoImagesHessahAlbanwan1,2andRongjunQin1,2,3,4,*1GeospatialDataAnalyticsLaboratory,TheOhioStateUniversity,218BBolzHall,2036NeilAvenue,Columbus,OH43210,USA2DepartmentofCivil,EnvironmentalandGeodetic...

展开>> 收起<<

1 A Comparative Study on Deep -Learning Methods for Dense Image Matching of Multi -angle and.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 A Comparative Study on Deep -Learning Methods for Dense Image Matching of Multi -angle and

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: