Fully Transformer Network for Change Detection of Remote Sensing Images Tianyu Yan Zifu Wan and Pingping Zhang0000000312061444

2025-05-06 0 0 1.57MB 18 页 10玖币
侵权投诉
Fully Transformer Network for Change Detection
of Remote Sensing Images
Tianyu Yan, Zifu Wan, and Pingping Zhang[0000000312061444]
School of Artificial Intelligence, Dalian University of Technology, China
{tianyuyan2001,wanzifu2000}@gmail.com;zhpp@dlut.edu.cn
Abstract. Recently, change detection (CD) of remote sensing images
have achieved great progress with the advances of deep learning. How-
ever, current methods generally deliver incomplete CD regions and ir-
regular CD boundaries due to the limited representation ability of the
extracted visual features. To relieve these issues, in this work we propose
a novel learning framework named Fully Transformer Network (FTN) for
remote sensing image CD, which improves the feature extraction from a
global view and combines multi-level visual features in a pyramid man-
ner. More specifically, the proposed framework first utilizes the advan-
tages of Transformers in long-range dependency modeling. It can help to
learn more discriminative global-level features and obtain complete CD
regions. Then, we introduce a pyramid structure to aggregate multi-level
visual features from Transformers for feature enhancement. The pyra-
mid structure grafted with a Progressive Attention Module (PAM) can
improve the feature representation ability with additional interdepen-
dencies through channel attentions. Finally, to better train the frame-
work, we utilize the deeply-supervised learning with multiple boundary-
aware loss functions. Extensive experiments demonstrate that our pro-
posed method achieves a new state-of-the-art performance on four public
CD benchmarks. For model reproduction, the source code is released at
https://github.com/AI-Zhpp/FTN.
Keywords: Fully Transformer Network ·Change Detection ·Remote
Sensing Image.
1 Introduction
Change Detection (CD) plays an important role in the field of remote sensing. It
aims to detect the key change regions in dual-phase remote sensing images cap-
tured at different times but over the same area. Remote sensing image CD has
been used in many real-world applications, such as land-use planning, urban ex-
pansion management, geological disaster monitoring, ecological environment pro-
tection. However, due to change regions can be any shapes in complex scenarios,
there are still many challenges for high-accuracy CD. In addition, remote sensing
image CD by handcrafted methods is time-consuming and labor-intensive, thus
there is a great need for fully-automatic and highly-efficient CD.
arXiv:2210.00757v1 [cs.CV] 3 Oct 2022
2 T. Yan et al.
In recent years, deep learning has been widely used in remote sensing image
processing due to its powerful feature representation capabilities, and has shown
great potential in CD. With deep Convolutional Neural Networks (CNN) [12,15,
17], many CD methods extract discriminative features and have demonstrated
good CD performances. However, previous methods still have the following short-
comings: 1) With the resolution improvement of remote sensing images, rich
semantic information contained in high-resolution images is not fully utilized.
As a result, current CD methods are unable to distinguish pseudo changes such
as shadow, vegetation and sunshine in sensitive areas. 2) Boundary information
in complex remote sensing images is often missing. In previous methods, the
extracted changed areas often have regional holes and their boundaries can be
very irregular, resulting in a poor visual effect [28]. 3) The temporal information
contained in dual-phase remote sensing images is not fully utilized, which is also
one of the reasons for the low performance of current CD methods.
To tackle above issues, in this work we propose a novel learning framework
named Fully Transformer Network (FTN) for remote sensing image CD, which
improves the feature extraction from a global view and combines multi-level vi-
sual features in a pyramid manner. More specifically, the proposed framework
is a three-branch structure whose input is a dual-phase remote sensing image
pair. We first utilize the advantages of Transformers [9, 29, 42] in long-range de-
pendency modeling to learn more discriminative global-level features. Then, to
highlight the change regions, the summation features and difference features are
generated by directly comparing the temporal features of dual-phase remote sens-
ing images. Thus, one can obtain complete CD regions. To improve the boundary
perception ability, we further introduce a pyramid structure to aggregate multi-
level visual features from Transformers. The pyramid structure grafted with a
Progressive Attention Module (PAM) can improve the feature representation
ability with additional interdependencies through channel attentions. Finally, to
better train the framework, we utilize the deeply-supervised learning with multi-
ple boundary-aware loss functions. Extensive experiments show that our method
achieves a new state-of-the-art performance on four public CD benchmarks.
In summary, the main contributions of this work are as follow:
We propose a novel learning framework (i.e., FTN) for remote sensing im-
age CD, which can improve the feature extraction from a global view and
combine multi-level visual features in a pyramid manner.
We propose a pyramid structure grafted with a Progressive Attention Module
(PAM) to further improve the feature representation ability with additional
interdependencies through channel attentions.
We introduce the deeply-supervised learning with multiple boundary-aware
loss functions, to address the irregular boundary problem in CD.
Extensive experiments on four public CD benchmarks demonstrate that our
framework attains better performances than most state-of-the-art methods.
Fully Transformer Network for Change Detection of Remote Sensing Images 3
2 Related Work
2.1 Change Detection of Remote Sensing Images
Technically, the task of change detection takes dual-phase remote sensing images
as inputs, and predicts the change regions of the same area. Before deep learn-
ing, direct classification based methods witness the great progress in CD. For
example, Change Vector Analysis (CVA) [16, 48] is powerful in extracting pixel-
level features and is widely utilized in CD. With the rapid improvement in image
resolution, more details of objects have been recorded in remote sensing images.
Therefore, many object-aware methods are proposed to improve the CD per-
formance. For example, Tang et al. [41] propose an object-oriented CD method
based on the Kolmogorov–Smirnov test. Li et al. [23] propose the object-oriented
CVA to reduce the number of pseudo detection pixels. With multiple classifiers
and multi-scale uncertainty analysis, Tan et al. [40] build an object-based ap-
proach for complex scene CD. Although above methods can generate CD maps
from dual-phase remote sensing images, they generally deliver incomplete CD
regions and irregular CD boundaries due to the limited representation ability of
the extracted visual features.
With the advances of deep learning, many works improve the CD performance
by extracting more discriminative features. For example, Zhang et al. [52] uti-
lize a Deep Belief Network (DBN) to extract deep features and represent the
change regions by patch differences. Saha et al. [36] combine a pre-trained deep
CNN and traditional CVA to generate certain change regions. Hou et al. [14]
take the advantages of deep features and introduce the low rank analysis to
improve the CD results. Peng et al. [33] utilize saliency detection analysis and
pre-trained deep networks to achieve unsupervised CD. Since change regions
may appear any places, Lei et al. [22] integrate Stacked Denoising AutoEncoders
(SDAE) with the multi-scale superpixel segmentation to realize superpixel-based
CD. Similarly, Lv et al. [31] utilize a Stacked Contractive AutoEncoder (SCAE)
to extract temporal change features from superpixels, then adopt a clustering
method to produce CD maps. Meanwhile, some methods formulate the CD task
into a binary image segmentation task. Thus, CD can be finished in a super-
vised manner. For example, Alcantarilla et al. [1] first concatenate dual-phase
images as one image with six channels. Then, the six-channel image is fed into a
Fully Convolutional Network (FCN) to realize the CD. Similarly, Peng et al. [34]
combine bi-temporal remote sensing images as one input, which is then fed into
a modified U-Net++ [57] for CD. Daudt et al. [7] utilize Siamese networks to
extract features for each remote sensing image, then predict the CD maps with
fused features. The experimental results prove the efficiency of Siamese networks.
Further more, Guo et al. [11] use a fully convolutional Siamese network with a
contrastive loss to measure the change regions. Zhang et al. [49] propose a deeply-
supervised image fusion network for CD. There are also some works focused on
specific object CD. For example, Liu et al. [28] propose a dual-task constrained
deep Siamese convolutional network for building CD. Jiang et al. [19] propose a
pyramid feature-based attention-guided Siamese network for building CD. Lei et
4 T. Yan et al.
al. [21] propose a hierarchical paired channel fusion network for street scene CD.
The aforementioned methods have shown great success in feature learning for
CD. However, these methods have limited global representation capabilities and
usually focus on local regions of changed objects. We find that Transformers have
strong characteristics in extracting global features. Thus, different from previ-
ous works, we take the advantages of Transformers, and propose a new learning
framework for more discriminative feature representations.
2.2 Vision Transformers for Change Detection
Recently, Transformers [42] have been applied to many computer vision tasks,
such as image classification [9, 29], object detection [4], semantic segmenta-
tion [44], person re-identification [27, 51] and so on. Inspired by that, Zhang et
al. [50] deploy a Swin Transformer [29] with a U-Net [35] structure for re-
mote sensing image CD. Zheng et al. [56] design a deep Multi-task Encoder-
Transformer-Decoder (METD) architecture for semantic CD. Wang et al. [45]
incorporate a Siamese Vision Transformer (SViT) into a feature difference frame-
work for CD. To take the advantages of both Transformers and CNNs, Wang et
al. [43] propose to combine a Transformer and a CNN for remote sensing image
CD. Li et al. [24] propose an encoding-decoding hybrid framework for CD, which
has the advantages of both Transformers and U-Net. Bandara et al. [3] unify hier-
archically structured Transformer encoders with Multi-Layer Perception (MLP)
decoders in a Siamese network to efficiently render multi-scale long-range details
for accurate CD. Chen et al. [5] propose a Bitemporal Image Transformer (BIT)
to efficiently and effectively model contexts within the spatial-temporal domain
for CD. Ke et al. [20] propose a hybrid Transformer with token aggregation for
remote sensing image CD. Song et al. [39] combine the multi-scale Swin Trans-
former and a deeply-supervised network for CD. All these methods have shown
that Transformers can model the inter-patch relations for strong feature repre-
sentations. However, these methods do not take the full abilities of Transformers
in multi-level feature learning. Different from existing Transformer-based CD
methods, our proposed approach improves the feature extraction from a global
view and combines multi-level visual features in a pyramid manner.
3 Proposed Approach
As shown in Fig. 1, the proposed framework includes three key components, i.e.,
Siamese Feature Extraction (SFE), Deep Feature Enhancement (DFE) and Pro-
gressive Change Prediction (PCP). By taking dual-phase remote sensing images
as inputs, SFE first extracts multi-level visual features through two shared Swin
Transformers. Then, DFE utilizes the multi-level visual features to generate sum-
mation features and difference features, which highlight the change regions with
temporal information. Finally, by integrating all above features, PCP introduces
a pyramid structure grafted with a Progressive Attention Module (PAM) for the
final CD prediction. To train our framework, we introduce the deeply-supervised
摘要:

FullyTransformerNetworkforChangeDetectionofRemoteSensingImagesTianyuYan,ZifuWan,andPingpingZhang[0000000312061444]SchoolofArticialIntelligence,DalianUniversityofTechnology,China{tianyuyan2001,wanzifu2000}@gmail.com;zhpp@dlut.edu.cnAbstract.Recently,changedetection(CD)ofremotesensingimageshaveachiev...

展开>> 收起<<
Fully Transformer Network for Change Detection of Remote Sensing Images Tianyu Yan Zifu Wan and Pingping Zhang0000000312061444.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.57MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注