Fully Transformer Network for Change Detection of Remote Sensing Images Tianyu Yan Zifu Wan and Pingping Zhang0000000312061444

2025-05-06 0 0 1.57MB 18 页 10玖币

侵权投诉

Fully Transformer Network for Change Detection

of Remote Sensing Images

Tianyu Yan, Zifu Wan, and Pingping Zhang[0000−0003−1206−1444]

School of Artiﬁcial Intelligence, Dalian University of Technology, China

{tianyuyan2001,wanzifu2000}@gmail.com;zhpp@dlut.edu.cn

Abstract. Recently, change detection (CD) of remote sensing images

have achieved great progress with the advances of deep learning. How-

ever, current methods generally deliver incomplete CD regions and ir-

regular CD boundaries due to the limited representation ability of the

extracted visual features. To relieve these issues, in this work we propose

a novel learning framework named Fully Transformer Network (FTN) for

remote sensing image CD, which improves the feature extraction from a

global view and combines multi-level visual features in a pyramid man-

ner. More speciﬁcally, the proposed framework ﬁrst utilizes the advan-

tages of Transformers in long-range dependency modeling. It can help to

learn more discriminative global-level features and obtain complete CD

regions. Then, we introduce a pyramid structure to aggregate multi-level

visual features from Transformers for feature enhancement. The pyra-

mid structure grafted with a Progressive Attention Module (PAM) can

improve the feature representation ability with additional interdepen-

dencies through channel attentions. Finally, to better train the frame-

work, we utilize the deeply-supervised learning with multiple boundary-

aware loss functions. Extensive experiments demonstrate that our pro-

posed method achieves a new state-of-the-art performance on four public

CD benchmarks. For model reproduction, the source code is released at

https://github.com/AI-Zhpp/FTN.

Keywords: Fully Transformer Network ·Change Detection ·Remote

Sensing Image.

1 Introduction

Change Detection (CD) plays an important role in the ﬁeld of remote sensing. It

aims to detect the key change regions in dual-phase remote sensing images cap-

tured at diﬀerent times but over the same area. Remote sensing image CD has

been used in many real-world applications, such as land-use planning, urban ex-

pansion management, geological disaster monitoring, ecological environment pro-

tection. However, due to change regions can be any shapes in complex scenarios,

there are still many challenges for high-accuracy CD. In addition, remote sensing

image CD by handcrafted methods is time-consuming and labor-intensive, thus

there is a great need for fully-automatic and highly-eﬃcient CD.

arXiv:2210.00757v1 [cs.CV] 3 Oct 2022

2 T. Yan et al.

In recent years, deep learning has been widely used in remote sensing image

processing due to its powerful feature representation capabilities, and has shown

great potential in CD. With deep Convolutional Neural Networks (CNN) [12,15,

17], many CD methods extract discriminative features and have demonstrated

good CD performances. However, previous methods still have the following short-

comings: 1) With the resolution improvement of remote sensing images, rich

semantic information contained in high-resolution images is not fully utilized.

As a result, current CD methods are unable to distinguish pseudo changes such

as shadow, vegetation and sunshine in sensitive areas. 2) Boundary information

in complex remote sensing images is often missing. In previous methods, the

extracted changed areas often have regional holes and their boundaries can be

very irregular, resulting in a poor visual eﬀect [28]. 3) The temporal information

contained in dual-phase remote sensing images is not fully utilized, which is also

one of the reasons for the low performance of current CD methods.

To tackle above issues, in this work we propose a novel learning framework

named Fully Transformer Network (FTN) for remote sensing image CD, which

improves the feature extraction from a global view and combines multi-level vi-

sual features in a pyramid manner. More speciﬁcally, the proposed framework

is a three-branch structure whose input is a dual-phase remote sensing image

pair. We ﬁrst utilize the advantages of Transformers [9, 29, 42] in long-range de-

pendency modeling to learn more discriminative global-level features. Then, to

highlight the change regions, the summation features and diﬀerence features are

generated by directly comparing the temporal features of dual-phase remote sens-

ing images. Thus, one can obtain complete CD regions. To improve the boundary

perception ability, we further introduce a pyramid structure to aggregate multi-

level visual features from Transformers. The pyramid structure grafted with a

Progressive Attention Module (PAM) can improve the feature representation

ability with additional interdependencies through channel attentions. Finally, to

better train the framework, we utilize the deeply-supervised learning with multi-

ple boundary-aware loss functions. Extensive experiments show that our method

achieves a new state-of-the-art performance on four public CD benchmarks.

In summary, the main contributions of this work are as follow:

–We propose a novel learning framework (i.e., FTN) for remote sensing im-

age CD, which can improve the feature extraction from a global view and

combine multi-level visual features in a pyramid manner.

–We propose a pyramid structure grafted with a Progressive Attention Module

(PAM) to further improve the feature representation ability with additional

interdependencies through channel attentions.

–We introduce the deeply-supervised learning with multiple boundary-aware

loss functions, to address the irregular boundary problem in CD.

–Extensive experiments on four public CD benchmarks demonstrate that our

framework attains better performances than most state-of-the-art methods.

Fully Transformer Network for Change Detection of Remote Sensing Images 3

2 Related Work

2.1 Change Detection of Remote Sensing Images

Technically, the task of change detection takes dual-phase remote sensing images

as inputs, and predicts the change regions of the same area. Before deep learn-

ing, direct classiﬁcation based methods witness the great progress in CD. For

example, Change Vector Analysis (CVA) [16, 48] is powerful in extracting pixel-

level features and is widely utilized in CD. With the rapid improvement in image

resolution, more details of objects have been recorded in remote sensing images.

Therefore, many object-aware methods are proposed to improve the CD per-

formance. For example, Tang et al. [41] propose an object-oriented CD method

based on the Kolmogorov–Smirnov test. Li et al. [23] propose the object-oriented

CVA to reduce the number of pseudo detection pixels. With multiple classiﬁers

and multi-scale uncertainty analysis, Tan et al. [40] build an object-based ap-

proach for complex scene CD. Although above methods can generate CD maps

from dual-phase remote sensing images, they generally deliver incomplete CD

regions and irregular CD boundaries due to the limited representation ability of

the extracted visual features.

With the advances of deep learning, many works improve the CD performance

by extracting more discriminative features. For example, Zhang et al. [52] uti-

lize a Deep Belief Network (DBN) to extract deep features and represent the

change regions by patch diﬀerences. Saha et al. [36] combine a pre-trained deep

CNN and traditional CVA to generate certain change regions. Hou et al. [14]

take the advantages of deep features and introduce the low rank analysis to

improve the CD results. Peng et al. [33] utilize saliency detection analysis and

pre-trained deep networks to achieve unsupervised CD. Since change regions

may appear any places, Lei et al. [22] integrate Stacked Denoising AutoEncoders

(SDAE) with the multi-scale superpixel segmentation to realize superpixel-based

CD. Similarly, Lv et al. [31] utilize a Stacked Contractive AutoEncoder (SCAE)

to extract temporal change features from superpixels, then adopt a clustering

method to produce CD maps. Meanwhile, some methods formulate the CD task

into a binary image segmentation task. Thus, CD can be ﬁnished in a super-

vised manner. For example, Alcantarilla et al. [1] ﬁrst concatenate dual-phase

images as one image with six channels. Then, the six-channel image is fed into a

Fully Convolutional Network (FCN) to realize the CD. Similarly, Peng et al. [34]

combine bi-temporal remote sensing images as one input, which is then fed into

a modiﬁed U-Net++ [57] for CD. Daudt et al. [7] utilize Siamese networks to

extract features for each remote sensing image, then predict the CD maps with

fused features. The experimental results prove the eﬃciency of Siamese networks.

Further more, Guo et al. [11] use a fully convolutional Siamese network with a

contrastive loss to measure the change regions. Zhang et al. [49] propose a deeply-

supervised image fusion network for CD. There are also some works focused on

speciﬁc object CD. For example, Liu et al. [28] propose a dual-task constrained

deep Siamese convolutional network for building CD. Jiang et al. [19] propose a

pyramid feature-based attention-guided Siamese network for building CD. Lei et

4 T. Yan et al.

al. [21] propose a hierarchical paired channel fusion network for street scene CD.

The aforementioned methods have shown great success in feature learning for

CD. However, these methods have limited global representation capabilities and

usually focus on local regions of changed objects. We ﬁnd that Transformers have

strong characteristics in extracting global features. Thus, diﬀerent from previ-

ous works, we take the advantages of Transformers, and propose a new learning

framework for more discriminative feature representations.

2.2 Vision Transformers for Change Detection

Recently, Transformers [42] have been applied to many computer vision tasks,

such as image classiﬁcation [9, 29], object detection [4], semantic segmenta-

tion [44], person re-identiﬁcation [27, 51] and so on. Inspired by that, Zhang et

al. [50] deploy a Swin Transformer [29] with a U-Net [35] structure for re-

mote sensing image CD. Zheng et al. [56] design a deep Multi-task Encoder-

Transformer-Decoder (METD) architecture for semantic CD. Wang et al. [45]

incorporate a Siamese Vision Transformer (SViT) into a feature diﬀerence frame-

work for CD. To take the advantages of both Transformers and CNNs, Wang et

al. [43] propose to combine a Transformer and a CNN for remote sensing image

CD. Li et al. [24] propose an encoding-decoding hybrid framework for CD, which

has the advantages of both Transformers and U-Net. Bandara et al. [3] unify hier-

archically structured Transformer encoders with Multi-Layer Perception (MLP)

decoders in a Siamese network to eﬃciently render multi-scale long-range details

for accurate CD. Chen et al. [5] propose a Bitemporal Image Transformer (BIT)

to eﬃciently and eﬀectively model contexts within the spatial-temporal domain

for CD. Ke et al. [20] propose a hybrid Transformer with token aggregation for

remote sensing image CD. Song et al. [39] combine the multi-scale Swin Trans-

former and a deeply-supervised network for CD. All these methods have shown

that Transformers can model the inter-patch relations for strong feature repre-

sentations. However, these methods do not take the full abilities of Transformers

in multi-level feature learning. Diﬀerent from existing Transformer-based CD

methods, our proposed approach improves the feature extraction from a global

view and combines multi-level visual features in a pyramid manner.

3 Proposed Approach

As shown in Fig. 1, the proposed framework includes three key components, i.e.,

Siamese Feature Extraction (SFE), Deep Feature Enhancement (DFE) and Pro-

gressive Change Prediction (PCP). By taking dual-phase remote sensing images

as inputs, SFE ﬁrst extracts multi-level visual features through two shared Swin

Transformers. Then, DFE utilizes the multi-level visual features to generate sum-

mation features and diﬀerence features, which highlight the change regions with

temporal information. Finally, by integrating all above features, PCP introduces

a pyramid structure grafted with a Progressive Attention Module (PAM) for the

ﬁnal CD prediction. To train our framework, we introduce the deeply-supervised

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FullyTransformerNetworkforChangeDetectionofRemoteSensingImagesTianyuYan,ZifuWan,andPingpingZhang[0000000312061444]SchoolofArticialIntelligence,DalianUniversityofTechnology,China{tianyuyan2001,wanzifu2000}@gmail.com;zhpp@dlut.edu.cnAbstract.Recently,changedetection(CD)ofremotesensingimageshaveachiev...

展开>> 收起<<

Fully Transformer Network for Change Detection of Remote Sensing Images Tianyu Yan Zifu Wan and Pingping Zhang0000000312061444.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Fully Transformer Network for Change Detection of Remote Sensing Images Tianyu Yan Zifu Wan and Pingping Zhang0000000312061444

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: