MULTIMODAL IMAGE FUSION BASED ON HYBRID CNN-TRANSFORMER AND NON-LOCAL CROSS-MODAL ATTENTION Yu Yuan1 Jiaqi Wu2 Zhongliang Jing1 Henry Leung3 Han Pan1

2025-05-02 0 0 787.68KB 5 页 10玖币

侵权投诉

MULTIMODAL IMAGE FUSION BASED ON HYBRID CNN-TRANSFORMER AND

NON-LOCAL CROSS-MODAL ATTENTION

Yu Yuan1, Jiaqi Wu2, Zhongliang Jing1, Henry Leung3, Han Pan1

1School of AA, Shanghai Jiao Tong University, Shanghai 200240, China

2School of ICE, University of Electronic Science and Technology of China, Chengdu 610097, China

3Department of ECE, University of Calgary, Calgary, AB T2N 1N4, Canada

ABSTRACT

The fusion of images taken by heterogeneous sensors helps

to enrich the information and improve the quality of imag-

ing. In this article, we present a hybrid model consisting of

a convolutional encoder and a Transformer-based decoder to

fuse multimodal images. In the encoder, a non-local cross-

modal attention block is proposed to capture both local and

global dependencies of multiple source images. A branch

fusion module is designed to adaptively fuse the features of

the two branches. We embed a Transformer module with lin-

ear complexity in the decoder to enhance the reconstruction

capability of the proposed network. Qualitative and quanti-

tative experiments demonstrate the effectiveness of the pro-

posed method by comparing it with existing state-of-the-art

fusion models. The source code of our work is available at

https://github.com/pandayuanyu/HCFusion.

Index Terms—Multimodal Image Fusion, Hybrid CNN-

Transformer Architecture, Non-local Cross-modal Attention

1. INTRODUCTION

Multimodal image fusion refers to combining complementary

information from multi-source images to generate a fused im-

age of improved quality [1, 2]. For example, visible images

focus on the details of the scene textures, while infrared

images reﬂect the temperature information of objects. The

fusion of visible and infrared images produces an image with

rich details as well as salient objects. In the ﬁeld of medical

imaging, the fusion of images of lesions taken by multiple

instruments can contribute to a more accurate diagnosis.

Feature extraction and fusion rules are two core issues

in multimodal image fusion. In the past few decades, many

traditional methods have been proposed to solve these two

problems. These methods can be divided into multi-scale

transform [3, 4], sparse representation [5], subspace [6], hy-

brid models [7], saliency-based [8], and other methods [1].

Although these conventional methods could achieve satisfac-

tory results in many cases, they tend to weaken the essential

characteristics of source images [see Fig. 1(c)].

Beneﬁting from its strong ability of feature extraction

(a) MRI (b) SPECT

Fig. 1. The fusion results for MRI and SPECT images us-

ing different methods. Tan et al. [4] weakens the features

of SPECT. Lahoud et al. [13] over-enhances textures of MRI.

Our method generates image (e) with smoother transitions be-

tween modalities while retaining appropriate details.

and representation, deep learning has many applications in

multimodal image fusion. Prabhakar et al. [9] proposed a

convolution-based network termed as Deepfuse for multi-

exposure image fusion (MEF). Based on Deepfuse, Li et al.

[10] proposed the visible and infrared image fusion network

DenseFuse, by replacing the feature extraction layer with

dense blocks and redesigning the fusion strategy. Ma et al.

[11] proposed an infrared and visible image fusion frame-

work based on a generative adversarial network (GAN). Xu

et al. [12] used a weight block to measure the importance

of information from different source images. Lahoud et al.

[13] proposed to extract feature maps of multimodal medical

images using a pre-trained network and designed a weight

fusion strategy with these feature maps. Recently, several

studies presented to combine fusion with high-level vision

tasks. Tang et al. [14] proposed SeAFusion, which bridged

the gap between image fusion and semantic features. While

arXiv:2210.09847v1 [cs.CV] 18 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MULTIMODALIMAGEFUSIONBASEDONHYBRIDCNN-TRANSFORMERANDNON-LOCALCROSS-MODALATTENTIONYuYuan1,JiaqiWu2,ZhongliangJing1,HenryLeung3,HanPan11SchoolofAA,ShanghaiJiaoTongUniversity,Shanghai200240,China2SchoolofICE,UniversityofElectronicScienceandTechnologyofChina,Chengdu610097,China3DepartmentofECE,Universit...

展开>> 收起<<

MULTIMODAL IMAGE FUSION BASED ON HYBRID CNN-TRANSFORMER AND NON-LOCAL CROSS-MODAL ATTENTION Yu Yuan1 Jiaqi Wu2 Zhongliang Jing1 Henry Leung3 Han Pan1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MULTIMODAL IMAGE FUSION BASED ON HYBRID CNN-TRANSFORMER AND NON-LOCAL CROSS-MODAL ATTENTION Yu Yuan1 Jiaqi Wu2 Zhongliang Jing1 Henry Leung3 Han Pan1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: