MULTIMODAL IMAGE FUSION BASED ON HYBRID CNN-TRANSFORMER AND NON-LOCAL CROSS-MODAL ATTENTION Yu Yuan1 Jiaqi Wu2 Zhongliang Jing1 Henry Leung3 Han Pan1

2025-05-02 0 0 787.68KB 5 页 10玖币
侵权投诉
MULTIMODAL IMAGE FUSION BASED ON HYBRID CNN-TRANSFORMER AND
NON-LOCAL CROSS-MODAL ATTENTION
Yu Yuan1, Jiaqi Wu2, Zhongliang Jing1, Henry Leung3, Han Pan1
1School of AA, Shanghai Jiao Tong University, Shanghai 200240, China
2School of ICE, University of Electronic Science and Technology of China, Chengdu 610097, China
3Department of ECE, University of Calgary, Calgary, AB T2N 1N4, Canada
ABSTRACT
The fusion of images taken by heterogeneous sensors helps
to enrich the information and improve the quality of imag-
ing. In this article, we present a hybrid model consisting of
a convolutional encoder and a Transformer-based decoder to
fuse multimodal images. In the encoder, a non-local cross-
modal attention block is proposed to capture both local and
global dependencies of multiple source images. A branch
fusion module is designed to adaptively fuse the features of
the two branches. We embed a Transformer module with lin-
ear complexity in the decoder to enhance the reconstruction
capability of the proposed network. Qualitative and quanti-
tative experiments demonstrate the effectiveness of the pro-
posed method by comparing it with existing state-of-the-art
fusion models. The source code of our work is available at
https://github.com/pandayuanyu/HCFusion.
Index TermsMultimodal Image Fusion, Hybrid CNN-
Transformer Architecture, Non-local Cross-modal Attention
1. INTRODUCTION
Multimodal image fusion refers to combining complementary
information from multi-source images to generate a fused im-
age of improved quality [1, 2]. For example, visible images
focus on the details of the scene textures, while infrared
images reflect the temperature information of objects. The
fusion of visible and infrared images produces an image with
rich details as well as salient objects. In the field of medical
imaging, the fusion of images of lesions taken by multiple
instruments can contribute to a more accurate diagnosis.
Feature extraction and fusion rules are two core issues
in multimodal image fusion. In the past few decades, many
traditional methods have been proposed to solve these two
problems. These methods can be divided into multi-scale
transform [3, 4], sparse representation [5], subspace [6], hy-
brid models [7], saliency-based [8], and other methods [1].
Although these conventional methods could achieve satisfac-
tory results in many cases, they tend to weaken the essential
characteristics of source images [see Fig. 1(c)].
Benefiting from its strong ability of feature extraction
(a) MRI (b) SPECT
(c) Tan et al. (d) Lahoud et al. (e) Ours
Fig. 1. The fusion results for MRI and SPECT images us-
ing different methods. Tan et al. [4] weakens the features
of SPECT. Lahoud et al. [13] over-enhances textures of MRI.
Our method generates image (e) with smoother transitions be-
tween modalities while retaining appropriate details.
and representation, deep learning has many applications in
multimodal image fusion. Prabhakar et al. [9] proposed a
convolution-based network termed as Deepfuse for multi-
exposure image fusion (MEF). Based on Deepfuse, Li et al.
[10] proposed the visible and infrared image fusion network
DenseFuse, by replacing the feature extraction layer with
dense blocks and redesigning the fusion strategy. Ma et al.
[11] proposed an infrared and visible image fusion frame-
work based on a generative adversarial network (GAN). Xu
et al. [12] used a weight block to measure the importance
of information from different source images. Lahoud et al.
[13] proposed to extract feature maps of multimodal medical
images using a pre-trained network and designed a weight
fusion strategy with these feature maps. Recently, several
studies presented to combine fusion with high-level vision
tasks. Tang et al. [14] proposed SeAFusion, which bridged
the gap between image fusion and semantic features. While
arXiv:2210.09847v1 [cs.CV] 18 Oct 2022
摘要:

MULTIMODALIMAGEFUSIONBASEDONHYBRIDCNN-TRANSFORMERANDNON-LOCALCROSS-MODALATTENTIONYuYuan1,JiaqiWu2,ZhongliangJing1,HenryLeung3,HanPan11SchoolofAA,ShanghaiJiaoTongUniversity,Shanghai200240,China2SchoolofICE,UniversityofElectronicScienceandTechnologyofChina,Chengdu610097,China3DepartmentofECE,Universit...

展开>> 收起<<
MULTIMODAL IMAGE FUSION BASED ON HYBRID CNN-TRANSFORMER AND NON-LOCAL CROSS-MODAL ATTENTION Yu Yuan1 Jiaqi Wu2 Zhongliang Jing1 Henry Leung3 Han Pan1.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:787.68KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注