
MULTIMODAL IMAGE FUSION BASED ON HYBRID CNN-TRANSFORMER AND
NON-LOCAL CROSS-MODAL ATTENTION
Yu Yuan1, Jiaqi Wu2, Zhongliang Jing1, Henry Leung3, Han Pan1
1School of AA, Shanghai Jiao Tong University, Shanghai 200240, China
2School of ICE, University of Electronic Science and Technology of China, Chengdu 610097, China
3Department of ECE, University of Calgary, Calgary, AB T2N 1N4, Canada
ABSTRACT
The fusion of images taken by heterogeneous sensors helps
to enrich the information and improve the quality of imag-
ing. In this article, we present a hybrid model consisting of
a convolutional encoder and a Transformer-based decoder to
fuse multimodal images. In the encoder, a non-local cross-
modal attention block is proposed to capture both local and
global dependencies of multiple source images. A branch
fusion module is designed to adaptively fuse the features of
the two branches. We embed a Transformer module with lin-
ear complexity in the decoder to enhance the reconstruction
capability of the proposed network. Qualitative and quanti-
tative experiments demonstrate the effectiveness of the pro-
posed method by comparing it with existing state-of-the-art
fusion models. The source code of our work is available at
https://github.com/pandayuanyu/HCFusion.
Index Terms—Multimodal Image Fusion, Hybrid CNN-
Transformer Architecture, Non-local Cross-modal Attention
1. INTRODUCTION
Multimodal image fusion refers to combining complementary
information from multi-source images to generate a fused im-
age of improved quality [1, 2]. For example, visible images
focus on the details of the scene textures, while infrared
images reflect the temperature information of objects. The
fusion of visible and infrared images produces an image with
rich details as well as salient objects. In the field of medical
imaging, the fusion of images of lesions taken by multiple
instruments can contribute to a more accurate diagnosis.
Feature extraction and fusion rules are two core issues
in multimodal image fusion. In the past few decades, many
traditional methods have been proposed to solve these two
problems. These methods can be divided into multi-scale
transform [3, 4], sparse representation [5], subspace [6], hy-
brid models [7], saliency-based [8], and other methods [1].
Although these conventional methods could achieve satisfac-
tory results in many cases, they tend to weaken the essential
characteristics of source images [see Fig. 1(c)].
Benefiting from its strong ability of feature extraction
(a) MRI (b) SPECT
(c) Tan et al. (d) Lahoud et al. (e) Ours
Fig. 1. The fusion results for MRI and SPECT images us-
ing different methods. Tan et al. [4] weakens the features
of SPECT. Lahoud et al. [13] over-enhances textures of MRI.
Our method generates image (e) with smoother transitions be-
tween modalities while retaining appropriate details.
and representation, deep learning has many applications in
multimodal image fusion. Prabhakar et al. [9] proposed a
convolution-based network termed as Deepfuse for multi-
exposure image fusion (MEF). Based on Deepfuse, Li et al.
[10] proposed the visible and infrared image fusion network
DenseFuse, by replacing the feature extraction layer with
dense blocks and redesigning the fusion strategy. Ma et al.
[11] proposed an infrared and visible image fusion frame-
work based on a generative adversarial network (GAN). Xu
et al. [12] used a weight block to measure the importance
of information from different source images. Lahoud et al.
[13] proposed to extract feature maps of multimodal medical
images using a pre-trained network and designed a weight
fusion strategy with these feature maps. Recently, several
studies presented to combine fusion with high-level vision
tasks. Tang et al. [14] proposed SeAFusion, which bridged
the gap between image fusion and semantic features. While
arXiv:2210.09847v1 [cs.CV] 18 Oct 2022