2
diagnosing the same anatomical structure. Generally, T1, T2,
PD, and FS-PD weighted images are produced together in
the acquisition of MRI. Clinically, PD-weighted images have
shorter repetition and echo time than T2 weighted images [17].
Inspired by this, some Ref-SR based methods leverage the HR
PD-weighted images to recover the HR T2 images from the LR
T2 images. For Ref-SR, some researches roughly complement
LR image features with Ref image features through the plus
or concatenation operation, where the improvement of the LR
images quality is limited. Subsequently, a series of methods
adopt deformable convolution. [18], [14]to fuse the Ref
image features and LR image features. Existing state-of-the-art
(SOTA) feature fusion methods subsequently unfold the image
into patches and adopt Transformer-based cross-attention (CA)
mechanism [14], [15] to calculate the correlation between
patches of LR and Ref images. These methods have verified
that the feature alignment, which means matching valuable
information of the patch from Ref images to the corresponding
LR images, strongly impacts the reconstruction of HR images.
Distinct from natural images, the color of MRI is sole and
the object boundary is more ambiguous. Relevant experiments
indicate that despite the existence of authentic high-frequency
details in Ref images, the network cannot completely trans-
form these details into HR images. We divide the MRI images
into two parts: foreground and background. Specifically, the
foreground contains some concerning tissue and texture which
are important in SR. The background consists of some less
important regions, such as the black region and the skeleton
where the pixel information is nearly equal. Theoretically,
the cross-attention (CA) methods only consider the search
for the most relevant regions but ignore the variety of the
scale of foreground. Through a large amount of experiments
and observations, we find that the flexibility of patches has
a significant effect on the feature alignment. Specifically,
consider two cases:
1) When the LR and Ref image foregrounds are of different
scales, the patch will contain different areas of the foreground.
However, the foregrounds of the LR and Ref images have
the same semantic information. The affinity of the semantic
information will lead to the mismatch between LR and Ref
image patches, as illustrated in Fig. 1(a).
2) Assumed that the foreground scales of LR and Ref
images are the same, the cross-attention (CA) method ignores
the harmony between the patch size and the scale of the
foreground. The fixed patch size barely adapts to the various
scale of foregrounds. Therefore, the patch size is hard to fit the
foreground scale. For example, if the size of the foreground is
smaller than patch size, the patch will contain massive amounts
of information, further interfering with the calculation of the
correlation matrix, as illustrated in Fig. 1(b). If the patch size is
too small, different patches will mismatch due to the similarity
of local features, as illustrated in Fig. 1(c).
In fact, scale diversity has been shown to be important
in feature expression [19] and image restoration [20], [21].
Small-scale features can provide more complete semantic
features, while large-scale feature maps can provide texture
details. Based on this consideration, the core concept of the
patch size can be illustrated as follows. (1) The patch should
contain sufficient foreground information which contributes
to the alignment. (2) In the meantime, disturbed background
information is not expected too much in patch. To meet these
demands, the receptive fields of patch should be adjustable.
Therefore, we propose the Flexible Alignment (FA) module
aiming at generating various patch size and receptive field
to improve the precision of feature alignment. Specifically,
FA contains the Single-Multi Pyramid Alignment module (S-
A) and the Multi-Multi Pyramid Alignment module (M-A)
which respectively serves for the case I and case II. S-A
leverages various receptive field to ensure the completeness of
foreground information. M-A dynamically adjusts the patch
size of LR and Ref images to escape from the influence
of background. Additionally, we fuse the multi-scale fea-
tures with the Cross-Hierarchical Progressive Fusion (CHPF)
module, further improving the image quality. Furthermore,
fourier loss function is introduced to optimize the model. Our
contributions can be summarized as follows:
•We propose the FASR-Net to transform the textural
information of high-resolution PD images into low-resolution
T2 images and make the texture more realistic.
•Our model jointly combines the Multi-Multi Pyramid
Alignment module (M-A) and the Single-Multi Pyramid
Alignment module (S-A) to endow feature alignment with
flexibility.
•We introduce an effective feature fusion backbone Cross-
Hierarchical Progressive Fusion (CHPF) which takes advan-
tage of textural information and details of multi-scale features.
Our code will be available at FASR-Net.
II. RELATED WORK
A. Single Image Super-Resolution
For the past few years, deep learning-based SISR meth-
ods have performed amazing performances. Some Coarse-
to-Fine works [20], [21] have attractive results. Cai et
al. [20] proposed a novel Transformer-based method, coarse-
to-fine sparse Transformer (CST). Specifically, CST uses
spectraaware screening mechanism (SASM) for coarse patch
selecting. Then the selected patches are fed into spectra-
aggregation hashing multi-head self-attention (SAH-MSA) for
fine pixel clustering and self similarity capturing. Liang et
al. [21] focus on speeding-up the high-resolution photore-
alistic I2IT tasks based on closed-form Laplacian pyramid
decomposition and reconstruction. Except for Coarse-to-Fine
method, Lu et al. propose SRCNN [4], which introduces
deep convolutional neural networks in the field of image
Super-Resolution. Thereafter, residual blocks [6] and atten-
tion mechanisms [22], [5], [3] are introduced to deepen
the network. However, these approaches improve the image
quality restrictedly since the difficulty in recovering high-
frequency details. Christian et al. [23] firstly adopt Generative
Adversarial Network (GAN) in SR tasks. According to Chris-
tian, minimizing the mean squared loss (MSE) often lacks
high-frequency details. Thus, [23] utilizes the perceptual loss
which consists of content loss and adversarial loss. Wang et
al. [24] further propose enhanced generator and discriminator
obtaining more perceptually competitive results. Yan et al. [25]