Single Image Super-Resolution Using Lightweight Networks Based on Swin
Transformer
Bolong Zhang1Juan Chen1Quan Wen1
1University Of Electronic Science And Technology Of China
Abstract
Image super-resolution reconstruction is an important
task in the field of image processing technology, which can
restore low resolution image to high quality image with high
resolution. In recent years, deep learning has been ap-
plied in the field of image super-resolution reconstruction.
With the continuous development of deep neural network,
the quality of the reconstructed images has been greatly im-
proved, but the model complexity has also been increased.
In this paper, we propose two lightweight models named
as MSwinSR and UGSwinSR based on Swin Transformer.
The most important structure in MSwinSR is called Multi-
size Swin Transformer Block (MSTB), which mainly con-
tains four parallel multi-head self-attention (MSA) blocks.
UGSwinSR combines U-Net and GAN with Swin Trans-
former. Both of them can reduce the model complexity,
but MSwinSR can reach a higher objective quality, while
UGSwinSR can reach a higher perceptual quality. The
experimental results demonstrate that MSwinSR increases
PSNR by 0.07dB compared with the state-of-the-art model
SwinIR, while the number of parameters can reduced by
30.68%, and the calculation cost can reduced by 9.936%.
UGSwinSR can effectively reduce the amount of calculation
of the network, which can reduced by 90.92%compared
with SwinIR.
1. Introduction
Super resolution (SR) is a computer vision and image
processing technology that reconstructs a high-resolution
(HR) image from its low-resolution (LR) image [3,4,5].
It can improve the clarity of the displayed image with-
out changing the physical properties of the imaging equip-
ment. Therefore, this technology can be applied to medi-
cal imaging[6], security monitoring[7], remote sensing im-
age processing[8] and other fields, reducing the cost of up-
grading imaging equipment due to the need for higher im-
age quality. In addition, it can also be applied to the early
stage of data preprocessing[9], which can effectively im-
prove the performance of edge detection, semantic segmen-
tation, digit recognition and scene recognition.
Due to the successful performance of convolutional neu-
ral network (CNN) in the field of computer vision, it has
also been used in the field of SR[10,11,12,13,14], be-
sides, the quality of reconstructed images has been greatly
improved compared with traditional machine learning algo-
rithms. However, CNN also has some disadvantages, be-
cause the convolution kernel can not be well adapted to the
content of the image. This deficiency can be improved by
replacing CNN with Transformer[15], which uses a self-
attentional mechanism, capturing global content informa-
tion of images, showing promising performance in several
computer vision tasks[16,17,18,19]. However, Trans-
former can cost large amount of calculation. Recently, Swin
Transformer[18] has been introduced to reduce the compu-
tation, and it can capture local information of image content
just like CNN.
SwinIR[1] applied Swin Transformer to SR for the first
time. Compared with CNN and networks with attention
mechanism, SwinIR has the advantages of fewer parame-
ters and higher objective quality of reconstructed images.
But SwinIR also have the following drawbacks: (1) Since
capturing the attention mechanism is implemented by the
global information of source image, the overall reconstruc-
tion image is more smooth, while some local details are
difficult to be detected. This has little effect on higher-
resolution images, but it will greatly reduce the perceptual
quality of small-size images. (2) Besides the Swin Trans-
former block, SwinIR also uses a large number of con-
volutional layers, which will increase the amount of com-
putation in the network. If these convolutional layers are
deleted, the reconstruction quality of the image will be
greatly reduced. (3) In order to solve the special problem of
SR, SwinIR cancels the downsampling operation in Swin
Transformer. Although this can reduce the number of pa-
rameters, it will also increase the amount of calculation of
the model, and it is difficult to extract deeper features of the
images.
In this paper, we propose two models to solve the prob-
lems mentioned above. The first model named Multi-size
Swin SR (MsSwinSR) uses multiple blocks with different
1
arXiv:2210.11019v1 [eess.IV] 20 Oct 2022