Single Image Super-Resolution Using Lightweight Networks Based on Swin Transformer Bolong Zhang1Juan Chen1Quan Wen1

2025-05-03 2 0 1.08MB 10 页 10玖币
侵权投诉
Single Image Super-Resolution Using Lightweight Networks Based on Swin
Transformer
Bolong Zhang1Juan Chen1Quan Wen1
1University Of Electronic Science And Technology Of China
Abstract
Image super-resolution reconstruction is an important
task in the field of image processing technology, which can
restore low resolution image to high quality image with high
resolution. In recent years, deep learning has been ap-
plied in the field of image super-resolution reconstruction.
With the continuous development of deep neural network,
the quality of the reconstructed images has been greatly im-
proved, but the model complexity has also been increased.
In this paper, we propose two lightweight models named
as MSwinSR and UGSwinSR based on Swin Transformer.
The most important structure in MSwinSR is called Multi-
size Swin Transformer Block (MSTB), which mainly con-
tains four parallel multi-head self-attention (MSA) blocks.
UGSwinSR combines U-Net and GAN with Swin Trans-
former. Both of them can reduce the model complexity,
but MSwinSR can reach a higher objective quality, while
UGSwinSR can reach a higher perceptual quality. The
experimental results demonstrate that MSwinSR increases
PSNR by 0.07dB compared with the state-of-the-art model
SwinIR, while the number of parameters can reduced by
30.68%, and the calculation cost can reduced by 9.936%.
UGSwinSR can effectively reduce the amount of calculation
of the network, which can reduced by 90.92%compared
with SwinIR.
1. Introduction
Super resolution (SR) is a computer vision and image
processing technology that reconstructs a high-resolution
(HR) image from its low-resolution (LR) image [3,4,5].
It can improve the clarity of the displayed image with-
out changing the physical properties of the imaging equip-
ment. Therefore, this technology can be applied to medi-
cal imaging[6], security monitoring[7], remote sensing im-
age processing[8] and other fields, reducing the cost of up-
grading imaging equipment due to the need for higher im-
age quality. In addition, it can also be applied to the early
stage of data preprocessing[9], which can effectively im-
prove the performance of edge detection, semantic segmen-
tation, digit recognition and scene recognition.
Due to the successful performance of convolutional neu-
ral network (CNN) in the field of computer vision, it has
also been used in the field of SR[10,11,12,13,14], be-
sides, the quality of reconstructed images has been greatly
improved compared with traditional machine learning algo-
rithms. However, CNN also has some disadvantages, be-
cause the convolution kernel can not be well adapted to the
content of the image. This deficiency can be improved by
replacing CNN with Transformer[15], which uses a self-
attentional mechanism, capturing global content informa-
tion of images, showing promising performance in several
computer vision tasks[16,17,18,19]. However, Trans-
former can cost large amount of calculation. Recently, Swin
Transformer[18] has been introduced to reduce the compu-
tation, and it can capture local information of image content
just like CNN.
SwinIR[1] applied Swin Transformer to SR for the first
time. Compared with CNN and networks with attention
mechanism, SwinIR has the advantages of fewer parame-
ters and higher objective quality of reconstructed images.
But SwinIR also have the following drawbacks: (1) Since
capturing the attention mechanism is implemented by the
global information of source image, the overall reconstruc-
tion image is more smooth, while some local details are
difficult to be detected. This has little effect on higher-
resolution images, but it will greatly reduce the perceptual
quality of small-size images. (2) Besides the Swin Trans-
former block, SwinIR also uses a large number of con-
volutional layers, which will increase the amount of com-
putation in the network. If these convolutional layers are
deleted, the reconstruction quality of the image will be
greatly reduced. (3) In order to solve the special problem of
SR, SwinIR cancels the downsampling operation in Swin
Transformer. Although this can reduce the number of pa-
rameters, it will also increase the amount of calculation of
the model, and it is difficult to extract deeper features of the
images.
In this paper, we propose two models to solve the prob-
lems mentioned above. The first model named Multi-size
Swin SR (MsSwinSR) uses multiple blocks with different
1
arXiv:2210.11019v1 [eess.IV] 20 Oct 2022
CelebA (4×): 086750.jpg
HR SwinIR [1] SRGAN [2]
LR MSwinSR (ours) UGSwinSR (ours)
Figure 1: Comparison of the reconstructed image (×4) generated by different models. SwinIR[1] and MSwinSR use Swin Transformer
with high objective quality. SRGAN[2] and UGSwinSR use GAN with high perceptual quality.
attention windows to process feature maps in parallel, so
that a single multilayer perceptron (MLP) block can process
the information obtained by multiple Swin Transformer
blocks simultaneously. Therefore it can reduce the number
of MLP blocks, building a lightweight network with both
less computation and fewer parameters. In other words, the
original network’s depth is reduced, but the width of the net-
work increases. As a result, the performance of MSwinSR
doesn’t change greatly compared with SwinIR, since it uses
the same count of Swin Transormer blocks which play an
important role in SR.
The second model named as U-net GAN Swin SR
(UGSwinSR). We add the downsampling operation into
SwinIR, so that the deep features of the image could be
extracted. However, due to the particularity of SR, up-
sampling operation is required after downsampling to re-
store the features. Therefore, we refer to the design of
U-net[20,21] and removed the convolutional layer, which
greatly reduces the computation of the network. Due to the
downsampling operation, the original image information is
destroyed, hence this structure can only be used for extract-
ing the deep features. But for the original image, other
methods that can restore LR to HR are required. In order
to reduce the size of the model, we use BICUBIC[22], a
simple interpolation, to obtain HR. Generative Adversarial
Network (GAN)[14] is also used in this model for higher
perception quality. The results show that we can obtain
promising image perception quality with very low compu-
tational cost.
2. Related Work
2.1. Image Super-Resolution
Image super-resolution was first proposed by Harris[23]
in the 1960s. Early image super-resolution technologies
are mainly based on interpolation methods, such as near-
est neighbor interpolation, bilinear interpolation and bicu-
bic interpolation[22]. With the continuous development of
machine learning, Freeman et al.[24] introduced machine
learning to the field of SR for the first time in 2000. Subse-
quently, a variety of reconstruction algorithms based on ma-
chine learning have emerged, such as the algorithm based
on neighborhood embedding[25], the algorithm based on
sparse representation[26] and the algorithm based on local
linear regression[27].
However, traditional super-resolution reconstruction al-
gorithms and machine learning algorithms mostly make use
of the underlying features of the image, so the reconstruc-
tion performance is greatly limited, besides it is difficult to
reconstruct the edge, contour, texture and other details of
the high-resolution image. Therefore, in order to extract
deep features of images, deep learning algorithm is applied
to the field of super-resolution.
Dong et al.[10] applied convolutional neural network in
the field of image super-resolution in 2014 for the first time.
This work inspired researchers to apply neural networks to
the field of super-resolution and proposed a large number
of super-resolution reconstruction algorithms based on deep
learning[11,12,13,14].
2.2. Swin Transformer
The attention mechanism was first used in the field of
natural language processing to solve the problem that the re-
current neural network[28] could not perform parallel com-
putation and had long-term dependence on sequential in-
formation. The most famous model of the attention mech-
anism is called Transformer[15]. Transformer is mainly
used in the field of natural language processing and has
achieved great success. Vision Transformer (ViT)[17] in-
troduced Transformer into the field of Computer Vision
for the first time and worked out promising result. Swin
Transformer[18] is improved on the basis of ViT by intro-
ducing sliding window. On the one hand, it has the ability
to process local information, and on the other hand, it has
less computation compared with ViT. SwinIR[1] uses Swin
Transformer model, which is currently performing well in
2
摘要:

SingleImageSuper-ResolutionUsingLightweightNetworksBasedonSwinTransformerBolongZhang1JuanChen1QuanWen11UniversityOfElectronicScienceAndTechnologyOfChinaAbstractImagesuper-resolutionreconstructionisanimportanttaskintheeldofimageprocessingtechnology,whichcanrestorelowresolutionimagetohighqualityimage...

展开>> 收起<<
Single Image Super-Resolution Using Lightweight Networks Based on Swin Transformer Bolong Zhang1Juan Chen1Quan Wen1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:10 页 大小:1.08MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注