Single Image Super-Resolution Using Lightweight Networks Based on Swin Transformer Bolong Zhang1Juan Chen1Quan Wen1

2025-05-03 2 0 1.08MB 10 页 10玖币

侵权投诉

Single Image Super-Resolution Using Lightweight Networks Based on Swin

Transformer

Bolong Zhang1Juan Chen1Quan Wen1

1University Of Electronic Science And Technology Of China

Abstract

Image super-resolution reconstruction is an important

task in the ﬁeld of image processing technology, which can

restore low resolution image to high quality image with high

resolution. In recent years, deep learning has been ap-

plied in the ﬁeld of image super-resolution reconstruction.

With the continuous development of deep neural network,

the quality of the reconstructed images has been greatly im-

proved, but the model complexity has also been increased.

In this paper, we propose two lightweight models named

as MSwinSR and UGSwinSR based on Swin Transformer.

The most important structure in MSwinSR is called Multi-

size Swin Transformer Block (MSTB), which mainly con-

tains four parallel multi-head self-attention (MSA) blocks.

UGSwinSR combines U-Net and GAN with Swin Trans-

former. Both of them can reduce the model complexity,

but MSwinSR can reach a higher objective quality, while

UGSwinSR can reach a higher perceptual quality. The

experimental results demonstrate that MSwinSR increases

PSNR by 0.07dB compared with the state-of-the-art model

SwinIR, while the number of parameters can reduced by

30.68%, and the calculation cost can reduced by 9.936%.

UGSwinSR can effectively reduce the amount of calculation

of the network, which can reduced by 90.92%compared

with SwinIR.

1. Introduction

Super resolution (SR) is a computer vision and image

processing technology that reconstructs a high-resolution

(HR) image from its low-resolution (LR) image [3,4,5].

It can improve the clarity of the displayed image with-

out changing the physical properties of the imaging equip-

ment. Therefore, this technology can be applied to medi-

cal imaging[6], security monitoring[7], remote sensing im-

age processing[8] and other ﬁelds, reducing the cost of up-

grading imaging equipment due to the need for higher im-

age quality. In addition, it can also be applied to the early

stage of data preprocessing[9], which can effectively im-

prove the performance of edge detection, semantic segmen-

tation, digit recognition and scene recognition.

Due to the successful performance of convolutional neu-

ral network (CNN) in the ﬁeld of computer vision, it has

also been used in the ﬁeld of SR[10,11,12,13,14], be-

sides, the quality of reconstructed images has been greatly

improved compared with traditional machine learning algo-

rithms. However, CNN also has some disadvantages, be-

cause the convolution kernel can not be well adapted to the

content of the image. This deﬁciency can be improved by

replacing CNN with Transformer[15], which uses a self-

attentional mechanism, capturing global content informa-

tion of images, showing promising performance in several

computer vision tasks[16,17,18,19]. However, Trans-

former can cost large amount of calculation. Recently, Swin

Transformer[18] has been introduced to reduce the compu-

tation, and it can capture local information of image content

just like CNN.

SwinIR[1] applied Swin Transformer to SR for the ﬁrst

time. Compared with CNN and networks with attention

mechanism, SwinIR has the advantages of fewer parame-

ters and higher objective quality of reconstructed images.

But SwinIR also have the following drawbacks: (1) Since

capturing the attention mechanism is implemented by the

global information of source image, the overall reconstruc-

tion image is more smooth, while some local details are

difﬁcult to be detected. This has little effect on higher-

resolution images, but it will greatly reduce the perceptual

quality of small-size images. (2) Besides the Swin Trans-

former block, SwinIR also uses a large number of con-

volutional layers, which will increase the amount of com-

putation in the network. If these convolutional layers are

deleted, the reconstruction quality of the image will be

greatly reduced. (3) In order to solve the special problem of

SR, SwinIR cancels the downsampling operation in Swin

Transformer. Although this can reduce the number of pa-

rameters, it will also increase the amount of calculation of

the model, and it is difﬁcult to extract deeper features of the

images.

In this paper, we propose two models to solve the prob-

lems mentioned above. The ﬁrst model named Multi-size

Swin SR (MsSwinSR) uses multiple blocks with different

arXiv:2210.11019v1 [eess.IV] 20 Oct 2022

CelebA (4×): 086750.jpg

HR SwinIR [1] SRGAN [2]

LR MSwinSR (ours) UGSwinSR (ours)

Figure 1: Comparison of the reconstructed image (×4) generated by different models. SwinIR[1] and MSwinSR use Swin Transformer

with high objective quality. SRGAN[2] and UGSwinSR use GAN with high perceptual quality.

attention windows to process feature maps in parallel, so

that a single multilayer perceptron (MLP) block can process

the information obtained by multiple Swin Transformer

blocks simultaneously. Therefore it can reduce the number

of MLP blocks, building a lightweight network with both

less computation and fewer parameters. In other words, the

original network’s depth is reduced, but the width of the net-

work increases. As a result, the performance of MSwinSR

doesn’t change greatly compared with SwinIR, since it uses

the same count of Swin Transormer blocks which play an

important role in SR.

The second model named as U-net GAN Swin SR

(UGSwinSR). We add the downsampling operation into

SwinIR, so that the deep features of the image could be

extracted. However, due to the particularity of SR, up-

sampling operation is required after downsampling to re-

store the features. Therefore, we refer to the design of

U-net[20,21] and removed the convolutional layer, which

greatly reduces the computation of the network. Due to the

downsampling operation, the original image information is

destroyed, hence this structure can only be used for extract-

ing the deep features. But for the original image, other

methods that can restore LR to HR are required. In order

to reduce the size of the model, we use BICUBIC[22], a

simple interpolation, to obtain HR. Generative Adversarial

Network (GAN)[14] is also used in this model for higher

perception quality. The results show that we can obtain

promising image perception quality with very low compu-

tational cost.

2. Related Work

2.1. Image Super-Resolution

Image super-resolution was ﬁrst proposed by Harris[23]

in the 1960s. Early image super-resolution technologies

are mainly based on interpolation methods, such as near-

est neighbor interpolation, bilinear interpolation and bicu-

bic interpolation[22]. With the continuous development of

machine learning, Freeman et al.[24] introduced machine

learning to the ﬁeld of SR for the ﬁrst time in 2000. Subse-

quently, a variety of reconstruction algorithms based on ma-

chine learning have emerged, such as the algorithm based

on neighborhood embedding[25], the algorithm based on

sparse representation[26] and the algorithm based on local

linear regression[27].

However, traditional super-resolution reconstruction al-

gorithms and machine learning algorithms mostly make use

of the underlying features of the image, so the reconstruc-

tion performance is greatly limited, besides it is difﬁcult to

reconstruct the edge, contour, texture and other details of

the high-resolution image. Therefore, in order to extract

deep features of images, deep learning algorithm is applied

to the ﬁeld of super-resolution.

Dong et al.[10] applied convolutional neural network in

the ﬁeld of image super-resolution in 2014 for the ﬁrst time.

This work inspired researchers to apply neural networks to

the ﬁeld of super-resolution and proposed a large number

of super-resolution reconstruction algorithms based on deep

learning[11,12,13,14].

2.2. Swin Transformer

The attention mechanism was ﬁrst used in the ﬁeld of

natural language processing to solve the problem that the re-

current neural network[28] could not perform parallel com-

putation and had long-term dependence on sequential in-

formation. The most famous model of the attention mech-

anism is called Transformer[15]. Transformer is mainly

used in the ﬁeld of natural language processing and has

achieved great success. Vision Transformer (ViT)[17] in-

troduced Transformer into the ﬁeld of Computer Vision

for the ﬁrst time and worked out promising result. Swin

Transformer[18] is improved on the basis of ViT by intro-

ducing sliding window. On the one hand, it has the ability

to process local information, and on the other hand, it has

less computation compared with ViT. SwinIR[1] uses Swin

Transformer model, which is currently performing well in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SingleImageSuper-ResolutionUsingLightweightNetworksBasedonSwinTransformerBolongZhang1JuanChen1QuanWen11UniversityOfElectronicScienceAndTechnologyOfChinaAbstractImagesuper-resolutionreconstructionisanimportanttaskintheeldofimageprocessingtechnology,whichcanrestorelowresolutionimagetohighqualityimage...

展开>> 收起<<

Single Image Super-Resolution Using Lightweight Networks Based on Swin Transformer Bolong Zhang1Juan Chen1Quan Wen1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Single Image Super-Resolution Using Lightweight Networks Based on Swin Transformer Bolong Zhang1Juan Chen1Quan Wen1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: