Efficient Image Super-Resolution using Vast-Receptive-Field Attention Lin Zhou1 Haoming Cai1 Jinjin Gu23 Zheyuan Li1 Yingqi Liu1

2025-04-26 0 0 3.13MB 17 页 10玖币
侵权投诉
Efficient Image Super-Resolution using
Vast-Receptive-Field Attention
Lin Zhou1,, Haoming Cai1,, Jinjin Gu2,3, Zheyuan Li1, Yingqi Liu1,
Xiangyu Chen1,2,4, Yu Qiao1,2, and Chao Dong1,2,
1ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime
Joint Lab, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
2Shanghai AI Laboratory, Shanghai, China
3The University of Sydney 4University of Macau
{zhougrace885, helmut.choy, chxy95}@gmail.com, jinjin.gu@sydney.edu.au
{zy.li3, yq.liu3, yu.qiao, chao.dong}@siat.ac.cn
Abstract. The attention mechanism plays a pivotal role in designing
advanced super-resolution (SR) networks. In this work, we design an ef-
ficient SR network by improving the attention mechanism. We start from
a simple pixel attention module and gradually modify it to achieve bet-
ter super-resolution performance with reduced parameters. The specific
approaches include: (1) increasing the receptive field of the attention
branch, (2) replacing large dense convolution kernels with depth-wise
separable convolutions, and (3) introducing pixel normalization. These
approaches paint a clear evolutionary roadmap for the design of atten-
tion mechanisms. Based on these observations, we propose VapSR, the
VAst-receptive-field Pixel attention network. Experiments demonstrate
the superior performance of VapSR. VapSR outperforms the present
lightweight networks with even fewer parameters. And the light version
of VapSR can use only 21.68% and 28.18% parameters of IMDB and
RFDN to achieve similar performances to those networks. The code and
models are available at https://github.com/zhoumumu/VapSR.
Keywords: Image Super-Resolution, Deep Convolution Network, At-
tention Mechanism
1 Introduction
Single image Super-Resolution (SISR) is a fundamental low-level vision problem
that aims at recovering a high-resolution (HR) image from its low-resolution
(LR) observations. SISR has attracted increasing attention in both the research
community and industry. Since SRCNN [13] introduced deep learning into SR,
deep networks have become the de facto approach for advanced SR algorithms
due to their ease of use and high performance. However, deep SR networks rely
on a large number of parameters that can provide sufficiently complex capacity
to map LR images to HR images. These parameters and high computation costs
* Equal Contributions, Corresponding Author.
arXiv:2210.05960v1 [eess.IV] 12 Oct 2022
2 L. Zhou et al.
limit the application of SR networks. The design of SR networks with efficiency
as the primary goal has gradually become an important issue.
Among the numerous SR networks, the studies related to the attention mech-
anism have achieved a lot of success. The channel attention brought by RCAN
[54] makes it practical to train very deep high-performance SR networks. PAN
[56] has achieved good progress in designing a lightweight SR network using pixel
attention. After image processing entered the Transformer era, the application
of the attention mechanism underwent great changes. Vision Transformers [15]
rely on attention mechanisms to achieve excellent performance. Many works have
proved that introducing large receptive fields and local windows [9,44] in the at-
tention branch improves the SR effect. However, many advanced design ideas
have not been verified in designing the attention mechanism for convolutional
lightweight SR networks. In this paper, we start from a basic pixel attention
module to explore better attention mechanisms designed for efficient SR.
The first effort we made in this paper was to introduce the large receptive
field design into the attention mechanism. This is in line with other recent design
trends using large kernel sizes [16], as well as the design principles of transformers
[9,36,44]. We show the advantages of using large kernel convolutions in the at-
tention branch. Secondly, we use depth-wise separable convolution to split dense
large convolution kernels. A large receptive field is achieved in the attention
branch using a depth-wise and a depth-wise dilated convolution. We also replace
the 3×3 convolutions in the backbone network with 1×1 convolutions to reduce
the number of parameters. Thirdly, we present a novel pixel normalization that
can make the training less prone to crashing.
Along the above footprints, we demonstrate a novel path to an efficient SR
architecture called VapSR (VAst-receptive-field Pixel attention network). Com-
pared with the current state-of-the-art algorithms, the proposed VapSR reduces
a lot of parameters while improving the SR effect. For example, compared with
the champion of the NTIRE2022 efficient SR competition [27], VapSR achieves
an improvement on PSNR by more than 0.1dB with 185K fewer parameters.
Our experiments demonstrate the effectiveness of the proposed method.
2 Related Work
Deep Networks for SR. Since SRCNN [13] was proposed as the pioneer-
ing work for employing a three-layer convolutional neural network for the SR
task, numerous methods [14,24,30,37,45,43] have been proposed to achieve bet-
ter performance. FSRCNN [14] proposes a pipeline that upsamples features at
the end of the network, which boosts the performance while keeping the model
lightweight. VDSR [24] introduces skip connections for residual learning to in-
crease the depth of the SR network. DRCN [25] and DRRN [45] both adopt
the recursive structure to improve the reconstruction performance. SRDenseNet
[47] and RDN [55] prove that the dense connection is beneficial to improving the
capacity of SR models. RCAN [54] employs a channel attention scheme to bring
the attention mechanism to the SR methods. SAN [11] proposes a second-order
Efficient Image Super-Resolution using Vast-Receptive-Field Attention 3
attention module, which brings further performance improvement. SwinIR [36]
promotes Swin Transformer [39] for the SR task. HAT [9] refreshes state-of-the-
art performance through hybrid attention schemes and pre-training strategy.
Attention Schemes for SR. The attention mechanism can be interpreted
as a way to bias the allocation of available resources towards the most informa-
tive parts of an input signal. There are approximately four attention schemes:
channel attention, spatial attention, combined channel-spatial attention and self-
attention mechanism. RCAN [54], inspired by SENet [18], reweights the channel-
wise features according to their respective weight responses. SelNet [10] and PAN
[56] employ the spatial attention mechanism, which calculates the weight for each
element. Regarding combined channel-spatial attention, HAN [42] additionally
proposes a spatial attention module via 3D convolutional operation. The self-
attention mechanism was adopted from natural language processing to model the
long-range dependence [50]. IPT [7] is the first Transformer-based SR method
based on the ViT [15]. It relies on a large model scale (over 115.5M parameters).
SwinIR [36] calculates the window-based self-attention to save the computations.
HAT [9] further proposes multiple attention schemes to improve the window-
based self-attention and introduce channel-wise attention to SR Transformer.
Efficient SR Models. Efficient SR designing aims to reducing model com-
plexity and latency for SR networks [2,21,20,38,28,35,8]. CARN [2] employs the
combination of group convolution and 1 ×1 convolution to save computations.
After IDN [21] proposed the residual feature distillation structure, there ap-
pears a series of works [20,38,35] following this micro-architecture design. IMDN
[20] improves IDN via an information multi-distillation block by using a chan-
nel splitting strategy. RFDN [38] rethinks the channel splitting operation and
introduces the progressive refinement module as an equivalent architecture. In
NTIRE 2022 Efficient SR Challenge [34], RLFN [28] won the championship in
the runtime track by ditching the multi-branch design of RFDN and introducing
a contrastive loss for faster computation and better performance. BSRN [35] won
the first place in the model complexity track by replacing the standard convolu-
tion with a well-designed depth-wise separable convolution to save computations
and utilizing two effective attention schemes to enhance the model ability.
Large Kernel Design. CNNs used to be the common choice for computer
vision tasks. However, CNNs have been greatly challenged by Transformers re-
cently [15,6,39,31], and Transformer-based methods have also shown leading per-
formances on the SR task [7,36,32,9]. In Transformer, self-attention is designed
to be either global [15,7] or local, both accompanied by larger kernels [39,36,9].
Thus, information can be gathered from a large region. Inspired by this char-
acteristic of Transformer, a series of works have been proposed to design better
CNNs [48,40,12,16]. ConvMixer [48] utilizes large kernel convolutions to build
the model and achieve the competitive performance to the ViT [15]. ConvNeXt
[40] proves that well-designed CNN with large kernel convolution can obtain sim-
ilar performance to Swin Transformer [39]. RepLKNet [12] scales up the filter
kernel size to 31 ×31 and outperforms the state-of-the-art Transformer-based
4 L. Zhou et al.
ds 5x5, 64
ds-dilated
7x7, 64
1x1, 64
1x1, 64
Pixel Norm
1x1, 64
GELU
Baseline
Enlarging Receptive
Field of Attention
Parameter Reduction
on other Convolutions
Parameter Reduction
on Attention Module
Adding Pixel
Normalization
Switch Attention
Module to the Middle
1x1, 64
1x1, 64
GELU
Pixel Norm
ds 5x5, 64
ds-dilated
7x7, 64
1x1, 64
1x1, 64
1x1, 64
GELU
ds 5x5, 64
ds-dilated
7x7, 64
1x1, 64
1x1, 64
1x1, 64
GELU
9x9, 64
3x3, 64
3x3, 64
GELU
9x9, 64
3x3, 64
3x3, 64
GELU
1x1, 64
VapSR-S
RFDN (AIM 20 1st)
RLFN (NTIRE 22 1st)
IMDB
Regular Pixel Attention
DIV2K Val
PSNR (dB) 29.0028.30
Enlarging
the
attention
kernel
Parameter
Reduction
Norm
VapSR
Ours
SOTA
Models
Baseline ()
atten. kernel sz. 13
atten. kernel sz. 19 ()
other conv. 3x31x1 ()
atten. conv. dsd conv.
() + above dsd conv. ()
()+PN
atten. conv. dsd conv+PN
()+PN ()
Switch atten. to the middle ()
Fig. 1. The evolutionary design roadmap of the proposed method. The figures on the
left are the key architectural milestones. The plot on the right shows the main models’
parameters and PSNR performance on DIV2K validation set. Every evolution and
modification of the main design stages are marked with red boxes on the left and
described with the text on the right. We omit some micro designs in this plot and they
will be elaborated lately in section 5.3.
methods. VAN [16] conducts an analysis of the visual attention and proposes
the large kernel attention based on the depth-wise convolution.
3 Motivation
The attention mechanism has been proven effective in SR networks. In particular,
an efficient SR model PAN [56] achieves good performance using pixel attention
while greatly reduces the number of parameters. Pixel attention performs an
attention operation on each element of the features. Compared with channel at-
tention and spatial attention, pixel attention is a more general form of attention
operation and thus provides a good baseline for our further exploration.
Inspired by recent advances in self-attention [50] and vision transformers [15],
we believe that there is still considerable room for improvement even for the at-
tention mechanism based on convolutional operations. In this section, we show
the process of improving SR network attention through three design criteria in
pixel attention. First, we show the advantages of using large kernel convolu-
tions in the attention branch. Then we use well-designed depth-wise separable
large kernel convolutions to reduce the huge computational burden brought by
large kernel convolutions. We demonstrate the potential of this network topol-
ogy design for efficient SR. Finally, inspired by vision Transformers, we introduce
a pixel-wise normalization operation in the convolutional network to train SR
摘要:

EfficientImageSuper-ResolutionusingVast-Receptive-FieldAttentionLinZhou1,∗,HaomingCai1,∗,JinjinGu2,3,ZheyuanLi1,YingqiLiu1,XiangyuChen1,2,4,YuQiao1,2,andChaoDong1,2,†1ShenZhenKeyLabofComputerVisionandPatternRecognition,SIAT-SenseTimeJointLab,ShenzhenInstitutesofAdvancedTechnology,ChineseAcademyofSci...

展开>> 收起<<
Efficient Image Super-Resolution using Vast-Receptive-Field Attention Lin Zhou1 Haoming Cai1 Jinjin Gu23 Zheyuan Li1 Yingqi Liu1.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:3.13MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注