Efficient Image Super-Resolution using Vast-Receptive-Field Attention Lin Zhou1 Haoming Cai1 Jinjin Gu23 Zheyuan Li1 Yingqi Liu1

2025-04-26 0 0 3.13MB 17 页 10玖币

侵权投诉

Eﬃcient Image Super-Resolution using

Vast-Receptive-Field Attention

Lin Zhou1,∗, Haoming Cai1,∗, Jinjin Gu2,3, Zheyuan Li1, Yingqi Liu1,

Xiangyu Chen1,2,4, Yu Qiao1,2, and Chao Dong1,2,†

1ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime

Joint Lab, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

2Shanghai AI Laboratory, Shanghai, China

3The University of Sydney 4University of Macau

{zhougrace885, helmut.choy, chxy95}@gmail.com, jinjin.gu@sydney.edu.au

{zy.li3, yq.liu3, yu.qiao, chao.dong}@siat.ac.cn

Abstract. The attention mechanism plays a pivotal role in designing

advanced super-resolution (SR) networks. In this work, we design an ef-

ﬁcient SR network by improving the attention mechanism. We start from

a simple pixel attention module and gradually modify it to achieve bet-

ter super-resolution performance with reduced parameters. The speciﬁc

approaches include: (1) increasing the receptive ﬁeld of the attention

branch, (2) replacing large dense convolution kernels with depth-wise

separable convolutions, and (3) introducing pixel normalization. These

approaches paint a clear evolutionary roadmap for the design of atten-

tion mechanisms. Based on these observations, we propose VapSR, the

VAst-receptive-ﬁeld Pixel attention network. Experiments demonstrate

the superior performance of VapSR. VapSR outperforms the present

lightweight networks with even fewer parameters. And the light version

of VapSR can use only 21.68% and 28.18% parameters of IMDB and

RFDN to achieve similar performances to those networks. The code and

models are available at https://github.com/zhoumumu/VapSR.

Keywords: Image Super-Resolution, Deep Convolution Network, At-

tention Mechanism

1 Introduction

Single image Super-Resolution (SISR) is a fundamental low-level vision problem

that aims at recovering a high-resolution (HR) image from its low-resolution

(LR) observations. SISR has attracted increasing attention in both the research

community and industry. Since SRCNN [13] introduced deep learning into SR,

deep networks have become the de facto approach for advanced SR algorithms

due to their ease of use and high performance. However, deep SR networks rely

on a large number of parameters that can provide suﬃciently complex capacity

to map LR images to HR images. These parameters and high computation costs

* Equal Contributions, †Corresponding Author.

arXiv:2210.05960v1 [eess.IV] 12 Oct 2022

2 L. Zhou et al.

limit the application of SR networks. The design of SR networks with eﬃciency

as the primary goal has gradually become an important issue.

Among the numerous SR networks, the studies related to the attention mech-

anism have achieved a lot of success. The channel attention brought by RCAN

[54] makes it practical to train very deep high-performance SR networks. PAN

[56] has achieved good progress in designing a lightweight SR network using pixel

attention. After image processing entered the Transformer era, the application

of the attention mechanism underwent great changes. Vision Transformers [15]

rely on attention mechanisms to achieve excellent performance. Many works have

proved that introducing large receptive ﬁelds and local windows [9,44] in the at-

tention branch improves the SR eﬀect. However, many advanced design ideas

have not been veriﬁed in designing the attention mechanism for convolutional

lightweight SR networks. In this paper, we start from a basic pixel attention

module to explore better attention mechanisms designed for eﬃcient SR.

The ﬁrst eﬀort we made in this paper was to introduce the large receptive

ﬁeld design into the attention mechanism. This is in line with other recent design

trends using large kernel sizes [16], as well as the design principles of transformers

[9,36,44]. We show the advantages of using large kernel convolutions in the at-

tention branch. Secondly, we use depth-wise separable convolution to split dense

large convolution kernels. A large receptive ﬁeld is achieved in the attention

branch using a depth-wise and a depth-wise dilated convolution. We also replace

the 3×3 convolutions in the backbone network with 1×1 convolutions to reduce

the number of parameters. Thirdly, we present a novel pixel normalization that

can make the training less prone to crashing.

Along the above footprints, we demonstrate a novel path to an eﬃcient SR

architecture called VapSR (VAst-receptive-ﬁeld Pixel attention network). Com-

pared with the current state-of-the-art algorithms, the proposed VapSR reduces

a lot of parameters while improving the SR eﬀect. For example, compared with

the champion of the NTIRE2022 eﬃcient SR competition [27], VapSR achieves

an improvement on PSNR by more than 0.1dB with 185K fewer parameters.

Our experiments demonstrate the eﬀectiveness of the proposed method.

2 Related Work

Deep Networks for SR. Since SRCNN [13] was proposed as the pioneer-

ing work for employing a three-layer convolutional neural network for the SR

task, numerous methods [14,24,30,37,45,43] have been proposed to achieve bet-

ter performance. FSRCNN [14] proposes a pipeline that upsamples features at

the end of the network, which boosts the performance while keeping the model

lightweight. VDSR [24] introduces skip connections for residual learning to in-

crease the depth of the SR network. DRCN [25] and DRRN [45] both adopt

the recursive structure to improve the reconstruction performance. SRDenseNet

[47] and RDN [55] prove that the dense connection is beneﬁcial to improving the

capacity of SR models. RCAN [54] employs a channel attention scheme to bring

the attention mechanism to the SR methods. SAN [11] proposes a second-order

Eﬃcient Image Super-Resolution using Vast-Receptive-Field Attention 3

attention module, which brings further performance improvement. SwinIR [36]

promotes Swin Transformer [39] for the SR task. HAT [9] refreshes state-of-the-

art performance through hybrid attention schemes and pre-training strategy.

Attention Schemes for SR. The attention mechanism can be interpreted

as a way to bias the allocation of available resources towards the most informa-

tive parts of an input signal. There are approximately four attention schemes:

channel attention, spatial attention, combined channel-spatial attention and self-

attention mechanism. RCAN [54], inspired by SENet [18], reweights the channel-

wise features according to their respective weight responses. SelNet [10] and PAN

[56] employ the spatial attention mechanism, which calculates the weight for each

element. Regarding combined channel-spatial attention, HAN [42] additionally

proposes a spatial attention module via 3D convolutional operation. The self-

attention mechanism was adopted from natural language processing to model the

long-range dependence [50]. IPT [7] is the ﬁrst Transformer-based SR method

based on the ViT [15]. It relies on a large model scale (over 115.5M parameters).

SwinIR [36] calculates the window-based self-attention to save the computations.

HAT [9] further proposes multiple attention schemes to improve the window-

based self-attention and introduce channel-wise attention to SR Transformer.

Eﬃcient SR Models. Eﬃcient SR designing aims to reducing model com-

plexity and latency for SR networks [2,21,20,38,28,35,8]. CARN [2] employs the

combination of group convolution and 1 ×1 convolution to save computations.

After IDN [21] proposed the residual feature distillation structure, there ap-

pears a series of works [20,38,35] following this micro-architecture design. IMDN

[20] improves IDN via an information multi-distillation block by using a chan-

nel splitting strategy. RFDN [38] rethinks the channel splitting operation and

introduces the progressive reﬁnement module as an equivalent architecture. In

NTIRE 2022 Eﬃcient SR Challenge [34], RLFN [28] won the championship in

the runtime track by ditching the multi-branch design of RFDN and introducing

a contrastive loss for faster computation and better performance. BSRN [35] won

the ﬁrst place in the model complexity track by replacing the standard convolu-

tion with a well-designed depth-wise separable convolution to save computations

and utilizing two eﬀective attention schemes to enhance the model ability.

Large Kernel Design. CNNs used to be the common choice for computer

vision tasks. However, CNNs have been greatly challenged by Transformers re-

cently [15,6,39,31], and Transformer-based methods have also shown leading per-

formances on the SR task [7,36,32,9]. In Transformer, self-attention is designed

to be either global [15,7] or local, both accompanied by larger kernels [39,36,9].

Thus, information can be gathered from a large region. Inspired by this char-

acteristic of Transformer, a series of works have been proposed to design better

CNNs [48,40,12,16]. ConvMixer [48] utilizes large kernel convolutions to build

the model and achieve the competitive performance to the ViT [15]. ConvNeXt

[40] proves that well-designed CNN with large kernel convolution can obtain sim-

ilar performance to Swin Transformer [39]. RepLKNet [12] scales up the ﬁlter

kernel size to 31 ×31 and outperforms the state-of-the-art Transformer-based

4 L. Zhou et al.

ds 5x5, 64

ds-dilated

7x7, 64

1x1, 64

Pixel Norm

1x1, 64

GELU

（ⅰ）

Baseline

（ⅱ）

Enlarging Receptive

Field of Attention

（ⅲ）

Parameter Reduction

on other Convolutions

（ⅳ）

Parameter Reduction

on Attention Module

（ⅴ）

Adding Pixel

Normalization

（ⅵ）

Switch Attention

Module to the Middle

1x1, 64

GELU

Pixel Norm

ds 5x5, 64

ds-dilated

7x7, 64

1x1, 64

GELU

ds 5x5, 64

ds-dilated

7x7, 64

1x1, 64

GELU

9x9, 64

3x3, 64

GELU

9x9, 64

3x3, 64

GELU

1x1, 64

VapSR-S

RFDN (AIM 20 1st)

RLFN (NTIRE 22 1st)

IMDB

Regular Pixel Attention

DIV2K Val

PSNR (dB) 29.0028.30

Enlarging

the

attention

kernel

Parameter

Reduction

Norm

VapSR

Ours

SOTA

Models

Baseline (ⅰ)

atten. kernel sz. 1➔3

atten. kernel sz. 1➔9 (ⅱ)

other conv. 3x3➔1x1 (ⅲ)

atten. conv. ➔ dsd conv.

(ⅲ) + above dsd conv. (ⅳ)

(ⅲ)+PN

atten. conv. ➔ dsd conv+PN

(ⅳ)+PN (ⅴ)

Switch atten. to the middle (ⅵ)

Fig. 1. The evolutionary design roadmap of the proposed method. The ﬁgures on the

left are the key architectural milestones. The plot on the right shows the main models’

parameters and PSNR performance on DIV2K validation set. Every evolution and

modiﬁcation of the main design stages are marked with red boxes on the left and

described with the text on the right. We omit some micro designs in this plot and they

will be elaborated lately in section 5.3.

methods. VAN [16] conducts an analysis of the visual attention and proposes

the large kernel attention based on the depth-wise convolution.

3 Motivation

The attention mechanism has been proven eﬀective in SR networks. In particular,

an eﬃcient SR model PAN [56] achieves good performance using pixel attention

while greatly reduces the number of parameters. Pixel attention performs an

attention operation on each element of the features. Compared with channel at-

tention and spatial attention, pixel attention is a more general form of attention

operation and thus provides a good baseline for our further exploration.

Inspired by recent advances in self-attention [50] and vision transformers [15],

we believe that there is still considerable room for improvement even for the at-

tention mechanism based on convolutional operations. In this section, we show

the process of improving SR network attention through three design criteria in

pixel attention. First, we show the advantages of using large kernel convolu-

tions in the attention branch. Then we use well-designed depth-wise separable

large kernel convolutions to reduce the huge computational burden brought by

large kernel convolutions. We demonstrate the potential of this network topol-

ogy design for eﬃcient SR. Finally, inspired by vision Transformers, we introduce

a pixel-wise normalization operation in the convolutional network to train SR

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfficientImageSuper-ResolutionusingVast-Receptive-FieldAttentionLinZhou1,∗,HaomingCai1,∗,JinjinGu2,3,ZheyuanLi1,YingqiLiu1,XiangyuChen1,2,4,YuQiao1,2,andChaoDong1,2,†1ShenZhenKeyLabofComputerVisionandPatternRecognition,SIAT-SenseTimeJointLab,ShenzhenInstitutesofAdvancedTechnology,ChineseAcademyofSci...

展开>> 收起<<

Efficient Image Super-Resolution using Vast-Receptive-Field Attention Lin Zhou1 Haoming Cai1 Jinjin Gu23 Zheyuan Li1 Yingqi Liu1.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efficient Image Super-Resolution using Vast-Receptive-Field Attention Lin Zhou1 Haoming Cai1 Jinjin Gu23 Zheyuan Li1 Yingqi Liu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: