Efficient Image Super-Resolution using Vast-Receptive-Field Attention 3
attention module, which brings further performance improvement. SwinIR [36]
promotes Swin Transformer [39] for the SR task. HAT [9] refreshes state-of-the-
art performance through hybrid attention schemes and pre-training strategy.
Attention Schemes for SR. The attention mechanism can be interpreted
as a way to bias the allocation of available resources towards the most informa-
tive parts of an input signal. There are approximately four attention schemes:
channel attention, spatial attention, combined channel-spatial attention and self-
attention mechanism. RCAN [54], inspired by SENet [18], reweights the channel-
wise features according to their respective weight responses. SelNet [10] and PAN
[56] employ the spatial attention mechanism, which calculates the weight for each
element. Regarding combined channel-spatial attention, HAN [42] additionally
proposes a spatial attention module via 3D convolutional operation. The self-
attention mechanism was adopted from natural language processing to model the
long-range dependence [50]. IPT [7] is the first Transformer-based SR method
based on the ViT [15]. It relies on a large model scale (over 115.5M parameters).
SwinIR [36] calculates the window-based self-attention to save the computations.
HAT [9] further proposes multiple attention schemes to improve the window-
based self-attention and introduce channel-wise attention to SR Transformer.
Efficient SR Models. Efficient SR designing aims to reducing model com-
plexity and latency for SR networks [2,21,20,38,28,35,8]. CARN [2] employs the
combination of group convolution and 1 ×1 convolution to save computations.
After IDN [21] proposed the residual feature distillation structure, there ap-
pears a series of works [20,38,35] following this micro-architecture design. IMDN
[20] improves IDN via an information multi-distillation block by using a chan-
nel splitting strategy. RFDN [38] rethinks the channel splitting operation and
introduces the progressive refinement module as an equivalent architecture. In
NTIRE 2022 Efficient SR Challenge [34], RLFN [28] won the championship in
the runtime track by ditching the multi-branch design of RFDN and introducing
a contrastive loss for faster computation and better performance. BSRN [35] won
the first place in the model complexity track by replacing the standard convolu-
tion with a well-designed depth-wise separable convolution to save computations
and utilizing two effective attention schemes to enhance the model ability.
Large Kernel Design. CNNs used to be the common choice for computer
vision tasks. However, CNNs have been greatly challenged by Transformers re-
cently [15,6,39,31], and Transformer-based methods have also shown leading per-
formances on the SR task [7,36,32,9]. In Transformer, self-attention is designed
to be either global [15,7] or local, both accompanied by larger kernels [39,36,9].
Thus, information can be gathered from a large region. Inspired by this char-
acteristic of Transformer, a series of works have been proposed to design better
CNNs [48,40,12,16]. ConvMixer [48] utilizes large kernel convolutions to build
the model and achieve the competitive performance to the ViT [15]. ConvNeXt
[40] proves that well-designed CNN with large kernel convolution can obtain sim-
ilar performance to Swin Transformer [39]. RepLKNet [12] scales up the filter
kernel size to 31 ×31 and outperforms the state-of-the-art Transformer-based