1
CPSAA: Accelerating Sparse Attention using
Crossbar-based Processing-In-Memory Architecture
Huize Li, Member, IEEE, Hai Jin, Fellow, IEEE, Long Zheng, Member, IEEE, Xiaofei Liao, Member, IEEE, Yu
Huang, Member, IEEE, Cong Liu, Student Member, IEEE, Jiahong Xu, Student Member, IEEE, Zhuohui
Duan, Member, IEEE, Dan Chen, Student Member, IEEE, Chuangyi Gui, Student Member, IEEE
Abstract—The attention-based neural network attracts great
interest due to its excellent accuracy enhancement. However,
the attention mechanism requires huge computational efforts
to process unnecessary calculations, significantly limiting the
system’s performance. To reduce the unnecessary calculations,
researchers propose sparse attention to convert some dense-dense
matrices multiplication (DDMM) operations to sampled dense-
dense matrix multiplication (SDDMM) and sparse matrix multi-
plication (SpMM) operations. However, current sparse attention
solutions introduce massive off-chip random memory access since
the sparse attention matrix is generally unstructured.
We propose CPSAA, a novel crossbar-based processing-in-
memory (PIM)-featured sparse attention accelerator to eliminate
off-chip data transmissions. First, we present a novel attention
calculation mode to balance the crossbar writing and crossbar
processing latency. Second, we design a novel PIM-based sparsity
pruning architecture to eliminate the pruning phase’s off-chip
data transfers. Finally, we present novel crossbar-based SDDMM
and SpMM methods to process unstructured sparse attention
matrices by coupling two types of crossbar arrays. Experimental
results show that CPSAA has an average of 89.6×, 32.2×, 17.8×,
3.39×, and 3.84×performance improvement and 755.6×, 55.3×,
21.3×, 5.7×, and 4.9×energy-saving when compare with GPU,
FPGA, SANGER, ReBERT, and ReTransformer.
Index Terms—processing-in-memory, domain-specific acceler-
ator, attention mechanism, ReRAM.
I. INTRODUCTION
Attention-based neural network shows accuracy leaps in
machine learning applications, e. g., natural language pro-
cessing (NLP) [10] and computer vision [8]. Different from
the commonly used Convolutional Neural Network (CNN) or
Recurrent Neural Network (RNN) models, Transformer [30]
adopts a pure attention-based neural network to better identify
the dependencies between tokens of the input sequence. Fol-
lowing this design, Transformer and its variants achieve great
accuracy improvement in NLP tasks [10], such as machine
translation [30] and question answering [6], etc. Attention is
also widely used in computer vision tasks [8] including image
classification [1] and object detection [18], etc.
The vanilla attention mechanism [30] is usually imple-
mented as DDMM and softmax operations. By computing an
The authors are with the National Engineering Research Center for Big
Data Technology and System, Services Computing Technology and System
Lab, Cluster and Grid Computing Lab, School of Computer Science and
Technology, Huazhong University of Science and Technology, Wuhan, China
(e-mail: {huizeli, hjin, longzh, xfliao, yuh, congliu, jhxu, zhduan, cdhust,
chygui}@hust.edu.cn). Accepted by TCAD in 01/05/2023.
This work is supported by the NSFC (No. 61832006). The correspondence
of this paper should be addressed to Yu Huang.
attention score matrix, the attention mechanism can pay atten-
tion to these relevant token pairs. There is overwhelming com-
putation pressure in processing these irrelevant token pairs,
leading to intolerable execution time [7]. Researchers propose
sparse attention by adding a sparsity pruning phase before
the attention calculation to reduce irrelevant calculations [7],
[19], since most tokens in the input sequence are unrelated
to the current query. There are two types of sparse attention
designs, i.e., software-based and software-hardware co-design
methods [19]. Software-based methods [29], [38] aim to pro-
pose various optimization algorithms to reduce computational
overhead by increasing sparsity. Software-hardware co-design
solutions accelerate sparse attention by taking advantage of
high-parallelism hardware, such as Field Programmable Gate
Array (FPGA) [16], [40] and Application Specific Integrated
Circuit (ASIC) [7], [19], [32].
The above solutions only achieve limited speedups since
both the sparsity pruning and attention calculation phases
involve many off-chip data transfers. Emerging crossbar-based
architectures are promising to solve the off-chip data trans-
mission problem, such as Resistive Random Access Memory
(ReRAM) and ReRAM-based content addressable memory
(ReCAM) [12]. ReCAM is suitable for high parallel com-
parison, the core operation of content-based similarity search
in the attention mechanism. ReRAM is ideal for vector-
matrix multiplication (VMM) operation, which has superior
performance handling the DDMM operations of attention-
based neural network. Utilizing the in-situ processing ability
of ReRAM arrays, there emerge ReRAM-based PIM-featured
solutions to accelerate traditional neural network [27] and the
dense attention mechanism [11], [36]. However, these PIM-
based solutions can hardly extend to accelerate the sparse
attention for the following reasons.
First, the ReRAM array’s write overhead cannot be ignored
as it is in ReRAM-based CNN and RNN accelerators. Solving
the ReRAM write overhead is urgent because many matrices
in the attention mechanism cannot be reused and need to be
written in runtime. Second, the sparse attention involves the
sparsity pruning phase, which is not considered by current
dense attention accelerators. Using the current software-based
pruning algorithm can promote the PIM-based attention ac-
celerator. However, the software-based pruning methods have
poor performance because they need to load all input matrices
from the off-chip memory to the processor. Moreover, the
sparsity of the attention mechanism is pretty unstructured,
which will introduce lots of off-chip random memory access
arXiv:2210.06696v2 [cs.AR] 7 Oct 2023