1 CPSAA Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

2025-04-24 0 0 1.81MB 14 页 10玖币
侵权投诉
1
CPSAA: Accelerating Sparse Attention using
Crossbar-based Processing-In-Memory Architecture
Huize Li, Member, IEEE, Hai Jin, Fellow, IEEE, Long Zheng, Member, IEEE, Xiaofei Liao, Member, IEEE, Yu
Huang, Member, IEEE, Cong Liu, Student Member, IEEE, Jiahong Xu, Student Member, IEEE, Zhuohui
Duan, Member, IEEE, Dan Chen, Student Member, IEEE, Chuangyi Gui, Student Member, IEEE
Abstract—The attention-based neural network attracts great
interest due to its excellent accuracy enhancement. However,
the attention mechanism requires huge computational efforts
to process unnecessary calculations, significantly limiting the
system’s performance. To reduce the unnecessary calculations,
researchers propose sparse attention to convert some dense-dense
matrices multiplication (DDMM) operations to sampled dense-
dense matrix multiplication (SDDMM) and sparse matrix multi-
plication (SpMM) operations. However, current sparse attention
solutions introduce massive off-chip random memory access since
the sparse attention matrix is generally unstructured.
We propose CPSAA, a novel crossbar-based processing-in-
memory (PIM)-featured sparse attention accelerator to eliminate
off-chip data transmissions. First, we present a novel attention
calculation mode to balance the crossbar writing and crossbar
processing latency. Second, we design a novel PIM-based sparsity
pruning architecture to eliminate the pruning phase’s off-chip
data transfers. Finally, we present novel crossbar-based SDDMM
and SpMM methods to process unstructured sparse attention
matrices by coupling two types of crossbar arrays. Experimental
results show that CPSAA has an average of 89.6×, 32.2×, 17.8×,
3.39×, and 3.84×performance improvement and 755.6×, 55.3×,
21.3×, 5.7×, and 4.9×energy-saving when compare with GPU,
FPGA, SANGER, ReBERT, and ReTransformer.
Index Terms—processing-in-memory, domain-specific acceler-
ator, attention mechanism, ReRAM.
I. INTRODUCTION
Attention-based neural network shows accuracy leaps in
machine learning applications, e. g., natural language pro-
cessing (NLP) [10] and computer vision [8]. Different from
the commonly used Convolutional Neural Network (CNN) or
Recurrent Neural Network (RNN) models, Transformer [30]
adopts a pure attention-based neural network to better identify
the dependencies between tokens of the input sequence. Fol-
lowing this design, Transformer and its variants achieve great
accuracy improvement in NLP tasks [10], such as machine
translation [30] and question answering [6], etc. Attention is
also widely used in computer vision tasks [8] including image
classification [1] and object detection [18], etc.
The vanilla attention mechanism [30] is usually imple-
mented as DDMM and softmax operations. By computing an
The authors are with the National Engineering Research Center for Big
Data Technology and System, Services Computing Technology and System
Lab, Cluster and Grid Computing Lab, School of Computer Science and
Technology, Huazhong University of Science and Technology, Wuhan, China
(e-mail: {huizeli, hjin, longzh, xfliao, yuh, congliu, jhxu, zhduan, cdhust,
chygui}@hust.edu.cn). Accepted by TCAD in 01/05/2023.
This work is supported by the NSFC (No. 61832006). The correspondence
of this paper should be addressed to Yu Huang.
attention score matrix, the attention mechanism can pay atten-
tion to these relevant token pairs. There is overwhelming com-
putation pressure in processing these irrelevant token pairs,
leading to intolerable execution time [7]. Researchers propose
sparse attention by adding a sparsity pruning phase before
the attention calculation to reduce irrelevant calculations [7],
[19], since most tokens in the input sequence are unrelated
to the current query. There are two types of sparse attention
designs, i.e., software-based and software-hardware co-design
methods [19]. Software-based methods [29], [38] aim to pro-
pose various optimization algorithms to reduce computational
overhead by increasing sparsity. Software-hardware co-design
solutions accelerate sparse attention by taking advantage of
high-parallelism hardware, such as Field Programmable Gate
Array (FPGA) [16], [40] and Application Specific Integrated
Circuit (ASIC) [7], [19], [32].
The above solutions only achieve limited speedups since
both the sparsity pruning and attention calculation phases
involve many off-chip data transfers. Emerging crossbar-based
architectures are promising to solve the off-chip data trans-
mission problem, such as Resistive Random Access Memory
(ReRAM) and ReRAM-based content addressable memory
(ReCAM) [12]. ReCAM is suitable for high parallel com-
parison, the core operation of content-based similarity search
in the attention mechanism. ReRAM is ideal for vector-
matrix multiplication (VMM) operation, which has superior
performance handling the DDMM operations of attention-
based neural network. Utilizing the in-situ processing ability
of ReRAM arrays, there emerge ReRAM-based PIM-featured
solutions to accelerate traditional neural network [27] and the
dense attention mechanism [11], [36]. However, these PIM-
based solutions can hardly extend to accelerate the sparse
attention for the following reasons.
First, the ReRAM array’s write overhead cannot be ignored
as it is in ReRAM-based CNN and RNN accelerators. Solving
the ReRAM write overhead is urgent because many matrices
in the attention mechanism cannot be reused and need to be
written in runtime. Second, the sparse attention involves the
sparsity pruning phase, which is not considered by current
dense attention accelerators. Using the current software-based
pruning algorithm can promote the PIM-based attention ac-
celerator. However, the software-based pruning methods have
poor performance because they need to load all input matrices
from the off-chip memory to the processor. Moreover, the
sparsity of the attention mechanism is pretty unstructured,
which will introduce lots of off-chip random memory access
arXiv:2210.06696v2 [cs.AR] 7 Oct 2023
2
X
Input
X
people
love
peace
X
SpMM
SDDMM
Q KTS
S V Z
Weight
WQ
WK
WV
Head1
Head2
Head3
X
X
Q = XWQ
K = XWK
V = XWV
Q
K
V
X
S = QKT
Queries
Keys
Values
Score
Softmax(S)
Z
Output
Z = SV
X
(a) Computation stages of dense attention mechanism (b) Sparse attention calculation
Mask
X
Fig. 1. Dataflow of attention mechanism
to the attention calculation phase. Finally, the sparse attention
involves the SDDMM and SpMM operations. Direct applica-
tion of current ReRAM-based sparse methods [17], [28] to
ReRAM-based SDDMM and SpMM operations will achieve
inferior performance (for details to see § IV-C and § IV-D).
Given this landscape, we propose CPSAA, a novel Crossbar-
based PIM-featured Sparse Attention Accelerator. First, we de-
sign the attention calculation mode to increase the parallelism
of CPSAA dataflow while hiding ReRAM writing overhead.
Second, we design a novel PIM-based sparsity pruning ar-
chitecture to support our calculation mode while removing
the off-chip data transmissions. Finally, we couple ReRAM
and ReCAM arrays and propose new PIM-based SDDMM and
SpMM methods to utilize the unstructured sparsity. The main
contributions of this paper are as follows:
We propose CPSAA, a PIM-based sparse attention ac-
celerator, to speed up both the pruning and attention
calculation phases of the neural network inference.
We design the pruning phase as a novel PIM architecture
to eliminate off-chip data transfers. The attention calcula-
tion architecture involves novel ReRAM-based SDDMM
and SpMM methods to increase the parallelism of sparse
attention.
We evaluate and compare CPSAA with state-of-the-art
attention mechanism solutions. The experimental results
show that CPSAA has performance improvement in
speedups and energy-saving.
II. BACKGROUND AND MOTIVATION
A. Sparse Attention
The vanilla attention mechanism maps a query matrix Q
and a key-value matrix pair K-Vto an output matrix Z.
Figure 1 (a) depicts the calculation procedure of the dense
attention mechanism. First, the query (Q), key (K), and value
(V) matrices are obtained by multiplying the embedded input
sequence Xwith the corresponding weight matrices, WQ, WK,
and WV. Next, the score matrix Sis calculated by multiplying
Qwith KT. Then, the Smatrix is normalized with a row-wise
softmax function Softmax(S). Finally, Sis multiplied by the
Vmatrix to get the output Z.
As Figure 1 (a) shows, Sreveals the relevance between
the tokens of Qand Kmatrices. Researchers find that most
tokens in Qare irrelevant to K, making the Smatrix inherently
sparse [7]. The above fact makes it possible to utilize the spar-
sity of Sto avoid computing these irrelevant token pairs [7].
Thus, people propose sparse attention, which adopts a sparsity
Top Electrode
Bottom Electrode
Metal Oxide
(a)
. . .
. . .
(b)
I
1
I
2
I
n
G(1,1) G(1,2) G(1,3)
G(2,3)
G(m,n)
...
BNL
BL
Word line
T
R
T
A
G
(c)
V
1
V
2
V
m
DRV
DRV
DRV
S&H
DRV
DRV
Key
B(0,0) B(0,n)
B(m,0) B(m,n)
. . .
Word line
Bit line
Bit line
Word line
Bit-not line
顶部电极
底部电极
金属氧化物
(a)
. . .
. . .
(b)
I
1
I
2
I
j
G
(1,1)
G
(1,2)
G
(1,3)
G
(2,3)
G
(i,j)
V
1
V
2
V
i
DRV
DRV
DRV
S&H S&H S&H
Word line
Bit line
Fig. 2. (a) One ReRAM cell, (b) The ReRAM crossbar architecture, and (c)
The ReCAM array architecture
pruning phase to save computational resources of the attention
calculation phase.
G[i,j] = binarize(˜
S[i,j],θ) = 1,i f ˜
S[i,j]θ
0,otherwise (1)
SANGER [19] proposes a prediction-based pruning method
and achieves high sparsity. The idea of their pruning method
is to calculate the approximate score matrix ˜
Susing a low-
precision DDMM operation, ˜
S=So f tmax(QU1(QU(Q)·
QU(KT))/d), where dis the dimension of the input em-
beddings, and QU(x) = round(γx)is a quantization operator
that maps input buffers to low-bit by a scaling factor γand
a rounding bit-shift. QU1(·)is the corresponding de-quant
operator that transforms low-bit outputs back into high preci-
sion. Then, a binarization procedure described in equation (1)
will convert the ˜
Smatrix to a binary mask matrix G, where
θis the threshold of this binarization function. Because the
sparsity of the mask matrix is similar to the score matrix
S, SANGER can convert the DDMM operation S=Q·KT
to SDDMM operation. As Figure 1 (b) shows, SDDMM
operation generates the sparse score matrix Swith three input
matrices (two dense matrices Q,K, and one sparse mask
matrix), which can utilize the mask matrix to avoid calculating
the zero-value (white cells) of the Smatrix. Utilizing the sparse
Smatrix, people can convert the DDMM operation Z=S·V
to the SpMM operation. Figure 1 (b) shows the visualization
of the SpMM operation, which is a VMM operation between
a sparse matrix Sand a dense matrix V.
B. ReRAM and ReCAM Basics
Figure 2 (a) shows one ReRAM cell, which is comprised of
a top electrode, a metal oxide layer, and a bottom electrode.
Following the Kirchhoffs Current Law, ReRAM crossbar
array can process VMM operation efficiently [33], denoted
as In=N
m=0Vm×G(m,n), as Figure 2 (b) shows. Utilizing
this high parallel in-situ VMM operation, several ReRAM-
based neural network accelerators [27] and graph processing
accelerators [22], [28], [42] have been proposed.
The ReCAM array architecture is shown in Figure 2
(c), which is comprised of lots of two transistors and two
memristors (2T2R) ReCAM bit-cells [12]. One ReCAM bit-
cell contains a couple of ReRAM cells. One ReCAM array
contains a Key register, a ReCAM cells array, drivers (DRV),
and tag registers (TAG). ReCAM arrays can perform vector-
scalar comparisons in parallel [12]. The TAG will latch the
‘1’ signal if one row matches with the Key register.
3
TABLE I
CURRENT ATTENTION MECHANISM SOLUTIONS
Features Hardware-based Software-hardware co-design Software-based
[11], [36] CPSAA [16], [40] [7], [32] [19], [23] [2], [38] [3], [41]
Spar. pattern dense dynamic static dynamic dynamic static dynamic
Pruning plt. - ReRAM CPU/GPU CPU/GPU CPU/GPU CPU/GPU CPU/GPU
Attention plt. ReRAM ReRAM FPGA ASIC ASIC CPU/GPU CPU/GPU
Sparsity
visualization
Sparsity
regularity
- unstructured coarse-grain
structured
coarse-grain
structured
fine-grain
structured
coarse-grain
structured
unstructured
Sparsity - high low medium high low high
Speedup high high low medium high low low
C. Related Work
Table I lists the recent studies of the attention mechanism.
Different from SANGER’s classification, we divide these
works into three categories, i.e., software-based, software-
hardware co-design, and the new adding hardware-based. The
software-based methods can be further divided into static and
dynamic sparsity patterns. The static sparsity pattern can only
get coarse-grained sparsity for data dependent [2], [38]. The
dynamic sparsity pattern can achieve higher sparsity condi-
tioned on individual input samples during runtime, but their
unstructured sparsity increases the random memory access
overhead [3], [41].
Software-hardware co-design methods adopt more efficient
hardware, such as FPGAs and ASICs. FTRANS [16] is an
efficient acceleration framework for transformer-based NLP.
Zhang et al. propose a novel pruning method and design the
FPGA-based sparse attention accelerator [40]. A3[7] can ac-
celerate the attention mechanism with algorithmic approxima-
tion and hardware specialization. SpAtten [32] leverages token
sparsity, head sparsity, and quantization opportunities to reduce
redundent computation and random access. SANGER [19]
presents a prediction-based pruning algorithm to reduce un-
necessary calculations while designing the “splitting and pack-
ing” algorithm to reduce the massive random memory access
overhead. DOTA [23] proposes a dynamic weak connections
detector to avoid unnecessary attention calculation and pro-
mote sparse attention.
Although these co-design solutions can relieve the random
memory access by designing hardware-specific algorithms,
many off-chip data transmissions exist because of their sepa-
rate memory and processor architecture. Using SANGER as
an example, the “splitting and packing” algorithm can reduce
random memory access by converting the unstructured sparsity
to fine-grained structured sparsity. However, the benefit of
the fine-grained structured sparsity comes at the cost of the
dynamically configured control signals, which are highly com-
plex to scheduling processing elements and memory access
in runtime. In addition, SANGER introduces off-chip random
access and data dependency to their sparsity pruning phase.
Thus, no current accelerators can solve the off-chip random
memory access problem elegantly.
We also list current hardware-based PIM-featured dense
M N L I M R P C Q Q P Q N L I R T E M N L I M R P C Q Q P Q N L I R T E
0
2 0
4 0
6 0
8 0
100
D O T A
S A N G E R
O p e r a t i o n s R a t i o ( % )
A T - C A M A - G E M A - G E - M M A - G E - P A T - C A - M A T - C A - P
Fig. 3. Response time breakdown of SANGER and DOTA
attention accelerators in Table I. These solutions use high
parallel ReRAM arrays to significantly reduce the latency of
DDMM operations [11], [36]. However, since the sparsity
of the attention matrices is not considered, current PIM-
based attention accelerators need to calculate all five DDMM
operation steps, wasting lots of computational resources on
irrelevant tokens. Therefore, how to design a PIM-based sparse
attention accelerator and reducing unnecessary calculations
become the key to achieve further accelerations.
D. Motivation
We find that current sparse attention accelerators have many
off-chip data transmissions. We design experiments on two
sparse attention accelerators, i.e., SANGER and DOTA, to
confirm this statement. Figure 3 presents the operations ratio of
SANGER [19] and DOTA [23] running five real-world datasets
(details to see Section V). The ratio of the operation comes
from breaking down the response time to the mask matrix
generation (MA-GE) and the attention mechanism calculation
(AT-CA). To assess the impact of off-chip memory access, we
further break down the MA-GE and AT-CA to the processor
execution time (MA-GE-P and AT-CA-P) and the memory
access time (MA-GE-M and AT-CA-M). The memory access
time is obtained by subtracting the average kernel runtime
from the total execution time.
First, the overhead of the mask generation and attention
calculation phases both cannot be ignored. Figure 3 shows
the MA-GE takes an average of 17.9% (14.3% in DOTA)
response time, while the AT-CA takes 82.1% (85.7% in
DOTA). The calculation overhead of the pruning phase is
摘要:

1CPSAA:AcceleratingSparseAttentionusingCrossbar-basedProcessing-In-MemoryArchitectureHuizeLi,Member,IEEE,HaiJin,Fellow,IEEE,LongZheng,Member,IEEE,XiaofeiLiao,Member,IEEE,YuHuang,Member,IEEE,CongLiu,StudentMember,IEEE,JiahongXu,StudentMember,IEEE,ZhuohuiDuan,Member,IEEE,DanChen,StudentMember,IEEE,Chu...

展开>> 收起<<
1 CPSAA Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.81MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注