1 CPSAA Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

2025-04-24 0 0 1.81MB 14 页 10玖币

侵权投诉

CPSAA: Accelerating Sparse Attention using

Crossbar-based Processing-In-Memory Architecture

Huize Li, Member, IEEE, Hai Jin, Fellow, IEEE, Long Zheng, Member, IEEE, Xiaofei Liao, Member, IEEE, Yu

Huang, Member, IEEE, Cong Liu, Student Member, IEEE, Jiahong Xu, Student Member, IEEE, Zhuohui

Duan, Member, IEEE, Dan Chen, Student Member, IEEE, Chuangyi Gui, Student Member, IEEE

Abstract—The attention-based neural network attracts great

interest due to its excellent accuracy enhancement. However,

the attention mechanism requires huge computational efforts

to process unnecessary calculations, signiﬁcantly limiting the

system’s performance. To reduce the unnecessary calculations,

researchers propose sparse attention to convert some dense-dense

matrices multiplication (DDMM) operations to sampled dense-

dense matrix multiplication (SDDMM) and sparse matrix multi-

plication (SpMM) operations. However, current sparse attention

solutions introduce massive off-chip random memory access since

the sparse attention matrix is generally unstructured.

We propose CPSAA, a novel crossbar-based processing-in-

memory (PIM)-featured sparse attention accelerator to eliminate

off-chip data transmissions. First, we present a novel attention

calculation mode to balance the crossbar writing and crossbar

processing latency. Second, we design a novel PIM-based sparsity

pruning architecture to eliminate the pruning phase’s off-chip

data transfers. Finally, we present novel crossbar-based SDDMM

and SpMM methods to process unstructured sparse attention

matrices by coupling two types of crossbar arrays. Experimental

results show that CPSAA has an average of 89.6×, 32.2×, 17.8×,

3.39×, and 3.84×performance improvement and 755.6×, 55.3×,

21.3×, 5.7×, and 4.9×energy-saving when compare with GPU,

FPGA, SANGER, ReBERT, and ReTransformer.

Index Terms—processing-in-memory, domain-speciﬁc acceler-

ator, attention mechanism, ReRAM.

I. INTRODUCTION

Attention-based neural network shows accuracy leaps in

machine learning applications, e. g., natural language pro-

cessing (NLP) [10] and computer vision [8]. Different from

the commonly used Convolutional Neural Network (CNN) or

Recurrent Neural Network (RNN) models, Transformer [30]

adopts a pure attention-based neural network to better identify

the dependencies between tokens of the input sequence. Fol-

lowing this design, Transformer and its variants achieve great

accuracy improvement in NLP tasks [10], such as machine

translation [30] and question answering [6], etc. Attention is

also widely used in computer vision tasks [8] including image

classiﬁcation [1] and object detection [18], etc.

The vanilla attention mechanism [30] is usually imple-

mented as DDMM and softmax operations. By computing an

The authors are with the National Engineering Research Center for Big

Data Technology and System, Services Computing Technology and System

Lab, Cluster and Grid Computing Lab, School of Computer Science and

Technology, Huazhong University of Science and Technology, Wuhan, China

(e-mail: {huizeli, hjin, longzh, xﬂiao, yuh, congliu, jhxu, zhduan, cdhust,

chygui}@hust.edu.cn). Accepted by TCAD in 01/05/2023.

This work is supported by the NSFC (No. 61832006). The correspondence

of this paper should be addressed to Yu Huang.

attention score matrix, the attention mechanism can pay atten-

tion to these relevant token pairs. There is overwhelming com-

putation pressure in processing these irrelevant token pairs,

leading to intolerable execution time [7]. Researchers propose

sparse attention by adding a sparsity pruning phase before

the attention calculation to reduce irrelevant calculations [7],

[19], since most tokens in the input sequence are unrelated

to the current query. There are two types of sparse attention

designs, i.e., software-based and software-hardware co-design

methods [19]. Software-based methods [29], [38] aim to pro-

pose various optimization algorithms to reduce computational

overhead by increasing sparsity. Software-hardware co-design

solutions accelerate sparse attention by taking advantage of

high-parallelism hardware, such as Field Programmable Gate

Array (FPGA) [16], [40] and Application Speciﬁc Integrated

Circuit (ASIC) [7], [19], [32].

The above solutions only achieve limited speedups since

both the sparsity pruning and attention calculation phases

involve many off-chip data transfers. Emerging crossbar-based

architectures are promising to solve the off-chip data trans-

mission problem, such as Resistive Random Access Memory

(ReRAM) and ReRAM-based content addressable memory

(ReCAM) [12]. ReCAM is suitable for high parallel com-

parison, the core operation of content-based similarity search

in the attention mechanism. ReRAM is ideal for vector-

matrix multiplication (VMM) operation, which has superior

performance handling the DDMM operations of attention-

based neural network. Utilizing the in-situ processing ability

of ReRAM arrays, there emerge ReRAM-based PIM-featured

solutions to accelerate traditional neural network [27] and the

dense attention mechanism [11], [36]. However, these PIM-

based solutions can hardly extend to accelerate the sparse

attention for the following reasons.

First, the ReRAM array’s write overhead cannot be ignored

as it is in ReRAM-based CNN and RNN accelerators. Solving

the ReRAM write overhead is urgent because many matrices

in the attention mechanism cannot be reused and need to be

written in runtime. Second, the sparse attention involves the

sparsity pruning phase, which is not considered by current

dense attention accelerators. Using the current software-based

pruning algorithm can promote the PIM-based attention ac-

celerator. However, the software-based pruning methods have

poor performance because they need to load all input matrices

from the off-chip memory to the processor. Moreover, the

sparsity of the attention mechanism is pretty unstructured,

which will introduce lots of off-chip random memory access

arXiv:2210.06696v2 [cs.AR] 7 Oct 2023

Input

people

love

peace

SpMM

SDDMM

Q KTS

S V Z

Weight

Head1

Head2

Head3

Q = XWQ

K = XWK

V = XWV

S = QKT

Queries

Keys

Values

Score

Softmax(S)

Output

Z = SV

(a) Computation stages of dense attention mechanism (b) Sparse attention calculation

Mask

Fig. 1. Dataﬂow of attention mechanism

to the attention calculation phase. Finally, the sparse attention

involves the SDDMM and SpMM operations. Direct applica-

tion of current ReRAM-based sparse methods [17], [28] to

ReRAM-based SDDMM and SpMM operations will achieve

inferior performance (for details to see § IV-C and § IV-D).

Given this landscape, we propose CPSAA, a novel Crossbar-

based PIM-featured Sparse Attention Accelerator. First, we de-

sign the attention calculation mode to increase the parallelism

of CPSAA dataﬂow while hiding ReRAM writing overhead.

Second, we design a novel PIM-based sparsity pruning ar-

chitecture to support our calculation mode while removing

the off-chip data transmissions. Finally, we couple ReRAM

and ReCAM arrays and propose new PIM-based SDDMM and

SpMM methods to utilize the unstructured sparsity. The main

contributions of this paper are as follows:

•We propose CPSAA, a PIM-based sparse attention ac-

celerator, to speed up both the pruning and attention

calculation phases of the neural network inference.

•We design the pruning phase as a novel PIM architecture

to eliminate off-chip data transfers. The attention calcula-

tion architecture involves novel ReRAM-based SDDMM

and SpMM methods to increase the parallelism of sparse

attention.

•We evaluate and compare CPSAA with state-of-the-art

attention mechanism solutions. The experimental results

show that CPSAA has performance improvement in

speedups and energy-saving.

II. BACKGROUND AND MOTIVATION

A. Sparse Attention

The vanilla attention mechanism maps a query matrix Q

and a key-value matrix pair K-Vto an output matrix Z.

Figure 1 (a) depicts the calculation procedure of the dense

attention mechanism. First, the query (Q), key (K), and value

(V) matrices are obtained by multiplying the embedded input

sequence Xwith the corresponding weight matrices, WQ, WK,

and WV. Next, the score matrix Sis calculated by multiplying

Qwith KT. Then, the Smatrix is normalized with a row-wise

softmax function Softmax(S). Finally, Sis multiplied by the

Vmatrix to get the output Z.

As Figure 1 (a) shows, Sreveals the relevance between

the tokens of Qand Kmatrices. Researchers ﬁnd that most

tokens in Qare irrelevant to K, making the Smatrix inherently

sparse [7]. The above fact makes it possible to utilize the spar-

sity of Sto avoid computing these irrelevant token pairs [7].

Thus, people propose sparse attention, which adopts a sparsity

Top Electrode

Bottom Electrode

Metal Oxide

(a)

. . .

(b)

G(1,1) G(1,2) G(1,3)

G(2,3)

G(m,n)

...

BNL

Word line

(c)

DRV

S&H

DRV

Key

B(0,0) B(0,n)

B(m,0) B(m,n)

. . .

Word line

Bit line

Word line

Bit-not line

顶部电极

底部电极

金属氧化物

(a)

. . .

(b)

(1,1)

(1,2)

(1,3)

(2,3)

(i,j)

DRV

S&H S&H S&H

Word line

Bit line

Fig. 2. (a) One ReRAM cell, (b) The ReRAM crossbar architecture, and (c)

The ReCAM array architecture

pruning phase to save computational resources of the attention

calculation phase.

G[i,j] = binarize(˜

S[i,j],θ) = 1,i f ˜

S[i,j]≥θ

0,otherwise (1)

SANGER [19] proposes a prediction-based pruning method

and achieves high sparsity. The idea of their pruning method

is to calculate the approximate score matrix ˜

Susing a low-

precision DDMM operation, ˜

S=So f tmax(QU−1(QU(Q)·

QU(KT))/√d), where dis the dimension of the input em-

beddings, and QU(x) = round(γx)is a quantization operator

that maps input buffers to low-bit by a scaling factor γand

a rounding bit-shift. QU−1(·)is the corresponding de-quant

operator that transforms low-bit outputs back into high preci-

sion. Then, a binarization procedure described in equation (1)

will convert the ˜

Smatrix to a binary mask matrix G, where

θis the threshold of this binarization function. Because the

sparsity of the mask matrix is similar to the score matrix

S, SANGER can convert the DDMM operation S=Q·KT

to SDDMM operation. As Figure 1 (b) shows, SDDMM

operation generates the sparse score matrix Swith three input

matrices (two dense matrices Q,K, and one sparse mask

matrix), which can utilize the mask matrix to avoid calculating

the zero-value (white cells) of the Smatrix. Utilizing the sparse

Smatrix, people can convert the DDMM operation Z=S·V

to the SpMM operation. Figure 1 (b) shows the visualization

of the SpMM operation, which is a VMM operation between

a sparse matrix Sand a dense matrix V.

B. ReRAM and ReCAM Basics

Figure 2 (a) shows one ReRAM cell, which is comprised of

a top electrode, a metal oxide layer, and a bottom electrode.

Following the Kirchhoff’s Current Law, ReRAM crossbar

array can process VMM operation efﬁciently [33], denoted

as In=∑N

m=0Vm×G(m,n), as Figure 2 (b) shows. Utilizing

this high parallel in-situ VMM operation, several ReRAM-

based neural network accelerators [27] and graph processing

accelerators [22], [28], [42] have been proposed.

The ReCAM array architecture is shown in Figure 2

(c), which is comprised of lots of two transistors and two

memristors (2T2R) ReCAM bit-cells [12]. One ReCAM bit-

cell contains a couple of ReRAM cells. One ReCAM array

contains a Key register, a ReCAM cells array, drivers (DRV),

and tag registers (TAG). ReCAM arrays can perform vector-

scalar comparisons in parallel [12]. The TAG will latch the

‘1’ signal if one row matches with the Key register.

TABLE I

CURRENT ATTENTION MECHANISM SOLUTIONS

Features Hardware-based Software-hardware co-design Software-based

[11], [36] CPSAA [16], [40] [7], [32] [19], [23] [2], [38] [3], [41]

Spar. pattern dense dynamic static dynamic dynamic static dynamic

Pruning plt. - ReRAM CPU/GPU CPU/GPU CPU/GPU CPU/GPU CPU/GPU

Attention plt. ReRAM ReRAM FPGA ASIC ASIC CPU/GPU CPU/GPU

Sparsity

visualization

Sparsity

regularity

- unstructured coarse-grain

structured

coarse-grain

structured

ﬁne-grain

structured

coarse-grain

structured

unstructured

Sparsity - high low medium high low high

Speedup high high low medium high low low

C. Related Work

Table I lists the recent studies of the attention mechanism.

Different from SANGER’s classiﬁcation, we divide these

works into three categories, i.e., software-based, software-

hardware co-design, and the new adding hardware-based. The

software-based methods can be further divided into static and

dynamic sparsity patterns. The static sparsity pattern can only

get coarse-grained sparsity for data dependent [2], [38]. The

dynamic sparsity pattern can achieve higher sparsity condi-

tioned on individual input samples during runtime, but their

unstructured sparsity increases the random memory access

overhead [3], [41].

Software-hardware co-design methods adopt more efﬁcient

hardware, such as FPGAs and ASICs. FTRANS [16] is an

efﬁcient acceleration framework for transformer-based NLP.

Zhang et al. propose a novel pruning method and design the

FPGA-based sparse attention accelerator [40]. A3[7] can ac-

celerate the attention mechanism with algorithmic approxima-

tion and hardware specialization. SpAtten [32] leverages token

sparsity, head sparsity, and quantization opportunities to reduce

redundent computation and random access. SANGER [19]

presents a prediction-based pruning algorithm to reduce un-

necessary calculations while designing the “splitting and pack-

ing” algorithm to reduce the massive random memory access

overhead. DOTA [23] proposes a dynamic weak connections

detector to avoid unnecessary attention calculation and pro-

mote sparse attention.

Although these co-design solutions can relieve the random

memory access by designing hardware-speciﬁc algorithms,

many off-chip data transmissions exist because of their sepa-

rate memory and processor architecture. Using SANGER as

an example, the “splitting and packing” algorithm can reduce

random memory access by converting the unstructured sparsity

to ﬁne-grained structured sparsity. However, the beneﬁt of

the ﬁne-grained structured sparsity comes at the cost of the

dynamically conﬁgured control signals, which are highly com-

plex to scheduling processing elements and memory access

in runtime. In addition, SANGER introduces off-chip random

access and data dependency to their sparsity pruning phase.

Thus, no current accelerators can solve the off-chip random

memory access problem elegantly.

We also list current hardware-based PIM-featured dense

M N L I M R P C Q Q P Q N L I R T E M N L I M R P C Q Q P Q N L I R T E

2 0

4 0

6 0

8 0

100

D O T A

S A N G E R

O p e r a t i o n s R a t i o ( % )

A T - C A M A - G E M A - G E - M M A - G E - P A T - C A - M A T - C A - P

Fig. 3. Response time breakdown of SANGER and DOTA

attention accelerators in Table I. These solutions use high

parallel ReRAM arrays to signiﬁcantly reduce the latency of

DDMM operations [11], [36]. However, since the sparsity

of the attention matrices is not considered, current PIM-

based attention accelerators need to calculate all ﬁve DDMM

operation steps, wasting lots of computational resources on

irrelevant tokens. Therefore, how to design a PIM-based sparse

attention accelerator and reducing unnecessary calculations

become the key to achieve further accelerations.

D. Motivation

We ﬁnd that current sparse attention accelerators have many

off-chip data transmissions. We design experiments on two

sparse attention accelerators, i.e., SANGER and DOTA, to

conﬁrm this statement. Figure 3 presents the operations ratio of

SANGER [19] and DOTA [23] running ﬁve real-world datasets

(details to see Section V). The ratio of the operation comes

from breaking down the response time to the mask matrix

generation (MA-GE) and the attention mechanism calculation

(AT-CA). To assess the impact of off-chip memory access, we

further break down the MA-GE and AT-CA to the processor

execution time (MA-GE-P and AT-CA-P) and the memory

access time (MA-GE-M and AT-CA-M). The memory access

time is obtained by subtracting the average kernel runtime

from the total execution time.

First, the overhead of the mask generation and attention

calculation phases both cannot be ignored. Figure 3 shows

the MA-GE takes an average of 17.9% (14.3% in DOTA)

response time, while the AT-CA takes 82.1% (85.7% in

DOTA). The calculation overhead of the pruning phase is

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1CPSAA:AcceleratingSparseAttentionusingCrossbar-basedProcessing-In-MemoryArchitectureHuizeLi,Member,IEEE,HaiJin,Fellow,IEEE,LongZheng,Member,IEEE,XiaofeiLiao,Member,IEEE,YuHuang,Member,IEEE,CongLiu,StudentMember,IEEE,JiahongXu,StudentMember,IEEE,ZhuohuiDuan,Member,IEEE,DanChen,StudentMember,IEEE,Chu...

展开>> 收起<<

1 CPSAA Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 CPSAA Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: