RePAST A ReRAM-based PIM Accelerator for Second-order Training of DNN Yilong Zhao1Li Jiang1Mingyu Gao2Naifeng Jing1Chengyang Gu1Qidong Tang1

2025-04-29 1 0 1MB 13 页 10玖币

侵权投诉

RePAST: A ReRAM-based PIM Accelerator for

Second-order Training of DNN

Yilong Zhao1,Li Jiang1,Mingyu Gao2,Naifeng Jing1,Chengyang Gu1,Qidong Tang1,

Fangxin Liu1,Tao Yang1,Xiaoyao Liang1*

1School of Electronic Information and Electrical Engineering, Shanghai Jiaotong University, Shanghai, China

2Institute for Interdisciplinary Information Sciences, Qinghua University, Beijing, China

sjtuzyl@sjtu.edu.cn jiangli@cs.sjtu.edu.cn gaomy@tsinghua.edu.cn

{sjtuj,zeb19980914,tangqidong,liufangxin,yt594584152}@sjtu.edu.cn liang-xy@cs.sjtu.edu.cn

Abstract—The second-order training methods can converge

much faster than ﬁrst-order optimizers in DNN training. This

is because the second-order training utilizes the inversion of the

second-order information (SOI) matrix to ﬁnd a more accurate

descent direction and step size. However, the huge SOI matrices

bring signiﬁcant computational and memory overheads in the

traditional architectures like GPU and CPU. On the other side,

the ReRAM-based process-in-memory (PIM) technology is suit-

able for the second-order training because of the following three

reasons: First, PIM’s computation happens in memory, which

reduces data movement overheads; Second, ReRAM crossbars

can compute SOI’s inversion in O(1) time; Third, if architected

properly, ReRAM crossbars can perform matrix inversion and

vector-matrix multiplications which are important to the second-

order training algorithms.

Nevertheless, current ReRAM-based PIM techniques still face

a key challenge for accelerating the second-order training.

The existing ReRAM-based matrix inversion circuitry can only

support 8-bit accuracy matrix inversion and the computational

precision is not sufﬁcient for the second-order training that needs

at least 16-bit accurate matrix inversion. In this work, we propose

a method to achieve high-precision matrix inversion based on

a proven 8-bit matrix inversion (INV) circuitry and vector-

matrix multiplication (VMM) circuitry. We design RePAST, a

ReRAM-based PIM accelerator architecture for the second-order

training. Moreover, we propose a software mapping scheme for

RePAST to further optimize the performance by fusing VMM

and INV crossbar. Experiment shows that RePAST can achieve an

average of 115.8×/11.4×speedup and 41.9×/12.8×energy saving

compared to a GPU counterpart and PipeLayer on large-scale

DNNs.

I. INTRODUCTION

Most prevalent optimizers for neural network training, in-

cluding Stochastic Gradient Descent (SGD) [38], Adagrad [13]

and Adam [23], only use the information of the ﬁrst-order

gradient. However, as the complexity of the neural network

model increases, these optimizers take much more time to

train the neural network. For example, it takes over 29 hours

to train a ResNet-50 on an 8-GPU cluster [17]. To address the

problem, second-order training algorithms are proposed to ac-

celerate the training process [24], [31], [36]. These algorithms

take advantage of the inverse of the second-order information

(SOI) matrix to accelerate the convergence. For the common

second-order optimization methods, e.g. Newton method and

natural gradient method, their SOI matrices are Hessian Matrix

(HM) and Fisher Information Matrix (FIM), respectively. For

small neural networks on small datasets, such as autoencoder

on MINST, second-order training can reduce the iteration

number by over 100×[31]. One major challenge of directly

performing the two algorithms on large-scale DNNs is that

the SOI size increases quadratically as the parameter number

increases. For example, a ResNet-50 network has 2.5×107

parameters, and the size of its HM will be around 6.3×1014.

The large size of SOI brings the following two problems:

First, computing with SOI requires a signiﬁcant amount of

data movement. Second, the complexity of matrix inversion on

SOI is O(n3), which causes a prohibitive computational cost.

As a result, directly applying the second-order algorithms to

DNNs in reality is actually much slower than the ﬁrst-order

algorithms.

Current second-order training algorithms approximate the

SOI into several smaller matrices to reduce the overhead

brought by SOI. They treat the elements across different

layers as 0. K-FAC algorithm uses Kronecker decomposition

to decompose the FIM of each layer into two smaller matrices

[31]. However, the size of SOI is still large. For example, after

this approximation, the size of SOI is still about 1.4×108for

ResNet-50. ADAHESSIAN directly approximates the SOI into

a diagonal matrix so that we do not need to invert the SOI

matrix [51]. However, this results in a large computational

error such that little improvement in the convergence rate for

the training process is observed. Some algorithms trade off the

overhead of SOI and the convergence rate, by approximating

the SOI matrix into a block-diagonal matrix [7], [49]. How-

ever, limited by the GPU performance, the optimal block size

is usually small. For example, the optimal block size of THOR

algorithm is only 128 [7]. With smaller block sizes, it takes

more epochs to train due to higher approximate errors, causing

longer convergence time, offsetting the algorithmic advantage

of the second-order training.

ReRAM-based PIM is an emerging technique whose com-

putation happens in the memory and has the inherent ability

to accelerate vector-matrix multiplication (VMM) [21]. While

each ReRAM cell can only differentiate a limited number

of bit levels, which means the ReRAM crossbar itself can

only perform low-precision VMM computation, high-precision

VMM can be carried out by spliting it into multiple low-

precision VMMs with bit-slicing scheme [41]. Based on this,

arXiv:2210.15255v1 [cs.AR] 27 Oct 2022

Song et al. propose Pipelayer, a ReRAM-based architecture to

accelerate the ﬁrst-order training of DNN [44]. The ReRAM

crossbar can also accelerate the computation of matrix inver-

sion (INV), such as the design proposed by Sun et al. [47].

The time complexity of their circuit is O(1). But the precision

is still limited by the number of bits that the ReRAM cell

can support. The authors in [14] propose an analog bit-slicing

scheme and increase the precision of INV circuit to 8-bit.

They use analog ampliﬁers and combine the analog signals

from bitlines into a more precise analog signal. However,

the method can not further improve the precision of the INV

circuit because the noise level has surpassed the limit of the

precision in the analog signals.

The characteristic of ReRAM-based PIM technology makes

it suitable for accelerating the second-order training algo-

rithms. PIM can reduce the data movement of the large

SOI and PIM can perform the matrix inversion in O(1)

time regardless of the matrix size, reducing the computa-

tional overhead of matrix inversion. Also, ReRAM-based

crossbars can perform high-precision VMM through the bit-

slicing technique. However, the key challenge in designing

PIM accelerators for the second-order training algorithms is

that the current PIM matrix inversion circuit can only support

up to 8-bit precision. This cannot meet the criterion for the

second-order training algorithms, most of which require 16-bit

precision or higher [7], [15], [31], [49].

In this work, we address this challenge and propose

RePAST, a ReRAM-based PIM accelerator for the second-

order training. We make the following contributions:

•We propose a novel scheme to implement the high-

precision matrix inversion based on multiple low-

precision matrix inversion circuit. Leveraging Taylor Ex-

pansion, a high-precision matrix inversion can be decom-

posed into a series of low-precision matrix inversions

chained with a VMM operation, both of which can be

implemented with ReRAM-based crossbars.

•We propose RePAST, a ReRAM-based accelerator archi-

tecture for the second-order training. The RePAST has a

series delicate architecture design and mapping strategy

to fuse INV and VMM operations smoothly that can

support all the typical operations in the second-order

training algorithms, including VMM and high-precision

INV.

•The mapping strategy can also reduce the overhead of

SOI matrix. We ﬁnd that some patterns in the dataﬂow

graph vary in computation latency or memory footprint if

we choose different mapping strategies. For each pattern,

we propose a method to select the optimized mapping

strategy for different layers.

The remainder of the paper is organized as follows: Section

II introduce the background of the second-order training

algorithms and ReRAM-based VMM and low-precision INV

circuits. In Section III, we introduce a novel method for the

high-precision matrix inversion and implement it based on the

ReRAM-based VMM and low-precision INV circuits. Section

IV presents the overview and the hardware design of the

RePAST architecture. In Section V, the mapping scheme is

explained. Section VI presents the evaluation and analysis of

RePAST. Section VII discusses other related works. Section

VIII concludes the paper.

II. BACKGROUND

A. Second-Order Training

First-order training algorithms only utilize the ﬁrst-order

information to obtain the direction and step for optimization.

The rule for parameter update is:

θ=θ−η· ∇θJ(θ)(1)

where θis the parameter of neural network, Jis the loss

function, and ηis the learning rate.

Second-order training methods take advantage of the SOI

matrix Hto obtain a better descent direction and step. Since

the SOI matrix depicts the surrounding landscape, the descent

path is closer to the optimal descent path and can avoid getting

stuck in the saddle point. As a result, the convergence is faster

and the number of iterations and epochs can be reduced. For

example, the second-order optimizer proposed in [36] only

needs half of the iterations or epochs to train a ResNet-50

compared to the ﬁrst-order one. The rule of parameter update

in the second-order training is:

θ=θ−η·H−1∇θJ(θ)(2)

where His derived from J, and θ, decided by the deﬁni-

tion of SOI. However, this formulation cannot be directly

implemented for DNN. This is because the size of SOI scales

quadratically with the number of parameters, which results

in unacceptably large overheads for both computation and

storage. Therefore, most algorithms decouple the full SOI

matrix by layers, regarding the second-order gradient between

layers as 0. We introduce two popular approaches below:

1) K-FAC Algorithm. K-FAC is derived from Natural Gradi-

ent method (NG), whose SOI is the Fisher information matrix

(FIM) F[31]. K-FAC decomposes each layer’s FIM block

into the Kronecker product of two small matrices Aand G,

which are much smaller than the original FIM. After the

approximation and decomposition, the storage of the ResNet-

50 SOI becomes 140 million, still large but acceptable for

GPU. Moreover, the Kronecker product can be transformed

into matrix multiplication. Therefore the parameter update rule

becomes:

θ=θ−η·A−1∇θJ(θ)G−1(3)

The matrix Aand Gare obtained by A=a·aTand

G=g·gT, where ais the input feature map and gis

the gradient of the layer. Therefore, the two SOIs can be

obtained during the backpropagation efﬁciently. In the fully

connection layer (FC), the input feature map aand gradient

gare directly multiplied with themselves when computing

Aand G. While for convolution layer, aand gare ﬁrstly

reshaped to cink2×hw and cout ×hw respectively, and

Epochs Needed for

Accuracy 75.6%

23252729211

Epoch without Block

Approximation

(a) Block Size

(b)

Block Size

100

200

300

400

Average Step Time (ms)

23252729211

Step Time of First-

Order Training

Fig. 1. With different block sizes, (a) the step time of training ResNet-50 on

GPU, and (b) the epoch number of training ResNet-50 to 75.6% accuracy. The

epoch number of ResNet-50 training without block approximation is 34 [36].

then multiplied with themselves, where cin,cout,k,hand

ware input/output channel, kernel size, height and width of

the feature map, respectively. Therefore, for convolution layer,

A∈Rcin k2×cin k2and G∈Rcout ×cout [36].

2) Gauss-Newton Algorithm. Gauss-Newton method is de-

rived from the Newton method, whose SOI is the Hessian

matrix (HM) H[24], [49]. Each layer’s HM block is approx-

imated by H≈ ∇θJ(θ)B∇θJ(θ)T, where the matrix Bis

decided by the loss function. For example, when using cross

entropy loss, Bbecomes the identity matrix.

Diagonal block approximation of SOI matrices. Never-

theless, even with the approximation in both algorithms, the

computation of SOI is still a large overhead. Therefore, some

works further approximate FIM and HM into block-diagonal

matrices [7], [49] or even diagonal matrices [51]. Therefore,

the inversion of SOI matrix can be obtained by inverting

each blocks, and reduce the computational overhead. Take

one convolution layer of ResNet-50’s conv5 x as an example.

The layer has 512 output channels, therefore, the matrix G

is 512 ×512 according to [36]. If we approximate Ginto

four 128 ×128 diagonal blocks, the storage is reduced by 4×.

Moreover, as the computational complexity of matrix inversion

is On3, after approximation, the computing is reduced by

64×. Fig. 1(a) plot the step time of ResNet-50 training with

different block sizes. Using a smaller block size can reduce

the execution time of SOI. However, a small block size also

means that more information in SOI has been discarded and

the descent direction and step size may not be accurate. This

will actually slow down the overall convergence process in

training because it requires more training epochs to converge.

As Fig. 1(b) shows, the required epoch number of ResNet-

50 training increases sharply with a shrinking block size.

Therefore, this approximation introduces a trade-off between

the convergence rate and block size. In [7], the result shows

that a block size of 128 should be suitable for a GPU.

However, such a block size may slow down the convergence

by 29.4% and diminish the advantage of the second-order

training.

B. PIM-Based VMM and Matrix Inversion

Current works show that ReRAM-based crossbars offer

great efﬁciency in computing vector-matrix multiplication

(VMM) with PIM technologies [21]. One such ReRAM-

based VMM accelerator is illustrated in Fig. 2(a). The matrix

element values are programmed (i.e. written) into the crossbar,

(a) (b)

S&H

ADC

S+A

ADC

S&H

…

DAC

Voltage Follower (VF)

OpAmp

2-kRc

Fig. 2. ReRAM crossbar circuits for accelerating (a) vector-matrix multipli-

cation (VMM) and (b) matrix inversion (INV).

and the input vector is applied to the wordline as voltage

levels through DACs. According to the Kirchhoff’s Law, the

aggregated currents on the bitlines, after passing through

ADCs, become the result of the VMM. DNN inference and

training require sufﬁcient precision for VMM operations, e.g.,

8-16 bits. However, the state-of-the-art ReRAM cell can only

handle 2-6 bits [20]. Therefore, each element of the matrix is

split by bits and written to several adjacent cells. Moreover,

the DAC’s complexity is exponential to the resolution. To

allow for low-bit DACs, each input vector element is also split

by bits and applied to a wordline across several cycles. The

intermediate results across bitlines and cycles are combined

into the ﬁnal result by shift-and-add operation (S+A). This

is the well-known bit-slicing scheme to solve the precision

problem in VMM [9], [41].

Recent works also use ReRAM crossbars to accelerate

matrix inversion (INV) with the help of extra analog com-

ponents [14], [45], [47]. The circuit is shown in Fig. 2(b).

Suppose a matrix Ais QA-bit quantitized, and a ReRAM cell

has Rcbits. Ais bit-sliced and each slice is programmed to a

separate ReRAM crossbar. The ith ReRAM crossbar’s voltage

followers’ (VFs’) gain is set to 2−iRc. A vector −bis applied

as voltage levels on the bitline resistances of the crossbar. The

currents on the bitlines of the crossbars are aggregated, and are

fed back with open-loop operational ampliﬁers (OpAmps) to

the wordlines. According to the virtual short and virtual open

characteristic of the OpAmps, the voltages on the wordlines

would converge to the values that make both the current and

voltage on the bitlines to 0. Then the wordline voltage x,Aand

the bitline voltage bwill satisfy Eqn. 4, effectively achieving

a matrix INV in Eqn. 5:

x·A=b(4)

x=A−1·b(5)

The convergence time is decided by the dynamic characteristic

of the OpAmps. If the OpAmp has a wider bandwidth,

the convergence is faster. If we can keep the convergence

time within one cycle, the circuit can compute the matrix

inversion in O(1) time. For example, the circuit in [14] can

converge within 50ns, much shorter than the cycle time for

PIM architectures (typically around 100ns [41]).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RePAST:AReRAM-basedPIMAcceleratorforSecond-orderTrainingofDNNYilongZhao1,LiJiang1,MingyuGao2,NaifengJing1,ChengyangGu1,QidongTang1,FangxinLiu1,TaoYang1,XiaoyaoLiang1*1SchoolofElectronicInformationandElectricalEngineering,ShanghaiJiaotongUniversity,Shanghai,China2InstituteforInterdisciplinaryInformat...

展开>> 收起<<

RePAST A ReRAM-based PIM Accelerator for Second-order Training of DNN Yilong Zhao1Li Jiang1Mingyu Gao2Naifeng Jing1Chengyang Gu1Qidong Tang1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

RePAST A ReRAM-based PIM Accelerator for Second-order Training of DNN Yilong Zhao1Li Jiang1Mingyu Gao2Naifeng Jing1Chengyang Gu1Qidong Tang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: