RePAST A ReRAM-based PIM Accelerator for Second-order Training of DNN Yilong Zhao1Li Jiang1Mingyu Gao2Naifeng Jing1Chengyang Gu1Qidong Tang1

2025-04-29 0 0 1MB 13 页 10玖币
侵权投诉
RePAST: A ReRAM-based PIM Accelerator for
Second-order Training of DNN
Yilong Zhao1,Li Jiang1,Mingyu Gao2,Naifeng Jing1,Chengyang Gu1,Qidong Tang1,
Fangxin Liu1,Tao Yang1,Xiaoyao Liang1*
1School of Electronic Information and Electrical Engineering, Shanghai Jiaotong University, Shanghai, China
2Institute for Interdisciplinary Information Sciences, Qinghua University, Beijing, China
sjtuzyl@sjtu.edu.cn jiangli@cs.sjtu.edu.cn gaomy@tsinghua.edu.cn
{sjtuj,zeb19980914,tangqidong,liufangxin,yt594584152}@sjtu.edu.cn liang-xy@cs.sjtu.edu.cn
Abstract—The second-order training methods can converge
much faster than first-order optimizers in DNN training. This
is because the second-order training utilizes the inversion of the
second-order information (SOI) matrix to find a more accurate
descent direction and step size. However, the huge SOI matrices
bring significant computational and memory overheads in the
traditional architectures like GPU and CPU. On the other side,
the ReRAM-based process-in-memory (PIM) technology is suit-
able for the second-order training because of the following three
reasons: First, PIM’s computation happens in memory, which
reduces data movement overheads; Second, ReRAM crossbars
can compute SOI’s inversion in O(1) time; Third, if architected
properly, ReRAM crossbars can perform matrix inversion and
vector-matrix multiplications which are important to the second-
order training algorithms.
Nevertheless, current ReRAM-based PIM techniques still face
a key challenge for accelerating the second-order training.
The existing ReRAM-based matrix inversion circuitry can only
support 8-bit accuracy matrix inversion and the computational
precision is not sufficient for the second-order training that needs
at least 16-bit accurate matrix inversion. In this work, we propose
a method to achieve high-precision matrix inversion based on
a proven 8-bit matrix inversion (INV) circuitry and vector-
matrix multiplication (VMM) circuitry. We design RePAST, a
ReRAM-based PIM accelerator architecture for the second-order
training. Moreover, we propose a software mapping scheme for
RePAST to further optimize the performance by fusing VMM
and INV crossbar. Experiment shows that RePAST can achieve an
average of 115.8×/11.4×speedup and 41.9×/12.8×energy saving
compared to a GPU counterpart and PipeLayer on large-scale
DNNs.
I. INTRODUCTION
Most prevalent optimizers for neural network training, in-
cluding Stochastic Gradient Descent (SGD) [38], Adagrad [13]
and Adam [23], only use the information of the first-order
gradient. However, as the complexity of the neural network
model increases, these optimizers take much more time to
train the neural network. For example, it takes over 29 hours
to train a ResNet-50 on an 8-GPU cluster [17]. To address the
problem, second-order training algorithms are proposed to ac-
celerate the training process [24], [31], [36]. These algorithms
take advantage of the inverse of the second-order information
(SOI) matrix to accelerate the convergence. For the common
second-order optimization methods, e.g. Newton method and
natural gradient method, their SOI matrices are Hessian Matrix
(HM) and Fisher Information Matrix (FIM), respectively. For
small neural networks on small datasets, such as autoencoder
on MINST, second-order training can reduce the iteration
number by over 100×[31]. One major challenge of directly
performing the two algorithms on large-scale DNNs is that
the SOI size increases quadratically as the parameter number
increases. For example, a ResNet-50 network has 2.5×107
parameters, and the size of its HM will be around 6.3×1014.
The large size of SOI brings the following two problems:
First, computing with SOI requires a significant amount of
data movement. Second, the complexity of matrix inversion on
SOI is O(n3), which causes a prohibitive computational cost.
As a result, directly applying the second-order algorithms to
DNNs in reality is actually much slower than the first-order
algorithms.
Current second-order training algorithms approximate the
SOI into several smaller matrices to reduce the overhead
brought by SOI. They treat the elements across different
layers as 0. K-FAC algorithm uses Kronecker decomposition
to decompose the FIM of each layer into two smaller matrices
[31]. However, the size of SOI is still large. For example, after
this approximation, the size of SOI is still about 1.4×108for
ResNet-50. ADAHESSIAN directly approximates the SOI into
a diagonal matrix so that we do not need to invert the SOI
matrix [51]. However, this results in a large computational
error such that little improvement in the convergence rate for
the training process is observed. Some algorithms trade off the
overhead of SOI and the convergence rate, by approximating
the SOI matrix into a block-diagonal matrix [7], [49]. How-
ever, limited by the GPU performance, the optimal block size
is usually small. For example, the optimal block size of THOR
algorithm is only 128 [7]. With smaller block sizes, it takes
more epochs to train due to higher approximate errors, causing
longer convergence time, offsetting the algorithmic advantage
of the second-order training.
ReRAM-based PIM is an emerging technique whose com-
putation happens in the memory and has the inherent ability
to accelerate vector-matrix multiplication (VMM) [21]. While
each ReRAM cell can only differentiate a limited number
of bit levels, which means the ReRAM crossbar itself can
only perform low-precision VMM computation, high-precision
VMM can be carried out by spliting it into multiple low-
precision VMMs with bit-slicing scheme [41]. Based on this,
1
arXiv:2210.15255v1 [cs.AR] 27 Oct 2022
Song et al. propose Pipelayer, a ReRAM-based architecture to
accelerate the first-order training of DNN [44]. The ReRAM
crossbar can also accelerate the computation of matrix inver-
sion (INV), such as the design proposed by Sun et al. [47].
The time complexity of their circuit is O(1). But the precision
is still limited by the number of bits that the ReRAM cell
can support. The authors in [14] propose an analog bit-slicing
scheme and increase the precision of INV circuit to 8-bit.
They use analog amplifiers and combine the analog signals
from bitlines into a more precise analog signal. However,
the method can not further improve the precision of the INV
circuit because the noise level has surpassed the limit of the
precision in the analog signals.
The characteristic of ReRAM-based PIM technology makes
it suitable for accelerating the second-order training algo-
rithms. PIM can reduce the data movement of the large
SOI and PIM can perform the matrix inversion in O(1)
time regardless of the matrix size, reducing the computa-
tional overhead of matrix inversion. Also, ReRAM-based
crossbars can perform high-precision VMM through the bit-
slicing technique. However, the key challenge in designing
PIM accelerators for the second-order training algorithms is
that the current PIM matrix inversion circuit can only support
up to 8-bit precision. This cannot meet the criterion for the
second-order training algorithms, most of which require 16-bit
precision or higher [7], [15], [31], [49].
In this work, we address this challenge and propose
RePAST, a ReRAM-based PIM accelerator for the second-
order training. We make the following contributions:
We propose a novel scheme to implement the high-
precision matrix inversion based on multiple low-
precision matrix inversion circuit. Leveraging Taylor Ex-
pansion, a high-precision matrix inversion can be decom-
posed into a series of low-precision matrix inversions
chained with a VMM operation, both of which can be
implemented with ReRAM-based crossbars.
We propose RePAST, a ReRAM-based accelerator archi-
tecture for the second-order training. The RePAST has a
series delicate architecture design and mapping strategy
to fuse INV and VMM operations smoothly that can
support all the typical operations in the second-order
training algorithms, including VMM and high-precision
INV.
The mapping strategy can also reduce the overhead of
SOI matrix. We find that some patterns in the dataflow
graph vary in computation latency or memory footprint if
we choose different mapping strategies. For each pattern,
we propose a method to select the optimized mapping
strategy for different layers.
The remainder of the paper is organized as follows: Section
II introduce the background of the second-order training
algorithms and ReRAM-based VMM and low-precision INV
circuits. In Section III, we introduce a novel method for the
high-precision matrix inversion and implement it based on the
ReRAM-based VMM and low-precision INV circuits. Section
IV presents the overview and the hardware design of the
RePAST architecture. In Section V, the mapping scheme is
explained. Section VI presents the evaluation and analysis of
RePAST. Section VII discusses other related works. Section
VIII concludes the paper.
II. BACKGROUND
A. Second-Order Training
First-order training algorithms only utilize the first-order
information to obtain the direction and step for optimization.
The rule for parameter update is:
θ=θη· ∇θJ(θ)(1)
where θis the parameter of neural network, Jis the loss
function, and ηis the learning rate.
Second-order training methods take advantage of the SOI
matrix Hto obtain a better descent direction and step. Since
the SOI matrix depicts the surrounding landscape, the descent
path is closer to the optimal descent path and can avoid getting
stuck in the saddle point. As a result, the convergence is faster
and the number of iterations and epochs can be reduced. For
example, the second-order optimizer proposed in [36] only
needs half of the iterations or epochs to train a ResNet-50
compared to the first-order one. The rule of parameter update
in the second-order training is:
θ=θη·H1θJ(θ)(2)
where His derived from J, and θ, decided by the defini-
tion of SOI. However, this formulation cannot be directly
implemented for DNN. This is because the size of SOI scales
quadratically with the number of parameters, which results
in unacceptably large overheads for both computation and
storage. Therefore, most algorithms decouple the full SOI
matrix by layers, regarding the second-order gradient between
layers as 0. We introduce two popular approaches below:
1) K-FAC Algorithm. K-FAC is derived from Natural Gradi-
ent method (NG), whose SOI is the Fisher information matrix
(FIM) F[31]. K-FAC decomposes each layer’s FIM block
into the Kronecker product of two small matrices Aand G,
which are much smaller than the original FIM. After the
approximation and decomposition, the storage of the ResNet-
50 SOI becomes 140 million, still large but acceptable for
GPU. Moreover, the Kronecker product can be transformed
into matrix multiplication. Therefore the parameter update rule
becomes:
θ=θη·A1θJ(θ)G1(3)
The matrix Aand Gare obtained by A=a·aTand
G=g·gT, where ais the input feature map and gis
the gradient of the layer. Therefore, the two SOIs can be
obtained during the backpropagation efficiently. In the fully
connection layer (FC), the input feature map aand gradient
gare directly multiplied with themselves when computing
Aand G. While for convolution layer, aand gare firstly
reshaped to cink2×hw and cout ×hw respectively, and
2
Epochs Needed for
Accuracy 75.6%
0
20
40
60
80
23252729211
Epoch without Block
Approximation
(a) Block Size
(b)
Block Size
0
100
200
300
400
Average Step Time (ms)
23252729211
Step Time of First-
Order Training
Fig. 1. With different block sizes, (a) the step time of training ResNet-50 on
GPU, and (b) the epoch number of training ResNet-50 to 75.6% accuracy. The
epoch number of ResNet-50 training without block approximation is 34 [36].
then multiplied with themselves, where cin,cout,k,hand
ware input/output channel, kernel size, height and width of
the feature map, respectively. Therefore, for convolution layer,
ARcin k2×cin k2and GRcout ×cout [36].
2) Gauss-Newton Algorithm. Gauss-Newton method is de-
rived from the Newton method, whose SOI is the Hessian
matrix (HM) H[24], [49]. Each layer’s HM block is approx-
imated by H≈ ∇θJ(θ)BθJ(θ)T, where the matrix Bis
decided by the loss function. For example, when using cross
entropy loss, Bbecomes the identity matrix.
Diagonal block approximation of SOI matrices. Never-
theless, even with the approximation in both algorithms, the
computation of SOI is still a large overhead. Therefore, some
works further approximate FIM and HM into block-diagonal
matrices [7], [49] or even diagonal matrices [51]. Therefore,
the inversion of SOI matrix can be obtained by inverting
each blocks, and reduce the computational overhead. Take
one convolution layer of ResNet-50’s conv5 x as an example.
The layer has 512 output channels, therefore, the matrix G
is 512 ×512 according to [36]. If we approximate Ginto
four 128 ×128 diagonal blocks, the storage is reduced by 4×.
Moreover, as the computational complexity of matrix inversion
is On3, after approximation, the computing is reduced by
64×. Fig. 1(a) plot the step time of ResNet-50 training with
different block sizes. Using a smaller block size can reduce
the execution time of SOI. However, a small block size also
means that more information in SOI has been discarded and
the descent direction and step size may not be accurate. This
will actually slow down the overall convergence process in
training because it requires more training epochs to converge.
As Fig. 1(b) shows, the required epoch number of ResNet-
50 training increases sharply with a shrinking block size.
Therefore, this approximation introduces a trade-off between
the convergence rate and block size. In [7], the result shows
that a block size of 128 should be suitable for a GPU.
However, such a block size may slow down the convergence
by 29.4% and diminish the advantage of the second-order
training.
B. PIM-Based VMM and Matrix Inversion
Current works show that ReRAM-based crossbars offer
great efficiency in computing vector-matrix multiplication
(VMM) with PIM technologies [21]. One such ReRAM-
based VMM accelerator is illustrated in Fig. 2(a). The matrix
element values are programmed (i.e. written) into the crossbar,
(a) (b)
S&H
ADC
S+A
ADC
S&H
1
DAC
DAC
DAC
DAC
Voltage Follower (VF)
OpAmp
2-kRc
A
A
x
b
b
x
Fig. 2. ReRAM crossbar circuits for accelerating (a) vector-matrix multipli-
cation (VMM) and (b) matrix inversion (INV).
and the input vector is applied to the wordline as voltage
levels through DACs. According to the Kirchhoffs Law, the
aggregated currents on the bitlines, after passing through
ADCs, become the result of the VMM. DNN inference and
training require sufficient precision for VMM operations, e.g.,
8-16 bits. However, the state-of-the-art ReRAM cell can only
handle 2-6 bits [20]. Therefore, each element of the matrix is
split by bits and written to several adjacent cells. Moreover,
the DAC’s complexity is exponential to the resolution. To
allow for low-bit DACs, each input vector element is also split
by bits and applied to a wordline across several cycles. The
intermediate results across bitlines and cycles are combined
into the final result by shift-and-add operation (S+A). This
is the well-known bit-slicing scheme to solve the precision
problem in VMM [9], [41].
Recent works also use ReRAM crossbars to accelerate
matrix inversion (INV) with the help of extra analog com-
ponents [14], [45], [47]. The circuit is shown in Fig. 2(b).
Suppose a matrix Ais QA-bit quantitized, and a ReRAM cell
has Rcbits. Ais bit-sliced and each slice is programmed to a
separate ReRAM crossbar. The ith ReRAM crossbar’s voltage
followers’ (VFs’) gain is set to 2iRc. A vector bis applied
as voltage levels on the bitline resistances of the crossbar. The
currents on the bitlines of the crossbars are aggregated, and are
fed back with open-loop operational amplifiers (OpAmps) to
the wordlines. According to the virtual short and virtual open
characteristic of the OpAmps, the voltages on the wordlines
would converge to the values that make both the current and
voltage on the bitlines to 0. Then the wordline voltage x,Aand
the bitline voltage bwill satisfy Eqn. 4, effectively achieving
a matrix INV in Eqn. 5:
x·A=b(4)
x=A1·b(5)
The convergence time is decided by the dynamic characteristic
of the OpAmps. If the OpAmp has a wider bandwidth,
the convergence is faster. If we can keep the convergence
time within one cycle, the circuit can compute the matrix
inversion in O(1) time. For example, the circuit in [14] can
converge within 50ns, much shorter than the cycle time for
PIM architectures (typically around 100ns [41]).
3
摘要:

RePAST:AReRAM-basedPIMAcceleratorforSecond-orderTrainingofDNNYilongZhao1,LiJiang1,MingyuGao2,NaifengJing1,ChengyangGu1,QidongTang1,FangxinLiu1,TaoYang1,XiaoyaoLiang1*1SchoolofElectronicInformationandElectricalEngineering,ShanghaiJiaotongUniversity,Shanghai,China2InstituteforInterdisciplinaryInformat...

展开>> 收起<<
RePAST A ReRAM-based PIM Accelerator for Second-order Training of DNN Yilong Zhao1Li Jiang1Mingyu Gao2Naifeng Jing1Chengyang Gu1Qidong Tang1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注