RePAST: A ReRAM-based PIM Accelerator for
Second-order Training of DNN
Yilong Zhao1,Li Jiang1,Mingyu Gao2,Naifeng Jing1,Chengyang Gu1,Qidong Tang1,
Fangxin Liu1,Tao Yang1,Xiaoyao Liang1*
1School of Electronic Information and Electrical Engineering, Shanghai Jiaotong University, Shanghai, China
2Institute for Interdisciplinary Information Sciences, Qinghua University, Beijing, China
sjtuzyl@sjtu.edu.cn jiangli@cs.sjtu.edu.cn gaomy@tsinghua.edu.cn
{sjtuj,zeb19980914,tangqidong,liufangxin,yt594584152}@sjtu.edu.cn liang-xy@cs.sjtu.edu.cn
Abstract—The second-order training methods can converge
much faster than first-order optimizers in DNN training. This
is because the second-order training utilizes the inversion of the
second-order information (SOI) matrix to find a more accurate
descent direction and step size. However, the huge SOI matrices
bring significant computational and memory overheads in the
traditional architectures like GPU and CPU. On the other side,
the ReRAM-based process-in-memory (PIM) technology is suit-
able for the second-order training because of the following three
reasons: First, PIM’s computation happens in memory, which
reduces data movement overheads; Second, ReRAM crossbars
can compute SOI’s inversion in O(1) time; Third, if architected
properly, ReRAM crossbars can perform matrix inversion and
vector-matrix multiplications which are important to the second-
order training algorithms.
Nevertheless, current ReRAM-based PIM techniques still face
a key challenge for accelerating the second-order training.
The existing ReRAM-based matrix inversion circuitry can only
support 8-bit accuracy matrix inversion and the computational
precision is not sufficient for the second-order training that needs
at least 16-bit accurate matrix inversion. In this work, we propose
a method to achieve high-precision matrix inversion based on
a proven 8-bit matrix inversion (INV) circuitry and vector-
matrix multiplication (VMM) circuitry. We design RePAST, a
ReRAM-based PIM accelerator architecture for the second-order
training. Moreover, we propose a software mapping scheme for
RePAST to further optimize the performance by fusing VMM
and INV crossbar. Experiment shows that RePAST can achieve an
average of 115.8×/11.4×speedup and 41.9×/12.8×energy saving
compared to a GPU counterpart and PipeLayer on large-scale
DNNs.
I. INTRODUCTION
Most prevalent optimizers for neural network training, in-
cluding Stochastic Gradient Descent (SGD) [38], Adagrad [13]
and Adam [23], only use the information of the first-order
gradient. However, as the complexity of the neural network
model increases, these optimizers take much more time to
train the neural network. For example, it takes over 29 hours
to train a ResNet-50 on an 8-GPU cluster [17]. To address the
problem, second-order training algorithms are proposed to ac-
celerate the training process [24], [31], [36]. These algorithms
take advantage of the inverse of the second-order information
(SOI) matrix to accelerate the convergence. For the common
second-order optimization methods, e.g. Newton method and
natural gradient method, their SOI matrices are Hessian Matrix
(HM) and Fisher Information Matrix (FIM), respectively. For
small neural networks on small datasets, such as autoencoder
on MINST, second-order training can reduce the iteration
number by over 100×[31]. One major challenge of directly
performing the two algorithms on large-scale DNNs is that
the SOI size increases quadratically as the parameter number
increases. For example, a ResNet-50 network has 2.5×107
parameters, and the size of its HM will be around 6.3×1014.
The large size of SOI brings the following two problems:
First, computing with SOI requires a significant amount of
data movement. Second, the complexity of matrix inversion on
SOI is O(n3), which causes a prohibitive computational cost.
As a result, directly applying the second-order algorithms to
DNNs in reality is actually much slower than the first-order
algorithms.
Current second-order training algorithms approximate the
SOI into several smaller matrices to reduce the overhead
brought by SOI. They treat the elements across different
layers as 0. K-FAC algorithm uses Kronecker decomposition
to decompose the FIM of each layer into two smaller matrices
[31]. However, the size of SOI is still large. For example, after
this approximation, the size of SOI is still about 1.4×108for
ResNet-50. ADAHESSIAN directly approximates the SOI into
a diagonal matrix so that we do not need to invert the SOI
matrix [51]. However, this results in a large computational
error such that little improvement in the convergence rate for
the training process is observed. Some algorithms trade off the
overhead of SOI and the convergence rate, by approximating
the SOI matrix into a block-diagonal matrix [7], [49]. How-
ever, limited by the GPU performance, the optimal block size
is usually small. For example, the optimal block size of THOR
algorithm is only 128 [7]. With smaller block sizes, it takes
more epochs to train due to higher approximate errors, causing
longer convergence time, offsetting the algorithmic advantage
of the second-order training.
ReRAM-based PIM is an emerging technique whose com-
putation happens in the memory and has the inherent ability
to accelerate vector-matrix multiplication (VMM) [21]. While
each ReRAM cell can only differentiate a limited number
of bit levels, which means the ReRAM crossbar itself can
only perform low-precision VMM computation, high-precision
VMM can be carried out by spliting it into multiple low-
precision VMMs with bit-slicing scheme [41]. Based on this,
1
arXiv:2210.15255v1 [cs.AR] 27 Oct 2022