Training Spiking Neural Networks with Local Tandem Learning Qu Yang1 Jibin Wu2 Malu Zhang3 Yansong Chua4 Xinchao Wang1 Haizhou Li561

2025-05-06 0 0 2.57MB 21 页 10玖币
侵权投诉
Training Spiking Neural Networks
with Local Tandem Learning
Qu Yang1, Jibin Wu2
, Malu Zhang3, Yansong Chua4, Xinchao Wang1, Haizhou Li5,6,1
1National University of Singapore
2The Hong Kong Polytechnic University
3University of Electronic Science and Technology of China
4China Nanhu Academy of Electronics and Information Technology
5The Chinese University of Hong Kong, Shenzhen, China
6Kriston AI, Xiamen, China
Abstract
Spiking neural networks (SNNs) are shown to be more biologically plausible and
energy efficient over their predecessors. However, there is a lack of an efficient
and generalized training method for deep SNNs, especially for deployment on
analog computing substrates. In this paper, we put forward a generalized learning
rule, termed Local Tandem Learning (LTL). The LTL rule follows the teacher-
student learning approach by mimicking the intermediate feature representations of
a pre-trained ANN. By decoupling the learning of network layers and leveraging
highly informative supervisor signals, we demonstrate rapid network convergence
within five training epochs on the CIFAR-10 dataset while having low computa-
tional complexity. Our experimental results have also shown that the SNNs thus
trained can achieve comparable accuracies to their teacher ANNs on CIFAR-10,
CIFAR-100, and Tiny ImageNet datasets. Moreover, the proposed LTL rule is
hardware friendly. It can be easily implemented on-chip to perform fast parameter
calibration and provide robustness against the notorious device non-ideality issues.
It, therefore, opens up a myriad of opportunities for training and deployment of
SNN on ultra-low-power mixed-signal neuromorphic computing chips.
1 Introduction
Over the last decade, artificial neural networks (ANNs) have improved the perceptual and cognitive
capabilities of machines by leaps and bounds, and have become the de-facto standard for many pattern
recognition tasks including computer vision [
30
,
35
,
52
,
53
,
62
], speech processing [
39
,
60
], language
understanding [
3
], and robotics [
51
]. Despite their superior performance, ANNs are computationally
expensive to be deployed on ubiquitous mobile and edge computing devices due to high memory and
computation requirements.
Spiking neural networks (SNNs), the third-generation artificial neural networks, have gained growing
research attention due to their greater biological plausibility and potential to realize ultra-low-
power computation as observed in biological neural networks. Leveraging the sparse, spike-driven
computation and fine-grain parallelism, the fully digital neuromorphic computing (NC) chips like
TrueNorth [
2
], Loihi [
11
], and Tianjic [
42
], that support the efficient inference of SNNs, have
indeed demonstrated orders of magnitude improved power efficiency over GPU-based AI solutions.
Moreover, the emerging in situ mixed-signal NC chips [
47
,
54
], enabled by nascent non-volatile
technologies, can further boost the hardware efficiency by a large margin over the aforementioned
digital chips.
Corresponding Author: jibin.wu@polyu.edu.hk
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04532v1 [cs.NE] 10 Oct 2022
Despite remarkable progress in neuromorphic hardware development, how to efficiently and effec-
tively train the core computational model, spiking neural network, remains a challenging research
topic. It, therefore, impedes the development of efficient neuromorphic training chips as well as
the wide adoption of neuromorphic solutions in mainstream AI applications. The existing train-
ing algorithms for deep SNNs can be grouped into two categories: ANN-to-SNN conversion and
gradient-based direct training.
For ANN-to-SNN conversion methods, they propose to reuse network weights from more easily
trainable ANNs. This can be viewed as a specific example of Teacher-Student (T-S) learning that
transfers the knowledge from a teacher ANN to a student SNN in the form of network weights.
By properly determining the neuronal firing threshold and initial membrane potentials for SNNs,
recent studies show that the activation values of ANNs can be well approximated with the firing
rate of spiking neurons, achieving near-lossless network conversion on a number of challenging AI
benchmarks [
4
,
5
,
7
,
12
,
14
,
21
,
22
,
34
,
46
,
49
,
63
]. Nevertheless, these network conversion methods
are developed solely based on the non-leaky integrate-and-fire (IF) neuron model and typically
require a large time window so as to reach a reliable firing rate approximation. It is, therefore, not
straightforward and efficient to deploy these converted SNNs onto the existing neuromorphic chips.
In another vein of research, the gradient-based direct training methods explicitly model each spiking
neuron as a self-recurrent neural network and leverage the canonical Backpropagation Through Time
(BPTT) algorithm to optimize the network parameters. The non-differentiable spiking activation
function is typically circumvented with continuous surrogate gradient (SG) functions during error
backpropagation [
8
,
15
,
38
,
45
,
50
,
58
,
59
,
61
]. Despite their compatibility with event-based inputs
and different spiking neuron models, they are computational and memory inefficient to operate in
practice. Moreover, the gradient approximation error introduced by these SG functions tends to
accumulate over layers, causing significant performance degradation in the face of the deep network
structure and short time window [57].
In general, SNN learning algorithms can be categorized into off-chip learning [
16
,
64
] and on-
chip learning [
9
,
41
,
43
]. Almost all of the direct SNN training methods discussed above belong
to the off-chip learning category. Due to the lack of effective ways to exploit the high level of
sparsity in spiking activities and the requirement to store non-local information for credit assignment,
these off-chip methods exhibit very low training efficiency. Moreover, due to notorious device
non-ideality problems [
6
], the actual network dynamics will deviate from the off-chip simulated
ones, causing the accuracy of off-chip trained SNNs degrades significantly when deployed onto
the analog computing substrates [
1
,
24
,
37
,
44
]. To address these problems, recent work proposes
on-chip learning algorithms in the form of local Hebbian learning [
11
,
28
,
41
] and approximation of
gradient-based learning [
10
,
19
,
32
,
40
], while the effectiveness of these algorithms had only been
demonstrated on simple benchmarks, such as MNIST and N-MNIST datasets.
To address the aforementioned problems in SNNs training and hardware deployment, we put forward
a generalized SNN learning rule in this paper, which we referred to as the Local Tandem Learning
(LTL) rule. The LTL rule takes the best of both ANN-to-SNN conversion and gradient-based training
methods. On one hand, it makes good use of highly effective intermediate feature representations
of ANNs to supervise the training of SNNs. By doing so, we show that it can achieve rapid
network convergence within five training epochs on the CIFAR-10 dataset with low computational
complexity. On the other hand, the LTL rule adopts the gradient-based approach to perform knowledge
transfer, which can support different neuron models and achieve rapid pattern recognition. By
propagating gradient information locally within a layer, it can also alleviate the compounding gradient
approximation errors of the SG method and lead to near-lossless knowledge transfer on CIFAR-10,
CIFAR-100, and Tiny ImageNet datasets. Moreover, the LTL rule is designed to be hardware friendly,
which can perform efcient on-chip learning using only local information. Under this on-chip setting,
we demonstrate that the LTL rule is capable of addressing the notorious device non-ideality issues
of analog computing substrates, including device mismatch, quantization noise, thermal noise, and
neuron silencing.
2
Layer 3
Layer 2
Layer 1
Layer 3
Layer 2
Layer 1
ANN (teacher)
Data
SNN (student)
(a)
T=0 T=1 T=2
Layer
(b)
T=0 T=1 T=2
Layer
(c)
Synapse Membrane
Potential
Spike
Accumulator
Error
Boxcar()
Spike
Generator
Sign()
Reset
Main
Processor Memory
Hardware
Modules
Host Computer NC Chips
Control Signal
Data
Cache
Layer 1
Layer l
Layer l+1
Layer l
(d)
Figure 1: Illustration of the proposed LTL rule and its on-chip implementation. (a) The LTL
rule follows the teacher-student learning approach, whereby the SNN tries to mimic the feature
representation of a pre-trained ANN through local loss functions. (b) Computational graph of the
offline LTL rule. (c) Computational graph of the online LTL rule. (d) Functional block diagram of the
proposed on-chip implementation, where the host computer transfers the control signal and training
data (input spike train, layerwise targets) to NC chips. The proposed SNN on-chip learning circuit
consists of two parts: spiking neuron (green) and learning circuits (red).
2 Methods
2.1 Spiking Neuron Model
To demonstrate the proposed LTL rule is compatible with different spiking neuron models, we base
our study on both non-leaky integrate-and-fire (IF) [
48
] and leaky integrate-and-fire (LIF) neuron
models [
18
], whose neuronal dynamics can be described by the following discrete-time formulation:
Ul
i[t] = αUl
i[t1] + Il
i[t]ϑSl
i[t1] (1)
with
Il
i[t] = X
j
wl1
ij Sl1
j[t1] + bl
i(2)
where
Ul
i[t]
and
Il
i[t]
refer to the subthreshold membrane potential of and input current to neuron
i
at
layer
l
, respectively.
αexp(dt/τm)
is the membrane potential decaying constant, wherein
τm
is
the membrane time constant and
dt
is the simulation time step. For IF neuron,
α
takes a value of
1
.
ϑ
denotes the neuronal firing threshold.
wl1
ij
represents the connection weight from neuron
j
of the
preceding layer
l1
and
bl
i
denotes the constant injecting current to neuron
i
.
Sl
i[t1]
indicates the
occurrence of an output spike from neuron
i
at time step
t1
, which is determined according to the
spiking activation function as per
Sl
i[t]=ΘUl
i[t]ϑwith Θ(x) = 1,if x0
0,otherwise (3)
3
2.2 Local Tandem Learning
As illustrated in Figure 1(a), the LTL rule follows the T-S learning approach [
20
], whereby the
intermediate feature representation of a pre-trained ANN is transferred to SNN through layer-wise
local loss functions. In contrast to the ANN-to-SNN conversion methods, we establish the feature
representation equivalence at the neuron level rather than at the synapse level, which provides the
flexibility for choosing any neuron model to be used in the SNN. On the other hand, with the proposed
spatially local loss function, we simplify the spatial-temporal credit assignment required in the end-
to-end direct training methods, which can dramatically improve the network convergence speed and
meanwhile reduce the computational complexity. In the following, we introduce two versions of the
LTL rule, offline and online, depending on whether the temporal locality constraint is imposed.
Offline Learning
Following the T-S learning approach, we consider the intermediate feature repre-
sentation of a pre-trained ANN as the knowledge and train an SNN to reproduce an equivalent feature
representation via layer-wise loss functions. In particular, we establish an equivalence between the
normalized activation values of an ANN and the global average firing rates of an SNN. To reduce
the discrepancy between these two quantities, we adopt the mean square error (MSE) loss function
and apply it separately for each layer. Thus, for any layer
l
, the local loss function
Ll
is defined as
follows
Llˆyl, yl[Tw]=
ˆyl
ynorm
Cl[Tw]
Tw
2
2
(4)
where
ˆyl
is the output of ANN layer
l
.
ynorm
is a normalization constant that takes the value of
99th
or
99.9th
percentile across all
ˆyl
i
. This can alleviate the effect of outliers compared to using the
maximum activation value [
48
].
Tw
is the time window size.
Cl[Tw] = ΣTw
t=1S[t]
is total spike count.
As the computational graph shown in Figure 1(b), we adopt the BPTT algorithm to resolve the
temporal credit assignment problem, and the weight gradients can be derived as
Ll
wl
ij
=
Tw
X
t=1
Ll
Ul
i[t]
Ul
i[t]
wl
ij
=
Tw
X
t=1
Ll
Ul
i[t]
Ul
i[t]
Il
i[t]
Il
i[t]
wl
ij
=
Tw
X
t=1
Ll
Ul
i[t]Sl1
j[t1] (5)
with
Ll
Ul
i[t]=
αδl
i[t+ 1] ∂Sl
i[t+1]
U l
i[t+1] +δl
i[t]Sl
i[t]
U l
i[t]if t<Tw
δl
i[Tw]Sl
i[Tw]
U l
i[Tw]if t=Tw
(6)
where
δl
i[t] = Ll
Sl
i[t]=
ϑδl
i[t+ 1] ∂Sl
i[t+1]
U l
i[t+1] +δl
i[Tw]if t<Tw
2
Twˆyl
i
ynorm 1
TwΣTw
t=1Sl
i[t]if t=Tw
(7)
To resolve the problem of the non-differentiable spiking activation function, we apply the surrogate
gradient method, i.e.,
Θ0(x)θ0(x)
. Specifically, we adopt the boxcar function for
θ0(x)
that
supports convenient and efficient on-chip implementation.
Sl
i[t]
Ul
i[t]=θ0(Ul
i[t]ϑ) = 1
psign |Ul
i[t]ϑ|<p
2(8)
where
p
controls the permissible range of membrane potentials that allow gradients to pass through,
and we tune this hyperparameter separately for each dataset. By substituting Eq. (8) into Eqs. (6) and
(7), we can yield the ultimate form of weight gradients, and we can update the weights according to
the stochastic gradient descent method or its adaptive variants. See Supplementary Materials Section
A.1 for a more detailed derivation of the gradients to weight and bias terms.
Online Learning
The offline LTL rule requires storing intermediate synaptic and membrane state
variables so as to be used during error backpropagation, which is prohibited for on-chip learning
where memory resources are limited. To address this problem, we introduce an online LTL rule,
whose loss function is designed to be both spatially and temporally local. To achieve the temporal
locality, we use the moving average firing rate, which can be calculated at each time step, to replace
the global firing rate used in Eq. (4). It hence yields the following local loss function
Ll[t] =
ˆyl
ynorm
Cl[t]
t
2
2
(9)
4
Compared to the offline version, the gradient update is much simpler now:
Ll[t]
wl
ij
=Ll[t]
Sl
i[t]
Sl
i[t]
Ul
i[t]
Ul
i[t]
wl
ij
=ζl
i[t]Sl
i[t]
Ul
i[t]Sl1
j[t1] (10)
where ζl[t]can be directly computed from Eq. (9):
ζl
i[t] = Ll[t]
Sl
i[t]=2
tˆyl
i
ynorm
1
tΣt
k=1Sl
i[k](11)
The computational graph of the online LTL rule is provided in Figure 1(c). It is worth noting that the
first few time steps of the firing rate calculation are relatively noisy. Nevertheless, this issue can be
easily addressed by treating the first few steps as the warm-up period, during which the parameter
updates are not allowed (see Supplementary Materials Section Cfor a study on the effect of the
warm-up period). By doing so, it can also reduce the overall training cost. As will be discussed in
Sections 3.1 and 3.4, this online version can significantly reduce the computational complexity, while
achieving a comparable test accuracy to that of the offline version.
On-chip Implementation
To allow a convenient and efficient on-chip implementation of the pro-
posed online LTL rule, we carefully designed the learning circuits as illustrated in Figure 1(d). The
output spike count
Cl
is updated at the spike accumulator, and it is compared to the local target
ˆyl
following the layer-wise loss function defined in Eq. (9). This error term is further feedback to the
neuron to update the synaptic parameters. Note that the synaptic updates are gated by
sign(·)
and
boxcar(·)functions, which can significantly reduce the overall number of parameter updates.
We would like to highlight that the proposed LTL learning rule is more hardware-friendly than
the recently introduced hardware-in-the-loop (HIL) training approach [
10
,
19
]. The HIL training
approaches require two-way information communication, that is, (1) reading intermediate neuronal
states from the NC chip to the host computer to perform off-chip training and (2) writing the updated
weights from the host computer to the NC chip. Given the sequential nature of these two processes
and the high implementation cost for reading neuronal states (e.g., requiring to implement costly
analog-to-digital converters for analog spiking neurons), HIL training approaches are expensive to
deploy in practice.
In contrast, the LTL training rule can be implemented efficiently on-chip by simultaneously extracting
the layerwise targets from ANNs, running on the host computer, for data batch
i+ 1
and performing
on-chip SNN training for data batch
i
. This is similar to conventional ANN training, where the data
preprocessing of the next data batch is performed on the CPU and meanwhile the current data batch
is used for ANN training on the GPU. The only difference is that the input data is preprocessed by
the pre-trained ANN to extract the targets for intermediate layers for the proposed LTL rule. Given
the inference of ANN can be performed in parallel on the host computer, the overall training time
is bottlenecked by the NC chip that operates in a sequential mode, where only one sample is been
processed at a time. Therefore, our method has much lower hardware and time complexity.
3 Experiments
In this section, we evaluate the effectiveness of the proposed LTL rule on the image classification task
with CIFAR-10 [
29
], CIFAR-100 [
29
], and Tiny ImageNet [
55
] datasets. We perform a comprehensive
study to demonstrate its superiority in: 1. accurate, rapid, and efficient pattern recognition; 2. rapid
network convergence with low computational complexity; 3. provide robustness against hardware-
related noises. More details about the experimental datasets and implementation details are provided
in the Supplementary Materials Section B, and the source code can be found at2.
3.1 Accurate and Scalable Image Classification
Here, we report the classification results of LTL trained SNNs on CIFAR-10, CIFAR-100 and Tiny
ImageNet datasets against other SNN learning rules, including ANN-to-SNN conversion [
4
,
5
,
12
,
17
,
21
,
22
,
31
,
34
,
46
,
49
,
57
] and direct SNN training [
13
,
45
] methods. Given the network architectures
and data preparation processes vary slightly across different work, therefore, we focus our discussions
on the conversion or transfer errors between the ANNs and SNNs whenever the data is available.
2https://github.com/Aries231/Local_tandem_learning_rule
5
摘要:

TrainingSpikingNeuralNetworkswithLocalTandemLearningQuYang1,JibinWu2,MaluZhang3,YansongChua4,XinchaoWang1,HaizhouLi5;6;11NationalUniversityofSingapore2TheHongKongPolytechnicUniversity3UniversityofElectronicScienceandTechnologyofChina4ChinaNanhuAcademyofElectronicsandInformationTechnology5TheChinese...

展开>> 收起<<
Training Spiking Neural Networks with Local Tandem Learning Qu Yang1 Jibin Wu2 Malu Zhang3 Yansong Chua4 Xinchao Wang1 Haizhou Li561.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:2.57MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注