Despite remarkable progress in neuromorphic hardware development, how to efficiently and effec-
tively train the core computational model, spiking neural network, remains a challenging research
topic. It, therefore, impedes the development of efficient neuromorphic training chips as well as
the wide adoption of neuromorphic solutions in mainstream AI applications. The existing train-
ing algorithms for deep SNNs can be grouped into two categories: ANN-to-SNN conversion and
gradient-based direct training.
For ANN-to-SNN conversion methods, they propose to reuse network weights from more easily
trainable ANNs. This can be viewed as a specific example of Teacher-Student (T-S) learning that
transfers the knowledge from a teacher ANN to a student SNN in the form of network weights.
By properly determining the neuronal firing threshold and initial membrane potentials for SNNs,
recent studies show that the activation values of ANNs can be well approximated with the firing
rate of spiking neurons, achieving near-lossless network conversion on a number of challenging AI
benchmarks [
4
,
5
,
7
,
12
,
14
,
21
,
22
,
34
,
46
,
49
,
63
]. Nevertheless, these network conversion methods
are developed solely based on the non-leaky integrate-and-fire (IF) neuron model and typically
require a large time window so as to reach a reliable firing rate approximation. It is, therefore, not
straightforward and efficient to deploy these converted SNNs onto the existing neuromorphic chips.
In another vein of research, the gradient-based direct training methods explicitly model each spiking
neuron as a self-recurrent neural network and leverage the canonical Backpropagation Through Time
(BPTT) algorithm to optimize the network parameters. The non-differentiable spiking activation
function is typically circumvented with continuous surrogate gradient (SG) functions during error
backpropagation [
8
,
15
,
38
,
45
,
50
,
58
,
59
,
61
]. Despite their compatibility with event-based inputs
and different spiking neuron models, they are computational and memory inefficient to operate in
practice. Moreover, the gradient approximation error introduced by these SG functions tends to
accumulate over layers, causing significant performance degradation in the face of the deep network
structure and short time window [57].
In general, SNN learning algorithms can be categorized into off-chip learning [
16
,
64
] and on-
chip learning [
9
,
41
,
43
]. Almost all of the direct SNN training methods discussed above belong
to the off-chip learning category. Due to the lack of effective ways to exploit the high level of
sparsity in spiking activities and the requirement to store non-local information for credit assignment,
these off-chip methods exhibit very low training efficiency. Moreover, due to notorious device
non-ideality problems [
6
], the actual network dynamics will deviate from the off-chip simulated
ones, causing the accuracy of off-chip trained SNNs degrades significantly when deployed onto
the analog computing substrates [
1
,
24
,
37
,
44
]. To address these problems, recent work proposes
on-chip learning algorithms in the form of local Hebbian learning [
11
,
28
,
41
] and approximation of
gradient-based learning [
10
,
19
,
32
,
40
], while the effectiveness of these algorithms had only been
demonstrated on simple benchmarks, such as MNIST and N-MNIST datasets.
To address the aforementioned problems in SNNs training and hardware deployment, we put forward
a generalized SNN learning rule in this paper, which we referred to as the Local Tandem Learning
(LTL) rule. The LTL rule takes the best of both ANN-to-SNN conversion and gradient-based training
methods. On one hand, it makes good use of highly effective intermediate feature representations
of ANNs to supervise the training of SNNs. By doing so, we show that it can achieve rapid
network convergence within five training epochs on the CIFAR-10 dataset with low computational
complexity. On the other hand, the LTL rule adopts the gradient-based approach to perform knowledge
transfer, which can support different neuron models and achieve rapid pattern recognition. By
propagating gradient information locally within a layer, it can also alleviate the compounding gradient
approximation errors of the SG method and lead to near-lossless knowledge transfer on CIFAR-10,
CIFAR-100, and Tiny ImageNet datasets. Moreover, the LTL rule is designed to be hardware friendly,
which can perform efficient on-chip learning using only local information. Under this on-chip setting,
we demonstrate that the LTL rule is capable of addressing the notorious device non-ideality issues
of analog computing substrates, including device mismatch, quantization noise, thermal noise, and
neuron silencing.
2