motivate rank-coding for ANN. As such, many existing works on training SNN do so without exact gradients,
which range from heuristic rules like Hebbian learning [
26
,
44
] and STDP [
31
,
33
], SNN-ANN conversion
[43, 13, 22], and surrogate gradient approximations [37].
In this work, by applying the implicit function theorem (IFT) at the firing times of the neurons in SNN,
we first show that under fairly general conditions, gradients of loss w.r.t. network weights are well-defined.
We do this by proving that the conditions for IFT are always satisfied at firing times. We then provide what
we call a forward-propagation (FP) algorithm which uses the causality structure in network firing times and
our IFT-based gradient calculations in order to calculate exact gradients of the loss w.r.t. network weights.
We call it forward propagation because intermediate calculations needed to calculate the final gradient are
actually done forward in time (or forward in layers for feed-forward networks). We highlight the following
features of our method:
•
Our method can be applied in networks with arbitrary recurrent connections (up to self loops) and
is agnostic to how the forward pass is implemented. We provide an implementation for computing
the firing times in the forward pass, but as long as we can obtain accurate firing times and causality
information (for instance, using existing libraries), we can calculate gradients.
•
Our method can be seen as an extension of Hebbian learning as it illustrates that the gradient w.r.t. a
weight
Wji
connecting neuron
j
to neuron
i
is almost an average of the feeding kernel
yji
between these
neurons at the firing times. In the context of Hebbian learning (especially from a biological perspective),
this is interpreted as the well-known fact that stronger feeding/activation amplifies the association
between the neurons. [8, 19]
•
In our method, the smoothing kernels
yji
arise naturally as a result of application of IFT at the firing
times, resembling the smoothing kernels applied in surrogate gradient methods. As a result (1) our
method sheds some light on why the surrogate gradient methods may work quite well and (2) in our
method, the smoothing kernels
yji
vary according to the firing times between two neurons; thus, they
can be seen as an adaptive version of the fixed smoothing kernels used in surrogate gradient methods.
•
Most of the methods in the literature apply a time-quantized version of the neuron dynamics and convert
the continuous-time system into a discrete-time system. While we derive results in the continuous time
regime, our IFT formulation is also applicable in these discrete-time scenarios. To do so, one needs to
treat the weight parameters and all the time-quantized versions of the variables (such as synaptic and
membrane potential, etc.) as separate variables. The number of these state variables however grows
proportionally to the simulation time and the precision of the time quantization, which is why the
continuous-time regime is preferred.
1.1 Related Work
A review of learning in deep spiking networks can be found at [
48
,
40
,
42
,
49
], with [
42
] discussing also
developments in neuromorphic computing in both software (algorithms) and hardware. [
37
] focuses on
surrogate gradient methods, which use smooth activation functions in place of the hard-thresholding for
compatibility with usual backpropagation and have been used to train SNNs in a variety of settings [
16
,
3
,
23, 51, 47, 45].
A number of works explore backpropagation in SNNs [
5
,
25
,
52
]. The SpikeProp [
5
] framework assumes a
linear relationship between the post-synaptic input and the resultant spiking time, which our framework does
not rely on. The method in [
25
] and its RSNN version [
52
] are limited to a rate-coded loss that depends on
spike counts. The continuous “spike time” representation of spikes in our framework is related to temporal
coding [
36
], but the authors of [
36
] in the context of differentiation of losses largely ignore the discontinuities
that occur at spikes times, stating “the derivative...is discontinuous at such points [but] many feedforward
ANNs use activation functions with a discontinuous first derivative”. In contrast with [
36
], we prove that
exact gradients can be calculated despite this discontinuity.
2