
Towards Accurate, Energy-Efficient, & Low-Latency Spiking LSTMs
Gourav Datta1, Haoqin Deng2*, Robert Aviles1, Peter A. Beerel1
1University of Southern California, USA 2University of Washington, Seattle, USA
Abstract
Spiking Neural Networks (SNNs) have emerged as an attrac-
tive spatio-temporal computing paradigm for complex vision
tasks. However, most existing works yield models that re-
quire many time steps and do not leverage the inherent tempo-
ral dynamics of spiking neural networks, even for sequential
tasks. Motivated by this observation, we propose an optimized
spiking long short-term memory networks (LSTM) training
framework that involves a novel ANN-to-SNN conversion
framework, followed by SNN training. In particular, we pro-
pose novel activation functions in the source LSTM architec-
ture and judiciously select a subset of them for conversion
to integrate-and-fire (IF) activations with optimal bias shifts.
Additionally, we derive the leaky-integrate-and-fire (LIF) ac-
tivation functions converted from their non-spiking LSTM
counterparts which justifies the need to jointly optimize the
weights, threshold, and leak parameter. We also propose a
pipelined parallel processing scheme which hides the SNN
time steps, significantly improving system latency, especially
for long sequences. The resulting SNNs have high activation
sparsity and require only accumulate operations (AC), in con-
trast to expensive multiply-and-accumulates (MAC) needed
for ANNs, except for the input layer when using direct encod-
ing, yielding significant improvements in energy efficiency.
We evaluate our framework on sequential learning tasks in-
cluding temporal MNIST, Google Speech Commands (GSC),
and UCI Smartphone datasets on different LSTM architec-
tures. We obtain test accuracy of
94.75
% with only
2
time
steps with direct encoding on the GSC dataset with
∼4.1×
lower energy than an iso-architecture standard LSTM.
Introduction & Related Work
In contrast to the neurons in ANNs, the neurons in Spiking
Neural Networks (SNNs) are biologically inspired, receiv-
ing and transmitting information via spikes. SNNs promise
higher energy-efficiency than ANNs due to their high ac-
tivation sparsity and event-driven spike-based computation
(Diehl et al. 2016b) which helps avoid the costly multipli-
cation operations that dominate ANNs. To handle multi-bit
inputs, such as typical in traditional datasets and real-life
sensor-based applications, however, the inputs are often spike
encoded in the temporal domain using rate coding (Diehl et al.
2016b), temporal coding (Comsa et al. 2020), or rank-order
coding (Kheradpisheh et al. 2020). Alternatively, instead of
spike encoding the inputs, some researchers explored directly
feeding the analog pixel values in the first convolutional layer,
and thereby, emitting spikes only in the subsequent layers
(Rathi et al. 2020b). This can dramatically reduce the number
*Work done at University of Southern California
of time steps needed to achieve the state-of-the-art accuracy,
but comes at the cost that the first layer now requires MACs
(Rathi et al. 2020b; Datta et al. 2022; Kundu et al. 2021).
However, all these encoding techniques increase the end-
to-end latency (proportional to the number of time steps)
compared to their non-spiking counterparts.
In addition to accommodating various forms of spike en-
coding, supervised learning algorithms for SNNs, such as
surrogate gradient learning (SGL) have overcome various
roadblocks associated with the discontinuous derivative of
the spike activation function (Lee et al. 2016; Kim and Panda
2021b; Neftci, Mostafa, and Zenke 2019; Panda et al. 2020).
It is also commonly agreed that SNNs following the integrate-
and-fire (IF) compute model can be converted from ANNs
with low error by approximating the activation value of ReLU
neurons with the firing rate of spiking neurons (Sengupta et al.
2019; Rathi et al. 2020a; Diehl et al. 2016b). SNNs trained
using ANN-to-SNN conversion, coupled with SGL, have
been able to perform similar to SOTA CNNs in terms of test
accuracy in traditional image recognition tasks (Rathi et al.
2020b,a) with significant advantages in compute efficiency.
Previous works (Rathi et al. 2020b; Datta et al. 2021; Kundu
et al. 2021) have adopted SGL to jointly train the threshold
and leak values to improve the accuracy-latency tradeoff but
without any analytical justification.
Inspite of numerous innovations in SNN training algo-
rithms for static (Panda and Roy 2016; Panda et al. 2020;
Rathi et al. 2020b,a; Kim and Panda 2021b) and dynamic
vision tasks (Kim and Panda 2021a; Li et al. 2022), there has
been relatively fewer research that target SNNs for sequence
learning tasks. Among the existing works, some are limited
to the use of spiking inputs (Rezaabad and Vishwanath 2020;
Ponghiran and Roy 2021b)which might not represent several
real-world use cases. Furthermore, some (Deng and Gu 2021;
Moritz, Hori, and Roux 2019; Diehl et al. 2016a) propose
to yield SNNs from vanilla RNNs which has been shown to
yield a large accuracy drop for large-scale sequence learning
tasks, as they are unable to model temporal dependencies
for long sequences. Others (Ponghiran and Roy 2021a) use
the same input expansion approach for spike encoding and
yield SNNs which requires serial processing for each input
in the sequence, severely increasing total latency. A more
recent work (Ponghiran and Roy 2021b) proposed a more
complex neuron model compared to the popular IF or leaky-
integrate-and-fire (LIF) model, to improve the recurrence
dynamics for sequential learning. Additionally, it lets the
hidden activation maps be multi-bit (as opposed to binary
spikes) which improves training, but requires multiplications
that reduces energy efficiency compared to the multiplier-less
arXiv:2210.12613v1 [cs.NE] 23 Oct 2022