
The biological nervous system offers valuable in-
sights for addressing the challenges of implementing high-
performance Spiking Transformers using the ANN-to-SNN
conversion method. The quantal synaptic failure theory
suggests that missing information during neuronal signal
transmission may not impact the computational informa-
tion transmitted to a postsynaptic neuron under certain con-
ditions, but can reduce energy consumption and heat pro-
duction [22]. Likewise, in the ANN-to-SNN conversion
process, missing spikes can possibly be compensated for
by leveraging the correlations between signals in the space
and time domains during the information propagation over
multiple time steps. Furthermore, neural network models
possess lots of redundant connections: prior works reveal
that the redundancy in the self-attention module of Trans-
formers can be pruned without significantly impacting per-
formance [33,48]. Therefore, eliminating redundant in-
formation during the transmission of neuronal signals can
possibly reduce overall energy consumption in the Spiking
Transformer model while preserving high performance.
In our work, we propose a Masked Spiking Trans-
former (MST), which incorporates a Random Spike Mask-
ing (RSM) method designed specifically for SNNs. The
RSM method randomly selects only a subset of input spikes,
significantly reducing the number of spikes involved in the
computation process. We evaluate the MST model on both
static and neuromorphic datasets, demonstrating its supe-
riority over existing SNN models. Our experiments show
that the RSM method can reduce energy consumption on the
self-attention module and the MLP module in Transformer,
enabling the SNNs to take advantage of energy efficiency
and high performance. Furthermore, the proposed RSM
method is not limited to Transformer, but can be extended to
other backbones such as ResNet and VGG, highlighting its
potential as a general technique to improve SNN efficiency.
Our results demonstrate the potential of this approach to
provide a new direction for developing high-performance
and energy-efficient SNN models.
The main contributions of this paper can be summarized
as follows:
• We propose a Masked Spiking Transformer (MST) us-
ing the ANN-to-SNN conversion method. To the best
of our knowledge, it is the first exploration of applying
the self-attention mechanism fully in SNNs utilizing
the ANN-to-SNN conversion method.
• The MST model is evaluated on both static and neu-
romorphic datasets, and the results show that it out-
performs SOTA SNNs on all datasets. In specific, the
top-1 accuracy of the MST model is 1.21%, 7.3%,
and 3.7% higher than the current SOTA SNN model
on the CIFAR-10, CIFAR-100, and ImageNet datasets,
respectively.
Figure 2. Overview of our Masked Spiking Transformer (MST).
(a) Schematic of the model architecture of the Swin Transformer,
which is the backbone of our model. (b) Schematic of the pro-
posed Transformer blocks, where BN layers replace the original
LN layers. (c) Conceptual illustration of the Random Spike Mask-
ing (RSM) method, which involves randomly masking the input
spike. (d-e) The RSM method in self-attention and MLP module.
• We design a Random Spike Masking (RSM) method
for SNNs trained with the ANN-to-SNN conversion
method to prune the redundant spikes during inference
and save energy consumption.
• Extensive experiments show that our proposed RSM is
a versatile and general method that can be utilized in
other spike-based deep networks, such as ResNet and
VGG SNN model variants.
2. Related Work
Spiking Neural Networks SNNs have gained popular-
ity in the field of brain-inspired intelligence due to their
compatibility with neuromorphic hardware and biologi-
cal properties. With the increasing interest in larger-scale
and higher-performance SNNs, recent research has focused
on developing novel training algorithms and architectures.
Zheng et al. proposed a threshold-dependent batch normal-
ization (tdBN) method based on spatiotemporal backpropa-
gation to train a large-scale SNN model with 50 layers [55].
Besides, Fang et al. proposed the SEW ResNet architec-
ture for residual learning in deep SNNs to overcome the
gradient vanishing problem [9]. Later, they introduced a
training algorithm that learns the threshold of each spiking
neuron to improve the performance of SNNs [10]. How-
ever, these methods mainly discuss the SNN models that are
dominated by convolutional layers, such as VGG [41] and
ResNet [14] SNN variants. Despite their improvements, the
performance of these methods still struggles to match their
ANN counterparts, limiting the application of SNNs. In
this context, our proposed work focuses on implementing