Masked Spiking Transformer Ziqing Wang1 2Yuetong Fang1Jiahang Cao1Qiang Zhang1Zhongrui Wang3Renjing Xu1 1The Hong Kong University of Science and Technology Guangzhou

2025-05-02 0 0 5.52MB 11 页 10玖币
侵权投诉
Masked Spiking Transformer
Ziqing Wang1, 2*Yuetong Fang1*Jiahang Cao1Qiang Zhang1Zhongrui Wang3Renjing Xu1
1The Hong Kong University of Science and Technology (Guangzhou)
2North Carolina State University
3The University of Hong Kong
Abstract
The combination of Spiking Neural Networks (SNNs)
and Transformers has attracted significant attention due
to their potential for high energy efficiency and high-
performance nature. However, existing works on this topic
typically rely on direct training, which can lead to subop-
timal performance. To address this issue, we propose to
leverage the benefits of the ANN-to-SNN conversion method
to combine SNNs and Transformers, resulting in signifi-
cantly improved performance over existing state-of-the-art
SNN models. Furthermore, inspired by the quantal synaptic
failures observed in the nervous system, which reduces the
number of spikes transmitted across synapses, we introduce
a novel Masked Spiking Transformer (MST) framework that
incorporates a Random Spike Masking (RSM) method to
prune redundant spikes and reduce energy consumption
without sacrificing performance. Our experimental results
demonstrate that the proposed MST model achieves a sig-
nificant reduction of 26.8% in power consumption when the
masking ratio is 75% while maintaining the same level of
performance as the unmasked model.
1. Introduction
Spiking neural networks (SNNs), considered as the next
generation neural networks [29], are brain-inspired neural
networks based on the dynamic characteristics of biological
neurons [30,16]. SNNs have attracted significant attention
due to their unique properties in handling sparse data, which
can yield great energy efficiency benefits on neuromorphic
hardware. Due to their specialties, they have been widely
utilized in various fields, such as classification [31,17],
object detection [3] and tracking [50], etc. Nevertheless,
SNNs currently can hardly realize a comparable perfor-
mance to that of artificial neural networks (ANNs), espe-
cially for complex tasks such as ImageNet [39].
In order to improve the performance of SNNs, various
*Equal contribution.
Corresponding author: renjingxu@ust.hk1zrwang@eee.hku.hk3.
Figure 1. Performance of Masked Spiking Transformer (MST) and
other state-of-the-art (SOTA) SNN models regarding top-1 accu-
racy and time steps. The markers, represented by circles and star
shapes, denote the direct training (DT) and the ANN-to-SNN con-
version method, respectively, where the marker size corresponds
to the model size. Results show that the proposed MST model
achieves higher accuracy compared to other SNN models.
training methods have been proposed, broadly categorized
as the direct training method and the ANN-to-SNN con-
version method. Direct training methods leverage a con-
tinuous relaxation of the non-smooth spiking mechanism to
enable backpropagation with a surrogate gradient function
for handling non-differentiability [35], but this can lead to
unstable gradient propagation and relatively low accuracy
compared to leading ANNs [37]. Alternatively, ANN-to-
SNN conversion methods convert pre-trained ANNs into
SNNs for better performance while requiring more time
steps, with increased power consumption to reduce con-
version errors [45,24,2,7]. Our focus is on implement-
ing the ANN-to-SNN conversion method to narrow the per-
formance gap between leading ANNs and SNNs, but the
required long time steps pose challenges in reducing en-
ergy consumption. Therefore, identifying strategies to de-
crease power consumption while maintaining excellent per-
formance is crucial.
arXiv:2210.01208v2 [cs.NE] 17 Jul 2023
The biological nervous system offers valuable in-
sights for addressing the challenges of implementing high-
performance Spiking Transformers using the ANN-to-SNN
conversion method. The quantal synaptic failure theory
suggests that missing information during neuronal signal
transmission may not impact the computational informa-
tion transmitted to a postsynaptic neuron under certain con-
ditions, but can reduce energy consumption and heat pro-
duction [22]. Likewise, in the ANN-to-SNN conversion
process, missing spikes can possibly be compensated for
by leveraging the correlations between signals in the space
and time domains during the information propagation over
multiple time steps. Furthermore, neural network models
possess lots of redundant connections: prior works reveal
that the redundancy in the self-attention module of Trans-
formers can be pruned without significantly impacting per-
formance [33,48]. Therefore, eliminating redundant in-
formation during the transmission of neuronal signals can
possibly reduce overall energy consumption in the Spiking
Transformer model while preserving high performance.
In our work, we propose a Masked Spiking Trans-
former (MST), which incorporates a Random Spike Mask-
ing (RSM) method designed specifically for SNNs. The
RSM method randomly selects only a subset of input spikes,
significantly reducing the number of spikes involved in the
computation process. We evaluate the MST model on both
static and neuromorphic datasets, demonstrating its supe-
riority over existing SNN models. Our experiments show
that the RSM method can reduce energy consumption on the
self-attention module and the MLP module in Transformer,
enabling the SNNs to take advantage of energy efficiency
and high performance. Furthermore, the proposed RSM
method is not limited to Transformer, but can be extended to
other backbones such as ResNet and VGG, highlighting its
potential as a general technique to improve SNN efficiency.
Our results demonstrate the potential of this approach to
provide a new direction for developing high-performance
and energy-efficient SNN models.
The main contributions of this paper can be summarized
as follows:
We propose a Masked Spiking Transformer (MST) us-
ing the ANN-to-SNN conversion method. To the best
of our knowledge, it is the first exploration of applying
the self-attention mechanism fully in SNNs utilizing
the ANN-to-SNN conversion method.
The MST model is evaluated on both static and neu-
romorphic datasets, and the results show that it out-
performs SOTA SNNs on all datasets. In specific, the
top-1 accuracy of the MST model is 1.21%, 7.3%,
and 3.7% higher than the current SOTA SNN model
on the CIFAR-10, CIFAR-100, and ImageNet datasets,
respectively.
Figure 2. Overview of our Masked Spiking Transformer (MST).
(a) Schematic of the model architecture of the Swin Transformer,
which is the backbone of our model. (b) Schematic of the pro-
posed Transformer blocks, where BN layers replace the original
LN layers. (c) Conceptual illustration of the Random Spike Mask-
ing (RSM) method, which involves randomly masking the input
spike. (d-e) The RSM method in self-attention and MLP module.
We design a Random Spike Masking (RSM) method
for SNNs trained with the ANN-to-SNN conversion
method to prune the redundant spikes during inference
and save energy consumption.
Extensive experiments show that our proposed RSM is
a versatile and general method that can be utilized in
other spike-based deep networks, such as ResNet and
VGG SNN model variants.
2. Related Work
Spiking Neural Networks SNNs have gained popular-
ity in the field of brain-inspired intelligence due to their
compatibility with neuromorphic hardware and biologi-
cal properties. With the increasing interest in larger-scale
and higher-performance SNNs, recent research has focused
on developing novel training algorithms and architectures.
Zheng et al. proposed a threshold-dependent batch normal-
ization (tdBN) method based on spatiotemporal backpropa-
gation to train a large-scale SNN model with 50 layers [55].
Besides, Fang et al. proposed the SEW ResNet architec-
ture for residual learning in deep SNNs to overcome the
gradient vanishing problem [9]. Later, they introduced a
training algorithm that learns the threshold of each spiking
neuron to improve the performance of SNNs [10]. How-
ever, these methods mainly discuss the SNN models that are
dominated by convolutional layers, such as VGG [41] and
ResNet [14] SNN variants. Despite their improvements, the
performance of these methods still struggles to match their
ANN counterparts, limiting the application of SNNs. In
this context, our proposed work focuses on implementing
the self-attention mechanism in SNNs to design a Spiking
Transformer that improves the performance of SNNs.
Transformer Transformer [46] was first introduced in
Natural Language Processing (NLP) and quickly gained
popularity for its remarkable capabilities in capturing long-
range dependencies. Its success in NLP has inspired re-
searchers to explore its potential in computer vision. Vi-
sion Transformer (ViT) [8] was the first attempt to apply
the Transformer to image classification. ViT has achieved
impressive results on various computer vision benchmarks,
demonstrating the effectiveness of the self-attention mech-
anism in image understanding. Following the success of
ViT, a series of works [28,13] proposed improvements to
the original ViT architecture. Motivated by the success of
Transformers and its variations, this paper proposes a new
architecture for SNNs that leverages the capacities of the
Transformer and the energy efficiency of SNNs.
Spiking Transformer The combination of the Trans-
former and SNNs can achieve better performance, which
has been discussed in prior studies, including STNet [53]
and Spike-T [54]. These models utilized separate branches
of SNNs and Transformers for feature extraction, leading to
the inability to run independently on neuromorphic hard-
ware and failing to exploit the energy efficiency benefits
of SNNs fully. In addition, Mueller et al. [34] proposed
a Spiking Transformer using the ANN-to-SNN conversion
method, but they did not implement the self-attention mod-
ule in SNNs. The recently proposed Spikformer [56] di-
rectly trained the Transformer in SNNs, but still struggles
to achieve comparable performance to leading ANNs. To
address these limitations, we apply the self-attention mech-
anism fully in SNNs by utilizing the ANN-to-SNN con-
version method and propose the RSM method to improve
both the performance and energy efficiency of the Spiking
Transformer. Our model offers a new direction for develop-
ing high-performance SNNs using the ANN-to-SNN con-
version method.
3. Methods
3.1. Spiking Neuron Model
For ANNs, the input al1to layer lis mapped to the out-
put alby a linear transformation matrix Wland a nonlinear
activation function f(·), that is (l= 1,2,3,··· , L):
al=fWlal1(1)
where f(·)is often set as the ReLU activation function.
In SNNs, the Integrate-and-Fire (IF) spiking neuron
model is commonly used in ANN-to-SNN conversion [24,
2,7]. The dynamics of the IF model are described by:
vl(t) = vl(t1) + Wlθl1sl1(t)θlsl(t)(2)
where vl(t)denotes the membrane potential of neurons in
layer lat time-step t, which is corresponding to the linear
transformation matrix Wl, the threshold θl, and the binary
output spikes of neurons in the previous layer l1, denoted
as sl1(t). The sl(t)is defined as follows:
sl(t) = Hul(t)θl(3)
where ul(t) = vl(t1) + Wlθl1sl1(t)denotes the
membrane potential of neurons before the trigger of a spike
at time-step t,H(·)denotes the Heaviside step function.
The neurons generate output spikes whenever the mem-
brane potential ul(t)exceeds the threshold value θl, and
the membrane potential is reset by subtracting the threshold
value to reduce information loss [40].
3.2. ANN-to-SNN conversion
To achieve the ANN-SNN conversion, a relationship is
established between the rectified linear unit (ReLU) activa-
tion of analog neurons in ANNs and the firing rate or post-
synaptic potential of spiking neurons in SNNs. This is ob-
tained by summing Eq. 2from time step 1 to Tdividing T
on both sides, resulting in the following equation:
vl(T)vl(0)
T=PT
t=1 Wlθl1sl1(t)
TPT
t=1 θlsl(t)
T(4)
The linear relationship between ϕl(T)and ϕl1(T)is es-
tablished by defining ϕl(T) = PT
t=1 θlsl(t)
Tas the average
postsynaptic potential:
ϕl(T) = Wlϕl1(T)vl(T)vl(0)
T(5)
The equivalence between Eq. 1and 5holds only as T goes
to infinity, resulting in a conversion error. To address this is-
sue, we replace the ReLU activation function with the quan-
tization clip-floor-shift (QCFS) [2] function in the ANNs.
3.3. Model Architecture
An overview of the MST is depicted in Fig. 2, where the
Swin Transformer [28] is adopted as the backbone network.
To convert the original network into a fully-spiking manner,
we incorporate QCFS activation functions after each linear
or regularization layer during the training phase, which are
replaced with Integrate-and-Fire (IF) neurons in the infer-
ence process, resulting in more efficient computation.
The entire computation process in the spiking self-
attention module can be formulated as:
Qspk[t] = IF(X[t]Wq)
Kspk[t] = IF(X[t]Wk)(6)
where Qspk,Kspk denote the spike matrices of the query
and key at ttime step, IF(·)is the IF neuron function, Wq,
摘要:

MaskedSpikingTransformerZiqingWang1,2*YuetongFang1*JiahangCao1QiangZhang1ZhongruiWang3†RenjingXu1†1TheHongKongUniversityofScienceandTechnology(Guangzhou)2NorthCarolinaStateUniversity3TheUniversityofHongKongAbstractThecombinationofSpikingNeuralNetworks(SNNs)andTransformershasattractedsignificantatten...

展开>> 收起<<
Masked Spiking Transformer Ziqing Wang1 2Yuetong Fang1Jiahang Cao1Qiang Zhang1Zhongrui Wang3Renjing Xu1 1The Hong Kong University of Science and Technology Guangzhou.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:5.52MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注