Masked Spiking Transformer Ziqing Wang1 2Yuetong Fang1Jiahang Cao1Qiang Zhang1Zhongrui Wang3Renjing Xu1 1The Hong Kong University of Science and Technology Guangzhou

2025-05-02 0 0 5.52MB 11 页 10玖币

侵权投诉

Masked Spiking Transformer

Ziqing Wang1, 2*Yuetong Fang1*Jiahang Cao1Qiang Zhang1Zhongrui Wang3† Renjing Xu1†

1The Hong Kong University of Science and Technology (Guangzhou)

2North Carolina State University

3The University of Hong Kong

Abstract

The combination of Spiking Neural Networks (SNNs)

and Transformers has attracted signiﬁcant attention due

to their potential for high energy efﬁciency and high-

performance nature. However, existing works on this topic

typically rely on direct training, which can lead to subop-

timal performance. To address this issue, we propose to

leverage the beneﬁts of the ANN-to-SNN conversion method

to combine SNNs and Transformers, resulting in signiﬁ-

cantly improved performance over existing state-of-the-art

SNN models. Furthermore, inspired by the quantal synaptic

failures observed in the nervous system, which reduces the

number of spikes transmitted across synapses, we introduce

a novel Masked Spiking Transformer (MST) framework that

incorporates a Random Spike Masking (RSM) method to

prune redundant spikes and reduce energy consumption

without sacriﬁcing performance. Our experimental results

demonstrate that the proposed MST model achieves a sig-

niﬁcant reduction of 26.8% in power consumption when the

masking ratio is 75% while maintaining the same level of

performance as the unmasked model.

1. Introduction

Spiking neural networks (SNNs), considered as the next

generation neural networks [29], are brain-inspired neural

networks based on the dynamic characteristics of biological

neurons [30,16]. SNNs have attracted signiﬁcant attention

due to their unique properties in handling sparse data, which

can yield great energy efﬁciency beneﬁts on neuromorphic

hardware. Due to their specialties, they have been widely

utilized in various ﬁelds, such as classiﬁcation [31,17],

object detection [3] and tracking [50], etc. Nevertheless,

SNNs currently can hardly realize a comparable perfor-

mance to that of artiﬁcial neural networks (ANNs), espe-

cially for complex tasks such as ImageNet [39].

In order to improve the performance of SNNs, various

*Equal contribution.

†Corresponding author: renjingxu@ust.hk1zrwang@eee.hku.hk3.

Figure 1. Performance of Masked Spiking Transformer (MST) and

other state-of-the-art (SOTA) SNN models regarding top-1 accu-

racy and time steps. The markers, represented by circles and star

shapes, denote the direct training (DT) and the ANN-to-SNN con-

version method, respectively, where the marker size corresponds

to the model size. Results show that the proposed MST model

achieves higher accuracy compared to other SNN models.

training methods have been proposed, broadly categorized

as the direct training method and the ANN-to-SNN con-

version method. Direct training methods leverage a con-

tinuous relaxation of the non-smooth spiking mechanism to

enable backpropagation with a surrogate gradient function

for handling non-differentiability [35], but this can lead to

unstable gradient propagation and relatively low accuracy

compared to leading ANNs [37]. Alternatively, ANN-to-

SNN conversion methods convert pre-trained ANNs into

SNNs for better performance while requiring more time

steps, with increased power consumption to reduce con-

version errors [45,24,2,7]. Our focus is on implement-

ing the ANN-to-SNN conversion method to narrow the per-

formance gap between leading ANNs and SNNs, but the

required long time steps pose challenges in reducing en-

ergy consumption. Therefore, identifying strategies to de-

crease power consumption while maintaining excellent per-

formance is crucial.

arXiv:2210.01208v2 [cs.NE] 17 Jul 2023

The biological nervous system offers valuable in-

sights for addressing the challenges of implementing high-

performance Spiking Transformers using the ANN-to-SNN

conversion method. The quantal synaptic failure theory

suggests that missing information during neuronal signal

transmission may not impact the computational informa-

tion transmitted to a postsynaptic neuron under certain con-

ditions, but can reduce energy consumption and heat pro-

duction [22]. Likewise, in the ANN-to-SNN conversion

process, missing spikes can possibly be compensated for

by leveraging the correlations between signals in the space

and time domains during the information propagation over

multiple time steps. Furthermore, neural network models

possess lots of redundant connections: prior works reveal

that the redundancy in the self-attention module of Trans-

formers can be pruned without signiﬁcantly impacting per-

formance [33,48]. Therefore, eliminating redundant in-

formation during the transmission of neuronal signals can

possibly reduce overall energy consumption in the Spiking

Transformer model while preserving high performance.

In our work, we propose a Masked Spiking Trans-

former (MST), which incorporates a Random Spike Mask-

ing (RSM) method designed speciﬁcally for SNNs. The

RSM method randomly selects only a subset of input spikes,

signiﬁcantly reducing the number of spikes involved in the

computation process. We evaluate the MST model on both

static and neuromorphic datasets, demonstrating its supe-

riority over existing SNN models. Our experiments show

that the RSM method can reduce energy consumption on the

self-attention module and the MLP module in Transformer,

enabling the SNNs to take advantage of energy efﬁciency

and high performance. Furthermore, the proposed RSM

method is not limited to Transformer, but can be extended to

other backbones such as ResNet and VGG, highlighting its

potential as a general technique to improve SNN efﬁciency.

Our results demonstrate the potential of this approach to

provide a new direction for developing high-performance

and energy-efﬁcient SNN models.

The main contributions of this paper can be summarized

as follows:

• We propose a Masked Spiking Transformer (MST) us-

ing the ANN-to-SNN conversion method. To the best

of our knowledge, it is the ﬁrst exploration of applying

the self-attention mechanism fully in SNNs utilizing

the ANN-to-SNN conversion method.

• The MST model is evaluated on both static and neu-

romorphic datasets, and the results show that it out-

performs SOTA SNNs on all datasets. In speciﬁc, the

top-1 accuracy of the MST model is 1.21%, 7.3%,

and 3.7% higher than the current SOTA SNN model

on the CIFAR-10, CIFAR-100, and ImageNet datasets,

respectively.

Figure 2. Overview of our Masked Spiking Transformer (MST).

(a) Schematic of the model architecture of the Swin Transformer,

which is the backbone of our model. (b) Schematic of the pro-

posed Transformer blocks, where BN layers replace the original

LN layers. (c) Conceptual illustration of the Random Spike Mask-

ing (RSM) method, which involves randomly masking the input

spike. (d-e) The RSM method in self-attention and MLP module.

• We design a Random Spike Masking (RSM) method

for SNNs trained with the ANN-to-SNN conversion

method to prune the redundant spikes during inference

and save energy consumption.

• Extensive experiments show that our proposed RSM is

a versatile and general method that can be utilized in

other spike-based deep networks, such as ResNet and

VGG SNN model variants.

2. Related Work

Spiking Neural Networks SNNs have gained popular-

ity in the ﬁeld of brain-inspired intelligence due to their

compatibility with neuromorphic hardware and biologi-

cal properties. With the increasing interest in larger-scale

and higher-performance SNNs, recent research has focused

on developing novel training algorithms and architectures.

Zheng et al. proposed a threshold-dependent batch normal-

ization (tdBN) method based on spatiotemporal backpropa-

gation to train a large-scale SNN model with 50 layers [55].

Besides, Fang et al. proposed the SEW ResNet architec-

ture for residual learning in deep SNNs to overcome the

gradient vanishing problem [9]. Later, they introduced a

training algorithm that learns the threshold of each spiking

neuron to improve the performance of SNNs [10]. How-

ever, these methods mainly discuss the SNN models that are

dominated by convolutional layers, such as VGG [41] and

ResNet [14] SNN variants. Despite their improvements, the

performance of these methods still struggles to match their

ANN counterparts, limiting the application of SNNs. In

this context, our proposed work focuses on implementing

the self-attention mechanism in SNNs to design a Spiking

Transformer that improves the performance of SNNs.

Transformer Transformer [46] was ﬁrst introduced in

Natural Language Processing (NLP) and quickly gained

popularity for its remarkable capabilities in capturing long-

range dependencies. Its success in NLP has inspired re-

searchers to explore its potential in computer vision. Vi-

sion Transformer (ViT) [8] was the ﬁrst attempt to apply

the Transformer to image classiﬁcation. ViT has achieved

impressive results on various computer vision benchmarks,

demonstrating the effectiveness of the self-attention mech-

anism in image understanding. Following the success of

ViT, a series of works [28,13] proposed improvements to

the original ViT architecture. Motivated by the success of

Transformers and its variations, this paper proposes a new

architecture for SNNs that leverages the capacities of the

Transformer and the energy efﬁciency of SNNs.

Spiking Transformer The combination of the Trans-

former and SNNs can achieve better performance, which

has been discussed in prior studies, including STNet [53]

and Spike-T [54]. These models utilized separate branches

of SNNs and Transformers for feature extraction, leading to

the inability to run independently on neuromorphic hard-

ware and failing to exploit the energy efﬁciency beneﬁts

of SNNs fully. In addition, Mueller et al. [34] proposed

a Spiking Transformer using the ANN-to-SNN conversion

method, but they did not implement the self-attention mod-

ule in SNNs. The recently proposed Spikformer [56] di-

rectly trained the Transformer in SNNs, but still struggles

to achieve comparable performance to leading ANNs. To

address these limitations, we apply the self-attention mech-

anism fully in SNNs by utilizing the ANN-to-SNN con-

version method and propose the RSM method to improve

both the performance and energy efﬁciency of the Spiking

Transformer. Our model offers a new direction for develop-

ing high-performance SNNs using the ANN-to-SNN con-

version method.

3. Methods

3.1. Spiking Neuron Model

For ANNs, the input al−1to layer lis mapped to the out-

put alby a linear transformation matrix Wland a nonlinear

activation function f(·), that is (l= 1,2,3,··· , L):

al=fWlal−1(1)

where f(·)is often set as the ReLU activation function.

In SNNs, the Integrate-and-Fire (IF) spiking neuron

model is commonly used in ANN-to-SNN conversion [24,

2,7]. The dynamics of the IF model are described by:

vl(t) = vl(t−1) + Wlθl−1sl−1(t)−θlsl(t)(2)

where vl(t)denotes the membrane potential of neurons in

layer lat time-step t, which is corresponding to the linear

transformation matrix Wl, the threshold θl, and the binary

output spikes of neurons in the previous layer l−1, denoted

as sl−1(t). The sl(t)is deﬁned as follows:

sl(t) = Hul(t)−θl(3)

where ul(t) = vl(t−1) + Wlθl−1sl−1(t)denotes the

membrane potential of neurons before the trigger of a spike

at time-step t,H(·)denotes the Heaviside step function.

The neurons generate output spikes whenever the mem-

brane potential ul(t)exceeds the threshold value θl, and

the membrane potential is reset by subtracting the threshold

value to reduce information loss [40].

3.2. ANN-to-SNN conversion

To achieve the ANN-SNN conversion, a relationship is

established between the rectiﬁed linear unit (ReLU) activa-

tion of analog neurons in ANNs and the ﬁring rate or post-

synaptic potential of spiking neurons in SNNs. This is ob-

tained by summing Eq. 2from time step 1 to Tdividing T

on both sides, resulting in the following equation:

vl(T)−vl(0)

T=PT

t=1 Wlθl−1sl−1(t)

T−PT

t=1 θlsl(t)

T(4)

The linear relationship between ϕl(T)and ϕl−1(T)is es-

tablished by deﬁning ϕl(T) = PT

t=1 θlsl(t)

Tas the average

postsynaptic potential:

ϕl(T) = Wlϕl−1(T)−vl(T)−vl(0)

T(5)

The equivalence between Eq. 1and 5holds only as T goes

to inﬁnity, resulting in a conversion error. To address this is-

sue, we replace the ReLU activation function with the quan-

tization clip-ﬂoor-shift (QCFS) [2] function in the ANNs.

3.3. Model Architecture

An overview of the MST is depicted in Fig. 2, where the

Swin Transformer [28] is adopted as the backbone network.

To convert the original network into a fully-spiking manner,

we incorporate QCFS activation functions after each linear

or regularization layer during the training phase, which are

replaced with Integrate-and-Fire (IF) neurons in the infer-

ence process, resulting in more efﬁcient computation.

The entire computation process in the spiking self-

attention module can be formulated as:

Qspk[t] = IF(X[t]∗Wq)

Kspk[t] = IF(X[t]∗Wk)(6)

where Qspk,Kspk denote the spike matrices of the query

and key at ttime step, IF(·)is the IF neuron function, Wq,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MaskedSpikingTransformerZiqingWang1,2*YuetongFang1*JiahangCao1QiangZhang1ZhongruiWang3†RenjingXu1†1TheHongKongUniversityofScienceandTechnology(Guangzhou)2NorthCarolinaStateUniversity3TheUniversityofHongKongAbstractThecombinationofSpikingNeuralNetworks(SNNs)andTransformershasattractedsignificantatten...

展开>> 收起<<

Masked Spiking Transformer Ziqing Wang1 2Yuetong Fang1Jiahang Cao1Qiang Zhang1Zhongrui Wang3Renjing Xu1 1The Hong Kong University of Science and Technology Guangzhou.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Masked Spiking Transformer Ziqing Wang1 2Yuetong Fang1Jiahang Cao1Qiang Zhang1Zhongrui Wang3Renjing Xu1 1The Hong Kong University of Science and Technology Guangzhou

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: