The Lazy Neuron Phenomenon On Emergence of Activation Sparsity in Transformers Zonglin Li Chong You Srinadh Bhojanapalli Daliang Li Ankit Singh Rawat Sashank J.

2025-04-26 0 0 2.57MB 35 页 10玖币

侵权投诉

The Lazy Neuron Phenomenon: On Emergence of Activation

Sparsity in Transformers

Zonglin Li

∗

, Chong You

∗

, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J.

Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, and Sanjiv Kumar

Google Research, New York City

June 13, 2023

Abstract

This paper studies the curious phenomenon for machine learning models with Transformer architectures

that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer

perceptrons (MLPs) after a ReLU activation function, and by “sparse” we mean that on average very few

entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger

Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage

of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a

prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and

evaluation data, for Transformers of various conﬁgurations, at layers of all depth levels, as well as for other

architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training

datasets with random labels, or with random inputs, or with inﬁnite amount of data, demonstrating that

sparsity is not a result of a speciﬁc family of datasets. We discuss how sparsity immediately implies a way

to signiﬁcantly reduce the FLOP count and improve eﬃciency for Transformers. Moreover, we demonstrate

perhaps surprisingly that enforcing an even sparser activation via Top-

thresholding with a small value of

brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training

data, more robustness to input corruptions, and better calibration for their prediction conﬁdence.

1 Introduction

The great success of modern machine learning for applications in computer vision, natural language processing,

game playing, and beyond is driven primarily by the computational model known as deep neural networks

(DNNs) [

]. With inspirations drawn from information processing in biological brains, DNNs are artiﬁcial

neural networks constructed from distributed computational nodes (a.k.a. neurons) with inter-connections

learned from data. Compared to shallow machine learning models, DNNs possess superior learning capacity

and hence can handle complex real-world tasks.

Although motivated from biological brains, there are diﬀerences at very fundamental levels on how artiﬁcial

and biological neural networks work. One of such diﬀerences is in the sparsity of computation. Evidence

from neuroscience suggests that neural activity in biological brains is sparse, namely, only a small percentage

of all neurons ﬁre at each time [

]. Sparse ﬁring suggests that despite having billions of neurons,

only a small fraction of the brain participates in computation at each time. This may explain why brains can

sustain at a very low energy cost. In contrast, learning and inference with DNNs rely primarily on dense

computations where all neurons are involved for any input. In fact, modern computational hardware for deep

neural networks, such as GPUs and TPUs, are designed to facilitate massive scale dense computations. Even

with such dedicated hardware, DNNs are still notoriously resource-demanding to train and deploy. Aside

from computation eﬃciency, artiﬁcial neural networks also lag far behind biological ones in terms of robustness

to input perturbation, error correction for erroneous training labels, conﬁdence calibration for the predictions,

etc.

∗Equal contribution

arXiv:2210.06313v2 [cs.LG] 9 Jun 2023

(a) T5 Encoder (b) T5 Decoder

Figure 1: Percentage of nonzero entries (y-axis, log scale) in the activation map as a function of number of

training steps (x-axis) for a T5-Base model trained with the span corruption objective on the C4 dataset. Left:

layers (from shallow to deep) of the encoder. Right: layers of the decoder.

1.1 An Intriguing Observation: Activations are Sparse in Trained Transformers

This paper provides an extensive study on a surprising observation that despite performing dense computations,

DNNs produce very sparse activation in its intermediate layers once trained

. Speciﬁcally, we study Transformer

[

], a DNN model architecture that has become a workhorse for modern applications. Transformers are

constructed from interweaving a self-attention module and a multi-layer perceptrons (MLPs) of depth 2,

and the focus of this paper is on the activation map in the intermediate output of MLPs (after the activation

function). Figure 1shows the sparsity of the activation maps from the training data, measured by the percentage

of nonzeros, in all MLP layers of a T5-Base model which is a Transformer based encoder-decoder model for

natural language processing [

]. We see that the percentage of nonzero entries is around 50% at initialization,

which is expected: randomly initialized weights produce roughly equal numbers of positive and negative

entries in the pre-activation map, resulting in

∼50

% non-zeros after the ReLU. However, at the end of training

the percentage of nonzero entries reduces drastically: the average value across all encoder-decoder layers

is 2.7% with the largest one being 12.0% and the smallest one being only 1.1%. The emergence of sparse

activation in Transformers bears a similarity to the sparsity of neural activities in biological brains, revealing

an interesting connection between artiﬁcial and biological neural networks. Moreover, unlike classical sparse

methods where such a connection is established via explicit sparse regularization [

], the sparsity observed in

Transformers is emergent without any explicit design.

It is worth noting that the observation that Transformers produce sparse activations is previously reported

in [

]. Our paper signiﬁcantly extends upon results in [

] to demonstrate that sparsity emerges prevalently at

all layers of Transformers, for both language and vision tasks, on both training and evaluation data, and for

some architectures beyond Transformers. We also examine the activation of individual neurons, to show that

sparsity is not caused by “dead” neurons and that the percentage of activation has a long tail distribution. In

addition, by experiments with particularly designed datasets, as well as theoretical analysis of gradient at the

beginning of training, we show that the emergence of sparsity may be due to the training dynamics rather

than particular choices of datasets. Finally, our paper provides empirical evidence that sparsity is positively

correlated with model robustness and calibration.

1.2 Prevalence, Causes, and Beneﬁts of Sparsity

This paper provides a study on the aforementioned phenomenon of sparse activation in trained Transformer

models, with a focus on answering the following three questions. First, is the phenomenon in Figure 1a corner

case or is it occurring more broadly? Second, what are the causes for the emergence of sparsity? Third, why

should we care about the sparsity in DNNs, other than the appeal of its similarity to biological brains? Our

main results along these lines are summarized below.

1This implies, as we explain in details later, that a lot of the computations are spent in vain with multiplying a value by zero.

Sparsity is a prevalent phenomenon. We show in Section 2that the emergence of sparsity in activation

maps of T5 as reported in Figure 1is not an isolated and cherry-picked case. Rather, sparsity is prevalent,

and occurs broadly in Transformer models: it emerges in all layers of a Transformer, for Transformers trained

on both vision and natural language data, for Transformers of various conﬁgurations, and for activation

maps computed on both training and test data, etc. Moreover, through controlled experiments on the width

and depth of Transformers, we reveal that larger models are sparser, as measured by percentage of nonzero

entries. We also show in the Appendix Bthat sparsity emerges with many other architectures and with

diﬀerent optimizers.

Sparsity comes from training dynamic? Towards understanding where sparsity comes from, one argument

is that commonly used image and natural language training datasets entail a compact representation due to

information in the labels or intrinsic low-dimensional structures of the natural data. Another hypothesis

is that sparsity has nothing to do with commonly used training datasets, and arises as a result of modern

over-parameterized models being able to ﬁt the training data even when it is generated in random. In

Section 3, we design experiments using training datasets with random labels, or random images, or inﬁnitely

amount of data, to show that none of the above fully explains the emergence of sparsity. Based on our

observations, we speculate that the sparsity may be attributed to the training dynamic in the optimization

process. In particular, we show theoretically with a simpliﬁed model architecture that the descending

direction of gradient at the beginning of training points to decreasing the value of activations.

Sparsity improves eﬃciency, robustness, and calibration. Sparsity of activation map in trained Trans-

formers implies that a large proportion of the computation during inference is spent on multiplying values

by zero. Hence, FLOPs can be drastically reduced by avoiding all such computations, which we discuss

in Section 4.1. Motivated by this observation, and to obtain reduced FLOPs not only after training but

throughout training, we introduce Top-

Transformer in Section 4.2, a simple modiﬁcation of Transformers

where a Top-

thresholding is applied to the activation maps

. We show that Top-

Transformers with a

reasonable sized

has on par performance with vanilla Transformers. To demonstrate the computation

beneﬁts of Top-

Transformers, we provide proof-of-concept results on wall time reduction for the task of

unbatched decoding on TPUv4 with a large Top-

T5. Meanwhile, we emphasise that this result is far from

fully realizing the beneﬁt of sparse activation, due to a lack of hardware support for sparse computation.

While it is straightforward to associate sparsity with computational eﬃciency, it may be less obvious

and somewhat surprising that sparsity is associated with reliability of the models as well. We show in

Section 4.3 that enforcing explicit sparsity via Top-

Transformers improves model performance in terms of

less sensitivity to noisy training data, less sensitivity to input corruptions, and better conﬁdence calibration.

1.3 Experimental Setup

We study the sparsity in activation maps of Transformers with two commonly used Transformer models,

namely Text-to-Text Transfer Transformer (i.e., T5) and Vision Transformer (i.e., ViT).

•

T5 is an encoder-decoder model for natural language processing tasks [

]. We train T5 on the Colossal

Clean Crawled Corpus (C4) using the span corruption task as suggested by [7].

•

ViT is an encoder model for vision tasks [

]. Unless speciﬁed otherwise, we train ViT on ImageNet-21k [

an image classiﬁcation dataset with 14M images and 21k classes. For certain cases we also use ImageNet-1k

which is a subset of ImageNet-21k with 1.3M images and 1k classes.

Beyond T5 and ViT, we also present the results for BERT in the Appendix.

We measure the sparsity level at the intermediate output of the two-layer MLPs in a Transformer. Recall

that an MLP performs the following mapping

f(x;K,V).

dﬀ

i=1 σ(⟨ki,x⟩)·vi,or equivalently, f(x;K,V).

=Vσ(K⊤x),(1)

where

x∈IRdmodel

is the input,

K= [k1,...,kdﬀ]∈IRdmodel×dﬀ

and

V= [v1,...,vdﬀ]∈IRdmodel×dﬀ

are learnable

layer parameters, and

σ()

is a nonlinear activation function. We use ReLU as the activation function

σ()

for

The approach is previously adopted in ConvNets for improving model robustness [

], and more recently in [

] for improving

memory eﬃciency of Transformers.

(a) T5 vs ViT (b) Train vs evaluation data (c) Diﬀerent training data size

(d) Varying conﬁguration (ViT)

(e) Varying conﬁguration (T5 Encoder) (f) Varying conﬁguration (T5 Decoder)

Figure 2: Percentage of nonzero entries across diﬀerent layers of trained Transformers (a) for both language

data with T5 and vision data with ViT, (b) on both training and evaluation data, (c) for ViT trained on two

ImageNet of diﬀerent scales (21k vs 1k classes), (d) on ViT of varying conﬁgurations, and (e, f) on T5 of

varying conﬁgurations. Please note that the y-axis is in log scale. Sparsity emerges in all cases.

both T5 and ViT

. A two-layer MLP may be regarded as having

dﬀ

neurons where the

-th neuron performs the

computation

σ(⟨ki,x⟩)·vi

, and the ﬁnal layer output is the sum of the output of all neurons. Each neuron is

called activated if

σ(⟨ki,x⟩)

is strictly positive. Hence, the sparsity of neuron activation can be measured by the

number of nonzero entries in the feature map

=σ(K⊤x),(2)

which is a vector of dimension

dﬀ

. Throughout the paper, the sparsity level is computed on the training set

unless otherwise speciﬁed.

Both T5 and ViT come with several conﬁgurations for

dmodel, dﬀ

, number of layers, etc. Unless speciﬁed

otherwise, we will use the Base models (i.e., T5-Base and ViT-B/16) which have dmodel = 768,dﬀ= 3072, and

layers (for ViT) and

encoder layers

+12

decoder layers (for T5). Our experiment with T5 uses the T5X

codebase [

], and our experiment with ViT uses the Scenic codebase [

]. More training details of T5 and

ViT are provided in Appendix A.

2 Prevalence of Sparsity in Learned Transformers

This section shows thorough experiments on commonly used Transformers that sparsity in activation maps

is a prevalent phenomenon. We also show through some controlled experiments that deeper and wider

Transformers tend to be sparser measured by percentage of nonzero entries in activation maps.

ViT originally uses GeLU as its activation function as in [

]. Here we switch to using ReLU as it allows us to more easily measure the

sparsity level using the number of nonzero entries with a very small performance drop (e.g.,

47.78%

with GeLU vs

47.58%

with ReLU for

Top-1 evaluation accuracy on ImageNet-21k).

2.1 Sparsity is a Ubiquitous Phenomenon

We start by providing experimental evidence that the emergence of sparse activation in trained Transformers

is a ubiquitous phenomenon. To this end, we plot the percentage of nonzero entries of activation maps in

diﬀerent Transformers, and present the results in Figure 2. Such results demonstrate the following.

•

Sparsity emerges for both Vision and NLP tasks. Figure 2a shows the percentage of nonzero entries of trained T5

and ViT models evaluated on their respective training datasets. We see that both encoder and decoder of T5,

as well as the ViT, all exhibit sparsity.

•

Sparsity emerges on both training and evaluation data. Figure 2b shows the percentage of nonzero entries in a

trained T5 evaluated on both the training data and the evaluation data. We see that the property of sparsity

generalizes very well to evaluation data as the curves for training and evaluation data align very closely

with each other.

•

Sparsity emerges on datasets of varying scale. Figure 2c shows the percentage of nonzero entries in ViT trained

on both ImageNet-21k and ImageNet-1k, where the former is a superset of the later with approximately

10×

more images and

21×

more classes. We see that the scale of data does not aﬀect much of the sparsity level.

•

Sparsity emerges on Transformers of varying conﬁgurations. Figure 2d shows the percentage of nonzero entries

for ViT of varying conﬁgurations in model size. Figure 2e and 2f show the percentage of nonzero entries for

encoder and decoder, respectively, of T5 with varying conﬁgurations in model size. We see that sparsity

persists for all cases.

•

Sparsity emerges across all layers of a Transformer. Finally, all plots in Figure 2show that sparsity emerges in

all layers of a Transformer. Moreover, in all cases the ﬁrst few and last few layers tend to be denser than

intermediate layers.

Figure 3: Percentage of times that each

neuron in the ﬁrst MLP layer of a

trained T5 is activated on C4 dataset.

The presence of sparsity in activation maps does not rule out the

possibility that a small percentage of the neurons are always activated

for all inputs, whereas the rest of the neurons are never activated. To

illustrate that this is not the case, we experiment with a pretrained

T5 base model

to plot the percentage of layer inputs for which each

of the

dﬀ

neurons is activated when evaluated on 800 examples taken

from C4 dataset with span corruption task. Note that there are 800

* 512 = 409600 samples as MLP activation is computed per token. The

results are presented in Figure 3with x-axis being indices of neurons

in the ﬁrst encoder layer of T5 sorted in descending order according

to percentage of layer inputs on which they are activated. It can be

seen that while a few neurons are activated for around 50% of the

time, the vast majority of neurons (around 93.5%) are activated less

than 10% of the time. Moreover, there are no dead neurons that are

never activated, and the least activated neuron is activated for around 0.001% of the time, and 99% of neurons

are activated over 1% of the time. Finally, while the results here are for neurons in the ﬁrst MLP layer of a

pretrained T5 base encoder, all other MLP layers show qualitatively similar behavior.

2.2 The Larger, the Sparser

We next examine the eﬀect of model size on the sparsity level of activation maps. Note that Figure 2e and

Figure 2f provide evidence with T5 of varying conﬁguration that larger models tend to be sparser. Here we

perform controlled experiments to examine the eﬀect of model depth, measured by the number of Transformer

layers, as well as the eﬀect of model width, measured by the dimension of activation map of MLPs (i.e.,

dﬀ

separately. Towards that, we take a standard T5 model and vary the depth and width, respectively while

keeping the rest of the conﬁguration ﬁxed, and examine their sparsity level after training. The results are

presented in Figure 4for the encoder, whereas we omit the results for the decoder as they are qualitatively the

same as those for encoder.

It can be seen from Figure 4a that deeper Transformers are arguably sparser. For example, many of the

middle layers of the 32-layer model have less than 1% nonzero entries while all shallower models have more

4https://github.com/google-research/t5x/blob/main/docs/models.md#t5-checkpoints

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheLazyNeuronPhenomenon:OnEmergenceofActivationSparsityinTransformersZonglinLi∗,ChongYou∗,SrinadhBhojanapalli,DaliangLi,AnkitSinghRawat,SashankJ.Reddi,KeYe,FelixChern,FelixYu,RuiqiGuo,andSanjivKumarGoogleResearch,NewYorkCityJune13,2023AbstractThispaperstudiesthecuriousphenomenonformachinelearningmod...

展开>> 收起<<

The Lazy Neuron Phenomenon On Emergence of Activation Sparsity in Transformers Zonglin Li Chong You Srinadh Bhojanapalli Daliang Li Ankit Singh Rawat Sashank J..pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Lazy Neuron Phenomenon On Emergence of Activation Sparsity in Transformers Zonglin Li Chong You Srinadh Bhojanapalli Daliang Li Ankit Singh Rawat Sashank J.

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: