The Lazy Neuron Phenomenon On Emergence of Activation Sparsity in Transformers Zonglin Li Chong You Srinadh Bhojanapalli Daliang Li Ankit Singh Rawat Sashank J.

2025-04-26 0 0 2.57MB 35 页 10玖币
侵权投诉
The Lazy Neuron Phenomenon: On Emergence of Activation
Sparsity in Transformers
Zonglin Li
, Chong You
, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J.
Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, and Sanjiv Kumar
Google Research, New York City
June 13, 2023
Abstract
This paper studies the curious phenomenon for machine learning models with Transformer architectures
that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer
perceptrons (MLPs) after a ReLU activation function, and by “sparse” we mean that on average very few
entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger
Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage
of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a
prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and
evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other
architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training
datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that
sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way
to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate
perhaps surprisingly that enforcing an even sparser activation via Top-
k
thresholding with a small value of
k
brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training
data, more robustness to input corruptions, and better calibration for their prediction confidence.
1 Introduction
The great success of modern machine learning for applications in computer vision, natural language processing,
game playing, and beyond is driven primarily by the computational model known as deep neural networks
(DNNs) [
1
]. With inspirations drawn from information processing in biological brains, DNNs are artificial
neural networks constructed from distributed computational nodes (a.k.a. neurons) with inter-connections
learned from data. Compared to shallow machine learning models, DNNs possess superior learning capacity
and hence can handle complex real-world tasks.
Although motivated from biological brains, there are differences at very fundamental levels on how artificial
and biological neural networks work. One of such differences is in the sparsity of computation. Evidence
from neuroscience suggests that neural activity in biological brains is sparse, namely, only a small percentage
of all neurons fire at each time [
2
,
3
,
4
,
5
]. Sparse firing suggests that despite having billions of neurons,
only a small fraction of the brain participates in computation at each time. This may explain why brains can
sustain at a very low energy cost. In contrast, learning and inference with DNNs rely primarily on dense
computations where all neurons are involved for any input. In fact, modern computational hardware for deep
neural networks, such as GPUs and TPUs, are designed to facilitate massive scale dense computations. Even
with such dedicated hardware, DNNs are still notoriously resource-demanding to train and deploy. Aside
from computation efficiency, artificial neural networks also lag far behind biological ones in terms of robustness
to input perturbation, error correction for erroneous training labels, confidence calibration for the predictions,
etc.
Equal contribution
1
arXiv:2210.06313v2 [cs.LG] 9 Jun 2023
(a) T5 Encoder (b) T5 Decoder
Figure 1: Percentage of nonzero entries (y-axis, log scale) in the activation map as a function of number of
training steps (x-axis) for a T5-Base model trained with the span corruption objective on the C4 dataset. Left:
layers (from shallow to deep) of the encoder. Right: layers of the decoder.
1.1 An Intriguing Observation: Activations are Sparse in Trained Transformers
This paper provides an extensive study on a surprising observation that despite performing dense computations,
DNNs produce very sparse activation in its intermediate layers once trained
1
. Specifically, we study Transformer
[
6
], a DNN model architecture that has become a workhorse for modern applications. Transformers are
constructed from interweaving a self-attention module and a multi-layer perceptrons (MLPs) of depth 2,
and the focus of this paper is on the activation map in the intermediate output of MLPs (after the activation
function). Figure 1shows the sparsity of the activation maps from the training data, measured by the percentage
of nonzeros, in all MLP layers of a T5-Base model which is a Transformer based encoder-decoder model for
natural language processing [
7
]. We see that the percentage of nonzero entries is around 50% at initialization,
which is expected: randomly initialized weights produce roughly equal numbers of positive and negative
entries in the pre-activation map, resulting in
50
% non-zeros after the ReLU. However, at the end of training
the percentage of nonzero entries reduces drastically: the average value across all encoder-decoder layers
is 2.7% with the largest one being 12.0% and the smallest one being only 1.1%. The emergence of sparse
activation in Transformers bears a similarity to the sparsity of neural activities in biological brains, revealing
an interesting connection between artificial and biological neural networks. Moreover, unlike classical sparse
methods where such a connection is established via explicit sparse regularization [
8
], the sparsity observed in
Transformers is emergent without any explicit design.
It is worth noting that the observation that Transformers produce sparse activations is previously reported
in [
9
]. Our paper significantly extends upon results in [
9
] to demonstrate that sparsity emerges prevalently at
all layers of Transformers, for both language and vision tasks, on both training and evaluation data, and for
some architectures beyond Transformers. We also examine the activation of individual neurons, to show that
sparsity is not caused by “dead” neurons and that the percentage of activation has a long tail distribution. In
addition, by experiments with particularly designed datasets, as well as theoretical analysis of gradient at the
beginning of training, we show that the emergence of sparsity may be due to the training dynamics rather
than particular choices of datasets. Finally, our paper provides empirical evidence that sparsity is positively
correlated with model robustness and calibration.
1.2 Prevalence, Causes, and Benefits of Sparsity
This paper provides a study on the aforementioned phenomenon of sparse activation in trained Transformer
models, with a focus on answering the following three questions. First, is the phenomenon in Figure 1a corner
case or is it occurring more broadly? Second, what are the causes for the emergence of sparsity? Third, why
should we care about the sparsity in DNNs, other than the appeal of its similarity to biological brains? Our
main results along these lines are summarized below.
1This implies, as we explain in details later, that a lot of the computations are spent in vain with multiplying a value by zero.
2
1.
Sparsity is a prevalent phenomenon. We show in Section 2that the emergence of sparsity in activation
maps of T5 as reported in Figure 1is not an isolated and cherry-picked case. Rather, sparsity is prevalent,
and occurs broadly in Transformer models: it emerges in all layers of a Transformer, for Transformers trained
on both vision and natural language data, for Transformers of various configurations, and for activation
maps computed on both training and test data, etc. Moreover, through controlled experiments on the width
and depth of Transformers, we reveal that larger models are sparser, as measured by percentage of nonzero
entries. We also show in the Appendix Bthat sparsity emerges with many other architectures and with
different optimizers.
2.
Sparsity comes from training dynamic? Towards understanding where sparsity comes from, one argument
is that commonly used image and natural language training datasets entail a compact representation due to
information in the labels or intrinsic low-dimensional structures of the natural data. Another hypothesis
is that sparsity has nothing to do with commonly used training datasets, and arises as a result of modern
over-parameterized models being able to fit the training data even when it is generated in random. In
Section 3, we design experiments using training datasets with random labels, or random images, or infinitely
amount of data, to show that none of the above fully explains the emergence of sparsity. Based on our
observations, we speculate that the sparsity may be attributed to the training dynamic in the optimization
process. In particular, we show theoretically with a simplified model architecture that the descending
direction of gradient at the beginning of training points to decreasing the value of activations.
3.
Sparsity improves efficiency, robustness, and calibration. Sparsity of activation map in trained Trans-
formers implies that a large proportion of the computation during inference is spent on multiplying values
by zero. Hence, FLOPs can be drastically reduced by avoiding all such computations, which we discuss
in Section 4.1. Motivated by this observation, and to obtain reduced FLOPs not only after training but
throughout training, we introduce Top-
k
Transformer in Section 4.2, a simple modification of Transformers
where a Top-
k
thresholding is applied to the activation maps
2
. We show that Top-
k
Transformers with a
reasonable sized
k
has on par performance with vanilla Transformers. To demonstrate the computation
benefits of Top-
k
Transformers, we provide proof-of-concept results on wall time reduction for the task of
unbatched decoding on TPUv4 with a large Top-
k
T5. Meanwhile, we emphasise that this result is far from
fully realizing the benefit of sparse activation, due to a lack of hardware support for sparse computation.
While it is straightforward to associate sparsity with computational efficiency, it may be less obvious
and somewhat surprising that sparsity is associated with reliability of the models as well. We show in
Section 4.3 that enforcing explicit sparsity via Top-
k
Transformers improves model performance in terms of
less sensitivity to noisy training data, less sensitivity to input corruptions, and better confidence calibration.
1.3 Experimental Setup
We study the sparsity in activation maps of Transformers with two commonly used Transformer models,
namely Text-to-Text Transfer Transformer (i.e., T5) and Vision Transformer (i.e., ViT).
T5 is an encoder-decoder model for natural language processing tasks [
7
]. We train T5 on the Colossal
Clean Crawled Corpus (C4) using the span corruption task as suggested by [7].
ViT is an encoder model for vision tasks [
12
]. Unless specified otherwise, we train ViT on ImageNet-21k [
13
],
an image classification dataset with 14M images and 21k classes. For certain cases we also use ImageNet-1k
which is a subset of ImageNet-21k with 1.3M images and 1k classes.
Beyond T5 and ViT, we also present the results for BERT in the Appendix.
We measure the sparsity level at the intermediate output of the two-layer MLPs in a Transformer. Recall
that an MLP performs the following mapping
f(x;K,V).
=
d
X
i=1 σ(ki,x)·vi,or equivalently, f(x;K,V).
=Vσ(Kx),(1)
where
xIRdmodel
is the input,
K= [k1,...,kd]IRdmodel×d
and
V= [v1,...,vd]IRdmodel×d
are learnable
layer parameters, and
σ()
is a nonlinear activation function. We use ReLU as the activation function
σ()
for
2
The approach is previously adopted in ConvNets for improving model robustness [
10
], and more recently in [
11
] for improving
memory efficiency of Transformers.
3
(a) T5 vs ViT (b) Train vs evaluation data (c) Different training data size
(d) Varying configuration (ViT)
(e) Varying configuration (T5 Encoder) (f) Varying configuration (T5 Decoder)
Figure 2: Percentage of nonzero entries across different layers of trained Transformers (a) for both language
data with T5 and vision data with ViT, (b) on both training and evaluation data, (c) for ViT trained on two
ImageNet of different scales (21k vs 1k classes), (d) on ViT of varying configurations, and (e, f) on T5 of
varying configurations. Please note that the y-axis is in log scale. Sparsity emerges in all cases.
both T5 and ViT
3
. A two-layer MLP may be regarded as having
d
neurons where the
i
-th neuron performs the
computation
σ(ki,x)·vi
, and the final layer output is the sum of the output of all neurons. Each neuron is
called activated if
σ(ki,x)
is strictly positive. Hence, the sparsity of neuron activation can be measured by the
number of nonzero entries in the feature map
a.
=σ(Kx),(2)
which is a vector of dimension
d
. Throughout the paper, the sparsity level is computed on the training set
unless otherwise specified.
Both T5 and ViT come with several configurations for
dmodel, d
, number of layers, etc. Unless specified
otherwise, we will use the Base models (i.e., T5-Base and ViT-B/16) which have dmodel = 768,d= 3072, and
12
layers (for ViT) and
12
encoder layers
+12
decoder layers (for T5). Our experiment with T5 uses the T5X
codebase [
14
], and our experiment with ViT uses the Scenic codebase [
15
]. More training details of T5 and
ViT are provided in Appendix A.
2 Prevalence of Sparsity in Learned Transformers
This section shows thorough experiments on commonly used Transformers that sparsity in activation maps
is a prevalent phenomenon. We also show through some controlled experiments that deeper and wider
Transformers tend to be sparser measured by percentage of nonzero entries in activation maps.
3
ViT originally uses GeLU as its activation function as in [
12
]. Here we switch to using ReLU as it allows us to more easily measure the
sparsity level using the number of nonzero entries with a very small performance drop (e.g.,
47.78%
with GeLU vs
47.58%
with ReLU for
Top-1 evaluation accuracy on ImageNet-21k).
4
2.1 Sparsity is a Ubiquitous Phenomenon
We start by providing experimental evidence that the emergence of sparse activation in trained Transformers
is a ubiquitous phenomenon. To this end, we plot the percentage of nonzero entries of activation maps in
different Transformers, and present the results in Figure 2. Such results demonstrate the following.
Sparsity emerges for both Vision and NLP tasks. Figure 2a shows the percentage of nonzero entries of trained T5
and ViT models evaluated on their respective training datasets. We see that both encoder and decoder of T5,
as well as the ViT, all exhibit sparsity.
Sparsity emerges on both training and evaluation data. Figure 2b shows the percentage of nonzero entries in a
trained T5 evaluated on both the training data and the evaluation data. We see that the property of sparsity
generalizes very well to evaluation data as the curves for training and evaluation data align very closely
with each other.
Sparsity emerges on datasets of varying scale. Figure 2c shows the percentage of nonzero entries in ViT trained
on both ImageNet-21k and ImageNet-1k, where the former is a superset of the later with approximately
10×
more images and
21×
more classes. We see that the scale of data does not affect much of the sparsity level.
Sparsity emerges on Transformers of varying configurations. Figure 2d shows the percentage of nonzero entries
for ViT of varying configurations in model size. Figure 2e and 2f show the percentage of nonzero entries for
encoder and decoder, respectively, of T5 with varying configurations in model size. We see that sparsity
persists for all cases.
Sparsity emerges across all layers of a Transformer. Finally, all plots in Figure 2show that sparsity emerges in
all layers of a Transformer. Moreover, in all cases the first few and last few layers tend to be denser than
intermediate layers.
Figure 3: Percentage of times that each
neuron in the first MLP layer of a
trained T5 is activated on C4 dataset.
The presence of sparsity in activation maps does not rule out the
possibility that a small percentage of the neurons are always activated
for all inputs, whereas the rest of the neurons are never activated. To
illustrate that this is not the case, we experiment with a pretrained
T5 base model
4
to plot the percentage of layer inputs for which each
of the
d
neurons is activated when evaluated on 800 examples taken
from C4 dataset with span corruption task. Note that there are 800
* 512 = 409600 samples as MLP activation is computed per token. The
results are presented in Figure 3with x-axis being indices of neurons
in the first encoder layer of T5 sorted in descending order according
to percentage of layer inputs on which they are activated. It can be
seen that while a few neurons are activated for around 50% of the
time, the vast majority of neurons (around 93.5%) are activated less
than 10% of the time. Moreover, there are no dead neurons that are
never activated, and the least activated neuron is activated for around 0.001% of the time, and 99% of neurons
are activated over 1% of the time. Finally, while the results here are for neurons in the first MLP layer of a
pretrained T5 base encoder, all other MLP layers show qualitatively similar behavior.
2.2 The Larger, the Sparser
We next examine the effect of model size on the sparsity level of activation maps. Note that Figure 2e and
Figure 2f provide evidence with T5 of varying configuration that larger models tend to be sparser. Here we
perform controlled experiments to examine the effect of model depth, measured by the number of Transformer
layers, as well as the effect of model width, measured by the dimension of activation map of MLPs (i.e.,
d
),
separately. Towards that, we take a standard T5 model and vary the depth and width, respectively while
keeping the rest of the configuration fixed, and examine their sparsity level after training. The results are
presented in Figure 4for the encoder, whereas we omit the results for the decoder as they are qualitatively the
same as those for encoder.
It can be seen from Figure 4a that deeper Transformers are arguably sparser. For example, many of the
middle layers of the 32-layer model have less than 1% nonzero entries while all shallower models have more
4https://github.com/google-research/t5x/blob/main/docs/models.md#t5-checkpoints
5
摘要:

TheLazyNeuronPhenomenon:OnEmergenceofActivationSparsityinTransformersZonglinLi∗,ChongYou∗,SrinadhBhojanapalli,DaliangLi,AnkitSinghRawat,SashankJ.Reddi,KeYe,FelixChern,FelixYu,RuiqiGuo,andSanjivKumarGoogleResearch,NewYorkCityJune13,2023AbstractThispaperstudiesthecuriousphenomenonformachinelearningmod...

展开>> 收起<<
The Lazy Neuron Phenomenon On Emergence of Activation Sparsity in Transformers Zonglin Li Chong You Srinadh Bhojanapalli Daliang Li Ankit Singh Rawat Sashank J..pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:2.57MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注