
1.
Sparsity is a prevalent phenomenon. We show in Section 2that the emergence of sparsity in activation
maps of T5 as reported in Figure 1is not an isolated and cherry-picked case. Rather, sparsity is prevalent,
and occurs broadly in Transformer models: it emerges in all layers of a Transformer, for Transformers trained
on both vision and natural language data, for Transformers of various configurations, and for activation
maps computed on both training and test data, etc. Moreover, through controlled experiments on the width
and depth of Transformers, we reveal that larger models are sparser, as measured by percentage of nonzero
entries. We also show in the Appendix Bthat sparsity emerges with many other architectures and with
different optimizers.
2.
Sparsity comes from training dynamic? Towards understanding where sparsity comes from, one argument
is that commonly used image and natural language training datasets entail a compact representation due to
information in the labels or intrinsic low-dimensional structures of the natural data. Another hypothesis
is that sparsity has nothing to do with commonly used training datasets, and arises as a result of modern
over-parameterized models being able to fit the training data even when it is generated in random. In
Section 3, we design experiments using training datasets with random labels, or random images, or infinitely
amount of data, to show that none of the above fully explains the emergence of sparsity. Based on our
observations, we speculate that the sparsity may be attributed to the training dynamic in the optimization
process. In particular, we show theoretically with a simplified model architecture that the descending
direction of gradient at the beginning of training points to decreasing the value of activations.
3.
Sparsity improves efficiency, robustness, and calibration. Sparsity of activation map in trained Trans-
formers implies that a large proportion of the computation during inference is spent on multiplying values
by zero. Hence, FLOPs can be drastically reduced by avoiding all such computations, which we discuss
in Section 4.1. Motivated by this observation, and to obtain reduced FLOPs not only after training but
throughout training, we introduce Top-
k
Transformer in Section 4.2, a simple modification of Transformers
where a Top-
k
thresholding is applied to the activation maps
2
. We show that Top-
k
Transformers with a
reasonable sized
k
has on par performance with vanilla Transformers. To demonstrate the computation
benefits of Top-
k
Transformers, we provide proof-of-concept results on wall time reduction for the task of
unbatched decoding on TPUv4 with a large Top-
k
T5. Meanwhile, we emphasise that this result is far from
fully realizing the benefit of sparse activation, due to a lack of hardware support for sparse computation.
While it is straightforward to associate sparsity with computational efficiency, it may be less obvious
and somewhat surprising that sparsity is associated with reliability of the models as well. We show in
Section 4.3 that enforcing explicit sparsity via Top-
k
Transformers improves model performance in terms of
less sensitivity to noisy training data, less sensitivity to input corruptions, and better confidence calibration.
1.3 Experimental Setup
We study the sparsity in activation maps of Transformers with two commonly used Transformer models,
namely Text-to-Text Transfer Transformer (i.e., T5) and Vision Transformer (i.e., ViT).
•
T5 is an encoder-decoder model for natural language processing tasks [
7
]. We train T5 on the Colossal
Clean Crawled Corpus (C4) using the span corruption task as suggested by [7].
•
ViT is an encoder model for vision tasks [
12
]. Unless specified otherwise, we train ViT on ImageNet-21k [
13
],
an image classification dataset with 14M images and 21k classes. For certain cases we also use ImageNet-1k
which is a subset of ImageNet-21k with 1.3M images and 1k classes.
Beyond T5 and ViT, we also present the results for BERT in the Appendix.
We measure the sparsity level at the intermediate output of the two-layer MLPs in a Transformer. Recall
that an MLP performs the following mapping
f(x;K,V).
=
dff
X
i=1 σ(⟨ki,x⟩)·vi,or equivalently, f(x;K,V).
=Vσ(K⊤x),(1)
where
x∈IRdmodel
is the input,
K= [k1,...,kdff]∈IRdmodel×dff
and
V= [v1,...,vdff]∈IRdmodel×dff
are learnable
layer parameters, and
σ()
is a nonlinear activation function. We use ReLU as the activation function
σ()
for
2
The approach is previously adopted in ConvNets for improving model robustness [
10
], and more recently in [
11
] for improving
memory efficiency of Transformers.
3