
as importance metrics.
Structured pruning, or removing entire structural
components, is motivated by computational ben-
efits from hardware optimizations. In the case of
Transformers, most of this pruning work targets
removal of attention heads, either through static
ranking (Michel et al.,2019) or through iterative
training (Voita et al.,2019;Prasanna et al.,2020;
Xia et al.,2022). These pruning methods have
also been used to study model behavior, but meth-
ods with iterative finetuning are highly unstable as
many sub-networks can deliver the same level of
performance once trained (Prasanna et al.,2020).
Our work studies pruning without updating
model parameters, which aligns with Michel et al.
(2019) which was able to maintain reasonable
model performance even when removing 50% of
total attention heads. Furthermore, Kovaleva et al.
(2019) found that pruning attention heads could
sometimes improve model performance without
further finetuning. We build on this to develop a
methodology for consistently identifying pruned
models which improve performance.
3 Methods
To identify and remove interference, we need a
metric which can distinguish harmful, unimportant,
and beneficial attention heads. Prior work (Michel
et al.,2019;Ma et al.,2021) utilized the magnitude
of gradients as an importance metric. However, this
metric measures the sensitivity of the loss function
to the masking of a particular head regardless of the
direction of that sensitivity. Since the loss function
is sensitive to the removal of both harmful and
beneficial heads, we develop a simple yet effective
method which separates these classes.
Shapley Values (Shapley,1953) have often been
applied in model interpretability since they are the
only attribution method to abide by the theoretical
properties of local accuracy, missingness, and con-
sistency laid out by Lundberg and Lee (2017). In
our setting, Shapley Values have two advantages
over gradient-based importance metrics. Firstly,
gradient-based approaches require differentiable re-
laxations of evaluation functions and masking, but
Shapley Values do not. Therefore, we can instead
use the evaluation functions and binary masks di-
rectly. Secondly, Shapley Values are meaningfully
signed which allows them to distinguish beneficial,
unimportant, and harmful heads rather than just im-
portant and unimportant heads. This latter property
is essential for our goal of identifying interference.
We apply Shapley Values to the task of structural
pruning. In order to compute Shapley Values for
each head, we first formalize the forward pass of
a Transformer as a coalitional game between at-
tention heads. Then, we describe a methodology
to efficiently approximate Shapley Values using
Monte Carlo simulation combined with truncation
and multi-armed bandit search. Finally, we pro-
pose a simple pruning algorithm using the resulting
values to evaluate the practical utility of this theo-
retically grounded importance metric.
3.1 Attention Head Shapley Values
We formalize a Transformer performing a task as
a coalitional game. Our set of players
A
are at-
tention heads of the model. In order to remove
self-attention heads from the game without retrain-
ing, we follow Michel et al. (2019) which aug-
ments multi-headed attention with an added gate
Gh={0,1}
for each head
Atth
in a layer with
Nhheads as follows:
MHAtt(x, q) =
Nh
X
h=0
GhAtt
h(x, q)(1)
With
Gh= 0
, that attention head does not con-
tribute to the output of the transformer and is there-
fore considered removed from the active coalition.
Our characteristic function
V(A)
is the task eval-
uation metric
Mv(A)
over a set of validation data
within a target language, adjusted by the evalua-
tion metric with all heads removed to abide by the
V(∅) = 0 property of coalitional games:
V(A) = Mv(A)−Mv(∅)(2)
With these established, the Shapley Value
ϕh
for
an attention head
Atth
is the mean performance
improvement from switching gate
Gh
from
0
to
1
across all Ppermutations of other gates:
ϕh=1
|P|X
A∈P
V(A∪h)−V(A)(3)
3.2 Approximating Shapley Values
However, the exact computation of Shapley Values
for
N
attention heads requires
2N
evaluations of
our validation metric which is intractable for the
number of heads used in most architectures. This
intractability is often addressed by using Monte
Carlo simulation as an approximate (Castro et al.,
2009). This replaces
P
in Equation 3with ran-
domly constructed permutations, rather than the
full set of all possible solutions.