Shapley Head Pruning Identifying and Removing Interference in Multilingual Transformers William Held

2025-04-26 0 0 1.35MB 11 页 10玖币
侵权投诉
Shapley Head Pruning:
Identifying and Removing Interference in Multilingual Transformers
William Held
wheld3@gatech.edu
Diyi Yang
diyiy@stanford.edu
Abstract
Multilingual transformer-based models
demonstrate remarkable zero and few-shot
transfer across languages by learning and
reusing language-agnostic features. However,
as a fixed-size model acquires more languages,
its performance across all languages degrades,
a phenomenon termed interference. Often
attributed to limited model capacity, inter-
ference is commonly addressed by adding
additional parameters despite evidence that
transformer-based models are overparam-
eterized. In this work, we show that it is
possible to reduce interference by instead
identifying and pruning language-specific
parameters. First, we use Shapley Values,
a credit allocation metric from coalitional
game theory, to identify attention heads that
introduce interference. Then, we show that
removing identified attention heads from
a fixed model improves performance for a
target language on both sentence classification
and structural prediction, seeing gains as
large as 24.7%. Finally, we provide insights
on language-agnostic and language-specific
attention heads using attention visualization.
1 Introduction
Cross-lingual transfer learning aims to utilize a nat-
ural language processing system trained on a source
language to improve results for the same task in a
different target language. The core goal is to main-
tain relevant learned patterns from the source while
disregarding those which are inapplicable to the
target. Multilingual pretraining of transformer lan-
guage models has recently become a widespread
method for cross-lingual transfer; demonstrating re-
markable zero and few shot performance across lan-
guages when finetuned on monolingual data (Pires
et al.,2019;Conneau et al.,2019;Xue et al.,2021).
However, adding languages beyond a threshold
begins to harm cross-lingual transfer in a fixed-
size model as shown in prior work (Conneau
et al.,2019;Xue et al.,2021). This has been ad-
dressed with additional parameters, both language-
specific (Pfeiffer et al.,2020) and broadly (Con-
neau et al.,2019;Xue et al.,2021). Wang et al.
(2020) justifies this by showing that competition
over limited capacity drives interference. How-
ever, limited capacity seems unlikely to be the sole
driver as Chen et al. (2020) shows that pretrained
language models are highly overparametrized.
We offer an alternate hypothesis that interfer-
ence is caused by components that are specialized
to language-specific patterns and introduce noise
when applied to other languages. To test this hy-
pothesis, we introduce a methodology for identify-
ing language-specific components and further re-
moving them, which is expected to improve model
performance without updating or adding additional
language-specific parameters. Our work builds on
prior research studying monolingual models which
shows that they can be pruned aggressively (Michel
et al.,2019;Voita et al.,2019).
We leverage Shapley Values, the mean marginal
contribution of a player to a collaborative reward,
to identify attention heads that cause interference.
Shapley Values are consistent, gradient-free which
allows for binary removal, and map harmful com-
ponents clearly to negative values. While the pro-
cess of identifying and pruning language-specific
structures is agnostic to attribution methodology,
this makes Shapley Values particularly well-suited
to the task. To make computation tractable, we
follow prior work in vision models (Ghorbani and
Zou,2020) to approximate Shapley Values more
efficiently using truncation and multi-armed bandit
sampling. We contribute the following:
1. Attention Head Language Affinity:
Even
when computed from aligned sentences, At-
tention Head Shapley Values vary based
on the language of input. This high-
lights that a subset of attention heads has
language-specific importance, while others
arXiv:2210.05709v1 [cs.CL] 11 Oct 2022
Figure 1: Attention Head Shapley Values of 3 Languages for XLM-R finetuned on English XNLI. Each value
indicates the mean marginal effect an attention head has on accuracy for the XNLI test set in the language.
are language-agnostic as shown in Figure 1.
2. Improving through Pruning:
Model prun-
ing according to Shapley Values improves
performance without updating parameters on
the Cross-Lingual Natural Language Infer-
ence corpus (Conneau et al.,2018) and the
Universal Dependencies Part-of-Speech cor-
pus (Nivre et al.,2020). This opens a path
of work to reduce interference by removing
parameters rather than adding them.
3. Interpreting Multilingual Heads:
In a qual-
itative study, we find that the most language-
agnostic heads identified have human inter-
pretable language-agnostic function, while
language-specific heads have varied behavior.
2 Related Work
2.1 Multilingual Learning
A large amount of work has studied both the the-
oretical underpinnings of learning common struc-
tures for language and their applications to cross-
lingual transfer. Early works exploited commonal-
ity through the use of pivot representations, created
either by translation (Mann and Yarowsky,2001;
Tiedemann et al.,2014;Mayhew et al.,2017) or
language-agnostic task formulations (Zeman,2008;
McDonald et al.,2011).
As NLP has increasingly used representation
learning, dense embedding spaces replaced ex-
plicit pivots. This led to methods that identified
the commonalities of embedding spaces and ways
to align them (Joulin et al.,2018;Artetxe et al.,
2018;Artetxe and Schwenk,2019). Recently, many
works have trained multilingual transformer mod-
els (Pires et al.,2019;Conneau et al.,2019;Liu
et al.,2020;Xue et al.,2021;Hu et al.,2021) as
the basis for cross-lingual transfer. These models
both implicitly and explicitly perform alignment,
although they empirically achieve stronger align-
ment between closely related languages (Artetxe
et al.,2020;Conneau et al.,2020).
With language-specific data, further work has
studied how to reduce interference by adding
a small number of language-specific parameters.
These works adapt a model for the target language
by training only Adapters (Wang et al.,2020;Pfeif-
fer et al.,2020;Ansell et al.,2021), prompts (Zhao
and Schütze,2021), or subsets of model parame-
ters (Ansell et al.,2022).
Ma et al. (2021) previously investigated prun-
ing in multilingual models using gradient-based
importance metrics to study variability across at-
tention heads. However, they used a process of
iterative pruning and language-specific finetuning.
This iterative process is largely unstable since there
are many trainable subnetworks within large mod-
els (Prasanna et al.,2020). Our method is the first
to address interference and improve cross-lingual
performance purely by pruning, without updating
or adding additional language-specific parameters.
2.2 Model Pruning
Model pruning has largely been focused reducing
the onerous memory and computation requirements
of large models. These techniques are broken into
two approaches: structured and unstructured prun-
ing. Unstructured pruning aims to remove individ-
ual parameters, which allows for more fine-grained
removal. This process often has minimal effects
even at extremely high degrees of sparsity. To effi-
ciently prune a large number of parameters, many
techniques propose using gradients or parameter
magnitude (Sundararajan et al.,2017;Lee et al.,
2019;Frankle and Carbin,2019;Chen et al.,2020)
as importance metrics.
Structured pruning, or removing entire structural
components, is motivated by computational ben-
efits from hardware optimizations. In the case of
Transformers, most of this pruning work targets
removal of attention heads, either through static
ranking (Michel et al.,2019) or through iterative
training (Voita et al.,2019;Prasanna et al.,2020;
Xia et al.,2022). These pruning methods have
also been used to study model behavior, but meth-
ods with iterative finetuning are highly unstable as
many sub-networks can deliver the same level of
performance once trained (Prasanna et al.,2020).
Our work studies pruning without updating
model parameters, which aligns with Michel et al.
(2019) which was able to maintain reasonable
model performance even when removing 50% of
total attention heads. Furthermore, Kovaleva et al.
(2019) found that pruning attention heads could
sometimes improve model performance without
further finetuning. We build on this to develop a
methodology for consistently identifying pruned
models which improve performance.
3 Methods
To identify and remove interference, we need a
metric which can distinguish harmful, unimportant,
and beneficial attention heads. Prior work (Michel
et al.,2019;Ma et al.,2021) utilized the magnitude
of gradients as an importance metric. However, this
metric measures the sensitivity of the loss function
to the masking of a particular head regardless of the
direction of that sensitivity. Since the loss function
is sensitive to the removal of both harmful and
beneficial heads, we develop a simple yet effective
method which separates these classes.
Shapley Values (Shapley,1953) have often been
applied in model interpretability since they are the
only attribution method to abide by the theoretical
properties of local accuracy, missingness, and con-
sistency laid out by Lundberg and Lee (2017). In
our setting, Shapley Values have two advantages
over gradient-based importance metrics. Firstly,
gradient-based approaches require differentiable re-
laxations of evaluation functions and masking, but
Shapley Values do not. Therefore, we can instead
use the evaluation functions and binary masks di-
rectly. Secondly, Shapley Values are meaningfully
signed which allows them to distinguish beneficial,
unimportant, and harmful heads rather than just im-
portant and unimportant heads. This latter property
is essential for our goal of identifying interference.
We apply Shapley Values to the task of structural
pruning. In order to compute Shapley Values for
each head, we first formalize the forward pass of
a Transformer as a coalitional game between at-
tention heads. Then, we describe a methodology
to efficiently approximate Shapley Values using
Monte Carlo simulation combined with truncation
and multi-armed bandit search. Finally, we pro-
pose a simple pruning algorithm using the resulting
values to evaluate the practical utility of this theo-
retically grounded importance metric.
3.1 Attention Head Shapley Values
We formalize a Transformer performing a task as
a coalitional game. Our set of players
A
are at-
tention heads of the model. In order to remove
self-attention heads from the game without retrain-
ing, we follow Michel et al. (2019) which aug-
ments multi-headed attention with an added gate
Gh={0,1}
for each head
Atth
in a layer with
Nhheads as follows:
MHAtt(x, q) =
Nh
X
h=0
GhAtt
h(x, q)(1)
With
Gh= 0
, that attention head does not con-
tribute to the output of the transformer and is there-
fore considered removed from the active coalition.
Our characteristic function
V(A)
is the task eval-
uation metric
Mv(A)
over a set of validation data
within a target language, adjusted by the evalua-
tion metric with all heads removed to abide by the
V() = 0 property of coalitional games:
V(A) = Mv(A)Mv()(2)
With these established, the Shapley Value
ϕh
for
an attention head
Atth
is the mean performance
improvement from switching gate
Gh
from
0
to
1
across all Ppermutations of other gates:
ϕh=1
|P|X
AP
V(Ah)V(A)(3)
3.2 Approximating Shapley Values
However, the exact computation of Shapley Values
for
N
attention heads requires
2N
evaluations of
our validation metric which is intractable for the
number of heads used in most architectures. This
intractability is often addressed by using Monte
Carlo simulation as an approximate (Castro et al.,
2009). This replaces
P
in Equation 3with ran-
domly constructed permutations, rather than the
full set of all possible solutions.
摘要:

ShapleyHeadPruning:IdentifyingandRemovingInterferenceinMultilingualTransformersWilliamHeldwheld3@gatech.eduDiyiYangdiyiy@stanford.eduAbstractMultilingualtransformer-basedmodelsdemonstrateremarkablezeroandfew-shottransferacrosslanguagesbylearningandreusinglanguage-agnosticfeatures.However,asaxed-siz...

展开>> 收起<<
Shapley Head Pruning Identifying and Removing Interference in Multilingual Transformers William Held.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.35MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注