Shapley Head Pruning Identifying and Removing Interference in Multilingual Transformers William Held

2025-04-26 0 0 1.35MB 11 页 10玖币

侵权投诉

Shapley Head Pruning:

Identifying and Removing Interference in Multilingual Transformers

William Held

wheld3@gatech.edu

Diyi Yang

diyiy@stanford.edu

Abstract

Multilingual transformer-based models

demonstrate remarkable zero and few-shot

transfer across languages by learning and

reusing language-agnostic features. However,

as a ﬁxed-size model acquires more languages,

its performance across all languages degrades,

a phenomenon termed interference. Often

attributed to limited model capacity, inter-

ference is commonly addressed by adding

additional parameters despite evidence that

transformer-based models are overparam-

eterized. In this work, we show that it is

possible to reduce interference by instead

identifying and pruning language-speciﬁc

parameters. First, we use Shapley Values,

a credit allocation metric from coalitional

game theory, to identify attention heads that

introduce interference. Then, we show that

removing identiﬁed attention heads from

a ﬁxed model improves performance for a

target language on both sentence classiﬁcation

and structural prediction, seeing gains as

large as 24.7%. Finally, we provide insights

on language-agnostic and language-speciﬁc

attention heads using attention visualization.

1 Introduction

Cross-lingual transfer learning aims to utilize a nat-

ural language processing system trained on a source

language to improve results for the same task in a

different target language. The core goal is to main-

tain relevant learned patterns from the source while

disregarding those which are inapplicable to the

target. Multilingual pretraining of transformer lan-

guage models has recently become a widespread

method for cross-lingual transfer; demonstrating re-

markable zero and few shot performance across lan-

guages when ﬁnetuned on monolingual data (Pires

et al.,2019;Conneau et al.,2019;Xue et al.,2021).

However, adding languages beyond a threshold

begins to harm cross-lingual transfer in a ﬁxed-

size model as shown in prior work (Conneau

et al.,2019;Xue et al.,2021). This has been ad-

dressed with additional parameters, both language-

speciﬁc (Pfeiffer et al.,2020) and broadly (Con-

neau et al.,2019;Xue et al.,2021). Wang et al.

(2020) justiﬁes this by showing that competition

over limited capacity drives interference. How-

ever, limited capacity seems unlikely to be the sole

driver as Chen et al. (2020) shows that pretrained

language models are highly overparametrized.

We offer an alternate hypothesis that interfer-

ence is caused by components that are specialized

to language-speciﬁc patterns and introduce noise

when applied to other languages. To test this hy-

pothesis, we introduce a methodology for identify-

ing language-speciﬁc components and further re-

moving them, which is expected to improve model

performance without updating or adding additional

language-speciﬁc parameters. Our work builds on

prior research studying monolingual models which

shows that they can be pruned aggressively (Michel

et al.,2019;Voita et al.,2019).

We leverage Shapley Values, the mean marginal

contribution of a player to a collaborative reward,

to identify attention heads that cause interference.

Shapley Values are consistent, gradient-free which

allows for binary removal, and map harmful com-

ponents clearly to negative values. While the pro-

cess of identifying and pruning language-speciﬁc

structures is agnostic to attribution methodology,

this makes Shapley Values particularly well-suited

to the task. To make computation tractable, we

follow prior work in vision models (Ghorbani and

Zou,2020) to approximate Shapley Values more

efﬁciently using truncation and multi-armed bandit

sampling. We contribute the following:

1. Attention Head Language Afﬁnity:

Even

when computed from aligned sentences, At-

tention Head Shapley Values vary based

on the language of input. This high-

lights that a subset of attention heads has

language-speciﬁc importance, while others

arXiv:2210.05709v1 [cs.CL] 11 Oct 2022

Figure 1: Attention Head Shapley Values of 3 Languages for XLM-R ﬁnetuned on English XNLI. Each value

indicates the mean marginal effect an attention head has on accuracy for the XNLI test set in the language.

are language-agnostic as shown in Figure 1.

2. Improving through Pruning:

Model prun-

ing according to Shapley Values improves

performance without updating parameters on

the Cross-Lingual Natural Language Infer-

ence corpus (Conneau et al.,2018) and the

Universal Dependencies Part-of-Speech cor-

pus (Nivre et al.,2020). This opens a path

of work to reduce interference by removing

parameters rather than adding them.

3. Interpreting Multilingual Heads:

In a qual-

itative study, we ﬁnd that the most language-

agnostic heads identiﬁed have human inter-

pretable language-agnostic function, while

language-speciﬁc heads have varied behavior.

2 Related Work

2.1 Multilingual Learning

A large amount of work has studied both the the-

oretical underpinnings of learning common struc-

tures for language and their applications to cross-

lingual transfer. Early works exploited commonal-

ity through the use of pivot representations, created

either by translation (Mann and Yarowsky,2001;

Tiedemann et al.,2014;Mayhew et al.,2017) or

language-agnostic task formulations (Zeman,2008;

McDonald et al.,2011).

As NLP has increasingly used representation

learning, dense embedding spaces replaced ex-

plicit pivots. This led to methods that identiﬁed

the commonalities of embedding spaces and ways

to align them (Joulin et al.,2018;Artetxe et al.,

2018;Artetxe and Schwenk,2019). Recently, many

works have trained multilingual transformer mod-

els (Pires et al.,2019;Conneau et al.,2019;Liu

et al.,2020;Xue et al.,2021;Hu et al.,2021) as

the basis for cross-lingual transfer. These models

both implicitly and explicitly perform alignment,

although they empirically achieve stronger align-

ment between closely related languages (Artetxe

et al.,2020;Conneau et al.,2020).

With language-speciﬁc data, further work has

studied how to reduce interference by adding

a small number of language-speciﬁc parameters.

These works adapt a model for the target language

by training only Adapters (Wang et al.,2020;Pfeif-

fer et al.,2020;Ansell et al.,2021), prompts (Zhao

and Schütze,2021), or subsets of model parame-

ters (Ansell et al.,2022).

Ma et al. (2021) previously investigated prun-

ing in multilingual models using gradient-based

importance metrics to study variability across at-

tention heads. However, they used a process of

iterative pruning and language-speciﬁc ﬁnetuning.

This iterative process is largely unstable since there

are many trainable subnetworks within large mod-

els (Prasanna et al.,2020). Our method is the ﬁrst

to address interference and improve cross-lingual

performance purely by pruning, without updating

or adding additional language-speciﬁc parameters.

2.2 Model Pruning

Model pruning has largely been focused reducing

the onerous memory and computation requirements

of large models. These techniques are broken into

two approaches: structured and unstructured prun-

ing. Unstructured pruning aims to remove individ-

ual parameters, which allows for more ﬁne-grained

removal. This process often has minimal effects

even at extremely high degrees of sparsity. To efﬁ-

ciently prune a large number of parameters, many

techniques propose using gradients or parameter

magnitude (Sundararajan et al.,2017;Lee et al.,

2019;Frankle and Carbin,2019;Chen et al.,2020)

as importance metrics.

Structured pruning, or removing entire structural

components, is motivated by computational ben-

eﬁts from hardware optimizations. In the case of

Transformers, most of this pruning work targets

removal of attention heads, either through static

ranking (Michel et al.,2019) or through iterative

training (Voita et al.,2019;Prasanna et al.,2020;

Xia et al.,2022). These pruning methods have

also been used to study model behavior, but meth-

ods with iterative ﬁnetuning are highly unstable as

many sub-networks can deliver the same level of

performance once trained (Prasanna et al.,2020).

Our work studies pruning without updating

model parameters, which aligns with Michel et al.

(2019) which was able to maintain reasonable

model performance even when removing 50% of

total attention heads. Furthermore, Kovaleva et al.

(2019) found that pruning attention heads could

sometimes improve model performance without

further ﬁnetuning. We build on this to develop a

methodology for consistently identifying pruned

models which improve performance.

3 Methods

To identify and remove interference, we need a

metric which can distinguish harmful, unimportant,

and beneﬁcial attention heads. Prior work (Michel

et al.,2019;Ma et al.,2021) utilized the magnitude

of gradients as an importance metric. However, this

metric measures the sensitivity of the loss function

to the masking of a particular head regardless of the

direction of that sensitivity. Since the loss function

is sensitive to the removal of both harmful and

beneﬁcial heads, we develop a simple yet effective

method which separates these classes.

Shapley Values (Shapley,1953) have often been

applied in model interpretability since they are the

only attribution method to abide by the theoretical

properties of local accuracy, missingness, and con-

sistency laid out by Lundberg and Lee (2017). In

our setting, Shapley Values have two advantages

over gradient-based importance metrics. Firstly,

gradient-based approaches require differentiable re-

laxations of evaluation functions and masking, but

Shapley Values do not. Therefore, we can instead

use the evaluation functions and binary masks di-

rectly. Secondly, Shapley Values are meaningfully

signed which allows them to distinguish beneﬁcial,

unimportant, and harmful heads rather than just im-

portant and unimportant heads. This latter property

is essential for our goal of identifying interference.

We apply Shapley Values to the task of structural

pruning. In order to compute Shapley Values for

each head, we ﬁrst formalize the forward pass of

a Transformer as a coalitional game between at-

tention heads. Then, we describe a methodology

to efﬁciently approximate Shapley Values using

Monte Carlo simulation combined with truncation

and multi-armed bandit search. Finally, we pro-

pose a simple pruning algorithm using the resulting

values to evaluate the practical utility of this theo-

retically grounded importance metric.

3.1 Attention Head Shapley Values

We formalize a Transformer performing a task as

a coalitional game. Our set of players

are at-

tention heads of the model. In order to remove

self-attention heads from the game without retrain-

ing, we follow Michel et al. (2019) which aug-

ments multi-headed attention with an added gate

Gh={0,1}

for each head

Atth

in a layer with

Nhheads as follows:

MHAtt(x, q) =

h=0

GhAtt

h(x, q)(1)

With

Gh= 0

, that attention head does not con-

tribute to the output of the transformer and is there-

fore considered removed from the active coalition.

Our characteristic function

V(A)

is the task eval-

uation metric

Mv(A)

over a set of validation data

within a target language, adjusted by the evalua-

tion metric with all heads removed to abide by the

V(∅) = 0 property of coalitional games:

V(A) = Mv(A)−Mv(∅)(2)

With these established, the Shapley Value

ϕh

for

an attention head

Atth

is the mean performance

improvement from switching gate

from

across all Ppermutations of other gates:

ϕh=1

|P|X

A∈P

V(A∪h)−V(A)(3)

3.2 Approximating Shapley Values

However, the exact computation of Shapley Values

for

attention heads requires

evaluations of

our validation metric which is intractable for the

number of heads used in most architectures. This

intractability is often addressed by using Monte

Carlo simulation as an approximate (Castro et al.,

2009). This replaces

in Equation 3with ran-

domly constructed permutations, rather than the

full set of all possible solutions.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ShapleyHeadPruning:IdentifyingandRemovingInterferenceinMultilingualTransformersWilliamHeldwheld3@gatech.eduDiyiYangdiyiy@stanford.eduAbstractMultilingualtransformer-basedmodelsdemonstrateremarkablezeroandfew-shottransferacrosslanguagesbylearningandreusinglanguage-agnosticfeatures.However,asaxed-siz...

展开>> 收起<<

Shapley Head Pruning Identifying and Removing Interference in Multilingual Transformers William Held.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Shapley Head Pruning Identifying and Removing Interference in Multilingual Transformers William Held

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: