Sparse Teachers Can Be Dense with Knowledge Yi Yang Chen Zhang

2025-05-03 0 0 1.1MB 12 页 10玖币

侵权投诉

Sparse Teachers Can Be Dense with Knowledge

Yi Yang , Chen Zhang , Dawei Song

Beijing Institute of Technology

{yang.yi,czhang,dwsong}@bit.edu.cn

Abstract

Recent advances in distilling pretrained lan-

guage models have discovered that, besides

the expressiveness of knowledge, the student-

friendliness should be taken into considera-

tion to realize a truly knowledgeable teacher.

Based on a pilot study, we ﬁnd that over-

parameterized teachers can produce expressive

yet student-unfriendly knowledge and are thus

limited in overall knowledgeableness. To re-

move the parameters that result in student-

unfriendliness, we propose a sparse teacher

trick under the guidance of an overall knowl-

edgeable score for each teacher parameter.

The knowledgeable score is essentially an in-

terpolation of the expressiveness and student-

friendliness scores. The aim is to ensure that

the expressive parameters are retained while

the student-unfriendly ones are removed. Ex-

tensive experiments on the GLUE benchmark

show that the proposed sparse teachers can

be dense with knowledge and lead to students

with compelling performance in comparison

with a series of competitive baselines.1

1 Introduction

Pretrained language models (LMs) built upon trans-

formers (Devlin et al.,2019;Liu et al.,2019;Raffel

et al.,2020) have achieved great successes. How-

ever, the appealing performance is usually accom-

panied with expensive computational costs and

memory footprints, which can be alleviated by

model compression (Ganesh et al.,2021). Knowl-

edge distillation (Hinton et al.,2015), as a domi-

nant method in model compression, concentrates

on transferring knowledge from a teacher of large

scale to a student of smaller scale.

Yi Yang and Chen Zhang contribute equally to this

work, and the order is determined alphabetically.

Dawei Song is the corresponding author, who is also

with The Open University, UK.

Code is available at

https://github.com/GeneZC/

StarK.

Figure 1: Performance and conﬁdence on RTE (Ben-

tivogli et al.,2009) of BERTbase and BERTlarge at small

sparsity levels. Task metric and output distribution vari-

ance are used as the measures of performance and con-

ﬁdence, respectively. Distribution variance is compara-

tively equivalent to distribution negative entropy as em-

ployed in Pereyra et al. (2017). Proof of the equiva-

lence can be found in Appendix A.

Conventional studies (Sun et al.,2019;Jiao et al.,

2020) mainly expect that the expressive knowl-

edge would be well transferred, yet largely ne-

glecting the existence of student-unfriendly knowl-

edge. Recent attempts (Zhou et al.,2022;Zhao

et al.,2022) are made to adapt the teacher to more

student-friendly knowledge and have yielded per-

formance gains. Based on these observations, we

posit that over-parameterized LMs, on the one

hand, can produce expressive knowledge due to

over-parameterization, but on the other hand can

also produce student-unfriendly knowledge due to

over-conﬁdence (Hinton et al.,2015;Pereyra et al.,

2017). From a pilot study shown in Figure 1, we

ﬁnd that LMs of large scale tend to have a good

performance and high conﬁdence, and that both per-

formance and conﬁdence can be degraded through

arXiv:2210.03923v2 [cs.CL] 17 Oct 2022

randomly sparsifying a small portion of parame-

ters.

This indicates that some parameters resulting

in student-unfriendliness can be rather removed, to

improve student-friendliness of the teacher without

sacriﬁcing too much its expressiveness.

Motivated by this ﬁnding, we propose a sparse

teacher trick (in short, STARK ) under the guid-

ance of an overall knowledgeable score for each

teacher parameter, which accords not only with the

expressiveness but also the student-friendliness of

the parameter by interpolation. The aim is to re-

tain the expressive parameters while removing the

student-unfriendly ones. Speciﬁcally, we introduce

a three-stage procedure consisting of 1) trial distil-

lation, 2) parameter sparsiﬁcation, and 3) actual

distillation. The trial distillation distills the dense

teacher to the student so that a trial student is ob-

tained. The parameter sparsiﬁcation ﬁrst estimates

the expressiveness score and student-friendliness

score of each teacher parameter via feedbacks re-

spectively from the teacher itself and the trial stu-

dent, and then sparsiﬁes the teacher by removing

the parameters associated with adequately low in-

terpolated knowledgeable scores. The actual distil-

lation distills the sparsiﬁed teacher to the student

so that an actual student is obtained, where the stu-

dent is initialized in the same manner as that used

in trial distillation following the commonly-used

rewinding technique (Frankle and Carbin,2019).

We conduct an extensive set of experiments

on the GLUE benchmark. Experimental results

demonstrate that the sparse teachers can be dense

with knowledge and lead to a remarkable perfor-

mance of students compared with a series of com-

petitive baselines.

2 Background

2.1 BERT Architecture

The BERT (Devlin et al.,2019) is composed

of several stacked encoder layers of transform-

ers (Vaswani et al.,2017). There are two blocks

in every encoder layer: a multi-head self-attention

block (MHA) and a feed-forward network block

(FFN), with a residual connection and a normaliza-

tion layer around each.

Given an

-length sequence of

-dimensional

input vectors

X∈Rl×d

, the output of the MHA

block with

independent heads can be represented

2https://pytorch.org/docs/stable/generated/

torch.nn.utils.prune.random_unstructured

as:

MHA(X)

i=1

Attn(X,W(i)

Q,W(i)

K,W(i)

V)W(i)

where the

-th head is parameterized by

W(i)

V∈Rd×dA

, and

W(i)

O∈RdA×d

. On

the other hand, the output of the FFN block is:

FFN(X) = GELU(XW1)W2,

where two fully-connected layers are parameterized

by W1∈Rd×dIand W2∈RdI×drespectively.

2.2 Knowledge Distillation

Knowledge distillation (Hinton et al.,2015) aims to

transfer the knowledge from a large-scale teacher

to a smaller-scale student, which is originally pro-

posed to supervise the student with the teacher

logits. With its prevalence, a tremendous amount

of work has been investigated to transfer various

knowledge from the teacher to the student (Romero

et al.,2015;Zagoruyko and Komodakis,2017;Sun

et al.,2019;Jiao et al.,2020;Park et al.,2021b;

Li et al.,2020;Wang et al.,2020). PKD (Sun

et al.,2019) introduces a patient distillation scheme

where the student learns multiple intermediate layer

representations and logits from the teacher. More-

over, attention distributions (Sun et al.,2020;Jiao

et al.,2020;Li et al.,2020;Wang et al.,2020) and

even high-order relations (Park et al.,2021b) are

considered to further boost the performance.

Since a large capacity gap between the teacher

and the student can lead to an inferior distilla-

tion quality, TAKD (Mirzadeh et al.,2020) pro-

poses to insert teacher assistants of possible in-

termediate scales between the teacher and the

student so that the gap is drawn closer (Zhang

et al.,2022). More recently, teachers with student-

friendly architectures have exactly showed the

signiﬁcance of student-friendliness (Park et al.,

2021a). MetaKD (Zhou et al.,2022) adopts meta-

learning to optimize the student-friendliness of

the teacher according to the student preference.

DKD (Zhao et al.,2022) decouples and ampliﬁes

student-friendly knowledge in contrast to others.

Distinguished from these student-friendly teachers

that are achieved by altering teacher scales, archi-

tectures, parameters or knowledge representations,

our work, to our best knowledge, is the ﬁrst one sug-

gesting that teacher parameters can produce both

student-friendly and student-unfriendly knowledge

and aiming to ﬁnd the sparse teacher with the best

student-friendliness.

2.3 Model Pruning

Model pruning is imposed to remove the less ex-

pressive parameters for model compression. Previ-

ous work applies either structured (Li et al.,2017;

Luo et al.,2017;He et al.,2017;Yang et al.,2022)

or unstructured pruning (Han et al.,2015;Park

et al.,2017;Louizos et al.,2018;Lee et al.,2019)

to transformers. Unstructured pruning focuses on

pruning parameter-level parameters based on zero-

order decisions derived from magnitudes (Gordon

et al.,2020) or ﬁrst-order decisions computed from

both gradients and magnitudes (Sanh et al.,2020).

In contrast, structured pruning prunes module-

level parameters such like MHA heads (Michel

et al.,2019) and FFN layers (Prasanna et al.,2020)

guided by the expressive score (Michel et al.,2019).

It is noteworthy that while some pruning methods

leverage post-training pruning (Hou et al.,2020),

others can take advantage of training-time prun-

ing (Xia et al.,2022). Although training-time prun-

ing can result in slightly better performance, it can

consume much more time to meet a convergence.

Our work mainly exploits structured pruning to ob-

tain sparse teachers, yet also explores the use of

unstructured pruning, in a post-training style.

3 Sparse Teacher Trick

Our trick involves three stages in the student learn-

ing procedure as shown in Figure 2. First, we distil

a trial student from the dense teacher on a speciﬁc

task (trial distillation). Then, we sparsify the pa-

rameters of the dense teacher that are associated

with adequately low knowledgeable scores (param-

eter sparsiﬁcation). Finally, rewinding is applied,

where the student is set to the initialization exactly

used in the trial distillation stage and is learned

from the sparse teacher during (actual distillation).

3.1 Trial and Actual Distillations

Trial distillation and actual distillation share the

same distillation regime. We employ the widely-

used logits distillation (Hinton et al.,2015) as the

distillation objective, as depicted below:

LKD =−softmax(zt/τ) log softmax(zs/τ ),

LTK =−ylog ys,

L=LKD +α· LTK,

where

separately stand for logits of the

teacher and student, and

separately stand

for prediction normalized probabilities of the stu-

dent and ground-truth one-hot probabilities. Two

subscripts

and

indicate distillation and task

losses respectively.

is a temperature controlling

the smoothness the logits (Hinton et al.,2015), and

αis a term balancing two losses.

The trial distillation and actual distillation also

reuse the initialization of the student for better

convergence, which is known as rewinding tech-

nique (Frankle and Carbin,2019).

3.2 Parameter Sparsiﬁcation

For parameter sparsiﬁcation, we design a knowl-

edgeable score, which is essentially an in-

terpolation of the already-proposed expressive

score (Molchanov et al.,2017) and our proposed

student-friendly score, to measure the knowledge-

ableness of each teacher parameter. Thanks to

the knowledgeable score, we can safely exclude

student-unfriendly parameters without harming ex-

pressive parameters too much.

We mainly sparsify the attention heads of MHA

blocks and intermediate neurons of FFN blocks in

the teacher. Following the literature on structured

pruning in a post-training style (Michel et al.,2019;

Hou et al.,2020), we attach a set of variables

ξ(i)

and

to the attention heads and the intermediate

neurons, to record the parameter sensitivities for a

speciﬁc task through accumulated absolute gradi-

ents, as shown below:

MHA◦(X)

i=1

ξ(i)Attn(X,W(i)

Q,W(i)

K,W(i)

V)W(i)

FFN◦(X) = GELU(XW1)diag(ν)W2,

where

ξ(i)≡1

and

ν≡1dI

. We set the values of

the

ξ(i)

and

to ones to ensure the functionalities

of corresponding heads and neurons are retained.

The implementation is mathematically equiva-

lent to the prevalent ﬁrst-order taylor expansion

of the absolute variation between before and after

removing a module (i.e., a head or a neuron) akin

to Molchanov et al. (2017). Take the

-th attention

head as an example, its parameter sensitivity can

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SparseTeachersCanBeDensewithKnowledgeYiYang,ChenZhang,DaweiSongBeijingInstituteofTechnology{yang.yi,czhang,dwsong}@bit.edu.cnAbstractRecentadvancesindistillingpretrainedlan-guagemodelshavediscoveredthat,besidestheexpressivenessofknowledge,thestudent-friendlinessshouldbetakenintoconsidera-tiontoreali...

展开>> 收起<<

Sparse Teachers Can Be Dense with Knowledge Yi Yang Chen Zhang.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sparse Teachers Can Be Dense with Knowledge Yi Yang Chen Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: