Sparse Teachers Can Be Dense with Knowledge Yi Yang Chen Zhang

2025-05-03 0 0 1.1MB 12 页 10玖币
侵权投诉
Sparse Teachers Can Be Dense with Knowledge
Yi Yang , Chen Zhang , Dawei Song
Beijing Institute of Technology
{yang.yi,czhang,dwsong}@bit.edu.cn
Abstract
Recent advances in distilling pretrained lan-
guage models have discovered that, besides
the expressiveness of knowledge, the student-
friendliness should be taken into considera-
tion to realize a truly knowledgeable teacher.
Based on a pilot study, we find that over-
parameterized teachers can produce expressive
yet student-unfriendly knowledge and are thus
limited in overall knowledgeableness. To re-
move the parameters that result in student-
unfriendliness, we propose a sparse teacher
trick under the guidance of an overall knowl-
edgeable score for each teacher parameter.
The knowledgeable score is essentially an in-
terpolation of the expressiveness and student-
friendliness scores. The aim is to ensure that
the expressive parameters are retained while
the student-unfriendly ones are removed. Ex-
tensive experiments on the GLUE benchmark
show that the proposed sparse teachers can
be dense with knowledge and lead to students
with compelling performance in comparison
with a series of competitive baselines.1
1 Introduction
Pretrained language models (LMs) built upon trans-
formers (Devlin et al.,2019;Liu et al.,2019;Raffel
et al.,2020) have achieved great successes. How-
ever, the appealing performance is usually accom-
panied with expensive computational costs and
memory footprints, which can be alleviated by
model compression (Ganesh et al.,2021). Knowl-
edge distillation (Hinton et al.,2015), as a domi-
nant method in model compression, concentrates
on transferring knowledge from a teacher of large
scale to a student of smaller scale.
Yi Yang and Chen Zhang contribute equally to this
work, and the order is determined alphabetically.
Dawei Song is the corresponding author, who is also
with The Open University, UK.
1
Code is available at
https://github.com/GeneZC/
StarK.
Figure 1: Performance and confidence on RTE (Ben-
tivogli et al.,2009) of BERTbase and BERTlarge at small
sparsity levels. Task metric and output distribution vari-
ance are used as the measures of performance and con-
fidence, respectively. Distribution variance is compara-
tively equivalent to distribution negative entropy as em-
ployed in Pereyra et al. (2017). Proof of the equiva-
lence can be found in Appendix A.
Conventional studies (Sun et al.,2019;Jiao et al.,
2020) mainly expect that the expressive knowl-
edge would be well transferred, yet largely ne-
glecting the existence of student-unfriendly knowl-
edge. Recent attempts (Zhou et al.,2022;Zhao
et al.,2022) are made to adapt the teacher to more
student-friendly knowledge and have yielded per-
formance gains. Based on these observations, we
posit that over-parameterized LMs, on the one
hand, can produce expressive knowledge due to
over-parameterization, but on the other hand can
also produce student-unfriendly knowledge due to
over-confidence (Hinton et al.,2015;Pereyra et al.,
2017). From a pilot study shown in Figure 1, we
find that LMs of large scale tend to have a good
performance and high confidence, and that both per-
formance and confidence can be degraded through
arXiv:2210.03923v2 [cs.CL] 17 Oct 2022
randomly sparsifying a small portion of parame-
ters.
2
This indicates that some parameters resulting
in student-unfriendliness can be rather removed, to
improve student-friendliness of the teacher without
sacrificing too much its expressiveness.
Motivated by this finding, we propose a sparse
teacher trick (in short, STARK ) under the guid-
ance of an overall knowledgeable score for each
teacher parameter, which accords not only with the
expressiveness but also the student-friendliness of
the parameter by interpolation. The aim is to re-
tain the expressive parameters while removing the
student-unfriendly ones. Specifically, we introduce
a three-stage procedure consisting of 1) trial distil-
lation, 2) parameter sparsification, and 3) actual
distillation. The trial distillation distills the dense
teacher to the student so that a trial student is ob-
tained. The parameter sparsification first estimates
the expressiveness score and student-friendliness
score of each teacher parameter via feedbacks re-
spectively from the teacher itself and the trial stu-
dent, and then sparsifies the teacher by removing
the parameters associated with adequately low in-
terpolated knowledgeable scores. The actual distil-
lation distills the sparsified teacher to the student
so that an actual student is obtained, where the stu-
dent is initialized in the same manner as that used
in trial distillation following the commonly-used
rewinding technique (Frankle and Carbin,2019).
We conduct an extensive set of experiments
on the GLUE benchmark. Experimental results
demonstrate that the sparse teachers can be dense
with knowledge and lead to a remarkable perfor-
mance of students compared with a series of com-
petitive baselines.
2 Background
2.1 BERT Architecture
The BERT (Devlin et al.,2019) is composed
of several stacked encoder layers of transform-
ers (Vaswani et al.,2017). There are two blocks
in every encoder layer: a multi-head self-attention
block (MHA) and a feed-forward network block
(FFN), with a residual connection and a normaliza-
tion layer around each.
Given an
l
-length sequence of
d
-dimensional
input vectors
XRl×d
, the output of the MHA
block with
A
independent heads can be represented
2https://pytorch.org/docs/stable/generated/
torch.nn.utils.prune.random_unstructured
as:
MHA(X)
=
A
X
i=1
Attn(X,W(i)
Q,W(i)
K,W(i)
V)W(i)
O,
where the
i
-th head is parameterized by
W(i)
Q
,
W(i)
K
,
W(i)
VRd×dA
, and
W(i)
ORdA×d
. On
the other hand, the output of the FFN block is:
FFN(X) = GELU(XW1)W2,
where two fully-connected layers are parameterized
by W1Rd×dIand W2RdI×drespectively.
2.2 Knowledge Distillation
Knowledge distillation (Hinton et al.,2015) aims to
transfer the knowledge from a large-scale teacher
to a smaller-scale student, which is originally pro-
posed to supervise the student with the teacher
logits. With its prevalence, a tremendous amount
of work has been investigated to transfer various
knowledge from the teacher to the student (Romero
et al.,2015;Zagoruyko and Komodakis,2017;Sun
et al.,2019;Jiao et al.,2020;Park et al.,2021b;
Li et al.,2020;Wang et al.,2020). PKD (Sun
et al.,2019) introduces a patient distillation scheme
where the student learns multiple intermediate layer
representations and logits from the teacher. More-
over, attention distributions (Sun et al.,2020;Jiao
et al.,2020;Li et al.,2020;Wang et al.,2020) and
even high-order relations (Park et al.,2021b) are
considered to further boost the performance.
Since a large capacity gap between the teacher
and the student can lead to an inferior distilla-
tion quality, TAKD (Mirzadeh et al.,2020) pro-
poses to insert teacher assistants of possible in-
termediate scales between the teacher and the
student so that the gap is drawn closer (Zhang
et al.,2022). More recently, teachers with student-
friendly architectures have exactly showed the
significance of student-friendliness (Park et al.,
2021a). MetaKD (Zhou et al.,2022) adopts meta-
learning to optimize the student-friendliness of
the teacher according to the student preference.
DKD (Zhao et al.,2022) decouples and amplifies
student-friendly knowledge in contrast to others.
Distinguished from these student-friendly teachers
that are achieved by altering teacher scales, archi-
tectures, parameters or knowledge representations,
our work, to our best knowledge, is the first one sug-
gesting that teacher parameters can produce both
student-friendly and student-unfriendly knowledge
and aiming to find the sparse teacher with the best
student-friendliness.
2.3 Model Pruning
Model pruning is imposed to remove the less ex-
pressive parameters for model compression. Previ-
ous work applies either structured (Li et al.,2017;
Luo et al.,2017;He et al.,2017;Yang et al.,2022)
or unstructured pruning (Han et al.,2015;Park
et al.,2017;Louizos et al.,2018;Lee et al.,2019)
to transformers. Unstructured pruning focuses on
pruning parameter-level parameters based on zero-
order decisions derived from magnitudes (Gordon
et al.,2020) or first-order decisions computed from
both gradients and magnitudes (Sanh et al.,2020).
In contrast, structured pruning prunes module-
level parameters such like MHA heads (Michel
et al.,2019) and FFN layers (Prasanna et al.,2020)
guided by the expressive score (Michel et al.,2019).
It is noteworthy that while some pruning methods
leverage post-training pruning (Hou et al.,2020),
others can take advantage of training-time prun-
ing (Xia et al.,2022). Although training-time prun-
ing can result in slightly better performance, it can
consume much more time to meet a convergence.
Our work mainly exploits structured pruning to ob-
tain sparse teachers, yet also explores the use of
unstructured pruning, in a post-training style.
3 Sparse Teacher Trick
Our trick involves three stages in the student learn-
ing procedure as shown in Figure 2. First, we distil
a trial student from the dense teacher on a specific
task (trial distillation). Then, we sparsify the pa-
rameters of the dense teacher that are associated
with adequately low knowledgeable scores (param-
eter sparsification). Finally, rewinding is applied,
where the student is set to the initialization exactly
used in the trial distillation stage and is learned
from the sparse teacher during (actual distillation).
3.1 Trial and Actual Distillations
Trial distillation and actual distillation share the
same distillation regime. We employ the widely-
used logits distillation (Hinton et al.,2015) as the
distillation objective, as depicted below:
LKD =softmax(zt) log softmax(zs ),
LTK =ylog ys,
L=LKD +α· LTK,
where
zt
,
zs
separately stand for logits of the
teacher and student, and
ys
,
y
separately stand
for prediction normalized probabilities of the stu-
dent and ground-truth one-hot probabilities. Two
subscripts
KD
and
TK
indicate distillation and task
losses respectively.
τ
is a temperature controlling
the smoothness the logits (Hinton et al.,2015), and
αis a term balancing two losses.
The trial distillation and actual distillation also
reuse the initialization of the student for better
convergence, which is known as rewinding tech-
nique (Frankle and Carbin,2019).
3.2 Parameter Sparsification
For parameter sparsification, we design a knowl-
edgeable score, which is essentially an in-
terpolation of the already-proposed expressive
score (Molchanov et al.,2017) and our proposed
student-friendly score, to measure the knowledge-
ableness of each teacher parameter. Thanks to
the knowledgeable score, we can safely exclude
student-unfriendly parameters without harming ex-
pressive parameters too much.
We mainly sparsify the attention heads of MHA
blocks and intermediate neurons of FFN blocks in
the teacher. Following the literature on structured
pruning in a post-training style (Michel et al.,2019;
Hou et al.,2020), we attach a set of variables
ξ(i)
and
ν
to the attention heads and the intermediate
neurons, to record the parameter sensitivities for a
specific task through accumulated absolute gradi-
ents, as shown below:
MHA(X)
=
A
X
i=1
ξ(i)Attn(X,W(i)
Q,W(i)
K,W(i)
V)W(i)
O,
FFN(X) = GELU(XW1)diag(ν)W2,
where
ξ(i)1
and
ν1dI
. We set the values of
the
ξ(i)
and
ν
to ones to ensure the functionalities
of corresponding heads and neurons are retained.
The implementation is mathematically equiva-
lent to the prevalent first-order taylor expansion
of the absolute variation between before and after
removing a module (i.e., a head or a neuron) akin
to Molchanov et al. (2017). Take the
i
-th attention
head as an example, its parameter sensitivity can
摘要:

SparseTeachersCanBeDensewithKnowledgeYiYang,ChenZhang,DaweiSongBeijingInstituteofTechnology{yang.yi,czhang,dwsong}@bit.edu.cnAbstractRecentadvancesindistillingpretrainedlan-guagemodelshavediscoveredthat,besidestheexpressivenessofknowledge,thestudent-friendlinessshouldbetakenintoconsidera-tiontoreali...

展开>> 收起<<
Sparse Teachers Can Be Dense with Knowledge Yi Yang Chen Zhang.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.1MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注