
randomly sparsifying a small portion of parame-
ters.
2
This indicates that some parameters resulting
in student-unfriendliness can be rather removed, to
improve student-friendliness of the teacher without
sacrificing too much its expressiveness.
Motivated by this finding, we propose a sparse
teacher trick (in short, STARK ) under the guid-
ance of an overall knowledgeable score for each
teacher parameter, which accords not only with the
expressiveness but also the student-friendliness of
the parameter by interpolation. The aim is to re-
tain the expressive parameters while removing the
student-unfriendly ones. Specifically, we introduce
a three-stage procedure consisting of 1) trial distil-
lation, 2) parameter sparsification, and 3) actual
distillation. The trial distillation distills the dense
teacher to the student so that a trial student is ob-
tained. The parameter sparsification first estimates
the expressiveness score and student-friendliness
score of each teacher parameter via feedbacks re-
spectively from the teacher itself and the trial stu-
dent, and then sparsifies the teacher by removing
the parameters associated with adequately low in-
terpolated knowledgeable scores. The actual distil-
lation distills the sparsified teacher to the student
so that an actual student is obtained, where the stu-
dent is initialized in the same manner as that used
in trial distillation following the commonly-used
rewinding technique (Frankle and Carbin,2019).
We conduct an extensive set of experiments
on the GLUE benchmark. Experimental results
demonstrate that the sparse teachers can be dense
with knowledge and lead to a remarkable perfor-
mance of students compared with a series of com-
petitive baselines.
2 Background
2.1 BERT Architecture
The BERT (Devlin et al.,2019) is composed
of several stacked encoder layers of transform-
ers (Vaswani et al.,2017). There are two blocks
in every encoder layer: a multi-head self-attention
block (MHA) and a feed-forward network block
(FFN), with a residual connection and a normaliza-
tion layer around each.
Given an
l
-length sequence of
d
-dimensional
input vectors
X∈Rl×d
, the output of the MHA
block with
A
independent heads can be represented
2https://pytorch.org/docs/stable/generated/
torch.nn.utils.prune.random_unstructured
as:
MHA(X)
=
A
X
i=1
Attn(X,W(i)
Q,W(i)
K,W(i)
V)W(i)
O,
where the
i
-th head is parameterized by
W(i)
Q
,
W(i)
K
,
W(i)
V∈Rd×dA
, and
W(i)
O∈RdA×d
. On
the other hand, the output of the FFN block is:
FFN(X) = GELU(XW1)W2,
where two fully-connected layers are parameterized
by W1∈Rd×dIand W2∈RdI×drespectively.
2.2 Knowledge Distillation
Knowledge distillation (Hinton et al.,2015) aims to
transfer the knowledge from a large-scale teacher
to a smaller-scale student, which is originally pro-
posed to supervise the student with the teacher
logits. With its prevalence, a tremendous amount
of work has been investigated to transfer various
knowledge from the teacher to the student (Romero
et al.,2015;Zagoruyko and Komodakis,2017;Sun
et al.,2019;Jiao et al.,2020;Park et al.,2021b;
Li et al.,2020;Wang et al.,2020). PKD (Sun
et al.,2019) introduces a patient distillation scheme
where the student learns multiple intermediate layer
representations and logits from the teacher. More-
over, attention distributions (Sun et al.,2020;Jiao
et al.,2020;Li et al.,2020;Wang et al.,2020) and
even high-order relations (Park et al.,2021b) are
considered to further boost the performance.
Since a large capacity gap between the teacher
and the student can lead to an inferior distilla-
tion quality, TAKD (Mirzadeh et al.,2020) pro-
poses to insert teacher assistants of possible in-
termediate scales between the teacher and the
student so that the gap is drawn closer (Zhang
et al.,2022). More recently, teachers with student-
friendly architectures have exactly showed the
significance of student-friendliness (Park et al.,
2021a). MetaKD (Zhou et al.,2022) adopts meta-
learning to optimize the student-friendliness of
the teacher according to the student preference.
DKD (Zhao et al.,2022) decouples and amplifies
student-friendly knowledge in contrast to others.
Distinguished from these student-friendly teachers
that are achieved by altering teacher scales, archi-
tectures, parameters or knowledge representations,
our work, to our best knowledge, is the first one sug-
gesting that teacher parameters can produce both