effective KD one should not only focus on improving the teacher performance. Instead,
rethinking what the “dark knowledge” is and how to enrich it is highly valuable.
•
We demonstrate the connection between our observations and the IB theory, providing a
new perspective for understanding KD and explaining the “dark knowledge”.
•
Based on our observations and analyses, a novel simple but effective algorithm is pro-
posed to find the optimal intermediate teacher and achieve better distillation performance.
Experiments validate its effectiveness and adaptability.
2 Related Work
Knowledge distillation.
Hinton et al. [
2
] proposed to transfer “dark knowledge” of a strong capacity
teacher model to a compact student model by minimizing the Kullback-Leibler divergence between
the soft targets of the two models. Since then, many variants of KD methods were proposed to
improve the distillation performance[
16
], such as Fitnets [
17
], AT[
18
], CCM[
19
], FSP[
20
], SP[
21
],
CCKD[
22
],TC
3
KD[
23
]. As a promising technique to improve model generalization, ensemble learn-
ing is often combined with knowledge distillation to improve the distillation performance[
3
,
11
,
4
].
In the online knowledge distillation framework[
24
], efforts were made to boost the distillation perfor-
mance by increasing the diversity between multiple models to improve the ensemble performance
[
6
,
7
]. Most existing methods commonly assumed that a high performing teacher is preferred for KD.
On the contrary, some researchers thought that the model capacity gap between strong teachers and
small students usually degrades knowledge transfer[
8
,
9
,
12
,
25
]. Some of them have experimentally
verified that poor teachers can also perform KD tasks well[
12
,
25
]. Several methods are proposed to
compress this gap by introducing an assistant network [
8
,
9
] or designing a student-friendly teacher
[
10
,
11
]. However, they did not explain theoretically why gap exists and how gap affects KD. In the
self-distillation framework[
26
,
27
], it is essentially the intermediate model that is used as the teacher,
but no one has a theoretical explanation for why the intermediate model works. In this paper, we link
KD and IB theory through extensive observations and experiments. From the perspective of mutual
information, we explain why intermediate models serve as better teachers than full models, and how
to select an suitable intermediate model to reduce the negative impact of model gap.
Information bottleneck.
Tishby et al. [
28
] firstly proposed the information bottleneck concept and
provided a tabular method to numerically solve the IB Lagrangian (Eq. (3)). Later, Tishby and
Zaslavsky [
14
] proposed to interpret deep learning with IB principle. Following this idea, Shwartz-Ziv
and Tishby [
29
] studied the IB principle to explain the training dynamics of deep networks. It has
motivated many studies to apply the IB principle to interpret and improve deep neural networks
(DNNs) [
30
,
31
,
32
]. Recently, some researchers introduced IB principle to deep reinforcement
learning successfully [
33
,
34
,
35
]. As far as we know, we are the first to introduce IB principle to
interpret knowledge distillation.
3 Exploratory Experiments
In this section, we first formally describe the KD and ensemble KD methods used in the paper, then
design two exploratory experiments to show how intermediate models are surprisingly valuable for
KD, despite their lower accuracies due to incompletion of training.
3.1 Formulation
In the classical KD setting, a fully converged teacher model (full teacher for short)
Tfull
is used to
distill a student model
S
. Define
PTfull
as the softmax output of teacher model,
PS
as the softmax
output of student model and
Ytrue
as the true labels. The student model is trained to optimize the
following loss function:
LKD =αH(Ytrue, PS) + (1 −α)H(Pτ
Tfull , P τ
S),(1)
where
H
refers to the cross-entropy,
α
is the trade-off parameter,
τ
is the temperature. Conducting
KD with an intermediate teacher model Tinter, means using Tinter instead of Tfull in Eq. (1).
In the standard ensemble KD setting, there are
M(M≥2)
full teachers
Tfull
1, T full
2, ..., T full
M
, which
have the same network structure and training strategy but different initial parameters. The student
3