
4 Proposed Methods
This section first decomposes KD into three parts and defines several quantitative metrics. Then, we
present theoretical analysis to demonstrate why larger networks can’t teach well. Finally, we propose
a more appropriate temperature scaling approach as an alternative.
4.1 KD Decomposition
We omit the coefficient of
λτ2
in Eq. 1, and define
eqT(τ)=1
C−1∑C
j=1,j≠ypT
j(τ)
, where
qT(τ)=
[pT
c(τ)]c≠y. Then, we have the following decomposition:
`kd =−pT
y(τ)log pS
y(τ)
Correct Guidance
−
c≠y
eqT(τ)log pS
c(τ)
Smooth Regularization
−
c≠ypT
c(τ)−eqT(τ)log pS
c(τ)
Class Discriminability
.(2)
(I) Correct Guidance
: this term guarantees correctness during teaching. The decomposition in [
9
]
also contains this term, which is explained as importance weighting. This term works similarly to the
cross-entropy loss, which could be dealt with separately when applying temperature scaling.
(II) Smooth Regularization
: some previous works [
55
,
62
,
63
] attribute the success of KD to the
efficacy of regularization and study its relation to label smoothing (LS). The combination of this term
with correct guidance works similarly to LS. Notably,
eqT(τ)
differs across samples, implying
that the strength of smoothing is instance-specific, which is similar to the analysis in [62].
(III) Class Discriminability
: this term tells the student the affinity of wrong classes to the correct
class. Transferring the knowledge of class similarities to students has been the mainstream guess of
the “dark knowledge” in KD [
13
,
38
]. Ideally, a good teacher should be as discriminating as possible
in telling students which classes are more related to the correct class.
Illustrations of the decomposition are presented in the left of Fig. 1. Obviously, an appropriate
temperature should simultaneously contain the efficacy of the three terms, e.g., the shaded row in
Fig. 1. A too high or too low temperature could lead to smaller class discriminability, making the
guidance less different among wrong classes, which weakens the distillation performance in practical.
Among these three terms, we advocate that class discriminability is more fundamental in KD and
present more discussions in Appendix A (verified in Fig. 2 and Fig. 3).
To measure these three terms quantitatively, we use the target class probability (i.e.,
py
), the average
of wrong class probabilities (i.e.,
e(q)=1
C−1∑j≠ypj
), and the variance of wrong class probabilities
(i.e.,
v(q)=1
C−1∑j≠y(pj−e(q))2
) as estimators.
e(⋅)
and
v(⋅)
calculates the mean and variance
of the elements in a vector. In some cases, we use the standard deviation as an estimator for the
third term, i.e.,
σ(q)=v1/2(q)
. Because the latter two terms are calculated after applying softmax
to the complete logit vector, we define them as Derived Average (DA) and Derived Variance (DV),
respectively. In experiments, we calculate these metrics for all training samples and sometimes report
the average or standard deviation across these samples.
4.2 Theoretical Analysis
We analyze the mean and variance of the softened probability vector, i.e., the teacher’s label
pT(τ)
used in KD. We defer the proofs of Lemma 4.1 and Proposition 4.3, 4.4 to Appendix B.
Lemma 4.1
(Variance of Softened Probabilities)
.
Given a logit vector
f∈RC
and the softened
probability vector p=SF(f;τ), τ ∈(0,∞),v(p)monotonically decreases as τincreases.
As
τ
increases,
p(τ)
becomes more uniform, i.e., it’s entropy increases. However, we especially
focus on the wrong classes, where the mean and variance are more intuitive to calculate and analyze.
Assumption 4.2. The target logit is higher than other classes’ logits, i.e., fy≥fc,∀c≠y.
Assumption 4.2 is rational because well-performed teachers could almost achieve a higher accuracy
(e.g., >95%) on the training set, and most training samples meet this requirement.
Proposition 4.3.
Under Assumption 4.2,
py
monotonically decreases as
τ
increases, and
e(q)
monotonically increases as τincreases. As τ→∞,e(q)→1C.
4