Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again Xin-Chun Li1 Wen-Shu Fan1 Shaoming Song2 Yinchuan Li2

2025-05-06 0 0 1.42MB 21 页 10玖币
侵权投诉
Asymmetric Temperature Scaling Makes Larger
Networks Teach Well Again
Xin-Chun Li1, Wen-Shu Fan1, Shaoming Song2, Yinchuan Li2
Bingshuai Li2,Yunfeng Shao2,De-Chuan Zhan1
1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
2Huawei Noah’s Ark Lab, Beijing, China
{lixc, fanws}@lamda.nju.edu.cn, zhandc@nju.edu.cn
{shaoming.song, liyinchuan, libingshuai, shaoyunfeng}@huawei.com
Abstract
Knowledge Distillation (KD) aims at transferring the knowledge of a well-
performed neural network (the teacher) to a weaker one (the student). A peculiar
phenomenon is that a more accurate model doesn’t necessarily teach better, and
temperature adjustment can neither alleviate the mismatched capacity. To explain
this, we decompose the efficacy of KD into three parts: correct guidance,smooth
regularization, and class discriminability. The last term describes the distinctness
of wrong class probabilities that the teacher provides in KD. Complex teachers
tend to be over-confident and traditional temperature scaling limits the efficacy of
class discriminability, resulting in less discriminative wrong class probabilities.
Therefore, we propose Asymmetric Temperature Scaling (ATS), which separately
applies a higher/lower temperature to the correct/wrong class. ATS enlarges the
variance of wrong class probabilities in the teacher’s label and makes the students
grasp the absolute affinities of wrong classes to the target class as discriminative
as possible. Both theoretical analysis and extensive experimental results demon-
strate the effectiveness of ATS. The demo developed in Mindspore is available at
https://gitee.com/lxcnju/ats-mindspore
and will be available at
https:
//gitee.com/mindspore/models/tree/master/research/cv/ats.
1 Introduction
Although large-scale deep neural networks have achieved overwhelming successes in many real-
world applications [
22
,
11
,
60
], the vast capacity hinders them from being deployed on portable
devices with limited computation and storage resources [
3
]. Some efficient architectures, e.g.,
MobileNets [
14
,
37
] and ShuffleNets [
59
,
29
], have been proposed for lightweight deployment, while
their performances are usually constrained. Fortunately, knowledge distillation (KD) [
46
,
13
] could
transfer the knowledge of a more complex and well-performed network (i.e., the teacher) to them.
The original KD [
13
] forces the student to mimic the teacher’s behavior via minimizing the Kullback-
Leibler (KL) divergence between their output probabilities. Recent studies generalize KD to various
types of knowledge [
36
,
57
,
17
,
12
,
33
,
1
,
34
,
44
,
52
,
27
,
54
,
45
,
50
,
26
,
23
] or various distillation
schemes [
61
,
2
,
58
,
20
]. An intuitive sense after the proposal of KD [
13
] is that larger teachers could
teach students better because their accuracies are higher. A recent work [
6
] first points out that the
teacher accuracy is a poor predictor of the student’s performance. That is, more accurate neural
networks don’t necessarily teach better. Until now, this phenomenon is still counter-intuitive [
51
],
surprising [
31
], and unexplored [
24
]. Different from some existing empirical studies and theoretical
analysis [
40
,
18
,
30
,
35
,
55
,
63
,
6
,
28
,
15
], we investigate the miraculous phenomenon in detail and
aim to answer the following questions: What’s the real reason that more complex teachers can’t teach
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04427v2 [cs.LG] 11 Oct 2022
Teacher Student
Decomposition of KD
Teachers
Label
Correct
Guidance
Smooth
Regularization
Class
Discriminability
Temperature
Lower
Higher
12.0 -0.6 -0.4 -0.2 -1.0
9.0 -0.6 -0.4 -0.2 -1.0
9.0 -0.3 -0.2 -0.1 -0.5
Larger
Teacher
Smaller
Teacher
Larger Target Logit Smaller Inherent Variance
OR
0.731 0.066 0.070 0.073 0.060
Larger
Teacher
Smaller
Teacher
Derived Variance: 0.0031 Derived Variance: 0.0030
Logits
Probs
Softmax with Temperature 4.0
Derived Variance: 0.0057
0.852 0.037 0.038 0.040 0.033 0.717 0.070 0.072 0.074 0.067
Figure 1:
Left
: Decomposition of a teacher’s label. The first class is the target. As temperature
increases, correct guidance is weaker, smooth regularization is stronger, while class discriminability
(measured by the variance of wrong class probabilities) will first increase and then decrease.
Right
:
Larger/Smaller teachers’ logits are consistent in relative class affinities, i.e., logit values of the four
wrong classes are in the same order of magnitude. However, larger teachers are over-confident and
give a larger target logit or smaller inherent variance, leading to a smaller derived variance under
traditional temperature scaling, i.e., less distinct wrong class probabilities after softmax.
well? Is it really impossible to make larger teachers teach better through simple operations, such as
temperature scaling?
To answer the first question, we focus on analyzing the distinctness of wrong class probabilities
that a teacher provides in KD. We decompose the teacher’s label into three parts (see Sect. 4.1): (I)
Correct Guidance: the correct class’s probability; (II) Smooth Regularization: the average probability
of wrong classes; (III) Class Discriminability: the variance of wrong class probabilities (defined as
derived variance). The commonly utilized temperature scaling could control the efficacy of these
three terms (the left of Fig. 1). More complex teachers are over-confident and assign a larger score for
the correct class or less varied scores for the wrong classes. If we use a uniform temperature to scale
their logits, the class discriminability of the larger teacher is less effective (theoretically analyzed in
Sect. 4.2), i.e., the probabilities of wrong classes are less distinct (the right of Fig. 1).
As to the second question, we focus on enlarging the variance of wrong class probabilities (i.e.,
derived variance) that a teacher provides to make the distillation process more discriminative. To
specifically enhance the distinctness of wrong class probabilities, we separately apply a higher/lower
temperature to the correct/wrong class’s logit instead of a uniform temperature (see Sect. 4.3). We
name our method Asymmetric Temperature Scaling (ATS), and abundant experimental studies verify
that utilizing this simple operation could make larger teachers teach well again.
2 Related Works
KD with Larger Teacher
: Although KD has been a general technique for knowledge transfer in
various applications [
13
,
61
,
42
,
25
], could any student learn from any teacher? [
6
] first studies the
KD’s dependence on student and teacher architectures. They find that larger models do not often
make better teachers and propose the early-stopped teacher as a solution. [
31
] introduces a multi-step
KD process, employing an intermediate-sized network (the teacher assistant) to bridge the capacity
gap. [
51
] formulates KD as a multi-task learning problem with several knowledge transfer losses. The
transfer loss will be utilized only when its gradient direction is consistent with the cross-entropy loss.
[
10
,
24
] define the knowledge gap as residual, which is utilized to teach the residual student, and
then they take the ensemble of the student and residual student for inference. These works attribute
the worse teaching performance to capacity mismatch, i.e., weaker students can’t completely mimic
the excellent teachers. However, they don’t explain this peculiar phenomenon in detail.
Understanding of KD
: Quite a few works focus on understanding the advantages of KD from a
principled perspective. [
28
] unifies KD and privileged information into generalized distillation.
[
35
,
18
] utilize gradient flow and neural tangent kernel to analyze the convergence property of KD
under deep linear networks and infinitely wide networks. [
5
] explains KD via quantifying the task-
2
Table 1: The used notations in this paper. The definitions of Derived Average,Derived Variance and
Inherent Variance are only for wrong classes (Sect. 4.1 and Sect. 4.2).
All Classes Wrong Classes
Logit f g =[fc]cy
Probability p=SF(f)q=[pc]cy
Derived Average of Probabilities - e(q)=jqj(C1)
Derived Variance of Probabilities - v(q)=j(qje(q))2(C1)
Inherent Variance of Probabilities - ˜q =SF(g),v(˜
q)=j(˜qje(˜q))2(C1)
relevant and task-irrelevant visual concepts. [
7
] casts KD as a semiparametric inference problem and
proposes corresponding enhancements. Our work is more related to KD decompositions. [
9
] treats
the teacher’s correct/wrong outputs differently, respectively explaining them as importance weighting
and class similarities. [
40
] further decomposes the “dark knowledge” into universal knowledge,
domain knowledge, and gradient rescaling. [
30
] establishes a bias-variance tradeoff to quantify the
divergence of a teacher with the Bayes teacher. [
63
] utilizes bias-variance decomposition to analyze
KD and discovers regularization samples that could increase bias and decrease variance. Our work is
also related to Label smoothing (LS). [
55
] points out that the regularization effect in KD is similar to
LS. [
32
] finds that training a teacher with LS could degrade its teaching quality, and attributes this
to the fact that LS erases relative information between teacher logits. Recently, [
38
] further studies
this problem and proposes a metric to measure the degree of erased information quantitatively. Our
work also decomposes KD into several effects to study why more complex teachers can’t teach well.
Detailed relatedness to these works is presented in Sect. 4.1.
3 Background
We consider a
C
-class classification problem with
Y=[C]={1,2,, C}
. Given a neural network
and a sample pair
(x, y)
, we could obtain the “logits” as
f(x)RC
. We denote the softmax function
with temperature
τ
as
SF(;τ)
, i.e.,
pc(τ)=exp (fc(x)τ)Z(τ)
and
Z(τ)=C
j=1exp (fj(x)τ)
,
where
p(τ)
is the softened probability vector that a network outputs and
c
is the index of class. Later,
we may omit the dependence on
x
and
τ
if without any ambiguity. We use
fy
and
py
to denote the
correct classs logit and probability, while we use
g
and
q
to represent the vector of wrong classes
logits and probabilities, i.e., g=[fc]cyand q=[pc]cy. The notations could be found in Tab. 1.
The most standard KD [
13
] contains two stages of training. The first stage trains complex teachers,
and then the second stage transfers the knowledge from teachers to a smaller student via minimizing
the KL divergence between softened probabilities. Usually, the loss function during the second stage
(i.e., the student’s learning objective) is a combination of cross-entropy loss and distillation loss:
`=(1λ)log pS
y(1)

CE Loss
λτ2
C
c=1
pT
c(τ)log pS
c(τ)

KD Loss
,(1)
where the upper script “T”/“S” denotes “Teacher”/“Student” respectively. Commonly, a default
temperature of
1
is utilized for the CE loss, and the student could also take a temperature of
1
for the
KD loss, e.g., pS
c(τ=1)[13, 31, 45, 44].
Suppose we have two teachers, denoted as
Tlarge
and
Tsmall
, and the larger teacher performs better on
both training and test data. If we use them to teach the same student
S
, we could find that the student’s
performance is worse when mimicking the larger teacher’s outputs. Adjusting the temperature could
neither make the larger teacher teach well. The details of this phenomenon could be found in [
6
,
31
]
and Fig. 9. Obviously,
pTlarge
could differ a lot from
pTsmall
, which is the only difference in the loss
function when teaching the student. Hence, we focus on analyzing what probability distributions are
tended to be provided by teachers with different capacities.
3
4 Proposed Methods
This section first decomposes KD into three parts and defines several quantitative metrics. Then, we
present theoretical analysis to demonstrate why larger networks can’t teach well. Finally, we propose
a more appropriate temperature scaling approach as an alternative.
4.1 KD Decomposition
We omit the coefficient of
λτ2
in Eq. 1, and define
eqT(τ)=1
C1C
j=1,jypT
j(τ)
, where
qT(τ)=
[pT
c(τ)]cy. Then, we have the following decomposition:
`kd =pT
y(τ)log pS
y(τ)

Correct Guidance
cy
eqT(τ)log pS
c(τ)

Smooth Regularization
cypT
c(τ)eqT(τ)log pS
c(τ)

Class Discriminability
.(2)
(I) Correct Guidance
: this term guarantees correctness during teaching. The decomposition in [
9
]
also contains this term, which is explained as importance weighting. This term works similarly to the
cross-entropy loss, which could be dealt with separately when applying temperature scaling.
(II) Smooth Regularization
: some previous works [
55
,
62
,
63
] attribute the success of KD to the
efficacy of regularization and study its relation to label smoothing (LS). The combination of this term
with correct guidance works similarly to LS. Notably,
eqT(τ)
differs across samples, implying
that the strength of smoothing is instance-specific, which is similar to the analysis in [62].
(III) Class Discriminability
: this term tells the student the affinity of wrong classes to the correct
class. Transferring the knowledge of class similarities to students has been the mainstream guess of
the “dark knowledge” in KD [
13
,
38
]. Ideally, a good teacher should be as discriminating as possible
in telling students which classes are more related to the correct class.
Illustrations of the decomposition are presented in the left of Fig. 1. Obviously, an appropriate
temperature should simultaneously contain the efficacy of the three terms, e.g., the shaded row in
Fig. 1. A too high or too low temperature could lead to smaller class discriminability, making the
guidance less different among wrong classes, which weakens the distillation performance in practical.
Among these three terms, we advocate that class discriminability is more fundamental in KD and
present more discussions in Appendix A (verified in Fig. 2 and Fig. 3).
To measure these three terms quantitatively, we use the target class probability (i.e.,
py
), the average
of wrong class probabilities (i.e.,
e(q)=1
C1jypj
), and the variance of wrong class probabilities
(i.e.,
v(q)=1
C1jy(pje(q))2
) as estimators.
e()
and
v()
calculates the mean and variance
of the elements in a vector. In some cases, we use the standard deviation as an estimator for the
third term, i.e.,
σ(q)=v1/2(q)
. Because the latter two terms are calculated after applying softmax
to the complete logit vector, we define them as Derived Average (DA) and Derived Variance (DV),
respectively. In experiments, we calculate these metrics for all training samples and sometimes report
the average or standard deviation across these samples.
4.2 Theoretical Analysis
We analyze the mean and variance of the softened probability vector, i.e., the teacher’s label
pT(τ)
used in KD. We defer the proofs of Lemma 4.1 and Proposition 4.3, 4.4 to Appendix B.
Lemma 4.1
(Variance of Softened Probabilities)
.
Given a logit vector
fRC
and the softened
probability vector p=SF(f;τ), τ (0,),v(p)monotonically decreases as τincreases.
As
τ
increases,
p(τ)
becomes more uniform, i.e., it’s entropy increases. However, we especially
focus on the wrong classes, where the mean and variance are more intuitive to calculate and analyze.
Assumption 4.2. The target logit is higher than other classes’ logits, i.e., fyfc,cy.
Assumption 4.2 is rational because well-performed teachers could almost achieve a higher accuracy
(e.g., >95%) on the training set, and most training samples meet this requirement.
Proposition 4.3.
Under Assumption 4.2,
py
monotonically decreases as
τ
increases, and
e(q)
monotonically increases as τincreases. As τ,e(q)1C.
4
Proposition 4.3 implies that increasing temperature could lead to a higher derived average (empirically
see Fig. 7) and strengthen the smooth regularization term in Eq. 2.
Before we analyze the class discriminability term, we define
˜
q(τ)
as the result of applying softmax
only to the wrong logits with temperature
τ
, i.e.,
˜
q(τ)=SF(g;τ)
. For the element index
c
of
q
, we
have
˜
qc(τ)=exp(gcτ)
j
exp(gjτ).(3)
Notably,
˜
q
differs from
q
a lot. Specifically, the former satisfies
c˜
qc=1
, while the summation of
the latter is
cypc=1py
. The former does not depend on the correct class’s logit while the latter
does. We name v(˜
q)Inherent Variance (IV) because it only depends on wrong classes’ logits.
Proposition 4.4
(Derived Variance vs. Inherent Variance)
.
The derived variance is determined by
the square of derived average and the inherent variance via:
v(q)
DV
=(C1)2e2(q)
DA2
v(˜
q)
IV
.(4)
With
τ
increases,
e(q)
increases (Proposition 4.3) while
v(˜
q)
decreases (Lemma 4.1), and hence, it
is not so easy to judge the specific monotonicity of
v(q)
w.r.t.
τ
. Empirically, we observe that the
derived variance first increases and then decreases (see Fig. 7), which conforms to the change of the
class discriminability as illustrated in Fig. 1.
We could use Proposition 4.4 to clearly analyze why larger teacher networks can’t teach well. Before
this, we present another two properties and a corollary without detailed proof.
Remark 4.5.Fixing
g
and
τ
, a higher target logit
fy
leads to a higher
py
, i.e., a smaller derived
average e(q).
Remark 4.6.Fixing
τ
, less varied wrong logits
g
leads to less varied
˜
q
, i.e., a smaller inherent
variance v(˜
q).
Corollary 4.7. Suppose we have two teachers T1and T2, and their logit vectors for a same sample
are fT1and fT2.
If
fT1
yfT2
y
while
gT1
and
gT2
are nearly the same, then
pT1
ypT2
y
(Remark 4.5) while
v(˜
qT1)v(˜
qT2). Hence, v(qT1)v(qT2).
If
fT1
yfT2
y
while
v(gT1)v(gT2)
, then
pT1
ypT2
y
while
v(˜
qT1)v(˜
qT2)
(Remark 4.6).
Hence, v(qT1)v(qT2).
This corollary explains why a larger teacher can’t teach better. Because the larger teacher tends to be
over-confident, the target logit
fy
may be larger or the variance of wrong logits
v(g)
may be smaller.
These are illustrated in Fig. 1 and empirically verified in Fig. 4. Then the derived variance
v(q)
may
be smaller, limiting the efficacy of class discriminability in Eq. 2. Empirical results are in Fig. 7.
Notably, we focus on analyzing the variance of wrong class probabilities instead of all classes.
Maximizing the variance of all classes’ probabilities does not mean maximizing the variance of
wrong classes’. For example, although a very low temperature can maximize the variance of all
classes’ probabilities, the generated teacher’s label is one-hot that shows no distinctness between
wrong classes. In other words, the effectiveness of KD should be more related to the distinctness
between wrong classes rather than all classes. However, traditional temperature scaling applies a
uniform temperature for all classes, which cannot separately handle the wrong classes.
4.3 Asymmetric Temperature Scaling
We conclude the above analysis: if a larger teacher makes an over-confident prediction, the wrong
class probabilities it provides could be not discriminative enough. Utilizing a uniform temperature
could not enlarge the derived variance as much as possible with the interference of the target class’s
logit (see the middle of Fig. 7). Thanks to the decomposition in Eq. 2, the correct guidance term
works similarly to the cross-entropy loss and allows us to deal with it separately. Hence, we propose
a novel temperature scaling approach:
pc(τ1, τ2)=exp (fcτc)
j[C]
exp (fjτj), τi=I{i=y}τ1+I{iy}τ2,i[C],(5)
5
摘要:

AsymmetricTemperatureScalingMakesLargerNetworksTeachWellAgainXin-ChunLi1,Wen-ShuFan1,ShaomingSong2,YinchuanLi2BingshuaiLi2,YunfengShao2,De-ChuanZhan11StateKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity,Nanjing,China2HuaweiNoah'sArkLab,Beijing,China{lixc,fanws}@lamda.nju.edu.cn,zhandc@nju....

展开>> 收起<<
Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again Xin-Chun Li1 Wen-Shu Fan1 Shaoming Song2 Yinchuan Li2.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.42MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注