Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again Xin-Chun Li1 Wen-Shu Fan1 Shaoming Song2 Yinchuan Li2

2025-05-06 0 0 1.42MB 21 页 10玖币

侵权投诉

Asymmetric Temperature Scaling Makes Larger

Networks Teach Well Again

Xin-Chun Li1, Wen-Shu Fan1, Shaoming Song2, Yinchuan Li2

Bingshuai Li2,Yunfeng Shao2,De-Chuan Zhan1

1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China

2Huawei Noah’s Ark Lab, Beijing, China

{lixc, fanws}@lamda.nju.edu.cn, zhandc@nju.edu.cn

{shaoming.song, liyinchuan, libingshuai, shaoyunfeng}@huawei.com

Abstract

Knowledge Distillation (KD) aims at transferring the knowledge of a well-

performed neural network (the teacher) to a weaker one (the student). A peculiar

phenomenon is that a more accurate model doesn’t necessarily teach better, and

temperature adjustment can neither alleviate the mismatched capacity. To explain

this, we decompose the efﬁcacy of KD into three parts: correct guidance,smooth

regularization, and class discriminability. The last term describes the distinctness

of wrong class probabilities that the teacher provides in KD. Complex teachers

tend to be over-conﬁdent and traditional temperature scaling limits the efﬁcacy of

class discriminability, resulting in less discriminative wrong class probabilities.

Therefore, we propose Asymmetric Temperature Scaling (ATS), which separately

applies a higher/lower temperature to the correct/wrong class. ATS enlarges the

variance of wrong class probabilities in the teacher’s label and makes the students

grasp the absolute afﬁnities of wrong classes to the target class as discriminative

as possible. Both theoretical analysis and extensive experimental results demon-

strate the effectiveness of ATS. The demo developed in Mindspore is available at

https://gitee.com/lxcnju/ats-mindspore

and will be available at

https:

//gitee.com/mindspore/models/tree/master/research/cv/ats.

1 Introduction

Although large-scale deep neural networks have achieved overwhelming successes in many real-

world applications [

], the vast capacity hinders them from being deployed on portable

devices with limited computation and storage resources [

]. Some efﬁcient architectures, e.g.,

MobileNets [

] and ShufﬂeNets [

], have been proposed for lightweight deployment, while

their performances are usually constrained. Fortunately, knowledge distillation (KD) [

] could

transfer the knowledge of a more complex and well-performed network (i.e., the teacher) to them.

The original KD [

] forces the student to mimic the teacher’s behavior via minimizing the Kullback-

Leibler (KL) divergence between their output probabilities. Recent studies generalize KD to various

types of knowledge [

] or various distillation

schemes [

]. An intuitive sense after the proposal of KD [

] is that larger teachers could

teach students better because their accuracies are higher. A recent work [

] ﬁrst points out that the

teacher accuracy is a poor predictor of the student’s performance. That is, more accurate neural

networks don’t necessarily teach better. Until now, this phenomenon is still counter-intuitive [

surprising [

], and unexplored [

]. Different from some existing empirical studies and theoretical

analysis [

], we investigate the miraculous phenomenon in detail and

aim to answer the following questions: What’s the real reason that more complex teachers can’t teach

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04427v2 [cs.LG] 11 Oct 2022

Teacher Student

Decomposition of KD

Teacher’s

Label

Correct

Guidance

Smooth

Regularization

Class

Discriminability

Temperature

Lower

Higher

12.0 -0.6 -0.4 -0.2 -1.0

9.0 -0.6 -0.4 -0.2 -1.0

9.0 -0.3 -0.2 -0.1 -0.5

Larger

Teacher

Smaller

Teacher

Larger Target Logit Smaller Inherent Variance

0.731 0.066 0.070 0.073 0.060

Larger

Teacher

Smaller

Teacher

Derived Variance: 0.0031 Derived Variance: 0.0030

Logits

Probs

Softmax with Temperature 4.0

Derived Variance: 0.0057

0.852 0.037 0.038 0.040 0.033 0.717 0.070 0.072 0.074 0.067

Figure 1:

Left

: Decomposition of a teacher’s label. The ﬁrst class is the target. As temperature

increases, correct guidance is weaker, smooth regularization is stronger, while class discriminability

(measured by the variance of wrong class probabilities) will ﬁrst increase and then decrease.

Right

Larger/Smaller teachers’ logits are consistent in relative class afﬁnities, i.e., logit values of the four

wrong classes are in the same order of magnitude. However, larger teachers are over-conﬁdent and

give a larger target logit or smaller inherent variance, leading to a smaller derived variance under

traditional temperature scaling, i.e., less distinct wrong class probabilities after softmax.

well? Is it really impossible to make larger teachers teach better through simple operations, such as

temperature scaling?

To answer the ﬁrst question, we focus on analyzing the distinctness of wrong class probabilities

that a teacher provides in KD. We decompose the teacher’s label into three parts (see Sect. 4.1): (I)

Correct Guidance: the correct class’s probability; (II) Smooth Regularization: the average probability

of wrong classes; (III) Class Discriminability: the variance of wrong class probabilities (deﬁned as

derived variance). The commonly utilized temperature scaling could control the efﬁcacy of these

three terms (the left of Fig. 1). More complex teachers are over-conﬁdent and assign a larger score for

the correct class or less varied scores for the wrong classes. If we use a uniform temperature to scale

their logits, the class discriminability of the larger teacher is less effective (theoretically analyzed in

Sect. 4.2), i.e., the probabilities of wrong classes are less distinct (the right of Fig. 1).

As to the second question, we focus on enlarging the variance of wrong class probabilities (i.e.,

derived variance) that a teacher provides to make the distillation process more discriminative. To

speciﬁcally enhance the distinctness of wrong class probabilities, we separately apply a higher/lower

temperature to the correct/wrong class’s logit instead of a uniform temperature (see Sect. 4.3). We

name our method Asymmetric Temperature Scaling (ATS), and abundant experimental studies verify

that utilizing this simple operation could make larger teachers teach well again.

2 Related Works

KD with Larger Teacher

: Although KD has been a general technique for knowledge transfer in

various applications [

], could any student learn from any teacher? [

] ﬁrst studies the

KD’s dependence on student and teacher architectures. They ﬁnd that larger models do not often

make better teachers and propose the early-stopped teacher as a solution. [

] introduces a multi-step

KD process, employing an intermediate-sized network (the teacher assistant) to bridge the capacity

gap. [

] formulates KD as a multi-task learning problem with several knowledge transfer losses. The

transfer loss will be utilized only when its gradient direction is consistent with the cross-entropy loss.

[

] deﬁne the knowledge gap as residual, which is utilized to teach the residual student, and

then they take the ensemble of the student and residual student for inference. These works attribute

the worse teaching performance to capacity mismatch, i.e., weaker students can’t completely mimic

the excellent teachers. However, they don’t explain this peculiar phenomenon in detail.

Understanding of KD

: Quite a few works focus on understanding the advantages of KD from a

principled perspective. [

] uniﬁes KD and privileged information into generalized distillation.

[

] utilize gradient ﬂow and neural tangent kernel to analyze the convergence property of KD

under deep linear networks and inﬁnitely wide networks. [

] explains KD via quantifying the task-

Table 1: The used notations in this paper. The deﬁnitions of Derived Average,Derived Variance and

Inherent Variance are only for wrong classes (Sect. 4.1 and Sect. 4.2).

All Classes Wrong Classes

Logit f g =[fc]c≠y

Probability p=SF(f)q=[pc]c≠y

Derived Average of Probabilities - e(q)=∑jqj(C−1)

Derived Variance of Probabilities - v(q)=∑j(qj−e(q))2(C−1)

Inherent Variance of Probabilities - ˜q =SF(g),v(˜

q)=∑j(˜qj−e(˜q))2(C−1)

relevant and task-irrelevant visual concepts. [

] casts KD as a semiparametric inference problem and

proposes corresponding enhancements. Our work is more related to KD decompositions. [

] treats

the teacher’s correct/wrong outputs differently, respectively explaining them as importance weighting

and class similarities. [

] further decomposes the “dark knowledge” into universal knowledge,

domain knowledge, and gradient rescaling. [

] establishes a bias-variance tradeoff to quantify the

divergence of a teacher with the Bayes teacher. [

] utilizes bias-variance decomposition to analyze

KD and discovers regularization samples that could increase bias and decrease variance. Our work is

also related to Label smoothing (LS). [

] points out that the regularization effect in KD is similar to

LS. [

] ﬁnds that training a teacher with LS could degrade its teaching quality, and attributes this

to the fact that LS erases relative information between teacher logits. Recently, [

] further studies

this problem and proposes a metric to measure the degree of erased information quantitatively. Our

work also decomposes KD into several effects to study why more complex teachers can’t teach well.

Detailed relatedness to these works is presented in Sect. 4.1.

3 Background

We consider a

-class classiﬁcation problem with

Y=[C]={1,2,⋯, C}

. Given a neural network

and a sample pair

(x, y)

, we could obtain the “logits” as

f(x)∈RC

. We denote the softmax function

with temperature

SF(⋅;τ)

, i.e.,

pc(τ)=exp (fc(x)τ)Z(τ)

and

Z(τ)=∑C

j=1exp (fj(x)τ)

where

p(τ)

is the softened probability vector that a network outputs and

is the index of class. Later,

we may omit the dependence on

and

if without any ambiguity. We use

and

to denote the

correct class’s logit and probability, while we use

and

to represent the vector of wrong classes’

logits and probabilities, i.e., g=[fc]c≠yand q=[pc]c≠y. The notations could be found in Tab. 1.

The most standard KD [

] contains two stages of training. The ﬁrst stage trains complex teachers,

and then the second stage transfers the knowledge from teachers to a smaller student via minimizing

the KL divergence between softened probabilities. Usually, the loss function during the second stage

(i.e., the student’s learning objective) is a combination of cross-entropy loss and distillation loss:

`=−(1−λ)log pS

y(1)



CE Loss

−λτ2



c=1

c(τ)log pS

c(τ)



KD Loss

,(1)

where the upper script “T”/“S” denotes “Teacher”/“Student” respectively. Commonly, a default

temperature of

is utilized for the CE loss, and the student could also take a temperature of

for the

KD loss, e.g., pS

c(τ=1)[13, 31, 45, 44].

Suppose we have two teachers, denoted as

Tlarge

and

Tsmall

, and the larger teacher performs better on

both training and test data. If we use them to teach the same student

, we could ﬁnd that the student’s

performance is worse when mimicking the larger teacher’s outputs. Adjusting the temperature could

neither make the larger teacher teach well. The details of this phenomenon could be found in [

]

and Fig. 9. Obviously,

pTlarge

could differ a lot from

pTsmall

, which is the only difference in the loss

function when teaching the student. Hence, we focus on analyzing what probability distributions are

tended to be provided by teachers with different capacities.

4 Proposed Methods

This section ﬁrst decomposes KD into three parts and deﬁnes several quantitative metrics. Then, we

present theoretical analysis to demonstrate why larger networks can’t teach well. Finally, we propose

a more appropriate temperature scaling approach as an alternative.

4.1 KD Decomposition

We omit the coefﬁcient of

λτ2

in Eq. 1, and deﬁne

eqT(τ)=1

C−1∑C

j=1,j≠ypT

j(τ)

, where

qT(τ)=

[pT

c(τ)]c≠y. Then, we have the following decomposition:

`kd =−pT

y(τ)log pS

y(τ)



Correct Guidance

−

c≠y

eqT(τ)log pS

c(τ)

Smooth Regularization

−

c≠ypT

c(τ)−eqT(τ)log pS

c(τ)



Class Discriminability

.(2)

(I) Correct Guidance

: this term guarantees correctness during teaching. The decomposition in [

]

also contains this term, which is explained as importance weighting. This term works similarly to the

cross-entropy loss, which could be dealt with separately when applying temperature scaling.

(II) Smooth Regularization

: some previous works [

] attribute the success of KD to the

efﬁcacy of regularization and study its relation to label smoothing (LS). The combination of this term

with correct guidance works similarly to LS. Notably,

eqT(τ)

differs across samples, implying

that the strength of smoothing is instance-speciﬁc, which is similar to the analysis in [62].

(III) Class Discriminability

: this term tells the student the afﬁnity of wrong classes to the correct

class. Transferring the knowledge of class similarities to students has been the mainstream guess of

the “dark knowledge” in KD [

]. Ideally, a good teacher should be as discriminating as possible

in telling students which classes are more related to the correct class.

Illustrations of the decomposition are presented in the left of Fig. 1. Obviously, an appropriate

temperature should simultaneously contain the efﬁcacy of the three terms, e.g., the shaded row in

Fig. 1. A too high or too low temperature could lead to smaller class discriminability, making the

guidance less different among wrong classes, which weakens the distillation performance in practical.

Among these three terms, we advocate that class discriminability is more fundamental in KD and

present more discussions in Appendix A (veriﬁed in Fig. 2 and Fig. 3).

To measure these three terms quantitatively, we use the target class probability (i.e.,

), the average

of wrong class probabilities (i.e.,

e(q)=1

C−1∑j≠ypj

), and the variance of wrong class probabilities

(i.e.,

v(q)=1

C−1∑j≠y(pj−e(q))2

) as estimators.

e(⋅)

and

v(⋅)

calculates the mean and variance

of the elements in a vector. In some cases, we use the standard deviation as an estimator for the

third term, i.e.,

σ(q)=v1/2(q)

. Because the latter two terms are calculated after applying softmax

to the complete logit vector, we deﬁne them as Derived Average (DA) and Derived Variance (DV),

respectively. In experiments, we calculate these metrics for all training samples and sometimes report

the average or standard deviation across these samples.

4.2 Theoretical Analysis

We analyze the mean and variance of the softened probability vector, i.e., the teacher’s label

pT(τ)

used in KD. We defer the proofs of Lemma 4.1 and Proposition 4.3, 4.4 to Appendix B.

Lemma 4.1

(Variance of Softened Probabilities)

Given a logit vector

f∈RC

and the softened

probability vector p=SF(f;τ), τ ∈(0,∞),v(p)monotonically decreases as τincreases.

increases,

p(τ)

becomes more uniform, i.e., it’s entropy increases. However, we especially

focus on the wrong classes, where the mean and variance are more intuitive to calculate and analyze.

Assumption 4.2. The target logit is higher than other classes’ logits, i.e., fy≥fc,∀c≠y.

Assumption 4.2 is rational because well-performed teachers could almost achieve a higher accuracy

(e.g., >95%) on the training set, and most training samples meet this requirement.

Proposition 4.3.

Under Assumption 4.2,

monotonically decreases as

increases, and

e(q)

monotonically increases as τincreases. As τ→∞,e(q)→1C.

Proposition 4.3 implies that increasing temperature could lead to a higher derived average (empirically

see Fig. 7) and strengthen the smooth regularization term in Eq. 2.

Before we analyze the class discriminability term, we deﬁne

q(τ)

as the result of applying softmax

only to the wrong logits with temperature

, i.e.,

q(τ)=SF(g;τ)

. For the element index

c′

, we

have

qc′(τ)=exp(gc′τ)

exp(gjτ).(3)

Notably,

differs from

a lot. Speciﬁcally, the former satisﬁes

∑c′˜

qc′=1

, while the summation of

the latter is

∑c≠ypc=1−py

. The former does not depend on the correct class’s logit while the latter

does. We name v(˜

q)Inherent Variance (IV) because it only depends on wrong classes’ logits.

Proposition 4.4

(Derived Variance vs. Inherent Variance)

The derived variance is determined by

the square of derived average and the inherent variance via:

v(q)



=(C−1)2e2(q)



DA2

v(˜



.(4)

With

increases,

e(q)

increases (Proposition 4.3) while

v(˜

decreases (Lemma 4.1), and hence, it

is not so easy to judge the speciﬁc monotonicity of

v(q)

w.r.t.

. Empirically, we observe that the

derived variance ﬁrst increases and then decreases (see Fig. 7), which conforms to the change of the

class discriminability as illustrated in Fig. 1.

We could use Proposition 4.4 to clearly analyze why larger teacher networks can’t teach well. Before

this, we present another two properties and a corollary without detailed proof.

Remark 4.5.Fixing

and

, a higher target logit

leads to a higher

, i.e., a smaller derived

average e(q).

Remark 4.6.Fixing

, less varied wrong logits

leads to less varied

, i.e., a smaller inherent

variance v(˜

q).

Corollary 4.7. Suppose we have two teachers T1and T2, and their logit vectors for a same sample

are fT1and fT2.

•

fT1

y≥fT2

while

gT1

and

gT2

are nearly the same, then

pT1

y≥pT2

(Remark 4.5) while

v(˜

qT1)≈v(˜

qT2). Hence, v(qT1)≤v(qT2).

•

fT1

y≈fT2

while

v(gT1)≤v(gT2)

, then

pT1

y≈pT2

while

v(˜

qT1)≤v(˜

qT2)

(Remark 4.6).

Hence, v(qT1)≤v(qT2).

This corollary explains why a larger teacher can’t teach better. Because the larger teacher tends to be

over-conﬁdent, the target logit

may be larger or the variance of wrong logits

v(g)

may be smaller.

These are illustrated in Fig. 1 and empirically veriﬁed in Fig. 4. Then the derived variance

v(q)

may

be smaller, limiting the efﬁcacy of class discriminability in Eq. 2. Empirical results are in Fig. 7.

Notably, we focus on analyzing the variance of wrong class probabilities instead of all classes.

Maximizing the variance of all classes’ probabilities does not mean maximizing the variance of

wrong classes’. For example, although a very low temperature can maximize the variance of all

classes’ probabilities, the generated teacher’s label is one-hot that shows no distinctness between

wrong classes. In other words, the effectiveness of KD should be more related to the distinctness

between wrong classes rather than all classes. However, traditional temperature scaling applies a

uniform temperature for all classes, which cannot separately handle the wrong classes.

4.3 Asymmetric Temperature Scaling

We conclude the above analysis: if a larger teacher makes an over-conﬁdent prediction, the wrong

class probabilities it provides could be not discriminative enough. Utilizing a uniform temperature

could not enlarge the derived variance as much as possible with the interference of the target class’s

logit (see the middle of Fig. 7). Thanks to the decomposition in Eq. 2, the correct guidance term

works similarly to the cross-entropy loss and allows us to deal with it separately. Hence, we propose

a novel temperature scaling approach:

pc(τ1, τ2)=exp (fcτc) 

j∈[C]

exp (fjτj), τi=I{i=y}τ1+I{i≠y}τ2,∀i∈[C],(5)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AsymmetricTemperatureScalingMakesLargerNetworksTeachWellAgainXin-ChunLi1,Wen-ShuFan1,ShaomingSong2,YinchuanLi2BingshuaiLi2,YunfengShao2,De-ChuanZhan11StateKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity,Nanjing,China2HuaweiNoah'sArkLab,Beijing,China{lixc,fanws}@lamda.nju.edu.cn,zhandc@nju....

展开>> 收起<<

Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again Xin-Chun Li1 Wen-Shu Fan1 Shaoming Song2 Yinchuan Li2.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again Xin-Chun Li1 Wen-Shu Fan1 Shaoming Song2 Yinchuan Li2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: