Efﬁcient Knowledge Distillation from Model Checkpoints Chaofei Wang Qisen Yang Rui Huang Shiji Song Gao Huangy

2025-04-26 0 0 1.46MB 17 页 10玖币

侵权投诉

Efﬁcient Knowledge Distillation from Model

Checkpoints

Chaofei Wang∗

, Qisen Yang∗, Rui Huang, Shiji Song, Gao Huang†

Department of Automation, Tsinghua University, China

wangcf18, yangqs19, hr20@mails.tsinghua.edu.cn

shijis, gaohuang@tsinghua.edu.cn

Abstract

Knowledge distillation is an effective approach to learn compact models (students)

with the supervision of large and strong models (teachers). As empirically there

exists a strong correlation between the performance of teacher and student models,

it is commonly believed that a high performing teacher is preferred. Consequently,

practitioners tend to use a well trained network or an ensemble of them as the

teacher. In this paper, we observe that an intermediate model, i.e., a checkpoint

in the middle of the training procedure, often serves as a better teacher com-

pared to the fully converged model, although the former has much lower accuracy.

More surprisingly, a weak snapshot ensemble of several intermediate models from

a same training trajectory can outperform a strong ensemble of independently

trained and fully converged models, when they are used as teachers. We show

that this phenomenon can be partially explained by the information bottleneck

principle: the feature representations of intermediate models can have higher mu-

tual information regarding the input, and thus contain more “dark knowledge”

for effective distillation. We further propose an optimal intermediate teacher se-

lection algorithm based on maximizing the total task-related mutual information.

Experiments verify its effectiveness and applicability. Our code is available at

https://github.com/LeapLabTHU/CheckpointKD.

1 Introduction

Knowledge distillation (KD) [

] has been proved to be an effective technique to promote the

performance of a low-capacity model by transferring “dark knowledge” from a large teacher model.

Empirically, there usually exists a strong correlation between the performance of the teacher model

and the student model. For this reason, it is a standard practice to use a well trained network or an

ensemble of multiple well trained networks as the teacher [

], and some researches are attempted

to improve distillation performance via boosting the ensemble performance [

]. The underlying

assumption is that high performing teachers lead to better student models.

However, this viewpoint has been challenged by some recent works [

], in which

it has been observed that a large model capacity gap between the teacher and student may have a

negative effect for knowledge transfer. To address this issue, researchers have proposed to employ an

intermediate-size network [

] or an assistant network [

] to improve the distillation performance in

such scenarios. In [

], a "tolerant" teacher model is designed by using a softened loss function. In

[

], Park et al. proposed to learn student-friendly teacher by plugging in student branches during the

training procedure. Nevertheless, there is no clear theoretical explanation for the gap between teacher

and student, and the search for a substitute teacher is not straightforward.

∗Equal contribution

†Corresponding author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06458v1 [cs.LG] 12 Oct 2022

Student

Full Teacher

Distilled Student

Intermediate Teacher

Training epochs of teacher

Knowledge Distillation

Performance

(a) Intermediate Teacher vs. Full Teacher

Performance

Knowledge Distillation

Ensemble size of teacher

(b) Snapshot Ensemble vs. Full Ensemble

Figure 1: A sketch map of the two counterintuitive observations. (a) A weak intermediate model can

serve as a better teacher than the strong fully converged model. (b) A weak Snapshot Ensemble [

]

can serve as a better teacher than the strong Full Ensemble.

In this paper, we make an intriguing observation that further supports the viewpoint that high

performing models may not necessarily be good teachers, but from a novel perspective. Speciﬁcally,

we ﬁnd that an unconverged intermediate model from the middle of the training procedure, often

serves as a better teacher than the ﬁnal converged model, although the former has much lower

accuracy (as illustrated in Figure 1(a)) . Moreover, a weak snapshot ensemble of intermediate teacher

models along the same optimization path (denoted as Snapshot Ensemble, which is a variant of [

]

)

can outperform the standard ensemble of an equal number of independently trained teacher models

(denoted as Full Ensemble). This surprising phenomenon is illustrated in Figure 1(b), in which a

Snapshot Ensemble can have better distillation performance than a Full Ensemble, although the

accuracy of the former is signiﬁcantly lower.

To understand the above phenomenon, we show that there is a strong connection between KD and

the information bottleneck (IB) theory [

]. Therein, it has been observed that during the training

procedure of deep neural networks, the mutual information between the learned features

and the

target

, denoted as

I(Y;F)

, increases monotonically as a function of the training epochs; while the

mutual information between

and the input

, denoted as

I(X;F)

, grows in the early training stage,

but then decreases gradually after a certain number of epochs. We note that maximizing the mutual

information

I(Y;F)

is helpful for improving the teacher model itself, but not always necessary for

KD, because the ground truth target Yis already included in the KD objective function. In contrast,

the mutual information

I(X;F)

, to some extent, can be viewed as a type of dark knowledge that is

desired for effective KD. For example, considering an image with a man driving a car, although it

may be uniquely labeled into the “car” category, it still contains features of the “people” category.

Such weak but non-negligible features extracted from the input (measured by

I(X;F)

) are in fact the

most valuable knowledge for distilling student models. Not surprisingly, most KD algorithms apply a

high temperature to soften the network prediction in order to reveal these information from a teacher

model. However, as shown by IB theory, a fully converged model tends to be overconﬁdent and

may already have collapsed representations for non-targeted classes. Therefore, simply scaling the

temperature can not effectively recover the suppressed knowledge. On the contrary, an intermediate

model, although does not reach its top accuracy due to non-optimal

I(Y;F)

, may have a larger

I(X;F)

that beneﬁts KD. This partially explains our observation that intermediate models can be

better teachers. More detailed and formal analyses are provided in the following sections.

Further, we propose an optimal intermediate teacher selection algorithm based on the IB theory. From

the perspective of entropy, the teacher model’s representation can be decomposed as the information

with respect to the input, the output and some nuisance [

]. The proposed algorithm aims to ﬁnd

the most informative intermediate teacher model which possesses the minimal part of nuisance

on a training trajectory. Experiments verify its applicability in various distillation scenarios. Our

contributions are summarized as follows:

•

By designing two exploratory experiments, we observe the phenomenon that intermediate

models can serve as better teachers than fully converged models. This suggests that for

3In this paper,we adopt a normal cosine learning rate instead of the cyclic learning rate.

effective KD one should not only focus on improving the teacher performance. Instead,

rethinking what the “dark knowledge” is and how to enrich it is highly valuable.

•

We demonstrate the connection between our observations and the IB theory, providing a

new perspective for understanding KD and explaining the “dark knowledge”.

•

Based on our observations and analyses, a novel simple but effective algorithm is pro-

posed to ﬁnd the optimal intermediate teacher and achieve better distillation performance.

Experiments validate its effectiveness and adaptability.

2 Related Work

Knowledge distillation.

Hinton et al. [

] proposed to transfer “dark knowledge” of a strong capacity

teacher model to a compact student model by minimizing the Kullback-Leibler divergence between

the soft targets of the two models. Since then, many variants of KD methods were proposed to

improve the distillation performance[

], such as Fitnets [

], AT[

], CCM[

], FSP[

], SP[

CCKD[

],TC

KD[

]. As a promising technique to improve model generalization, ensemble learn-

ing is often combined with knowledge distillation to improve the distillation performance[

In the online knowledge distillation framework[

], efforts were made to boost the distillation perfor-

mance by increasing the diversity between multiple models to improve the ensemble performance

[

]. Most existing methods commonly assumed that a high performing teacher is preferred for KD.

On the contrary, some researchers thought that the model capacity gap between strong teachers and

small students usually degrades knowledge transfer[

]. Some of them have experimentally

veriﬁed that poor teachers can also perform KD tasks well[

]. Several methods are proposed to

compress this gap by introducing an assistant network [

] or designing a student-friendly teacher

[

]. However, they did not explain theoretically why gap exists and how gap affects KD. In the

self-distillation framework[

], it is essentially the intermediate model that is used as the teacher,

but no one has a theoretical explanation for why the intermediate model works. In this paper, we link

KD and IB theory through extensive observations and experiments. From the perspective of mutual

information, we explain why intermediate models serve as better teachers than full models, and how

to select an suitable intermediate model to reduce the negative impact of model gap.

Information bottleneck.

Tishby et al. [

] ﬁrstly proposed the information bottleneck concept and

provided a tabular method to numerically solve the IB Lagrangian (Eq. (3)). Later, Tishby and

Zaslavsky [

] proposed to interpret deep learning with IB principle. Following this idea, Shwartz-Ziv

and Tishby [

] studied the IB principle to explain the training dynamics of deep networks. It has

motivated many studies to apply the IB principle to interpret and improve deep neural networks

(DNNs) [

]. Recently, some researchers introduced IB principle to deep reinforcement

learning successfully [

]. As far as we know, we are the ﬁrst to introduce IB principle to

interpret knowledge distillation.

3 Exploratory Experiments

In this section, we ﬁrst formally describe the KD and ensemble KD methods used in the paper, then

design two exploratory experiments to show how intermediate models are surprisingly valuable for

KD, despite their lower accuracies due to incompletion of training.

3.1 Formulation

In the classical KD setting, a fully converged teacher model (full teacher for short)

Tfull

is used to

distill a student model

. Deﬁne

PTfull

as the softmax output of teacher model,

as the softmax

output of student model and

Ytrue

as the true labels. The student model is trained to optimize the

following loss function:

LKD =αH(Ytrue, PS) + (1 −α)H(Pτ

Tfull , P τ

S),(1)

where

refers to the cross-entropy,

is the trade-off parameter,

is the temperature. Conducting

KD with an intermediate teacher model Tinter, means using Tinter instead of Tfull in Eq. (1).

In the standard ensemble KD setting, there are

M(M≥2)

full teachers

Tfull

1, T full

2, ..., T full

M

, which

have the same network structure and training strategy but different initial parameters. The student

model needs to mimic the average softened softmax output of all teacher models. We call this method

Full Ensemble KD. The loss function is as follows:

LEKD =αH(Ytrue, PS) + (1 −α)H(1

i=1

Pτ

Tfull

, P τ

S).(2)

The Snapshot Ensemble aggregates

M(M≥2)

intermediate teachers

Tinter

1, T inter

2, ..., T inter

M

from

one training trajectory. Conducting KD with a Snapshot Ensemble, means using

Tinter

instead of

Tfull

in Eq. (2). We call it Snapshot Ensemble KD.

3.2 Experimental design and setups

To examine the common assumption “high performing teachers lead to better student models” and

explore the value of intermediate models, we design two experiments. 1) The standard KD is to train

a full teacher model to distill a student model. What if we adopt the intermediate teacher models

instead? 2) The standard ensemble KD is to train multiple full teacher models independently and

average their output to distill a single student model. What if we adopt the Snapshot Ensemble

instead of the Full Ensemble? We name the ﬁrst experiment as “Intermediate Teacher vs. Full

Teacher”, and the second experiment as “Snapshot Ensemble vs. Full Ensemble”. For generality,

we conduct experiments on the CIFAR-100 [

], Tiny-ImageNet[

] and ImageNet [

] datasets

with various teacher-student pairs. The distillation loss functions follow Eqs. (1) and (2). For fair

comparison, we search the optimal hyperparameters (i.e., the loss ratio

and the temperature

) for

each teacher-student pair. Top 1 accuracy is averagely evaluated in ﬁve independent experiments.

The “Intermediate Teacher vs. Full Teacher" experiment is conducted on the CIFAR-100 and

ImageNet. On CIFAR-100, we adopt WRN-40-2[

] and ResNet-110[

] as teacher models, WRN-

40-1[

], ResNet-32[

], and MobileNetV2[

] (width multiplier is 0.75) as student models. We

train each teacher model for 200 epochs to ensure convergence. We save the intermediate models

at the 20

, 40

, ..., 180

epochs as intermediate teachers, and the models at the 200

epoch as

full teachers. On ImageNet, we adopt ResNet-50 [

], and ResNet-34 [

] as teacher models, and

MobileNetV2 [

], and ResNet-18 [

] as student models. We follow the standard PyTorch practice

but train teacher models for 120 epochs to guarantee convergence. We save the intermediate models

at the 60

epoch as intermediate teachers, and the models at the 120

epoch as full teachers. The

“Snapshot Ensemble vs. Full Ensemble” experiment is conducted on CIFAR-100 and Tiny-ImageNet.

We train models for 150 epochs on Tiny-ImageNet to ensure convergence. We save the intermediate

models at the 75

epoch as intermediate teachers, and the models at the 150

epoch as full teachers.

We adopt WRN-40-1[

], ResNet-32[

], and MobilenetV2[

] as student models, and WRN-40-

], and ResNet-110[

] as teacher models. Due to the page limitation, we include the introduction

of the datasets and detailed experimental settings in the Appendix A.1.

3.3 Intermediate Teacher vs. Full Teacher

20 40 60 80 100 120 140 160 180 200

Training epochs of teacher

Test accuracy of student

WRN-40-2/WRN-40-1

ResNet-110/ResNet-32

WRN-40-2/MobileNetV2

ResNet-110/MobileNetV2

Figure 2: Ablation experiments of the training

epochs of Tinter on CIFAR-100.

Firstly, we simply compare the half-way teach-

ers with the full teachers on CIFAR-100 and

ImageNet. It means that the intermediate mod-

els at the 100

epoch are adopted as

Tinter

CIFAR-100, the intermediate models at the 60

epoch are adopted as

Tinter

on ImageNet. The

training cost of all intermediate teachers is only

half that of the full teachers. Table 1 shows

the comparison results. Speciﬁcally, on CIFAR-

100, for WRN-40-2, the accuracy of the inter-

mediate model is 13.54% lower than that of the

full model, but its distillation performance is

comparable (0.08% higher) and superior (0.96%

higher). For ResNet-110, the accuracy of the

intermediate model is 13.98% lower than that of

the full model, but its distillation performance

is still comparable (0.01% higher and 0.16%

higher). On ImageNet, the accuracy of the inter-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfcientKnowledgeDistillationfromModelCheckpointsChaofeiWang,QisenYang,RuiHuang,ShijiSong,GaoHuangyDepartmentofAutomation,TsinghuaUniversity,Chinawangcf18,yangqs19,hr20@mails.tsinghua.edu.cnshijis,gaohuang@tsinghua.edu.cnAbstractKnowledgedistillationisaneffectiveapproachtolearncompactmodels(studen...

展开>> 收起<<

Efﬁcient Knowledge Distillation from Model Checkpoints Chaofei Wang Qisen Yang Rui Huang Shiji Song Gao Huangy.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efﬁcient Knowledge Distillation from Model Checkpoints Chaofei Wang Qisen Yang Rui Huang Shiji Song Gao Huangy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: