Efficient Knowledge Distillation from Model Checkpoints Chaofei Wang Qisen Yang Rui Huang Shiji Song Gao Huangy

2025-04-26 0 0 1.46MB 17 页 10玖币
侵权投诉
Efficient Knowledge Distillation from Model
Checkpoints
Chaofei Wang
, Qisen Yang, Rui Huang, Shiji Song, Gao Huang
Department of Automation, Tsinghua University, China
wangcf18, yangqs19, hr20@mails.tsinghua.edu.cn
shijis, gaohuang@tsinghua.edu.cn
Abstract
Knowledge distillation is an effective approach to learn compact models (students)
with the supervision of large and strong models (teachers). As empirically there
exists a strong correlation between the performance of teacher and student models,
it is commonly believed that a high performing teacher is preferred. Consequently,
practitioners tend to use a well trained network or an ensemble of them as the
teacher. In this paper, we observe that an intermediate model, i.e., a checkpoint
in the middle of the training procedure, often serves as a better teacher com-
pared to the fully converged model, although the former has much lower accuracy.
More surprisingly, a weak snapshot ensemble of several intermediate models from
a same training trajectory can outperform a strong ensemble of independently
trained and fully converged models, when they are used as teachers. We show
that this phenomenon can be partially explained by the information bottleneck
principle: the feature representations of intermediate models can have higher mu-
tual information regarding the input, and thus contain more “dark knowledge”
for effective distillation. We further propose an optimal intermediate teacher se-
lection algorithm based on maximizing the total task-related mutual information.
Experiments verify its effectiveness and applicability. Our code is available at
https://github.com/LeapLabTHU/CheckpointKD.
1 Introduction
Knowledge distillation (KD) [
1
,
2
] has been proved to be an effective technique to promote the
performance of a low-capacity model by transferring “dark knowledge” from a large teacher model.
Empirically, there usually exists a strong correlation between the performance of the teacher model
and the student model. For this reason, it is a standard practice to use a well trained network or an
ensemble of multiple well trained networks as the teacher [
3
,
4
,
5
], and some researches are attempted
to improve distillation performance via boosting the ensemble performance [
6
,
7
]. The underlying
assumption is that high performing teachers lead to better student models.
However, this viewpoint has been challenged by some recent works [
8
,
9
,
10
,
11
,
12
], in which
it has been observed that a large model capacity gap between the teacher and student may have a
negative effect for knowledge transfer. To address this issue, researchers have proposed to employ an
intermediate-size network [
8
] or an assistant network [
9
] to improve the distillation performance in
such scenarios. In [
10
], a "tolerant" teacher model is designed by using a softened loss function. In
[
11
], Park et al. proposed to learn student-friendly teacher by plugging in student branches during the
training procedure. Nevertheless, there is no clear theoretical explanation for the gap between teacher
and student, and the search for a substitute teacher is not straightforward.
Equal contribution
Corresponding author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06458v1 [cs.LG] 12 Oct 2022
Student
Full Teacher
Distilled Student
Intermediate Teacher
Training epochs of teacher
Knowledge Distillation
Performance
(a) Intermediate Teacher vs. Full Teacher
Performance
Knowledge Distillation
Ensemble size of teacher
(b) Snapshot Ensemble vs. Full Ensemble
Figure 1: A sketch map of the two counterintuitive observations. (a) A weak intermediate model can
serve as a better teacher than the strong fully converged model. (b) A weak Snapshot Ensemble [
13
]
can serve as a better teacher than the strong Full Ensemble.
In this paper, we make an intriguing observation that further supports the viewpoint that high
performing models may not necessarily be good teachers, but from a novel perspective. Specifically,
we find that an unconverged intermediate model from the middle of the training procedure, often
serves as a better teacher than the final converged model, although the former has much lower
accuracy (as illustrated in Figure 1(a)) . Moreover, a weak snapshot ensemble of intermediate teacher
models along the same optimization path (denoted as Snapshot Ensemble, which is a variant of [
13
]
3
)
can outperform the standard ensemble of an equal number of independently trained teacher models
(denoted as Full Ensemble). This surprising phenomenon is illustrated in Figure 1(b), in which a
Snapshot Ensemble can have better distillation performance than a Full Ensemble, although the
accuracy of the former is significantly lower.
To understand the above phenomenon, we show that there is a strong connection between KD and
the information bottleneck (IB) theory [
14
]. Therein, it has been observed that during the training
procedure of deep neural networks, the mutual information between the learned features
F
and the
target
Y
, denoted as
I(Y;F)
, increases monotonically as a function of the training epochs; while the
mutual information between
F
and the input
X
, denoted as
I(X;F)
, grows in the early training stage,
but then decreases gradually after a certain number of epochs. We note that maximizing the mutual
information
I(Y;F)
is helpful for improving the teacher model itself, but not always necessary for
KD, because the ground truth target Yis already included in the KD objective function. In contrast,
the mutual information
I(X;F)
, to some extent, can be viewed as a type of dark knowledge that is
desired for effective KD. For example, considering an image with a man driving a car, although it
may be uniquely labeled into the “car” category, it still contains features of the “people” category.
Such weak but non-negligible features extracted from the input (measured by
I(X;F)
) are in fact the
most valuable knowledge for distilling student models. Not surprisingly, most KD algorithms apply a
high temperature to soften the network prediction in order to reveal these information from a teacher
model. However, as shown by IB theory, a fully converged model tends to be overconfident and
may already have collapsed representations for non-targeted classes. Therefore, simply scaling the
temperature can not effectively recover the suppressed knowledge. On the contrary, an intermediate
model, although does not reach its top accuracy due to non-optimal
I(Y;F)
, may have a larger
I(X;F)
that benefits KD. This partially explains our observation that intermediate models can be
better teachers. More detailed and formal analyses are provided in the following sections.
Further, we propose an optimal intermediate teacher selection algorithm based on the IB theory. From
the perspective of entropy, the teacher model’s representation can be decomposed as the information
with respect to the input, the output and some nuisance [
15
]. The proposed algorithm aims to find
the most informative intermediate teacher model which possesses the minimal part of nuisance
on a training trajectory. Experiments verify its applicability in various distillation scenarios. Our
contributions are summarized as follows:
By designing two exploratory experiments, we observe the phenomenon that intermediate
models can serve as better teachers than fully converged models. This suggests that for
3In this paper,we adopt a normal cosine learning rate instead of the cyclic learning rate.
2
effective KD one should not only focus on improving the teacher performance. Instead,
rethinking what the “dark knowledge” is and how to enrich it is highly valuable.
We demonstrate the connection between our observations and the IB theory, providing a
new perspective for understanding KD and explaining the “dark knowledge”.
Based on our observations and analyses, a novel simple but effective algorithm is pro-
posed to find the optimal intermediate teacher and achieve better distillation performance.
Experiments validate its effectiveness and adaptability.
2 Related Work
Knowledge distillation.
Hinton et al. [
2
] proposed to transfer “dark knowledge” of a strong capacity
teacher model to a compact student model by minimizing the Kullback-Leibler divergence between
the soft targets of the two models. Since then, many variants of KD methods were proposed to
improve the distillation performance[
16
], such as Fitnets [
17
], AT[
18
], CCM[
19
], FSP[
20
], SP[
21
],
CCKD[
22
],TC
3
KD[
23
]. As a promising technique to improve model generalization, ensemble learn-
ing is often combined with knowledge distillation to improve the distillation performance[
3
,
11
,
4
].
In the online knowledge distillation framework[
24
], efforts were made to boost the distillation perfor-
mance by increasing the diversity between multiple models to improve the ensemble performance
[
6
,
7
]. Most existing methods commonly assumed that a high performing teacher is preferred for KD.
On the contrary, some researchers thought that the model capacity gap between strong teachers and
small students usually degrades knowledge transfer[
8
,
9
,
12
,
25
]. Some of them have experimentally
verified that poor teachers can also perform KD tasks well[
12
,
25
]. Several methods are proposed to
compress this gap by introducing an assistant network [
8
,
9
] or designing a student-friendly teacher
[
10
,
11
]. However, they did not explain theoretically why gap exists and how gap affects KD. In the
self-distillation framework[
26
,
27
], it is essentially the intermediate model that is used as the teacher,
but no one has a theoretical explanation for why the intermediate model works. In this paper, we link
KD and IB theory through extensive observations and experiments. From the perspective of mutual
information, we explain why intermediate models serve as better teachers than full models, and how
to select an suitable intermediate model to reduce the negative impact of model gap.
Information bottleneck.
Tishby et al. [
28
] firstly proposed the information bottleneck concept and
provided a tabular method to numerically solve the IB Lagrangian (Eq. (3)). Later, Tishby and
Zaslavsky [
14
] proposed to interpret deep learning with IB principle. Following this idea, Shwartz-Ziv
and Tishby [
29
] studied the IB principle to explain the training dynamics of deep networks. It has
motivated many studies to apply the IB principle to interpret and improve deep neural networks
(DNNs) [
30
,
31
,
32
]. Recently, some researchers introduced IB principle to deep reinforcement
learning successfully [
33
,
34
,
35
]. As far as we know, we are the first to introduce IB principle to
interpret knowledge distillation.
3 Exploratory Experiments
In this section, we first formally describe the KD and ensemble KD methods used in the paper, then
design two exploratory experiments to show how intermediate models are surprisingly valuable for
KD, despite their lower accuracies due to incompletion of training.
3.1 Formulation
In the classical KD setting, a fully converged teacher model (full teacher for short)
Tfull
is used to
distill a student model
S
. Define
PTfull
as the softmax output of teacher model,
PS
as the softmax
output of student model and
Ytrue
as the true labels. The student model is trained to optimize the
following loss function:
LKD =αH(Ytrue, PS) + (1 α)H(Pτ
Tfull , P τ
S),(1)
where
H
refers to the cross-entropy,
α
is the trade-off parameter,
τ
is the temperature. Conducting
KD with an intermediate teacher model Tinter, means using Tinter instead of Tfull in Eq. (1).
In the standard ensemble KD setting, there are
M(M2)
full teachers
Tfull
1, T full
2, ..., T full
M
, which
have the same network structure and training strategy but different initial parameters. The student
3
model needs to mimic the average softened softmax output of all teacher models. We call this method
Full Ensemble KD. The loss function is as follows:
LEKD =αH(Ytrue, PS) + (1 α)H(1
M
M
X
i=1
Pτ
Tfull
i
, P τ
S).(2)
The Snapshot Ensemble aggregates
M(M2)
intermediate teachers
Tinter
1, T inter
2, ..., T inter
M
from
one training trajectory. Conducting KD with a Snapshot Ensemble, means using
Tinter
i
instead of
Tfull
i
in Eq. (2). We call it Snapshot Ensemble KD.
3.2 Experimental design and setups
To examine the common assumption “high performing teachers lead to better student models” and
explore the value of intermediate models, we design two experiments. 1) The standard KD is to train
a full teacher model to distill a student model. What if we adopt the intermediate teacher models
instead? 2) The standard ensemble KD is to train multiple full teacher models independently and
average their output to distill a single student model. What if we adopt the Snapshot Ensemble
instead of the Full Ensemble? We name the first experiment as “Intermediate Teacher vs. Full
Teacher”, and the second experiment as “Snapshot Ensemble vs. Full Ensemble”. For generality,
we conduct experiments on the CIFAR-100 [
36
], Tiny-ImageNet[
37
] and ImageNet [
38
] datasets
with various teacher-student pairs. The distillation loss functions follow Eqs. (1) and (2). For fair
comparison, we search the optimal hyperparameters (i.e., the loss ratio
α
and the temperature
τ
) for
each teacher-student pair. Top 1 accuracy is averagely evaluated in five independent experiments.
The “Intermediate Teacher vs. Full Teacher" experiment is conducted on the CIFAR-100 and
ImageNet. On CIFAR-100, we adopt WRN-40-2[
39
] and ResNet-110[
40
] as teacher models, WRN-
40-1[
39
], ResNet-32[
40
], and MobileNetV2[
41
] (width multiplier is 0.75) as student models. We
train each teacher model for 200 epochs to ensure convergence. We save the intermediate models
at the 20
th
, 40
th
, ..., 180
th
epochs as intermediate teachers, and the models at the 200
th
epoch as
full teachers. On ImageNet, we adopt ResNet-50 [
40
], and ResNet-34 [
40
] as teacher models, and
MobileNetV2 [
41
], and ResNet-18 [
40
] as student models. We follow the standard PyTorch practice
but train teacher models for 120 epochs to guarantee convergence. We save the intermediate models
at the 60
th
epoch as intermediate teachers, and the models at the 120
th
epoch as full teachers. The
“Snapshot Ensemble vs. Full Ensemble” experiment is conducted on CIFAR-100 and Tiny-ImageNet.
We train models for 150 epochs on Tiny-ImageNet to ensure convergence. We save the intermediate
models at the 75
th
epoch as intermediate teachers, and the models at the 150
th
epoch as full teachers.
We adopt WRN-40-1[
39
], ResNet-32[
40
], and MobilenetV2[
41
] as student models, and WRN-40-
2[
39
], and ResNet-110[
40
] as teacher models. Due to the page limitation, we include the introduction
of the datasets and detailed experimental settings in the Appendix A.1.
3.3 Intermediate Teacher vs. Full Teacher
20 40 60 80 100 120 140 160 180 200
Training epochs of teacher
65
66
67
68
69
70
71
72
73
74
Test accuracy of student
WRN-40-2/WRN-40-1
ResNet-110/ResNet-32
WRN-40-2/MobileNetV2
ResNet-110/MobileNetV2
Figure 2: Ablation experiments of the training
epochs of Tinter on CIFAR-100.
Firstly, we simply compare the half-way teach-
ers with the full teachers on CIFAR-100 and
ImageNet. It means that the intermediate mod-
els at the 100
th
epoch are adopted as
Tinter
on
CIFAR-100, the intermediate models at the 60
th
epoch are adopted as
Tinter
on ImageNet. The
training cost of all intermediate teachers is only
half that of the full teachers. Table 1 shows
the comparison results. Specifically, on CIFAR-
100, for WRN-40-2, the accuracy of the inter-
mediate model is 13.54% lower than that of the
full model, but its distillation performance is
comparable (0.08% higher) and superior (0.96%
higher). For ResNet-110, the accuracy of the
intermediate model is 13.98% lower than that of
the full model, but its distillation performance
is still comparable (0.01% higher and 0.16%
higher). On ImageNet, the accuracy of the inter-
4
摘要:

EfcientKnowledgeDistillationfromModelCheckpointsChaofeiWang,QisenYang,RuiHuang,ShijiSong,GaoHuangyDepartmentofAutomation,TsinghuaUniversity,Chinawangcf18,yangqs19,hr20@mails.tsinghua.edu.cnshijis,gaohuang@tsinghua.edu.cnAbstractKnowledgedistillationisaneffectiveapproachtolearncompactmodels(studen...

展开>> 收起<<
Efficient Knowledge Distillation from Model Checkpoints Chaofei Wang Qisen Yang Rui Huang Shiji Song Gao Huangy.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.46MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注