Respecting Transfer Gap in Knowledge Distillation Yulei Niu1Long Chen1Chang Zhou2Hanwang Zhang3 1Columbia University2Damo Academy Alibaba Group3Nanyang Technological University

2025-04-29 0 0 1.01MB 18 页 10玖币
侵权投诉
Respecting Transfer Gap in Knowledge Distillation
Yulei Niu1Long Chen1Chang Zhou2Hanwang Zhang3
1Columbia University 2Damo Academy, Alibaba Group 3Nanyang Technological University
{yn.yuleiniu,zjuchenlong}@gmail.com zhouchang.zc@alibaba-inc.com
hanwangzhang@ntu.edu.sg
Abstract
Knowledge distillation (KD) is essentially a process of transferring a teacher
model’s behavior, e.g., network response, to a student model. The network response
serves as additional supervision to formulate the machine domain (
machine
for
short), which uses the data collected from the human domain (
human
for short) as
a transfer set. Traditional KD methods hold an underlying assumption that the data
collected in both human domain and machine domain are both independent and
identically distributed (IID). We point out that this naïve assumption is unrealistic
and there is indeed a transfer gap between the two domains. Although the gap offers
the student model external knowledge from the machine domain, the imbalanced
teacher knowledge would make us incorrectly estimate how much to transfer from
teacher to student per sample on the non-IID transfer set. To tackle this challenge,
we propose Inverse Probability Weighting Distillation (IPWD) that estimates the
propensity score of a training sample belonging to the machine domain, and assigns
its inverse amount to compensate for under-represented samples. Experiments
on CIFAR-100 and ImageNet demonstrate the effectiveness of IPWD for both
two-stage distillation and one-stage self-distillation.
1 Introduction
Knowledge distillation (KD) [
21
] transfers knowledge from a teacher model, e.g., a big, cumbersome,
and energy-inefficient network, to a student model, e.g., a small, light, and energy-efficient network,
to improve the performance of the student model. A common intuition is that a teacher with better
performance will teach a stronger student. However, recent studies find that the teacher’s accuracy is
not a good indicator of the resultant student performance [
8
]. For example, a poorly-trained teacher
with early stopping can still teach a better student [
8
,
11
,
77
]; or, a teacher with a smaller model size
than the student is also a good teacher [
77
]; or, a teacher with the same architecture as the student
helps to improve the student—self-distillation [13, 82, 81, 27].
Should we view KD in a perspective of domain transfer [
12
,
63
], we would better understand the
above counter-intuitive findings. From Figure 1, we can see that teacher predictions and ground-truth
labels indeed behave differently. Although the teacher is trained on the balanced dataset, its predicted
probability distribution over the dataset is imbalanced. Even on the same training set with the same
model parameter, teachers with different temperature
τ
yield different “soft label” distributions
from the ground-truth ones. This implies that human and teacher knowledge are from different
domains, and there is a transfer gap that drives the “dark knowledge” [21] transferring from teacher
to student—regardless of “strong” or “weak” teachers, it is a valid transfer as long as there is a gap.
However, the transfer gap affects the distillation performance on the under-represented classes, i.e.,
classes on the tail of teacher predictions, which is overlooked in recent studies. Take CIFAR-100 as
an example. We rank and divide the 100 classes into 4 groups according to the ranks of predicted
Work done when Yulei was at Nanyang Technological University.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12787v1 [cs.CV] 23 Oct 2022
Dataset: CIFAR-100; Teac her: ResNet-110 Dataset: ImageNet; Teacher: ResNet-50
Ranked Class IndexRanked Class Index
Avg. Prob. ×# Total Samples
Avg. Prob. ×# Total Samples
020 40 60 80 100 0200 400 600 800 1000
500
450
400
550
600
1000
800
600
1200
1800
1400
1600
Figure 1: Illustration of the distribution discrepancies among ground-truth annotations and teacher
predictions. Although the teacher model is trained on the balanced data (blue dashed), its prediction
distributions are imbalanced according to various temperatures.
probability. As shown in Table 1, compared to vanilla training, KD achieves better performance in all
the subgroups. However, the increase in the top 25 classes is much higher than that in the last 25
classes, i.e., averagely 5.14% vs. 0.85%. We ask: what causes the gap from the first place; or more
specifically, why does the teacher’s non-uniform distributed predictions implies the gap? We answer
in an invariance vs. equivariance learning point of view [4, 69]:
Table 1: Improvement of KD over vanilla student for different
classes. The metric is macro-average recall.
Arch. style Top 1-25 Top 26-50 Top 51-75 Top 76-100
ResNet50 -> MobileNetV2 +4.96 +5.92 +1.76 +1.20
resnet32x4 -> ShuffleNetV1 +5.80 +2.68 +2.52 +0.84
resnet32x4 -> ShuffleNetV2 +4.72 +1.92 +2.24 +0.76
WRN-40-2 -> ShuffleNetV1 +5.08 +7.20 +4.48 +0.60
Human domain: context in-
variance
. The discriminative
generalization is the ability to
learn both context-invariant and
class-equivariant information
from the diverse training samples
per class. The human domain
only provides context-invariant
class-specific information, i.e., hard targets. We normally collect a balanced dataset to formulate
human domain.
Machine domain: context equivariance
. Teacher models often use a temperature variable to
preserve the context. The temperature allows the teacher to represent a sample not only by its context-
invariant class-specific information, but also its context-equivariant information. For example, a
dog
image with soft label 0.8
·dog
+ 0.2
·wolf
may imply that the dog has wolf-like contextual attributes
such as “fluffy coat” and “upright ears”. Although the context-invariance (i.e., class) is balanced in
the training data, the context-equivariance (i.e., context) is imbalanced because the context balance is
not considered in class-specific data collection [
67
]. To construct the transfer set for the machine
domain, the teacher model annotates each sample after seeing others, i.e., being pre-trained on the
whole set. Interestingly, the diverse context results in a long-tailed imbalanced distribution, which is
exactly reflected in Figure 1. In other words, the teacher’s knowledge is imbalanced even though the
teacher is trained on a class-balanced dataset.
Now we are ready to point out how the transfer gap is not properly addressed in conventional KD
methods. Conventional KD calculates the Cross-Entropy (CE) loss between the ground-truth label
and student’s prediction, and the Kullback–Leibler (KL) divergence [
33
] loss between the teacher’s
and student’s predictions, where a constant weight is assigned for the two losses. This is essentially
based on the underlying assumption that the data in both the human and machine domains are IID.
Based on the analysis of context equivariance, we argue that the assumption is unrealistic, i.e.,the
teacher’s knowledge is imbalanced. Therefore, a constant sample weight for the KL loss would be a
bottleneck. In this paper, we propose a simple yet effective method, Inverse Probability Weighting
Distillation (IPWD), which compensates for the training samples that are under-weighted in the
machine domain. For each training sample
x
, we first estimate its machine-domain propensity
score
P(x|machine)
by comparing class-aware and context-aware predictions. A sample with a low
propensity score would have a high confidence from class-aware predictions and a low confidence
from context-aware predictions. Then, IPWD assigns the inverse probability
1/P (x|machine)
as
the sample weight for the KL loss to highlight the under-represented samples. In this way, IPWD
generates a pseudo-population [37, 26] to deal with the imbalanced knowledge.
2
We evaluate our proposed IPWD on two typical knowledge distillation settings: two-stage teacher-
student distillation and one-stage self-distillation. Experiments conducted on CIFAR-100 [
32
] and
ImageNet [10] demonstrate the effectiveness and generality of our IPWD.
Our contributions are three-fold:
We formulate KD as a domain transfer problem and argue that the naïve IID assumption on machine
domain neglects the imbalanced knowledge due to transfer gap.
We propose Inverse Probability Weighting Distillation (IPWD) which compensate for the samples
that are overlooked in the machine domain to tackle the imbalanced knowledge in transfer gap.
Experiments on CIFAR-100 and ImageNet for both two-stage distillation and one-stage self-
distillation show that the proper handling of the transfer gap is a promising direction in KD.
2 Related Work
Knowledge distillation
(KD) was first introduced to transfer the knowledge from an effective but
cumbersome model to a smaller and efficient model [
21
]. The knowledge can be formulated in
either output space [
21
,
28
,
35
,
78
,
77
,
43
,
61
,
85
,
31
] or representation space [
54
,
25
,
79
,
30
,
50
,
19
,
66
,
7
,
27
]. KD has attracted a wide interest in theory, methodology, and applications [
15
].
For applications, KD has shown its great potential in various areas, including but not limited to
classification [
36
,
53
,
39
,
23
], detection [
34
,
59
,
70
], segmentation [
18
,
44
,
38
] for visual recognition
tasks, and visual question answering [
46
,
1
,
48
], video captioning [
49
,
84
], and text-to-image
synthesis [
64
] for vision-language tasks. Recent studies further discussed how and why KD works.
Specifically, Müller et al. [
45
] and Shen et al. [
58
] empirically analyzed the effect of label smoothing
on KD. Cho et al. [
8
], Dong et al. [
11
], and Yuan et al. [
77
] pointed out that early stopping is a good
regularization for a better teacher. Yuan et al. [
77
] further found that a poorly trained teacher, even
a model smaller than the student, can improve the performance of the student. Besides, Memon et
al. [
41
] and Zhou et al. [
85
] proposed a bias-variance trade-off perspective for KD. In this paper, we
point out that existing KD methods hold an underlying assumption that the IID training samples are
also IID in the machine domain, which overlooks the transfer gap.
Self-distillation
is a special case of KD, which uses the student network itself as the teacher instead of
the cumbersome model, i.e., the teacher and student models have the same architecture [
13
,
82
,
81
,
27
].
This process can be executed in iterations and produce a stronger ensemble model [
13
]. Similar to
KD, traditional self-distillation follows a two-stage process: first pre-training a student model as the
teacher, and then distilling the knowledge from the pre-trained model to a new student model. In
order to perform the teacher-student optimization in one generation, recent studies [
75
,
31
] proposed
one-stage self-distillation that adopts student models at earlier epochs as teacher models. These
one-stage self-distillation methods outperform vanilla students by large margins. In this paper, we
also evaluate the effectiveness of our IPWD as a plug-in in one-stage self-distillation.
Inverse Probability Weighting
(IPW) [
55
,
37
,
26
,
5
], also known as inverse probability of treatment
weighting or inverse propensity weighting, was proposed to correct the selection bias when the
observations are non-IID. IPW uses the inverse of the probability (i.e., propensity score) that the
individual would be assigned to the treatment group to reweight the samples. Propensity-weighting
techniques have been widely applied and studied in many areas [
57
], such as causal inference [
26
],
complete-case analysis [
37
], machine learning [
9
,
6
,
62
], and recommendation systems [
57
,
72
,
3
].
In this paper, we view the distillation process as a domain transfer problem and adopt IPW to
dynamically assign the weight to each training sample for the distillation loss.
3 Analysis
3.1 Knowledge Distillation (KD)
We view knowledge distillation from a perspective of domain transfer, and take the image classification
task as the case study. Suppose that the training data
D={X ,Y}={(x, y)}
contains
x
as the input
(e.g., image) and
yRC
as its ground-truth annotation (e.g., one-hot label), where
C
denotes the
number of classes. A standard solution to train the classifier
θ
uses the cross-entropy loss as the
3
objective:
Lcls(human;θ) = E(x,y)Phuman [`cls(x, y;θ)] 1
|D| X
(x,y)∈D
`cls(x, y;θ),Lcls(D;θ),(1)
where
`cls(x, y) = H(ys, y)
is the classification loss for sample
x
,
H(p, q) = PC
i=1 qilog pi
denotes the cross entropy between
p
and
q
,
ys=f(x;θ)
denotes the model’s output probability
given
x
,i.e.,
ys
k=exp(zs
k)
PC
i=1 exp(zs
i)
, where
zs
is the output logits of the model. The hard targets provide
context-invariant class-specific information from the human domain. An assumption held behind
Eq.
(1)
is that the samples are independent and identically distributed (IID) in the training and test set.
KD adopts a teacher model
θt
to generate soft targets as extra supervisions, i.e., context-equivariant
information. To formulate the machine domain, traditional KD methods commonly use the training set
D
to construct the transfer set
Dt
using the same copy of
X
,i.e.,
Dt={(x, yt)}
where
yt=f(x;θt)
and x X . Traditional KD approaches use the KL divergence [33] loss for knowledge transfer:
Ldist(machine;θ) = E(x,y)Pmachine [`dist(x, y;θ)] 1
|Dt|X
(x,yt)∈Dt
`dist(x, yt;θ),Ldist(Dt;θ),
(2)
where
`dist(x, yt;θ) = τ2·[H(ys
τ, yt
τ)H(yt
τ, yt
τ)]
denotes the distillation loss for sample
x
.
Normally, the outputs of the student and teacher are softened using a temperature
τ
,i.e.,
ys
τ,k =
exp(zs
k)
PC
i=1 exp(zs
i)and yt
τ,k =exp(zt
k)
PC
i=1 exp(zt
i). The overall objective combines Lcls and Ldist as:
Lkd =α· Lcls +β· Ldist,(3)
where
α
and
β
are the hyper-parameters. The underlying assumption of traditional KD behind Eq.
(2)
is that the transfer set
Dt
is an unbiased approximation of the machine domain. However, the observed
long-tailed and temperature-sensitive distributions of teacher’s predictions in Figure 1 rationally
challenge this assumption. As a result, samples with lower
P(x|machine)
are under-represented
during the distillation process, which affects the unbiasedness of knowledge transfer. This analysis
indicates that Eq. (2) is not optimal to utilize the teacher’s imbalanced knowledge.
3.2 Transfer Gap in KD
𝑋𝑌𝑡
𝐷
θ𝑡
𝐷𝑡
Figure 2: Causal graph for KD.
We interpret the transfer gap and its confounding effect from the
perspective of causal inference. Figure 2 illustrates the causal
relations between the image
X
, training data
D={(x, y)}
,
teacher’s parameters
θt
and teacher’s output
Yt
in KD. Overall,
D
and
θt
jointly act as the confounder of
X
and
Yt
in the
transfer set. First, the training set
D
and transfer set of teacher
model
Dt={(x, yt)}
share the same image set, and
X=x
is sampled from the image set of
D
,
i.e.,
D
serves the cause of
X
. Second, the teacher
θt
is trained on
D
, and
yt
is calculated based on
θt
and
x
, i.e.,
yt=f(x;θt)
. Therefore,
X
and
θt
are the cause of
Yt
. Note that the transfer set is
constructed based on the images on
D
and teacher model
θt
. Therefore, we regard the transfer set
Dt
,
the joint of Dand θt, as the confounder of Xand Yt.
Although
D
is balanced when considering the context-invariant class-specific information, the context
information (e.g., attributes) is overlooked, which makes the
D
imbalanced on context. As shown
in Figure 1, such imbalanced context leads to an imbalanced transfer set
Dt
and further affects the
distillation performance of teacher’s knowledge.
To overcome such confounding effect, a commonly used technique is intervention via
P(yt|do(x))
in-
stead of
P(yt|x)
, which is formulated as
P(yt|do(x)) = PDtP(y|x, Dt)P(Dt) = PDt
P(x,yt,Dt)
P(x|Dt)
.
This transformation suggests that we can use the inverse of propensity score,
1/P (x|Dt)
, as sample
weight to implement the intervention and overcome the confounding effect. Thanks to the causality-
based theory [
55
,
5
], we can use the Inverse Probability Weighting (IPW) technique to overcome the
confounding effect brought by the transfer gap.
4
摘要:

RespectingTransferGapinKnowledgeDistillationYuleiNiu1LongChen1ChangZhou2HanwangZhang31ColumbiaUniversity2DamoAcademy,AlibabaGroup3NanyangTechnologicalUniversity{yn.yuleiniu,zjuchenlong}@gmail.comzhouchang.zc@alibaba-inc.comhanwangzhang@ntu.edu.sgAbstractKnowledgedistillation(KD)isessentiallyaproces...

展开>> 收起<<
Respecting Transfer Gap in Knowledge Distillation Yulei Niu1Long Chen1Chang Zhou2Hanwang Zhang3 1Columbia University2Damo Academy Alibaba Group3Nanyang Technological University.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.01MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注