Respecting Transfer Gap in Knowledge Distillation Yulei Niu1Long Chen1Chang Zhou2Hanwang Zhang3 1Columbia University2Damo Academy Alibaba Group3Nanyang Technological University

2025-04-29 1 0 1.01MB 18 页 10玖币

侵权投诉

Respecting Transfer Gap in Knowledge Distillation

Yulei Niu∗1Long Chen1Chang Zhou2Hanwang Zhang3

1Columbia University 2Damo Academy, Alibaba Group 3Nanyang Technological University

{yn.yuleiniu,zjuchenlong}@gmail.com zhouchang.zc@alibaba-inc.com

hanwangzhang@ntu.edu.sg

Abstract

Knowledge distillation (KD) is essentially a process of transferring a teacher

model’s behavior, e.g., network response, to a student model. The network response

serves as additional supervision to formulate the machine domain (

machine

for

short), which uses the data collected from the human domain (

human

for short) as

a transfer set. Traditional KD methods hold an underlying assumption that the data

collected in both human domain and machine domain are both independent and

identically distributed (IID). We point out that this naïve assumption is unrealistic

and there is indeed a transfer gap between the two domains. Although the gap offers

the student model external knowledge from the machine domain, the imbalanced

teacher knowledge would make us incorrectly estimate how much to transfer from

teacher to student per sample on the non-IID transfer set. To tackle this challenge,

we propose Inverse Probability Weighting Distillation (IPWD) that estimates the

propensity score of a training sample belonging to the machine domain, and assigns

its inverse amount to compensate for under-represented samples. Experiments

on CIFAR-100 and ImageNet demonstrate the effectiveness of IPWD for both

two-stage distillation and one-stage self-distillation.

1 Introduction

Knowledge distillation (KD) [

] transfers knowledge from a teacher model, e.g., a big, cumbersome,

and energy-inefﬁcient network, to a student model, e.g., a small, light, and energy-efﬁcient network,

to improve the performance of the student model. A common intuition is that a teacher with better

performance will teach a stronger student. However, recent studies ﬁnd that the teacher’s accuracy is

not a good indicator of the resultant student performance [

]. For example, a poorly-trained teacher

with early stopping can still teach a better student [

]; or, a teacher with a smaller model size

than the student is also a good teacher [

]; or, a teacher with the same architecture as the student

helps to improve the student—self-distillation [13, 82, 81, 27].

Should we view KD in a perspective of domain transfer [

], we would better understand the

above counter-intuitive ﬁndings. From Figure 1, we can see that teacher predictions and ground-truth

labels indeed behave differently. Although the teacher is trained on the balanced dataset, its predicted

probability distribution over the dataset is imbalanced. Even on the same training set with the same

model parameter, teachers with different temperature

yield different “soft label” distributions

from the ground-truth ones. This implies that human and teacher knowledge are from different

domains, and there is a transfer gap that drives the “dark knowledge” [21] transferring from teacher

to student—regardless of “strong” or “weak” teachers, it is a valid transfer as long as there is a gap.

However, the transfer gap affects the distillation performance on the under-represented classes, i.e.,

classes on the tail of teacher predictions, which is overlooked in recent studies. Take CIFAR-100 as

an example. We rank and divide the 100 classes into 4 groups according to the ranks of predicted

∗Work done when Yulei was at Nanyang Technological University.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12787v1 [cs.CV] 23 Oct 2022

Dataset: CIFAR-100; Teac her: ResNet-110 Dataset: ImageNet; Teacher: ResNet-50

Ranked Class IndexRanked Class Index

Avg. Prob. ×# Total Samples

020 40 60 80 100 0200 400 600 800 1000

500

450

400

550

600

1000

800

600

1200

1800

1400

1600

Figure 1: Illustration of the distribution discrepancies among ground-truth annotations and teacher

predictions. Although the teacher model is trained on the balanced data (blue dashed), its prediction

distributions are imbalanced according to various temperatures.

probability. As shown in Table 1, compared to vanilla training, KD achieves better performance in all

the subgroups. However, the increase in the top 25 classes is much higher than that in the last 25

classes, i.e., averagely 5.14% vs. 0.85%. We ask: what causes the gap from the ﬁrst place; or more

speciﬁcally, why does the teacher’s non-uniform distributed predictions implies the gap? We answer

in an invariance vs. equivariance learning point of view [4, 69]:

Table 1: Improvement of KD over vanilla student for different

classes. The metric is macro-average recall.

Arch. style Top 1-25 Top 26-50 Top 51-75 Top 76-100

ResNet50 -> MobileNetV2 +4.96 +5.92 +1.76 +1.20

resnet32x4 -> ShufﬂeNetV1 +5.80 +2.68 +2.52 +0.84

resnet32x4 -> ShufﬂeNetV2 +4.72 +1.92 +2.24 +0.76

WRN-40-2 -> ShufﬂeNetV1 +5.08 +7.20 +4.48 +0.60

Human domain: context in-

variance

. The discriminative

generalization is the ability to

learn both context-invariant and

class-equivariant information

from the diverse training samples

per class. The human domain

only provides context-invariant

class-speciﬁc information, i.e., hard targets. We normally collect a balanced dataset to formulate

human domain.

Machine domain: context equivariance

. Teacher models often use a temperature variable to

preserve the context. The temperature allows the teacher to represent a sample not only by its context-

invariant class-speciﬁc information, but also its context-equivariant information. For example, a

dog

image with soft label 0.8

·dog

+ 0.2

·wolf

may imply that the dog has wolf-like contextual attributes

such as “ﬂuffy coat” and “upright ears”. Although the context-invariance (i.e., class) is balanced in

the training data, the context-equivariance (i.e., context) is imbalanced because the context balance is

not considered in class-speciﬁc data collection [

]. To construct the transfer set for the machine

domain, the teacher model annotates each sample after seeing others, i.e., being pre-trained on the

whole set. Interestingly, the diverse context results in a long-tailed imbalanced distribution, which is

exactly reﬂected in Figure 1. In other words, the teacher’s knowledge is imbalanced even though the

teacher is trained on a class-balanced dataset.

Now we are ready to point out how the transfer gap is not properly addressed in conventional KD

methods. Conventional KD calculates the Cross-Entropy (CE) loss between the ground-truth label

and student’s prediction, and the Kullback–Leibler (KL) divergence [

] loss between the teacher’s

and student’s predictions, where a constant weight is assigned for the two losses. This is essentially

based on the underlying assumption that the data in both the human and machine domains are IID.

Based on the analysis of context equivariance, we argue that the assumption is unrealistic, i.e.,the

teacher’s knowledge is imbalanced. Therefore, a constant sample weight for the KL loss would be a

bottleneck. In this paper, we propose a simple yet effective method, Inverse Probability Weighting

Distillation (IPWD), which compensates for the training samples that are under-weighted in the

machine domain. For each training sample

, we ﬁrst estimate its machine-domain propensity

score

P(x|machine)

by comparing class-aware and context-aware predictions. A sample with a low

propensity score would have a high conﬁdence from class-aware predictions and a low conﬁdence

from context-aware predictions. Then, IPWD assigns the inverse probability

1/P (x|machine)

the sample weight for the KL loss to highlight the under-represented samples. In this way, IPWD

generates a pseudo-population [37, 26] to deal with the imbalanced knowledge.

We evaluate our proposed IPWD on two typical knowledge distillation settings: two-stage teacher-

student distillation and one-stage self-distillation. Experiments conducted on CIFAR-100 [

] and

ImageNet [10] demonstrate the effectiveness and generality of our IPWD.

Our contributions are three-fold:

•

We formulate KD as a domain transfer problem and argue that the naïve IID assumption on machine

domain neglects the imbalanced knowledge due to transfer gap.

•

We propose Inverse Probability Weighting Distillation (IPWD) which compensate for the samples

that are overlooked in the machine domain to tackle the imbalanced knowledge in transfer gap.

•

Experiments on CIFAR-100 and ImageNet for both two-stage distillation and one-stage self-

distillation show that the proper handling of the transfer gap is a promising direction in KD.

2 Related Work

Knowledge distillation

(KD) was ﬁrst introduced to transfer the knowledge from an effective but

cumbersome model to a smaller and efﬁcient model [

]. The knowledge can be formulated in

either output space [

] or representation space [

]. KD has attracted a wide interest in theory, methodology, and applications [

For applications, KD has shown its great potential in various areas, including but not limited to

classiﬁcation [

], detection [

], segmentation [

] for visual recognition

tasks, and visual question answering [

], video captioning [

], and text-to-image

synthesis [

] for vision-language tasks. Recent studies further discussed how and why KD works.

Speciﬁcally, Müller et al. [

] and Shen et al. [

] empirically analyzed the effect of label smoothing

on KD. Cho et al. [

], Dong et al. [

], and Yuan et al. [

] pointed out that early stopping is a good

regularization for a better teacher. Yuan et al. [

] further found that a poorly trained teacher, even

a model smaller than the student, can improve the performance of the student. Besides, Memon et

al. [

] and Zhou et al. [

] proposed a bias-variance trade-off perspective for KD. In this paper, we

point out that existing KD methods hold an underlying assumption that the IID training samples are

also IID in the machine domain, which overlooks the transfer gap.

Self-distillation

is a special case of KD, which uses the student network itself as the teacher instead of

the cumbersome model, i.e., the teacher and student models have the same architecture [

This process can be executed in iterations and produce a stronger ensemble model [

]. Similar to

KD, traditional self-distillation follows a two-stage process: ﬁrst pre-training a student model as the

teacher, and then distilling the knowledge from the pre-trained model to a new student model. In

order to perform the teacher-student optimization in one generation, recent studies [

] proposed

one-stage self-distillation that adopts student models at earlier epochs as teacher models. These

one-stage self-distillation methods outperform vanilla students by large margins. In this paper, we

also evaluate the effectiveness of our IPWD as a plug-in in one-stage self-distillation.

Inverse Probability Weighting

(IPW) [

], also known as inverse probability of treatment

weighting or inverse propensity weighting, was proposed to correct the selection bias when the

observations are non-IID. IPW uses the inverse of the probability (i.e., propensity score) that the

individual would be assigned to the treatment group to reweight the samples. Propensity-weighting

techniques have been widely applied and studied in many areas [

], such as causal inference [

complete-case analysis [

], machine learning [

], and recommendation systems [

In this paper, we view the distillation process as a domain transfer problem and adopt IPW to

dynamically assign the weight to each training sample for the distillation loss.

3 Analysis

3.1 Knowledge Distillation (KD)

We view knowledge distillation from a perspective of domain transfer, and take the image classiﬁcation

task as the case study. Suppose that the training data

D={X ,Y}={(x, y)}

contains

as the input

(e.g., image) and

y∈RC

as its ground-truth annotation (e.g., one-hot label), where

denotes the

number of classes. A standard solution to train the classiﬁer

uses the cross-entropy loss as the

objective:

Lcls(human;θ) = E(x,y)∼Phuman [`cls(x, y;θ)] ≈1

|D| X

(x,y)∈D

`cls(x, y;θ),Lcls(D;θ),(1)

where

`cls(x, y) = H(ys, y)

is the classiﬁcation loss for sample

H(p, q) = PC

i=1 −qilog pi

denotes the cross entropy between

and

ys=f(x;θ)

denotes the model’s output probability

given

,i.e.,

k=exp(zs

i=1 exp(zs

, where

is the output logits of the model. The hard targets provide

context-invariant class-speciﬁc information from the human domain. An assumption held behind

Eq.

(1)

is that the samples are independent and identically distributed (IID) in the training and test set.

KD adopts a teacher model

θt

to generate soft targets as extra supervisions, i.e., context-equivariant

information. To formulate the machine domain, traditional KD methods commonly use the training set

to construct the transfer set

using the same copy of

,i.e.,

Dt={(x, yt)}

where

yt=f(x;θt)

and x∈ X . Traditional KD approaches use the KL divergence [33] loss for knowledge transfer:

Ldist(machine;θ) = E(x,y)∼Pmachine [`dist(x, y;θ)] ≈1

|Dt|X

(x,yt)∈Dt

`dist(x, yt;θ),Ldist(Dt;θ),

(2)

where

`dist(x, yt;θ) = τ2·[H(ys

τ, yt

τ)−H(yt

τ, yt

τ)]

denotes the distillation loss for sample

Normally, the outputs of the student and teacher are softened using a temperature

,i.e.,

τ,k =

exp(zs

k/τ)

i=1 exp(zs

i/τ)and yt

τ,k =exp(zt

k/τ)

i=1 exp(zt

i/τ). The overall objective combines Lcls and Ldist as:

Lkd =α· Lcls +β· Ldist,(3)

where

and

are the hyper-parameters. The underlying assumption of traditional KD behind Eq.

(2)

is that the transfer set

is an unbiased approximation of the machine domain. However, the observed

long-tailed and temperature-sensitive distributions of teacher’s predictions in Figure 1 rationally

challenge this assumption. As a result, samples with lower

P(x|machine)

are under-represented

during the distillation process, which affects the unbiasedness of knowledge transfer. This analysis

indicates that Eq. (2) is not optimal to utilize the teacher’s imbalanced knowledge.

3.2 Transfer Gap in KD

𝑋𝑌𝑡

𝐷

θ𝑡

𝐷𝑡

Figure 2: Causal graph for KD.

We interpret the transfer gap and its confounding effect from the

perspective of causal inference. Figure 2 illustrates the causal

relations between the image

, training data

D={(x, y)}

teacher’s parameters

θt

and teacher’s output

in KD. Overall,

and

θt

jointly act as the confounder of

and

in the

transfer set. First, the training set

and transfer set of teacher

model

Dt={(x, yt)}

share the same image set, and

X=x

is sampled from the image set of

i.e.,

serves the cause of

. Second, the teacher

θt

is trained on

, and

is calculated based on

θt

and

, i.e.,

yt=f(x;θt)

. Therefore,

and

θt

are the cause of

. Note that the transfer set is

constructed based on the images on

and teacher model

θt

. Therefore, we regard the transfer set

the joint of Dand θt, as the confounder of Xand Yt.

Although

is balanced when considering the context-invariant class-speciﬁc information, the context

information (e.g., attributes) is overlooked, which makes the

imbalanced on context. As shown

in Figure 1, such imbalanced context leads to an imbalanced transfer set

and further affects the

distillation performance of teacher’s knowledge.

To overcome such confounding effect, a commonly used technique is intervention via

P(yt|do(x))

in-

stead of

P(yt|x)

, which is formulated as

P(yt|do(x)) = PDtP(y|x, Dt)P(Dt) = PDt

P(x,yt,Dt)

P(x|Dt)

This transformation suggests that we can use the inverse of propensity score,

1/P (x|Dt)

, as sample

weight to implement the intervention and overcome the confounding effect. Thanks to the causality-

based theory [

], we can use the Inverse Probability Weighting (IPW) technique to overcome the

confounding effect brought by the transfer gap.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RespectingTransferGapinKnowledgeDistillationYuleiNiu1LongChen1ChangZhou2HanwangZhang31ColumbiaUniversity2DamoAcademy,AlibabaGroup3NanyangTechnologicalUniversity{yn.yuleiniu,zjuchenlong}@gmail.comzhouchang.zc@alibaba-inc.comhanwangzhang@ntu.edu.sgAbstractKnowledgedistillation(KD)isessentiallyaproces...

展开>> 收起<<

Respecting Transfer Gap in Knowledge Distillation Yulei Niu1Long Chen1Chang Zhou2Hanwang Zhang3 1Columbia University2Damo Academy Alibaba Group3Nanyang Technological University.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Respecting Transfer Gap in Knowledge Distillation Yulei Niu1Long Chen1Chang Zhou2Hanwang Zhang3 1Columbia University2Damo Academy Alibaba Group3Nanyang Technological University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: