Improved Feature Distillation via Projector Ensemble Yudong Chen12Sen Wang1Jiajun Liu2Xuwei Xu12Frank de Hoog2Zi Huang1 1The University of Queensland2CSIRO Data61

2025-05-08 1 0 1.25MB 11 页 10玖币

侵权投诉

Improved Feature Distillation via Projector Ensemble

Yudong Chen1,2Sen Wang1∗Jiajun Liu2∗Xuwei Xu1,2Frank de Hoog2Zi Huang1

1The University of Queensland 2CSIRO Data61

{yudong.chen,sen.wang,xuwei.xu}@uq.edu.au

{jiajun.liu,frank.dehoog}@csiro.au

huang@itee.uq.edu.au

Abstract

In knowledge distillation, previous feature distillation methods mainly focus on

the design of loss functions and the selection of the distilled layers, while the

effect of the feature projector between the student and the teacher remains under-

explored. In this paper, we ﬁrst discuss a plausible mechanism of the projector

with empirical evidence and then propose a new feature distillation method based

on a projector ensemble for further performance improvement. We observe that

the student network beneﬁts from a projector even if the feature dimensions of

the student and the teacher are the same. Training a student backbone without a

projector can be considered as a multi-task learning process, namely achieving

discriminative feature extraction for classiﬁcation and feature matching between

the student and the teacher for distillation at the same time. We hypothesize and

empirically verify that without a projector, the student network tends to overﬁt the

teacher’s feature distributions despite having different architecture and weights

initialization. This leads to degradation on the quality of the student’s deep features

that are eventually used in classiﬁcation. Adding a projector, on the other hand,

disentangles the two learning tasks and helps the student network to focus better on

the main feature extraction task while still being able to utilize teacher features as

a guidance through the projector. Motivated by the positive effect of the projector

in feature distillation, we propose an ensemble of projectors to further improve the

quality of student features. Experimental results on different datasets with a series

of teacher-student pairs illustrate the effectiveness of the proposed method. Code

is available at https://github.com/chenyd7/PEFD.

1 Introduction

The last decade has witnessed the rapid development of Convolutional Neural Networks (CNNs)

[

]. The resulting increases in performance however, have come with substantial

increases in network size and this largely limits the applications of CNNs on edge devices [

To alleviate this problem, knowledge distillation has been proposed for network compression. The

key idea of distillation is to use the knowledge obtained by the large network (teacher) to guide the

optimization of the lightweight network (student) [14, 26, 33].

Existing methods can be roughly categorized into logit-based, feature-based and similarity-based

distillation [

]. Recent research shows that feature-based methods generally distill a better student

network compared to the other two groups [

]. We conjecture that the process of mimicking the

teacher’s features provides a clearer optimization direction for the training of the student network.

Despite the promising performance of feature distillation, it is still challenging to narrow the gap

between the student and teacher’s feature spaces. To improve the feature learning ability of the

∗Corresponding Author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.15274v2 [cs.CV] 1 Mar 2023

(a) w/o Projector (b) w/ Single Projector (c) w/ Multiple Projectors

Figure 1: Illustration of (a) feature distillation without a projector when the feature dimensions

of the student and the teacher are the same, (b) the general feature-based distillation with a single

projector [

] and (c) the proposed method with multiple projectors, where

LCE

and

LF D

are the

cross-entropy loss and the feature distillation loss, respectively.

student, various feature distillation methods have been developed by designing more powerful

objective functions [

] and determining more effective links between the layers of the

student and the teacher [2, 17, 1].

We have found that the feature projection process from the student to the teacher’s feature space plays

a key part in feature distillation and can be redesigned to improve the performance. Since the feature

dimensions of student networks are not always the same with that of teacher networks, a projector is

often required to map features into a common space for matching. As shown in Table 1, imposing

a projector on the student network can improve the distillation performance even if the feature

dimensions of the student and the teacher are the same. We hypothesize that adding a projector for

distillation helps to mitigate the overﬁtting problem when minimizing the feature discrepancy between

the student and the teacher. As shown in Figure 1(a), distillation without a projector can be regarded

as a multi-task learning process, including feature learning for classiﬁcation and feature matching

for distillation. In this case, the student network may overﬁt the teacher’s feature distributions and

the generated features are less distinguishable for classiﬁcation. Our empirical results in Section 3

support this hypothesis to some extent. Besides, inspired by the effectiveness of adding a projector

for feature distillation, we propose an ensemble of projectors for further improvement. Our intuition

is that projectors with different initialization generate diverse transformed features. Therefore, it is

helpful to improve the generalization of the student by using multiple projectors according to the

theory behind ensemble learning [

]. Figure 1 shows the comparisons of existing distillation

methods and our method.

Our contributions are three-fold:

•

We investigate the phenomenon that the student beneﬁts from a projector during feature

distillation even when the student and the teacher have identical feature dimensionalities.

•

Technically, we propose an ensemble of feature projectors to improve the performance of

feature distillation. The proposed method is extremely simple and easy to implement.

•

Experimentally, we conduct comprehensive comparisons between different methods on

benchmark datasets with a wide variety of teacher-student pairs. It is shown that the proposed

method consistently outperforms state-of-the-art feature distillation methods.

2 Related Work

Since this paper mainly focuses on the design of the projector, we divided the existing methods into

two categories in term of the usage of the projector as follows:

Projector-free methods.

As the most representative distillation method, Knowledge Distillation

(KD) [

] proposes to utilize the logits generated by the pre-trained teacher to be the additional

targets of the student. The intuition of KD is that the generated logits are able to provide more

useful information than the general binary labels for optimization. Motivated by the success of KD,

various logit-based methods have been proposed for further improvement. For example, Deep Mutual

Learning (DML) [

] proposes to replace the pre-trained teacher with an ensemble of students so

that the distillation mechanism does not need to train a large network in advance. Teacher Assistant

Knowledge Distillation (TAKD) [

] observes that a better teacher may distill a worse student due

to the large performance gap between them. Therefore, a teacher assistant network is introduced

to alleviate this problem. Another technical route of projector-free methods is the similarity-based

distillation. Unlike the logit-based methods that aim to exploit the category information hidden in the

predictions of the teacher, similarity-based methods try to explore the latent relationships between

samples in feature space. For example, Similarity-Preserving (SP) [

] distillation ﬁrst constructs the

similarity matrices of the student and the teacher by computing the inner products between features

and then minimises the discrepancy between the obtained similarity matrices. Similarly, Correlation

Congruence (CC) [

] forms the similarity matrices with a kernel function. Although the logit-based

and similarity-based methods do not require an extra projector during training, they are generally less

effective than the feature-based methods as shown in the recent research [6, 35].

Projector-dependent methods.

Feature distillation methods aim to make student and teacher fea-

tures as similar as possible. Therefore, a projector is essential to map features into a common space.

The ﬁrst feature distillation method FitNets [

] minimizes the L2 distance between student and

teacher feature maps produced by the intermediate layer of networks. Furthermore, Contrastive

Representation Distillation (CRD) [

], Softmax Regression Representation Learning (SRRL) [

]

and Comprehensive Interventional Distillation (CID) [

] show that the last feature representations of

networks are more suitable for distillation. One potential reason is that the last feature representations

are closer to the classiﬁer and will directly affect the classiﬁcation performance [

]. The aforemen-

tioned feature distillation methods mainly focus on the design of loss functions such as introducing

contrastive learning [

] and imposing causal intervention [

]. A simple 1x1 convolutional kernel

or a linear projection is adopted to transform features in these methods. We note that the effect of

projectors is largely ignored. Previous works such as Factor Transfer (FT) [

] and Overhaul of

Feature Distillation (OFD) [

] try to improve the architecture of projectors by introducing the auto-

encoder and modifying the activation function. However, their performance is not competitive when

compared to the state-of-the-art methods [

]. Instead, this paper proposes a simple distillation

framework by combining the ideas of distilling the last features and projector ensemble.

3 Improved Feature Distillation

We ﬁrst deﬁne the notations used in the following sections. In line with observations in re-

cent research [

], we apply the feature distillation loss to the layer before the classiﬁer.

S={s1, s2, ..., si, ..., sb} ∈ Rd×b

denotes the last student features, where

and

are the fea-

ture dimension and the batch size, respectively. The corresponding teacher features are represented by

T={t1, t2, ..., ti, ..., tb} ∈ Rm×b

, where

is the feature dimension. To match the dimensions of

and

, a projector

g(·)

is required to transform the student or teacher features. We experimentally

ﬁnd that imposing the projector on the teacher is less effective since the original and more informative

feature distribution from the teacher would be disrupted. Therefore, in the proposed distillation

framework, a projector will be added to the student as

g(si) = σ(W si)

during training and be

removed after training, where σ(·)is the ReLU function and W∈Rm×dis a weighting matrix.

3.1 Feature Distillation as Multi-task Learning

In recent work, SRRL and CID combine the feature-based loss with the logit-based loss to improve

the performance. Since distillation methods are sensitive to hyper-parameters and changes of

teacher-student combinations, the additional objectives will increase the training cost for coefﬁcients

adjustment. To alleviate this problem, we simply use the following Direction Alignment (DA) loss

[19, 3, 10] for feature distillation:

LDA =1

i=1

|| g(si)

||g(si)||2

−ti

||ti||2

||2

2= 1 −1

i=1

hg(si), tii

||g(si)||2||ti||2

,(1)

where

||·||2

indicates the L2-norm and

h·,·i

represents the inner product of two vectors. By convention

[

], the distillation loss is coupled with the cross-entropy loss to train a student. As mentioned

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovedFeatureDistillationviaProjectorEnsembleYudongChen1;2SenWang1JiajunLiu2XuweiXu1;2FrankdeHoog2ZiHuang11TheUniversityofQueensland2CSIROData61{yudong.chen,sen.wang,xuwei.xu}@uq.edu.au{jiajun.liu,frank.dehoog}@csiro.auhuang@itee.uq.edu.auAbstractInknowledgedistillation,previousfeaturedistillati...

展开>> 收起<<

Improved Feature Distillation via Projector Ensemble Yudong Chen12Sen Wang1Jiajun Liu2Xuwei Xu12Frank de Hoog2Zi Huang1 1The University of Queensland2CSIRO Data61.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improved Feature Distillation via Projector Ensemble Yudong Chen12Sen Wang1Jiajun Liu2Xuwei Xu12Frank de Hoog2Zi Huang1 1The University of Queensland2CSIRO Data61

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: