Improved Feature Distillation via Projector Ensemble Yudong Chen12Sen Wang1Jiajun Liu2Xuwei Xu12Frank de Hoog2Zi Huang1 1The University of Queensland2CSIRO Data61

2025-05-08 0 0 1.25MB 11 页 10玖币
侵权投诉
Improved Feature Distillation via Projector Ensemble
Yudong Chen1,2Sen Wang1Jiajun Liu2Xuwei Xu1,2Frank de Hoog2Zi Huang1
1The University of Queensland 2CSIRO Data61
{yudong.chen,sen.wang,xuwei.xu}@uq.edu.au
{jiajun.liu,frank.dehoog}@csiro.au
huang@itee.uq.edu.au
Abstract
In knowledge distillation, previous feature distillation methods mainly focus on
the design of loss functions and the selection of the distilled layers, while the
effect of the feature projector between the student and the teacher remains under-
explored. In this paper, we first discuss a plausible mechanism of the projector
with empirical evidence and then propose a new feature distillation method based
on a projector ensemble for further performance improvement. We observe that
the student network benefits from a projector even if the feature dimensions of
the student and the teacher are the same. Training a student backbone without a
projector can be considered as a multi-task learning process, namely achieving
discriminative feature extraction for classification and feature matching between
the student and the teacher for distillation at the same time. We hypothesize and
empirically verify that without a projector, the student network tends to overfit the
teacher’s feature distributions despite having different architecture and weights
initialization. This leads to degradation on the quality of the student’s deep features
that are eventually used in classification. Adding a projector, on the other hand,
disentangles the two learning tasks and helps the student network to focus better on
the main feature extraction task while still being able to utilize teacher features as
a guidance through the projector. Motivated by the positive effect of the projector
in feature distillation, we propose an ensemble of projectors to further improve the
quality of student features. Experimental results on different datasets with a series
of teacher-student pairs illustrate the effectiveness of the proposed method. Code
is available at https://github.com/chenyd7/PEFD.
1 Introduction
The last decade has witnessed the rapid development of Convolutional Neural Networks (CNNs)
[
21
,
31
,
11
,
22
,
4
]. The resulting increases in performance however, have come with substantial
increases in network size and this largely limits the applications of CNNs on edge devices [
15
].
To alleviate this problem, knowledge distillation has been proposed for network compression. The
key idea of distillation is to use the knowledge obtained by the large network (teacher) to guide the
optimization of the lightweight network (student) [14, 26, 33].
Existing methods can be roughly categorized into logit-based, feature-based and similarity-based
distillation [
9
]. Recent research shows that feature-based methods generally distill a better student
network compared to the other two groups [
32
,
6
]. We conjecture that the process of mimicking the
teacher’s features provides a clearer optimization direction for the training of the student network.
Despite the promising performance of feature distillation, it is still challenging to narrow the gap
between the student and teacher’s feature spaces. To improve the feature learning ability of the
Corresponding Author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.15274v2 [cs.CV] 1 Mar 2023
(a) w/o Projector (b) w/ Single Projector (c) w/ Multiple Projectors
Figure 1: Illustration of (a) feature distillation without a projector when the feature dimensions
of the student and the teacher are the same, (b) the general feature-based distillation with a single
projector [
6
,
35
] and (c) the proposed method with multiple projectors, where
LCE
and
LF D
are the
cross-entropy loss and the feature distillation loss, respectively.
student, various feature distillation methods have been developed by designing more powerful
objective functions [
32
,
38
,
35
,
6
] and determining more effective links between the layers of the
student and the teacher [2, 17, 1].
We have found that the feature projection process from the student to the teacher’s feature space plays
a key part in feature distillation and can be redesigned to improve the performance. Since the feature
dimensions of student networks are not always the same with that of teacher networks, a projector is
often required to map features into a common space for matching. As shown in Table 1, imposing
a projector on the student network can improve the distillation performance even if the feature
dimensions of the student and the teacher are the same. We hypothesize that adding a projector for
distillation helps to mitigate the overfitting problem when minimizing the feature discrepancy between
the student and the teacher. As shown in Figure 1(a), distillation without a projector can be regarded
as a multi-task learning process, including feature learning for classification and feature matching
for distillation. In this case, the student network may overfit the teacher’s feature distributions and
the generated features are less distinguishable for classification. Our empirical results in Section 3
support this hypothesis to some extent. Besides, inspired by the effectiveness of adding a projector
for feature distillation, we propose an ensemble of projectors for further improvement. Our intuition
is that projectors with different initialization generate diverse transformed features. Therefore, it is
helpful to improve the generalization of the student by using multiple projectors according to the
theory behind ensemble learning [
39
,
34
,
5
]. Figure 1 shows the comparisons of existing distillation
methods and our method.
Our contributions are three-fold:
We investigate the phenomenon that the student benefits from a projector during feature
distillation even when the student and the teacher have identical feature dimensionalities.
Technically, we propose an ensemble of feature projectors to improve the performance of
feature distillation. The proposed method is extremely simple and easy to implement.
Experimentally, we conduct comprehensive comparisons between different methods on
benchmark datasets with a wide variety of teacher-student pairs. It is shown that the proposed
method consistently outperforms state-of-the-art feature distillation methods.
2 Related Work
Since this paper mainly focuses on the design of the projector, we divided the existing methods into
two categories in term of the usage of the projector as follows:
Projector-free methods.
As the most representative distillation method, Knowledge Distillation
(KD) [
14
] proposes to utilize the logits generated by the pre-trained teacher to be the additional
targets of the student. The intuition of KD is that the generated logits are able to provide more
2
useful information than the general binary labels for optimization. Motivated by the success of KD,
various logit-based methods have been proposed for further improvement. For example, Deep Mutual
Learning (DML) [
37
] proposes to replace the pre-trained teacher with an ensemble of students so
that the distillation mechanism does not need to train a large network in advance. Teacher Assistant
Knowledge Distillation (TAKD) [
23
] observes that a better teacher may distill a worse student due
to the large performance gap between them. Therefore, a teacher assistant network is introduced
to alleviate this problem. Another technical route of projector-free methods is the similarity-based
distillation. Unlike the logit-based methods that aim to exploit the category information hidden in the
predictions of the teacher, similarity-based methods try to explore the latent relationships between
samples in feature space. For example, Similarity-Preserving (SP) [
33
] distillation first constructs the
similarity matrices of the student and the teacher by computing the inner products between features
and then minimises the discrepancy between the obtained similarity matrices. Similarly, Correlation
Congruence (CC) [
25
] forms the similarity matrices with a kernel function. Although the logit-based
and similarity-based methods do not require an extra projector during training, they are generally less
effective than the feature-based methods as shown in the recent research [6, 35].
Projector-dependent methods.
Feature distillation methods aim to make student and teacher fea-
tures as similar as possible. Therefore, a projector is essential to map features into a common space.
The first feature distillation method FitNets [
26
] minimizes the L2 distance between student and
teacher feature maps produced by the intermediate layer of networks. Furthermore, Contrastive
Representation Distillation (CRD) [
32
], Softmax Regression Representation Learning (SRRL) [
35
]
and Comprehensive Interventional Distillation (CID) [
6
] show that the last feature representations of
networks are more suitable for distillation. One potential reason is that the last feature representations
are closer to the classifier and will directly affect the classification performance [
35
]. The aforemen-
tioned feature distillation methods mainly focus on the design of loss functions such as introducing
contrastive learning [
32
] and imposing causal intervention [
6
]. A simple 1x1 convolutional kernel
or a linear projection is adopted to transform features in these methods. We note that the effect of
projectors is largely ignored. Previous works such as Factor Transfer (FT) [
18
] and Overhaul of
Feature Distillation (OFD) [
13
] try to improve the architecture of projectors by introducing the auto-
encoder and modifying the activation function. However, their performance is not competitive when
compared to the state-of-the-art methods [
35
,
6
]. Instead, this paper proposes a simple distillation
framework by combining the ideas of distilling the last features and projector ensemble.
3 Improved Feature Distillation
We first define the notations used in the following sections. In line with observations in re-
cent research [
32
,
6
], we apply the feature distillation loss to the layer before the classifier.
S={s1, s2, ..., si, ..., sb} ∈ Rd×b
denotes the last student features, where
d
and
b
are the fea-
ture dimension and the batch size, respectively. The corresponding teacher features are represented by
T={t1, t2, ..., ti, ..., tb} ∈ Rm×b
, where
m
is the feature dimension. To match the dimensions of
S
and
T
, a projector
g(·)
is required to transform the student or teacher features. We experimentally
find that imposing the projector on the teacher is less effective since the original and more informative
feature distribution from the teacher would be disrupted. Therefore, in the proposed distillation
framework, a projector will be added to the student as
g(si) = σ(W si)
during training and be
removed after training, where σ(·)is the ReLU function and WRm×dis a weighting matrix.
3.1 Feature Distillation as Multi-task Learning
In recent work, SRRL and CID combine the feature-based loss with the logit-based loss to improve
the performance. Since distillation methods are sensitive to hyper-parameters and changes of
teacher-student combinations, the additional objectives will increase the training cost for coefficients
adjustment. To alleviate this problem, we simply use the following Direction Alignment (DA) loss
[19, 3, 10] for feature distillation:
LDA =1
2b
b
X
i=1
|| g(si)
||g(si)||2
ti
||ti||2
||2
2= 1 1
b
b
X
i=1
hg(si), tii
||g(si)||2||ti||2
,(1)
where
||·||2
indicates the L2-norm and
,·i
represents the inner product of two vectors. By convention
[
14
,
32
,
35
], the distillation loss is coupled with the cross-entropy loss to train a student. As mentioned
3
摘要:

ImprovedFeatureDistillationviaProjectorEnsembleYudongChen1;2SenWang1JiajunLiu2XuweiXu1;2FrankdeHoog2ZiHuang11TheUniversityofQueensland2CSIROData61{yudong.chen,sen.wang,xuwei.xu}@uq.edu.au{jiajun.liu,frank.dehoog}@csiro.auhuang@itee.uq.edu.auAbstractInknowledgedistillation,previousfeaturedistillati...

展开>> 收起<<
Improved Feature Distillation via Projector Ensemble Yudong Chen12Sen Wang1Jiajun Liu2Xuwei Xu12Frank de Hoog2Zi Huang1 1The University of Queensland2CSIRO Data61.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.25MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注