
useful information than the general binary labels for optimization. Motivated by the success of KD,
various logit-based methods have been proposed for further improvement. For example, Deep Mutual
Learning (DML) [
37
] proposes to replace the pre-trained teacher with an ensemble of students so
that the distillation mechanism does not need to train a large network in advance. Teacher Assistant
Knowledge Distillation (TAKD) [
23
] observes that a better teacher may distill a worse student due
to the large performance gap between them. Therefore, a teacher assistant network is introduced
to alleviate this problem. Another technical route of projector-free methods is the similarity-based
distillation. Unlike the logit-based methods that aim to exploit the category information hidden in the
predictions of the teacher, similarity-based methods try to explore the latent relationships between
samples in feature space. For example, Similarity-Preserving (SP) [
33
] distillation first constructs the
similarity matrices of the student and the teacher by computing the inner products between features
and then minimises the discrepancy between the obtained similarity matrices. Similarly, Correlation
Congruence (CC) [
25
] forms the similarity matrices with a kernel function. Although the logit-based
and similarity-based methods do not require an extra projector during training, they are generally less
effective than the feature-based methods as shown in the recent research [6, 35].
Projector-dependent methods.
Feature distillation methods aim to make student and teacher fea-
tures as similar as possible. Therefore, a projector is essential to map features into a common space.
The first feature distillation method FitNets [
26
] minimizes the L2 distance between student and
teacher feature maps produced by the intermediate layer of networks. Furthermore, Contrastive
Representation Distillation (CRD) [
32
], Softmax Regression Representation Learning (SRRL) [
35
]
and Comprehensive Interventional Distillation (CID) [
6
] show that the last feature representations of
networks are more suitable for distillation. One potential reason is that the last feature representations
are closer to the classifier and will directly affect the classification performance [
35
]. The aforemen-
tioned feature distillation methods mainly focus on the design of loss functions such as introducing
contrastive learning [
32
] and imposing causal intervention [
6
]. A simple 1x1 convolutional kernel
or a linear projection is adopted to transform features in these methods. We note that the effect of
projectors is largely ignored. Previous works such as Factor Transfer (FT) [
18
] and Overhaul of
Feature Distillation (OFD) [
13
] try to improve the architecture of projectors by introducing the auto-
encoder and modifying the activation function. However, their performance is not competitive when
compared to the state-of-the-art methods [
35
,
6
]. Instead, this paper proposes a simple distillation
framework by combining the ideas of distilling the last features and projector ensemble.
3 Improved Feature Distillation
We first define the notations used in the following sections. In line with observations in re-
cent research [
32
,
6
], we apply the feature distillation loss to the layer before the classifier.
S={s1, s2, ..., si, ..., sb} ∈ Rd×b
denotes the last student features, where
d
and
b
are the fea-
ture dimension and the batch size, respectively. The corresponding teacher features are represented by
T={t1, t2, ..., ti, ..., tb} ∈ Rm×b
, where
m
is the feature dimension. To match the dimensions of
S
and
T
, a projector
g(·)
is required to transform the student or teacher features. We experimentally
find that imposing the projector on the teacher is less effective since the original and more informative
feature distribution from the teacher would be disrupted. Therefore, in the proposed distillation
framework, a projector will be added to the student as
g(si) = σ(W si)
during training and be
removed after training, where σ(·)is the ReLU function and W∈Rm×dis a weighting matrix.
3.1 Feature Distillation as Multi-task Learning
In recent work, SRRL and CID combine the feature-based loss with the logit-based loss to improve
the performance. Since distillation methods are sensitive to hyper-parameters and changes of
teacher-student combinations, the additional objectives will increase the training cost for coefficients
adjustment. To alleviate this problem, we simply use the following Direction Alignment (DA) loss
[19, 3, 10] for feature distillation:
LDA =1
2b
b
X
i=1
|| g(si)
||g(si)||2
−ti
||ti||2
||2
2= 1 −1
b
b
X
i=1
hg(si), tii
||g(si)||2||ti||2
,(1)
where
||·||2
indicates the L2-norm and
h·,·i
represents the inner product of two vectors. By convention
[
14
,
32
,
35
], the distillation loss is coupled with the cross-entropy loss to train a student. As mentioned
3