
Dataset: CIFAR-100; Teac her: ResNet-110 Dataset: ImageNet; Teacher: ResNet-50
Ranked Class IndexRanked Class Index
Avg. Prob. ×# Total Samples
Avg. Prob. ×# Total Samples
020 40 60 80 100 0200 400 600 800 1000
500
450
400
550
600
1000
800
600
1200
1800
1400
1600
Figure 1: Illustration of the distribution discrepancies among ground-truth annotations and teacher
predictions. Although the teacher model is trained on the balanced data (blue dashed), its prediction
distributions are imbalanced according to various temperatures.
probability. As shown in Table 1, compared to vanilla training, KD achieves better performance in all
the subgroups. However, the increase in the top 25 classes is much higher than that in the last 25
classes, i.e., averagely 5.14% vs. 0.85%. We ask: what causes the gap from the first place; or more
specifically, why does the teacher’s non-uniform distributed predictions implies the gap? We answer
in an invariance vs. equivariance learning point of view [4, 69]:
Table 1: Improvement of KD over vanilla student for different
classes. The metric is macro-average recall.
Arch. style Top 1-25 Top 26-50 Top 51-75 Top 76-100
ResNet50 -> MobileNetV2 +4.96 +5.92 +1.76 +1.20
resnet32x4 -> ShuffleNetV1 +5.80 +2.68 +2.52 +0.84
resnet32x4 -> ShuffleNetV2 +4.72 +1.92 +2.24 +0.76
WRN-40-2 -> ShuffleNetV1 +5.08 +7.20 +4.48 +0.60
Human domain: context in-
variance
. The discriminative
generalization is the ability to
learn both context-invariant and
class-equivariant information
from the diverse training samples
per class. The human domain
only provides context-invariant
class-specific information, i.e., hard targets. We normally collect a balanced dataset to formulate
human domain.
Machine domain: context equivariance
. Teacher models often use a temperature variable to
preserve the context. The temperature allows the teacher to represent a sample not only by its context-
invariant class-specific information, but also its context-equivariant information. For example, a
dog
image with soft label 0.8
·dog
+ 0.2
·wolf
may imply that the dog has wolf-like contextual attributes
such as “fluffy coat” and “upright ears”. Although the context-invariance (i.e., class) is balanced in
the training data, the context-equivariance (i.e., context) is imbalanced because the context balance is
not considered in class-specific data collection [
67
]. To construct the transfer set for the machine
domain, the teacher model annotates each sample after seeing others, i.e., being pre-trained on the
whole set. Interestingly, the diverse context results in a long-tailed imbalanced distribution, which is
exactly reflected in Figure 1. In other words, the teacher’s knowledge is imbalanced even though the
teacher is trained on a class-balanced dataset.
Now we are ready to point out how the transfer gap is not properly addressed in conventional KD
methods. Conventional KD calculates the Cross-Entropy (CE) loss between the ground-truth label
and student’s prediction, and the Kullback–Leibler (KL) divergence [
33
] loss between the teacher’s
and student’s predictions, where a constant weight is assigned for the two losses. This is essentially
based on the underlying assumption that the data in both the human and machine domains are IID.
Based on the analysis of context equivariance, we argue that the assumption is unrealistic, i.e.,the
teacher’s knowledge is imbalanced. Therefore, a constant sample weight for the KL loss would be a
bottleneck. In this paper, we propose a simple yet effective method, Inverse Probability Weighting
Distillation (IPWD), which compensates for the training samples that are under-weighted in the
machine domain. For each training sample
x
, we first estimate its machine-domain propensity
score
P(x|machine)
by comparing class-aware and context-aware predictions. A sample with a low
propensity score would have a high confidence from class-aware predictions and a low confidence
from context-aware predictions. Then, IPWD assigns the inverse probability
1/P (x|machine)
as
the sample weight for the KL loss to highlight the under-represented samples. In this way, IPWD
generates a pseudo-population [37, 26] to deal with the imbalanced knowledge.
2