
1k 2k 3k 4k 5k 6k 7k 8k 9k 10k
Dataset size [samples]
50
60
70
80
Validation accuracy [%]
ResNet18
FR-ResNet18
MLPMixer
ViT-B16
(a) CIFAR10 test accuracy.
5k 6k 7k 8k 9k 10k 11k 12k 13k 14k
Dataset size [samples]
30
40
50
60
Validation accuracy [%]
ResNet18
FR-ResNet18
MLPMixer
ViT-B16
(b) CIFAR100 test accuracy.
Figure 4:
Comparison with MLPMixer [14] and ViT. [12]
We compare our method with MLP-
Mixer and ViT. on the CIFAR10 (4a) and on the CIFAR100 (4b) datasets. Our method significantly
outperforms both architectures in the low-data regime.
0.9
, weight decay
5e−4
, and divide the learning rate by
10
after
80%
epochs. We used cross-entropy
loss as supervision. For the more complex Caltech datasets, we start with an Imagenet-pre-trained
backbone and reduce the dimensionality in our FR only to
256
. We use the same training setup for a
full fine-tuning, except that we reduce the initial learning rate to
1e−3
and train for only
100
epochs.
3.1 Supervised Learning
Comparisons with ResNet18 [8].
We compare the results of our method with those of ResNet18.
As shown in Figure 3a, on the first training cycle (
1000
labels), our method outperforms ResNet18 by
7.6
percentage points (pp). On the second cycle, we outperform ResNet18 by more than
10pp
. We
keep outperforming ResNet18 until the seventh cycle, where our improvement is half a percentage
point. For the remaining iterations, both methods reach the same accuracy.
On the CIFAR100 dataset (see Figure 3a), we start outperforming ResNet18 by
5.7pp
, and in the
second cycle, we are better by almost
7pp
. We continue outperforming ResNet18 in all ten cycles,
in the last one being better by half a percentage point. We see similar behavior in Caltech101 (see
Figure 3c) and Caltech256 (see Figure 3d).
A common tendency for all datasets is that with an increasing number of labeled samples, the gap
between our method and the baseline shrinks. Therefore, dropping the fully-connected layers in case
of a large labeled dataset does not cause any disadvantage, as was found in [
6
]. However, that work
did not analyze this question in the low-data regime, where using FC layers after CNN architectures
is clearly beneficial.
Comparisons with MLPMixer [14] and ViT [12].
We also compare the results of our method
with that of two non-convolutional networks, the MLPMixer and ViT. Similar to us, the power of
MLPMixer is on the strengths of the fully-connected layers. On the other hand, ViT is a transformer-
based architecture. Each Transformer block contains an attention block and a block consisting
of fully-connected layers. Unlike these methods, our method uses both convolutional and fully-
connected layers to leverage the advantages of both the high-level convolutional features and the
global interrelation from the fully-connected layers. In Figure 4a, we compare to MLPMixer and ViT
on the CIFAR10 dataset. On the first cycle, we outperform MLPMixer by
4.4pp
and ViT by circa
6pp
. We keep outperforming both methods in all other cycles, including the last one where we do
better than them by circa 13pp. As we can see, MLPMixer and ViT do not perform well even in the
latest training cycles and require to be trained on massive datasets. We show a similar comparison
for CIFAR100 in Figure 4b.
Comparisons with Knowledge Distillation baselines.
To show that our method’s main strength
comes from our FR head, we compare to several knowledge distillation (KD) methods. DML [
21
]
5