The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes Peter Kocsis

2025-04-24 0 0 907.07KB 24 页 10玖币
侵权投诉
The Unreasonable Effectiveness of Fully-Connected
Layers for Low-Data Regimes
Peter Kocsis
Technical University of Munich
peter.kocsis@tum.de
Peter Súkeník
Technical University of Munich
peter.sukenik@trojsten.sk
Guillem Brasó
Technical University of Munich
guillem.braso@tum.de
Matthias Nießner
Technical University of Munich
niessner@tum.de
Laura Leal-Taixé
Technical University of Munich
leal.taixe@tum.de
Ismail Elezi
Technical University of Munich
ismail.elezi@tum.de
peter-kocsis.github.io/LowDataGeneralization/
Abstract
Convolutional neural networks were the standard for solving many computer vision
tasks until recently, when Transformers of MLP-based architectures have started to
show competitive performance. These architectures typically have a vast number
of weights and need to be trained on massive datasets; hence, they are not suitable
for their use in low-data regimes. In this work, we propose a simple yet effective
framework to improve generalization from small amounts of data. We augment
modern CNNs with fully-connected (FC) layers and show the massive impact this
architectural change has in low-data regimes. We further present an online joint
knowledge-distillation method to utilize the extra FC layers at train time but avoid
them during test time. This allows us to improve the generalization of a CNN-based
model without any increase in the number of weights at test time. We perform
classification experiments for a large range of network backbones and several
standard datasets on supervised learning and active learning. Our experiments
significantly outperform the networks without fully-connected layers, reaching a
relative improvement of up to
16%
validation accuracy in the supervised setting
without adding any extra parameters during inference.
1 Introduction
Convolutional neural networks (CNNs) [
1
,
2
] have been the dominant architecture in the field of
computer vision. Traditionally, CNNs consisted of convolutional (often called cross-correlation) and
pooling layers, followed by several fully-connected layers [
3
5
]. The need for fully-connected layers
was challenged in an influential paper [
6
], and recent modern CNN architectures [
7
11
] discarded
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05657v2 [cs.CV] 13 Oct 2022
Feature Refiner
𝑑𝑏𝑏𝑓 𝑑𝑓𝑟𝑓
FC + Softmax
Predictions
𝑑𝑓𝑟𝑓
FC
LayerNorm
FC
ReLU
FC
LayerNorm
𝑑𝑓𝑟𝑓 𝑑𝑓𝑟𝑓 𝑑𝑓𝑟𝑓 𝑑𝑓𝑟𝑓
Features
𝑛𝑐
Figure 1:
Feature Refiner architecture.
Our network takes the features extracted by the backbone
network. We apply dimension-reduction to reduce the model parameters followed by a symmetric
2-layered multi-layer perceptron.
them without a noticeable loss of performance and a drastic decrease in the number of trainable
parameters.
Recently, the "reign" of all-convolutional networks has been challenged in several papers [
12
16
],
where CNNs were either replaced (or augmented) by vision transformers or replaced by multi-layer
perceptrons (MLPs). These methods remove the inductive biases of CNNs, leaving more learning
freedom to the network. While showing competitive performance and often outperforming CNNs,
these methods come with some major disadvantages. Because of their typically large number of
weights and the removed inductive biases, they need to be trained on massive datasets to reach
top performance. As a consequence, this leads to long training times and the need for massive
computational resources. For example, MLPMixer [
14
] requires a thousand TPU days to be trained
on the ImageNet dataset [17].
In this paper, instead of entirely replacing convolutional layers, we go back to the basics, and augment
modern CNNs with fully-connected neural networks, combining the best of both worlds. Contrary
to new alternative architectures [
14
,
15
] that usually require huge training sets, we focus our study
on the opposite scenario: the low-data regime, where the number of labeled samples is very-low
to moderately low. Remarkably, adding fully-connected layers yields a significant improvement
in several standard vision datasets. In addition, our experiments show that this is agnostic to the
underlying network architecture and find that fully-connected layers are required to achieve the best
results. Furthermore, we extend our study with two other settings that typically deal with a low-data
regime: active and semi-supervised learning. We find that the same pattern holds in both cases.
An obvious explanation for the performance increase would be that adding fully-connected layers
largely increases the number of learnable parameters, which explains the increase in performance.
To disprove this theory, we use knowledge distillation based on a gradient gating mechanism that
reduces the number of used weights during inference to be equal to the number of weights of the
original networks, e.g., ResNet18. We show in our experiments that this reduced network achieves
the same test accuracy as the larger (teacher) network and thus significantly outperforms equivalent
architecture that does not use our method.
In summary, our contributions are the following:
We show that adding fully-connected layers is beneficial for the generalization of convolu-
tional networks in the tasks working in the low-data regime.
We present a novel online joint knowledge distillation method (OJKD), which allows us to
utilize additional final fully-connected layers during training but drop them during inference
without a noticeable loss in performance. Doing so, we keep the same number of weights
during test time.
We show state-of-the-art results in supervised learning and active learning, outperforming
all convolutional networks by up to 16% in the low data regime.
2 Methodology
We propose a simple yet effective framework for improving the generalization from a small amount
of data. In our work, we bring back fully-connected layers at the end of CNN-based architectures.
2
Backbone
Feature
Refiner
FC +
Softmax
Gradient Gate
FC +
Softmax
Loss
Test time
Train time CE LFR
LOriginal
CE
Figure 2:
Online Joint knowledge distillation pipeline.
Besides the baseline network’s classifi-
cation head, we append an extra head with our Feature Refiner. The network is trained with the
composed of the two heads’ cross-entropy loss. During training, the gradients coming from the single
fully-connected layer are blocked. As a result, the backbone is updated only by the second head with
our Feature Refiner. In test time, this extra head can be neglected without any noticeable performance
loss.
We show that by adding as little as
0.37%
extra parameters during training, we can significantly
improve the generalization in the low-data regime. Our network architecture consists of two main
parts: a convolutional backbone network and our proposed Feature Refiner (FR) based on multi-layer
perceptrons. Our method is task and model-agnostic and can be applied to many convolutional
networks.
In our method, we extract features with the convolutional backbone network. Then, we apply our
FR followed by a task-specific head. More precisely, we first reduce the feature dimension
dbbf
to
dfrf
with a single linear layer to reduce the number of extra parameters. Then we apply a symmetric
two-layer multi-layer perceptron wrapped around by normalization layers. We present the precise
architecture of our Feature Refiner in Figure 1.
2.1 Online Joint Knowledge Distillation
One could argue that using more parameters can improve the performance just because of the
increased expressivity of the network. To disprove this argument, we develop an online joint
knowledge distillation (OJKD) method. Our OJKD enables us to use the exact same architecture as
our baseline networks during inference and utilizes our FR solely during training.
We base our training pipeline on the baseline network’s architecture. We split the baseline network
into two parts, the convolutional backbone for feature extraction and the final fully-connected
classification head. We append an additional head with our Feature Refiner. We devise the final
loss as the sum of the two head’s losses, making sure both heads are trained in parallel (online) and
enforcing that they share the same network backbone (joint). During inference, we drop the additional
head and use only the original one, resulting in the exact same test time architecture as our baseline.
In other words, our FR head is the teacher network that shares the backbone with the student original
head, and we distill the knowledge of the teacher head into the student head.
However, the key ingredient of our OJKD is the gradient-gating mechanism; we call it Gradient
Gate (GG). This gating mechanism blocks the gradient of the original head during training, making
the backbone only depend on our FR head. We implement this functionality with a single layer.
During the forward pass, GG works as identity and just forwards the input without any modification.
However, during the backward pass, it sets the gradient to zero. This way, the original head’s gradients
are backpropagated only until the GG. While the original head gets optimized, it does not influence
the training of the backbone but only adapts to it. Consequently, the backbone is trained only with
the gradients of our FR head. Furthermore, the original head can still fit to the backbone and reach a
similar performance as the FR head. We find that we can still improve upon the baseline without our
gating mechanism but reach lower accuracy than when we use it. We show the pipeline of our OJKD
in Figure 2.
3
3 Experiments
1k 2k 3k 4k 5k 6k 7k 8k 9k 10k
Dataset size [samples]
50
60
70
80
Validation accuracy [%]
ResNet18
FR-ResNet18
(a) CIFAR10
5k 6k 7k 8k 9k 10k 11k 12k 13k 14k
Dataset size [samples]
40
45
50
55
60
Validation accuracy [%]
ResNet18
FR-ResNet18
(b) CIFAR100
1k 2k 3k 4k 5k 6k
Dataset size [samples]
82.5
85.0
87.5
90.0
92.5
95.0
Validation accuracy [%]
ResNet18
FR-ResNet18
(c) Caltech101
5k 6k 7k 8k 9k 10k 11k 12k 13k 14k
Dataset size [samples]
76
78
80
82
Validation accuracy [%]
ResNet18
FR-ResNet18
(d) Caltech256
Figure 3:
Comparisons with ResNet18.
We compare our approach (FR) to the baseline network
ResNet18 in supervised learning. Our method significantly outperforms the baseline, especially in
the more challenging earlier stages, when we have a small amount of data.
In this section, we demonstrate the substantial effectiveness of our simple approach improving the
performance of neural networks in low-data regimes.
Datasets and the number of labels.
For all experiments, we report accuracy as the primary metric
and use four public datasets: CIFAR10 [
18
], CIFAR100 [
18
], Caltech101 [
19
], and Caltech256 [
19
].
We use the predefined train/test split for the CIFAR datasets, while we split the Caltech datasets into
70%
training and
30%
testing, maintaining the class distribution. In the simpler datasets, CIFAR10
and Caltech101, we start with an initial labeled pool of
1000
images, while in the more complicated,
CIFAR100 and Caltech256, we start with
5000
labeled images. In both cases, we incrementally add
1000
samples to the labeled pool in each cycle and evaluate the performance with the larger and
larger training datasets. We use the active learning terminology for a’ cycle’, where a cycle is defined
as a complete training loop.
CNN backbones.
For most experiments, we use ResNet18 [
8
] as our backbone (
dbbf = 512
),
and reduced feature size
dfrf = 64
. We compare the results of our method with those of pure
ResNet18 on both the supervised and active learning setups. Note that the supervised case (Figure 3)
is equivalent to a random labeling strategy in an active learning setup. We compare our method with
various active learning strategies. We also compare to a non-convolutional network, the MLPMixer
[
14
]. Finally, to show the generability of our method, we also experiment with other backbone
networks: ResNet34, EfficientNet, and DenseNet. We run each experiment
5
times and report the
mean and standard deviation. We train each network in a single GPU. We summarize the results in
plots and refer to our supplementary material for exact numbers and complete implementation details.
Training details.
For the CIFAR experiments, we follow the training procedure of [
20
]. More pre-
cisely, we train our networks for
200
epochs using SGD optimizer with learning rate
0.1
, momentum
4
1k 2k 3k 4k 5k 6k 7k 8k 9k 10k
Dataset size [samples]
50
60
70
80
Validation accuracy [%]
ResNet18
FR-ResNet18
MLPMixer
ViT-B16
(a) CIFAR10 test accuracy.
5k 6k 7k 8k 9k 10k 11k 12k 13k 14k
Dataset size [samples]
30
40
50
60
Validation accuracy [%]
ResNet18
FR-ResNet18
MLPMixer
ViT-B16
(b) CIFAR100 test accuracy.
Figure 4:
Comparison with MLPMixer [14] and ViT. [12]
We compare our method with MLP-
Mixer and ViT. on the CIFAR10 (4a) and on the CIFAR100 (4b) datasets. Our method significantly
outperforms both architectures in the low-data regime.
0.9
, weight decay
5e4
, and divide the learning rate by
10
after
80%
epochs. We used cross-entropy
loss as supervision. For the more complex Caltech datasets, we start with an Imagenet-pre-trained
backbone and reduce the dimensionality in our FR only to
256
. We use the same training setup for a
full fine-tuning, except that we reduce the initial learning rate to
1e3
and train for only
100
epochs.
3.1 Supervised Learning
Comparisons with ResNet18 [8].
We compare the results of our method with those of ResNet18.
As shown in Figure 3a, on the first training cycle (
1000
labels), our method outperforms ResNet18 by
7.6
percentage points (pp). On the second cycle, we outperform ResNet18 by more than
10pp
. We
keep outperforming ResNet18 until the seventh cycle, where our improvement is half a percentage
point. For the remaining iterations, both methods reach the same accuracy.
On the CIFAR100 dataset (see Figure 3a), we start outperforming ResNet18 by
5.7pp
, and in the
second cycle, we are better by almost
7pp
. We continue outperforming ResNet18 in all ten cycles,
in the last one being better by half a percentage point. We see similar behavior in Caltech101 (see
Figure 3c) and Caltech256 (see Figure 3d).
A common tendency for all datasets is that with an increasing number of labeled samples, the gap
between our method and the baseline shrinks. Therefore, dropping the fully-connected layers in case
of a large labeled dataset does not cause any disadvantage, as was found in [
6
]. However, that work
did not analyze this question in the low-data regime, where using FC layers after CNN architectures
is clearly beneficial.
Comparisons with MLPMixer [14] and ViT [12].
We also compare the results of our method
with that of two non-convolutional networks, the MLPMixer and ViT. Similar to us, the power of
MLPMixer is on the strengths of the fully-connected layers. On the other hand, ViT is a transformer-
based architecture. Each Transformer block contains an attention block and a block consisting
of fully-connected layers. Unlike these methods, our method uses both convolutional and fully-
connected layers to leverage the advantages of both the high-level convolutional features and the
global interrelation from the fully-connected layers. In Figure 4a, we compare to MLPMixer and ViT
on the CIFAR10 dataset. On the first cycle, we outperform MLPMixer by
4.4pp
and ViT by circa
6pp
. We keep outperforming both methods in all other cycles, including the last one where we do
better than them by circa 13pp. As we can see, MLPMixer and ViT do not perform well even in the
latest training cycles and require to be trained on massive datasets. We show a similar comparison
for CIFAR100 in Figure 4b.
Comparisons with Knowledge Distillation baselines.
To show that our method’s main strength
comes from our FR head, we compare to several knowledge distillation (KD) methods. DML [
21
]
5
摘要:

TheUnreasonableEffectivenessofFully-ConnectedLayersforLow-DataRegimesPeterKocsisTechnicalUniversityofMunichpeter.kocsis@tum.dePeterSúkeníkTechnicalUniversityofMunichpeter.sukenik@trojsten.skGuillemBrasóTechnicalUniversityofMunichguillem.braso@tum.deMatthiasNießnerTechnicalUniversityofMunichniessner@...

展开>> 收起<<
The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes Peter Kocsis.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:907.07KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注