The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes Peter Kocsis

2025-04-24 0 0 907.07KB 24 页 10玖币

侵权投诉

The Unreasonable Effectiveness of Fully-Connected

Layers for Low-Data Regimes

Peter Kocsis

Technical University of Munich

peter.kocsis@tum.de

Peter Súkeník

Technical University of Munich

peter.sukenik@trojsten.sk

Guillem Brasó

Technical University of Munich

guillem.braso@tum.de

Matthias Nießner

Technical University of Munich

niessner@tum.de

Laura Leal-Taixé

Technical University of Munich

leal.taixe@tum.de

Ismail Elezi

Technical University of Munich

ismail.elezi@tum.de

peter-kocsis.github.io/LowDataGeneralization/

Abstract

Convolutional neural networks were the standard for solving many computer vision

tasks until recently, when Transformers of MLP-based architectures have started to

show competitive performance. These architectures typically have a vast number

of weights and need to be trained on massive datasets; hence, they are not suitable

for their use in low-data regimes. In this work, we propose a simple yet effective

framework to improve generalization from small amounts of data. We augment

modern CNNs with fully-connected (FC) layers and show the massive impact this

architectural change has in low-data regimes. We further present an online joint

knowledge-distillation method to utilize the extra FC layers at train time but avoid

them during test time. This allows us to improve the generalization of a CNN-based

model without any increase in the number of weights at test time. We perform

classiﬁcation experiments for a large range of network backbones and several

standard datasets on supervised learning and active learning. Our experiments

signiﬁcantly outperform the networks without fully-connected layers, reaching a

relative improvement of up to

16%

validation accuracy in the supervised setting

without adding any extra parameters during inference.

1 Introduction

Convolutional neural networks (CNNs) [

] have been the dominant architecture in the ﬁeld of

computer vision. Traditionally, CNNs consisted of convolutional (often called cross-correlation) and

pooling layers, followed by several fully-connected layers [

–

]. The need for fully-connected layers

was challenged in an inﬂuential paper [

], and recent modern CNN architectures [

–

] discarded

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05657v2 [cs.CV] 13 Oct 2022

Feature Refiner

𝑑𝑏𝑏𝑓 𝑑𝑓𝑟𝑓

FC + Softmax

Predictions

𝑑𝑓𝑟𝑓

LayerNorm

ReLU

LayerNorm

𝑑𝑓𝑟𝑓 𝑑𝑓𝑟𝑓 𝑑𝑓𝑟𝑓 𝑑𝑓𝑟𝑓

Features

𝑛𝑐

Figure 1:

Feature Reﬁner architecture.

Our network takes the features extracted by the backbone

network. We apply dimension-reduction to reduce the model parameters followed by a symmetric

2-layered multi-layer perceptron.

them without a noticeable loss of performance and a drastic decrease in the number of trainable

parameters.

Recently, the "reign" of all-convolutional networks has been challenged in several papers [

–

where CNNs were either replaced (or augmented) by vision transformers or replaced by multi-layer

perceptrons (MLPs). These methods remove the inductive biases of CNNs, leaving more learning

freedom to the network. While showing competitive performance and often outperforming CNNs,

these methods come with some major disadvantages. Because of their typically large number of

weights and the removed inductive biases, they need to be trained on massive datasets to reach

top performance. As a consequence, this leads to long training times and the need for massive

computational resources. For example, MLPMixer [

] requires a thousand TPU days to be trained

on the ImageNet dataset [17].

In this paper, instead of entirely replacing convolutional layers, we go back to the basics, and augment

modern CNNs with fully-connected neural networks, combining the best of both worlds. Contrary

to new alternative architectures [

] that usually require huge training sets, we focus our study

on the opposite scenario: the low-data regime, where the number of labeled samples is very-low

to moderately low. Remarkably, adding fully-connected layers yields a signiﬁcant improvement

in several standard vision datasets. In addition, our experiments show that this is agnostic to the

underlying network architecture and ﬁnd that fully-connected layers are required to achieve the best

results. Furthermore, we extend our study with two other settings that typically deal with a low-data

regime: active and semi-supervised learning. We ﬁnd that the same pattern holds in both cases.

An obvious explanation for the performance increase would be that adding fully-connected layers

largely increases the number of learnable parameters, which explains the increase in performance.

To disprove this theory, we use knowledge distillation based on a gradient gating mechanism that

reduces the number of used weights during inference to be equal to the number of weights of the

original networks, e.g., ResNet18. We show in our experiments that this reduced network achieves

the same test accuracy as the larger (teacher) network and thus signiﬁcantly outperforms equivalent

architecture that does not use our method.

In summary, our contributions are the following:

•

We show that adding fully-connected layers is beneﬁcial for the generalization of convolu-

tional networks in the tasks working in the low-data regime.

•

We present a novel online joint knowledge distillation method (OJKD), which allows us to

utilize additional ﬁnal fully-connected layers during training but drop them during inference

without a noticeable loss in performance. Doing so, we keep the same number of weights

during test time.

•

We show state-of-the-art results in supervised learning and active learning, outperforming

all convolutional networks by up to 16% in the low data regime.

2 Methodology

We propose a simple yet effective framework for improving the generalization from a small amount

of data. In our work, we bring back fully-connected layers at the end of CNN-based architectures.

Backbone

Feature

Refiner

FC +

Softmax

Gradient Gate

FC +

Softmax

Loss

Test time

Train time CE LFR

LOriginal

Figure 2:

Online Joint knowledge distillation pipeline.

Besides the baseline network’s classiﬁ-

cation head, we append an extra head with our Feature Reﬁner. The network is trained with the

composed of the two heads’ cross-entropy loss. During training, the gradients coming from the single

fully-connected layer are blocked. As a result, the backbone is updated only by the second head with

our Feature Reﬁner. In test time, this extra head can be neglected without any noticeable performance

loss.

We show that by adding as little as

0.37%

extra parameters during training, we can signiﬁcantly

improve the generalization in the low-data regime. Our network architecture consists of two main

parts: a convolutional backbone network and our proposed Feature Reﬁner (FR) based on multi-layer

perceptrons. Our method is task and model-agnostic and can be applied to many convolutional

networks.

In our method, we extract features with the convolutional backbone network. Then, we apply our

FR followed by a task-speciﬁc head. More precisely, we ﬁrst reduce the feature dimension

dbbf

dfrf

with a single linear layer to reduce the number of extra parameters. Then we apply a symmetric

two-layer multi-layer perceptron wrapped around by normalization layers. We present the precise

architecture of our Feature Reﬁner in Figure 1.

2.1 Online Joint Knowledge Distillation

One could argue that using more parameters can improve the performance just because of the

increased expressivity of the network. To disprove this argument, we develop an online joint

knowledge distillation (OJKD) method. Our OJKD enables us to use the exact same architecture as

our baseline networks during inference and utilizes our FR solely during training.

We base our training pipeline on the baseline network’s architecture. We split the baseline network

into two parts, the convolutional backbone for feature extraction and the ﬁnal fully-connected

classiﬁcation head. We append an additional head with our Feature Reﬁner. We devise the ﬁnal

loss as the sum of the two head’s losses, making sure both heads are trained in parallel (online) and

enforcing that they share the same network backbone (joint). During inference, we drop the additional

head and use only the original one, resulting in the exact same test time architecture as our baseline.

In other words, our FR head is the teacher network that shares the backbone with the student original

head, and we distill the knowledge of the teacher head into the student head.

However, the key ingredient of our OJKD is the gradient-gating mechanism; we call it Gradient

Gate (GG). This gating mechanism blocks the gradient of the original head during training, making

the backbone only depend on our FR head. We implement this functionality with a single layer.

During the forward pass, GG works as identity and just forwards the input without any modiﬁcation.

However, during the backward pass, it sets the gradient to zero. This way, the original head’s gradients

are backpropagated only until the GG. While the original head gets optimized, it does not inﬂuence

the training of the backbone but only adapts to it. Consequently, the backbone is trained only with

the gradients of our FR head. Furthermore, the original head can still ﬁt to the backbone and reach a

similar performance as the FR head. We ﬁnd that we can still improve upon the baseline without our

gating mechanism but reach lower accuracy than when we use it. We show the pipeline of our OJKD

in Figure 2.

3 Experiments

1k 2k 3k 4k 5k 6k 7k 8k 9k 10k

Dataset size [samples]

Validation accuracy [%]

ResNet18

FR-ResNet18

(a) CIFAR10

5k 6k 7k 8k 9k 10k 11k 12k 13k 14k

Dataset size [samples]

Validation accuracy [%]

ResNet18

FR-ResNet18

(b) CIFAR100

1k 2k 3k 4k 5k 6k

Dataset size [samples]

82.5

85.0

87.5

90.0

92.5

95.0

Validation accuracy [%]

ResNet18

FR-ResNet18

5k 6k 7k 8k 9k 10k 11k 12k 13k 14k

Dataset size [samples]

Validation accuracy [%]

ResNet18

FR-ResNet18

(d) Caltech256

Figure 3:

Comparisons with ResNet18.

We compare our approach (FR) to the baseline network

ResNet18 in supervised learning. Our method signiﬁcantly outperforms the baseline, especially in

the more challenging earlier stages, when we have a small amount of data.

In this section, we demonstrate the substantial effectiveness of our simple approach improving the

performance of neural networks in low-data regimes.

Datasets and the number of labels.

For all experiments, we report accuracy as the primary metric

and use four public datasets: CIFAR10 [

], CIFAR100 [

], Caltech101 [

], and Caltech256 [

We use the predeﬁned train/test split for the CIFAR datasets, while we split the Caltech datasets into

70%

training and

30%

testing, maintaining the class distribution. In the simpler datasets, CIFAR10

and Caltech101, we start with an initial labeled pool of

1000

images, while in the more complicated,

CIFAR100 and Caltech256, we start with

5000

labeled images. In both cases, we incrementally add

1000

samples to the labeled pool in each cycle and evaluate the performance with the larger and

larger training datasets. We use the active learning terminology for a’ cycle’, where a cycle is deﬁned

as a complete training loop.

CNN backbones.

For most experiments, we use ResNet18 [

] as our backbone (

dbbf = 512

and reduced feature size

dfrf = 64

. We compare the results of our method with those of pure

ResNet18 on both the supervised and active learning setups. Note that the supervised case (Figure 3)

is equivalent to a random labeling strategy in an active learning setup. We compare our method with

various active learning strategies. We also compare to a non-convolutional network, the MLPMixer

[

]. Finally, to show the generability of our method, we also experiment with other backbone

networks: ResNet34, EfﬁcientNet, and DenseNet. We run each experiment

times and report the

mean and standard deviation. We train each network in a single GPU. We summarize the results in

plots and refer to our supplementary material for exact numbers and complete implementation details.

Training details.

For the CIFAR experiments, we follow the training procedure of [

]. More pre-

cisely, we train our networks for

200

epochs using SGD optimizer with learning rate

0.1

, momentum

1k 2k 3k 4k 5k 6k 7k 8k 9k 10k

Dataset size [samples]

Validation accuracy [%]

ResNet18

FR-ResNet18

MLPMixer

ViT-B16

(a) CIFAR10 test accuracy.

5k 6k 7k 8k 9k 10k 11k 12k 13k 14k

Dataset size [samples]

Validation accuracy [%]

ResNet18

FR-ResNet18

MLPMixer

ViT-B16

(b) CIFAR100 test accuracy.

Figure 4:

Comparison with MLPMixer [14] and ViT. [12]

We compare our method with MLP-

Mixer and ViT. on the CIFAR10 (4a) and on the CIFAR100 (4b) datasets. Our method signiﬁcantly

outperforms both architectures in the low-data regime.

0.9

, weight decay

5e−4

, and divide the learning rate by

after

80%

epochs. We used cross-entropy

loss as supervision. For the more complex Caltech datasets, we start with an Imagenet-pre-trained

backbone and reduce the dimensionality in our FR only to

256

. We use the same training setup for a

full ﬁne-tuning, except that we reduce the initial learning rate to

1e−3

and train for only

100

epochs.

3.1 Supervised Learning

Comparisons with ResNet18 [8].

We compare the results of our method with those of ResNet18.

As shown in Figure 3a, on the ﬁrst training cycle (

1000

labels), our method outperforms ResNet18 by

7.6

percentage points (pp). On the second cycle, we outperform ResNet18 by more than

10pp

. We

keep outperforming ResNet18 until the seventh cycle, where our improvement is half a percentage

point. For the remaining iterations, both methods reach the same accuracy.

On the CIFAR100 dataset (see Figure 3a), we start outperforming ResNet18 by

5.7pp

, and in the

second cycle, we are better by almost

7pp

. We continue outperforming ResNet18 in all ten cycles,

in the last one being better by half a percentage point. We see similar behavior in Caltech101 (see

Figure 3c) and Caltech256 (see Figure 3d).

A common tendency for all datasets is that with an increasing number of labeled samples, the gap

between our method and the baseline shrinks. Therefore, dropping the fully-connected layers in case

of a large labeled dataset does not cause any disadvantage, as was found in [

]. However, that work

did not analyze this question in the low-data regime, where using FC layers after CNN architectures

is clearly beneﬁcial.

Comparisons with MLPMixer [14] and ViT [12].

We also compare the results of our method

with that of two non-convolutional networks, the MLPMixer and ViT. Similar to us, the power of

MLPMixer is on the strengths of the fully-connected layers. On the other hand, ViT is a transformer-

based architecture. Each Transformer block contains an attention block and a block consisting

of fully-connected layers. Unlike these methods, our method uses both convolutional and fully-

connected layers to leverage the advantages of both the high-level convolutional features and the

global interrelation from the fully-connected layers. In Figure 4a, we compare to MLPMixer and ViT

on the CIFAR10 dataset. On the ﬁrst cycle, we outperform MLPMixer by

4.4pp

and ViT by circa

6pp

. We keep outperforming both methods in all other cycles, including the last one where we do

better than them by circa 13pp. As we can see, MLPMixer and ViT do not perform well even in the

latest training cycles and require to be trained on massive datasets. We show a similar comparison

for CIFAR100 in Figure 4b.

Comparisons with Knowledge Distillation baselines.

To show that our method’s main strength

comes from our FR head, we compare to several knowledge distillation (KD) methods. DML [

]

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheUnreasonableEffectivenessofFully-ConnectedLayersforLow-DataRegimesPeterKocsisTechnicalUniversityofMunichpeter.kocsis@tum.dePeterSúkeníkTechnicalUniversityofMunichpeter.sukenik@trojsten.skGuillemBrasóTechnicalUniversityofMunichguillem.braso@tum.deMatthiasNießnerTechnicalUniversityofMunichniessner@...

展开>> 收起<<

The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes Peter Kocsis.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes Peter Kocsis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: