Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets Zhiying Lu Hongtao Xie Chuanbin Liu Yongdong Zhang

2025-04-27 0 0 1.68MB 28 页 10玖币
侵权投诉
Bridging the Gap Between Vision Transformers and
Convolutional Neural Networks on Small Datasets
Zhiying Lu, Hongtao Xie, Chuanbin Liu, Yongdong Zhang
University of Science and Technology of China, Hefei, China
arieseirack@mail.ustc.edu.cn,{htxie,liucb92,zhyd73}@ustc.edu.cn
Abstract
There still remains an extreme performance gap between Vision Transformers
(ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on
small datasets, which is concluded to the lack of inductive bias. In this paper, we
further consider this problem and point out two weaknesses of ViTs in inductive
biases, that is, the
spatial relevance
and
diverse channel representation
. First,
on spatial aspect, objects are locally compact and relevant, thus fine-grained feature
needs to be extracted from a token and its neighbors. While the lack of data hinders
ViTs to attend the spatial relevance. Second, on channel aspect, representation
exhibits diversity on different channels. But the scarce data can not enable ViTs
to learn strong enough representation for accurate recognition. To this end, we
propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance
the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which
convolution is integrated into patch embedding and multi-layer perceptron module,
forcing the model to capture the token features as well as their neighboring features.
On channel aspect, we introduce a dynamic feature aggregation module in MLP
and a brand new "head token" design in multi-head self-attention module to help
re-calibrate channel representation and make different channel group representation
interacts with each other. The fusion of weak channel representation forms a strong
enough representation for classification. With this design, we successfully eliminate
the performance gap between CNNs and ViTs, and our DHVT achieves a series
of state-of-the-art performance with a lightweight model, 85.68
%
on CIFAR-100
with 22.8M parameters, 82.3
%
on ImageNet-1K with 24.0M parameters. Code is
available at https://github.com/ArieSeirack/DHVT.
1 Introduction
Convolutional Neural Networks (CNNs) have dominated in Computer Vision (CV) field as the
backbone for various tasks like classification [
1
,
2
,
3
,
4
,
5
,
6
,
7
], object detection [
8
,
9
,
10
] and
segmentation [
11
,
12
,
13
]. These years have witnessed the rapid growth of another promising
alternative architecture paradigm, Vision Transformers (ViTs). They have already exhibited great
performance in common tasks, such as classification [
14
,
15
,
16
,
17
,
18
,
19
], object detection
[20,21,22] and segmentation [23,24].
ViT [
14
] is the pioneering model that brings Transformer architecture [
25
] from Natural Language
Processing (NLP) into CV. It has a higher performance upper bound than standard CNNs, while it
is at the cost of expensive computation and extremely huge amount of training data. The vanilla
ViT needs to be firstly pre-trained on the huge dataset JFT-300M [
14
] and then fine-tuned on the
common dataset ImageNet-1K [
26
]. Under this experimental setting, it shows higher performance
than standard CNNs. However, when training from scratch on ImageNet-1K only, the accuracy is
Corresponding author
Preprint. Under review.
arXiv:2210.05958v2 [cs.CV] 30 Dec 2022
much lower. From the practical perspective, most of the datasets are even smaller than ImageNet-1K,
and not all the researchers can hold the burden of pre-training their own model on large datasets and
then fine-tuning on the target small datasets. Thus, an effective architecture for training ViTs from
scratch on small datasets is demanded.
Recent works [
27
,
28
,
29
] explore the reasons for the difference in data efficiency between ViT
and CNNs, and draw a conclusion to the lack of inductive bias. In [
27
], it points out that with
not enough data, ViT does not learn to attend locally in earlier layers. And in [
28
], it says that
the stronger the inductive biases, the stronger the representations. Large datasets tend to help ViT
learn strong representations. Locality constraints improve the performance of ViT. Meanwhile, in
recent work [
29
], it demonstrates that convolutional constraints can enable strongly sample-efficient
training in the small-data regime. The insufficient training data makes ViT hard to derive the
inductive bias of attending locality, thus many recent works strive to introduce local inductive bias
by integrating convolution into ViTs [
18
,
15
,
30
,
31
,
32
] and modify it to hierarchical structure
[
33
,
34
,
16
,
17
,
35
], making ViTs more like traditional CNNs. This style of hybrid structure shows
comparable performance with strong CNNs when training from scratch on medium dataset ImageNet-
1K only. But the performance gap on much smaller datasets still remains.
Here, we consider that the scarce training data weakens the inductive biases in ViTs. Two kinds of
inductive bias need to be enhanced and better exploited to improve the data efficiency, that is, the
spatial relevance
and
diverse channel representation
.
On spatial aspect
, tokens are relevant and
objects are locally compact. The important fine-grained low-level feature needs to be extracted from
the token and its neighbors at the earlier layers. Rethinking the feature extraction framework in ViTs,
the module for feature representation is the multi-layer perceptron (MLP) and its receptive field can
be seen as only itself. So ViTs depend on the multi-head self-attention (MHSA) module to model and
capture the relation between tokens. As is pointed out in work [
27
], with less training data, lower
attention layers do not learn to attend locally. In other words, they do not focus on neighboring tokens
and aggregate local information in the early stage. As is known, capturing local features in lower
layers facilitates the whole representation pipeline. The deep layers sequentially process the low-level
texture feature into high-level semantic features for final recognition. Thus ViTs have an extreme
performance gap compared with CNNs when training from scratch on small datasets.
On channel
aspect
, feature representation exhibits diversity in different channels. And ViT has its own inductive
bias that different channel group encodes different feature representation of the object, and the whole
token vector forms the representation of the object. As is pointed out in work [
28
], large datasets
tend to help ViT learn strong representation. The insufficient data can not enable ViTs to learn strong
enough representation, thus the whole representation is poor for accurate classification.
In this paper, we solve the performance gap of training from scratch on small datasets between CNNs
and ViTs and provide a hybrid architecture called Dynamic Hybrid Vision Transformer (DHVT)
as a substitute. We first introduce a hybrid model to address the issue
on spatial aspect
. The
proposed hybrid model integrates a sequence of convolution layers in the patch embedding stage to
eliminate the non-overlapping problem, preserving fine-grained low-level features, and it involves
depth-wise convolution [
36
] in MLP for local feature extraction. In addition, we design two modules
for making feature representation stronger to solve the problem
on channel view
. To be specific, in
MLP, depth-wise convolution is adopted for the patch tokens, and the class token is identically passed
through without any computation. We then leverage the output patch tokens to produce channel
weight like Squeeze-Excitation (SE) [
4
] for the class token. This operation helps re-calibrate each
channel for the class token to reinforce its feature representation. Moreover, in order to enhance
interaction among different semantic representations of different channel groups and owing to the
variable length of the token sequence in vision transformer structure, we devise a brand new token
mechanism called "head token". The number of head tokens is the same as the number of attention
heads in MHSA. Head tokens are generated by segmenting and projecting input tokens along the
channel. The head tokens will be concatenated with all other tokens to pass through the MHSA. Each
channel group in the corresponding attention head in the MHSA now is able to interact with others.
Though maybe the representation in each channel and channel group is poor for classification on
account of insufficient training data, the head tokens help re-calibrate each learned feature pattern
and enable a stronger integral representation of the object, which is beneficial to final recognition.
We conduct experiments of training from scratch on various small datasets, the common dataset
CIFAR-100, and small domain datasets Clipart, Painting and Sketch from DomainNet [
37
] to examine
the performance of our model. On CIFAR-100, our proposed models show a significant performance
2
margin with strong CNNs like ResNeXt, DenseNet and Res2Net. The Tiny model achieves 83.54
%
with only 5.8M parameters, and our Small model reaches the state-of-the-art 85.68
%
accuracy with
only 22.8M parameters, outperforming a series of strong CNNs. Therefore, we eliminate the gap
between CNNs and ViTs, providing an alternative architecture that can train from scratch on small
datasets. We also evaluate the performance of DHVT when training from scratch on ImageNet-1K.
Our proposed DHVT-S achieves competitive 82.3
%
accuracy with only 24.0M parameters, which is
the state-of-the-art non-hierarchical vision transformer structure as far as we know, demonstrating the
effectiveness of our model on larger datasets. In summary, our main contributions are:
1. We conclude that the data efficiency on small datasets can be addressed by strengthening two
inductive biases in ViTs, which are spatial relevance and diverse channel representation.
2. On spatial aspect, we adopt a hybrid model integrated with convolution, preserving fine-grained
low-level features at the earlier stage and forcing the model to extract tokens feature and corresponding
neighbor feature.
3. On channel aspect, we leverage the output patch tokens to re-calibrate class token channel-wise,
producing better feature representation. We further introduce "head token", a novel design that
helps fuse diverse feature representation encoded in different channel groups into a stronger integral
representation.
2 Related Work
Vision Transformers.
Convolutional Neural Networks [
38
,
39
,
1
,
40
,
2
,
41
] dominated the computer
vision fields in the past decade, with its intrinsic inductive biases designed for image recognition.
The past two years witnessed the rise of Vision Transformer models in various vision tasks [
42
,
43
,
23
,
20
,
44
,
45
]. Although there exist previous works introducing attention mechanism into CNNs
[
4
,
46
,
47
], the pioneering full transformer architecture in computer vision are iGPT [
48
] and ViT
[
14
]. ViT is widely adopted as the architecture paradigm for vision tasks especially image recognition.
It processes the image as a token sequence and exploits relations among tokens. It uses "class
token" like BERT [
49
] to exchange information at every layer and for final classification. It performs
well when pre-trained on huge datasets. But when training from scratch on ImageNet-1K only, it
underperforms ResNets, demonstrating a data-hungry problem.
Data-efficient ViTs.
Many of the subsequent modifications on ViT strive for a more data-efficient
architecture that can perform well without pre-training on larger datasets. The methods can be divided
into different groups. [
42
,
50
] use knowledge distillation strategy and stronger data-augmentation
methods to enable training from scratch. [
51
] points out that using convolution in the patch embedding
stage greatly benefits ViTs training. [
52
,
15
,
18
,
53
,
54
] leverage convolution for patch embedding
to eliminate the discontinuity brought by non-overlapping patch embedding in vanilla ViT, and
such design becomes a paradigm in subsequent works. To further introduce inductive bias into
ViT, [
15
,
30
,
34
,
55
,
56
] integrate depth-wise convolution into feed forward network, resulting in
a hybrid architecture combining the self-attention and convolution. To make ViTs more similar
to standard CNNs, [
16
,
54
,
17
,
34
,
33
,
35
,
57
,
32
] re-design the spatial and channel dimension of
vanilla ViT, producing a series of hierarchical style vision transformer. [
31
,
58
,
59
] design another
parallel convolution branch and enable the interaction with the self-attention branch, making the two
branch complements each other. The above architectures introduce strong inductive bias and become
data-efficient when training from scratch on ImageNet-1K. In addition, works like [
60
,
61
,
62
]
investigate channel-wise representation through conducting self-attention channel-wise, while we
enhance channel representation by dynamically aggregating patch token features to enhance class
token channel-wise and compatibly involve channel group-wise head tokens into vanilla self-attention.
Finally, works like [63,64,65], suggesting that the number of tokens can be variable.
ViTs for small datasets
. There exists several works on solving the training from scratch problem
on small datasets. Though the above modified vision transformers perform well when trained on
ImageNet-1K, they fail to compete with standard CNNs when training on much smaller datasets like
CIFAR-100. Work [
66
] introduces a self-supervised style training strategy and a loss function to help
train ViTs on small datasets. CCT [
67
] adopts a convolutional tokenization module and replaces the
class token with the final sequence pooling operation. SL-ViT [
68
] adopts shifted patch tokenization
module and modifies self-attention to make it focus more locally. Though the previous works reduce
the performance gap between standard CNNs ResNets[
1
], they fail to be sub-optimal when compared
3
with strong CNNs. Our proposed method leverages local constraints and enhances representation
interaction, successfully bridging the performance gap on small datasets.
HI-MHSA
LayerNorm
DAFF
LayerNorm
Sequential Overlapping Patch Embedding
Encoder Layer
Encoder Layer
L
FC Head
Classification Result
cls token
patch token
3×3 Conv, s=2
BatchNorm
GELU
k
Reshape
Affine
Affine
Figure 1: Overview of the proposed Dynamic Hybrid Vision Transformer (DHVT). DHVT follows a
non-hierarchical structure, where each encoder layer contains two pre-norm and shortcut, a Head-
Interacted Multi-Head Self-Attention (HI-MHSA) and a Dynamic Aggregation Feed Forward (DAFF).
3 Methods
3.1 Overview of DHVT
As shown in Fig. 1, the framework of our proposed DHVT is similar to vanilla ViT. We choose a
non-hierarchical structure, where every encoder block shares the same parameter setting, processing
the same shape of features. Under this structure, we can deal with variable length of token sequence.
We keep the design of using class token to interact with all the patch tokens and for final prediction.
In the patch embedding module, the input image will be split into patches first. Given the input
image with resolution
H×W
and the target patch size
P
, the resulting length of the patch token
sequence will be
N=HW/P 2
. Our modified patch embedding is called Sequential Overlapping
Patch Embedding (SOPE), which contains several successive convolution layers of
3×3
convolution
with stride
s= 2
, Batch Normalization and GELU [
69
] activation. The relation between the number
of convolution layers and the patch size is
P= 2k
. SOPE is able to eliminate the discontinuity
brought by the vanilla patch embedding module, preserving important low-level features. It is able
to provide position information to some extent. We also adopt two affine transformations before
and after the series of convolution layers. This operation rescales and shifts the input feature, and it
acts like normalization, making the training performance more stable on small datasets. The whole
process of SOPE can be formulated as follows.
Aff (x) = Diag(α)x+β(1)
Gi(x) = GELU (BN (Conv(x))), i = 1,...,k (2)
SOP E(x) = Reshape(Aff (Gk(. . . (G2(G1(Aff (x))))))) (3)
In Eq.1,
α
and
β
are learnable parameters, and initialized as 1 and 0 respectively. After the sequence
of convolution layers, the feature maps are then reshaped as patch tokens and concatenated with a
class token. Then the sequence of tokens will be fed into encoder layers. After SOPE, the token
sequence will pass through layers of encoder, where each encoder contains Layer Normalization [
70
],
multi-head self-attention and feed forward network. Here we modified the MHSA as Head-Interacted
Multi-Head Self-Attention (HI-MHSA) and feed forward network as Dynamic Aggregation Feed
Forward (DAFF). We will introduce them in the following sections. After the final encoder layer, the
output class token will be fed into the linear head for final prediction.
4
3.2 Dynamic Aggregation Feed Forward
FCFC GELUDWCONV AvgPool FC FC
GELU
sc
cls token patch token s split
cconcat
Figure 2: The structure of Dynamic Aggregation Feed Forward (DAFF).
The vanilla feed forward network (FFN) in ViT is formed by two fully-connected layers and GELU
activation. All the tokens, either patch tokens or class token, will be processed by FFN. Here we
integrate depth-wise convolution [
36
] (DWCONV) in FFN and resulting in a hybrid model. Such
hybrid model is similar to standard CNNs because it can be seen as using convolution to do feature
representation. With the inductive bias brought by depth-wise convolution, the model is forced to
capture neighboring features, solving the problem on spatial view. It greatly reduces the performance
gap when training from scratch on small datasets, and converges faster than standard CNNs. However,
such a structure still performs worse than stronger CNNs. More investigations are required to solve
the problem on channel aspect.
We propose two methods that make the whole model more dynamic and learn stronger feature
representation under insufficient data. The first proposed module is Dynamic Aggregation Feed
Forward (DAFF). We aggregate the feature of patch tokens into the class token in a channel attention
way, similar to the Squeeze-Excitation operation in SENet [
4
], as is shown in Fig. 2. Class token
is split before the projection layers. Then the patch tokens will go through a depth-wise integrated
multi-layer perceptron with a shortcut inside. The output patch tokens will then be averaged into a
weight vector
W
. After the squeeze-excitation operation, the output weight vector will be multiplied
with class token channel-wise. Then the re-calibrated class token will be concatenated with output
patch tokens to restore the token sequence. We use
Xc,Xp
to denote class token and patch tokens
respectively. The process can be formulated as:
W=Linear(GELU (Linear((Average(Xp)))) (4)
Xc=XcW(5)
3.3 Head Token
The second design to enhance feature representation is "head token", which is a brand new mechanism
as far as we know. There are two reasons why we introduce head token here. First, in the original
MHSA module, each attention head has not interacted with others, which means each head only
focuses on itself to calculate attention. Second, channel groups in different heads are responsible
for different feature representations, which is the inductive bias of ViTs. And as we pointed out
above, the lack of training data can not enable models to learn strong representation. Under this
circumstance, the representation in each channel group is too weak for recognition. After introducing
head tokens into attention calculation, the channel group in each head are able to interact with those
in other heads, and different representation can be fused into an integral representation of the object.
Representation learned by insufficient data may be poor in each channel, but their combination will
produce a strong enough representation. The structure of vision transformer also guarantees this
mechanism because the length of input tokens is variable, except for the hierarchical structure vision
transformer with window attention such as[17,35].
The process of generating head tokens is shown in Fig. 3(a). We denote the number of patch tokens
as
N
, so the length of the input sequence is
N+ 1
. According to the pre-defined number of heads
h
,
each
D
-dimensional token, including class token, will be reshaped into
h
parts. Each part contains
d
channels, where
D=d×h
. We average all the separated tokens in their own parts. Thus we get
5
摘要:

BridgingtheGapBetweenVisionTransformersandConvolutionalNeuralNetworksonSmallDatasetsZhiyingLu,HongtaoXie,ChuanbinLiu,YongdongZhangUniversityofScienceandTechnologyofChina,Hefei,Chinaarieseirack@mail.ustc.edu.cn,{htxie,liucb92,zhyd73}@ustc.edu.cnAbstractTherestillremainsanextremeperformancegapbetwee...

展开>> 收起<<
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets Zhiying Lu Hongtao Xie Chuanbin Liu Yongdong Zhang.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:1.68MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注