Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets Zhiying Lu Hongtao Xie Chuanbin Liu Yongdong Zhang

2025-04-27 0 0 1.68MB 28 页 10玖币

侵权投诉

Bridging the Gap Between Vision Transformers and

Convolutional Neural Networks on Small Datasets

Zhiying Lu, Hongtao Xie∗, Chuanbin Liu∗, Yongdong Zhang

University of Science and Technology of China, Hefei, China

arieseirack@mail.ustc.edu.cn,{htxie,liucb92,zhyd73}@ustc.edu.cn

Abstract

There still remains an extreme performance gap between Vision Transformers

(ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on

small datasets, which is concluded to the lack of inductive bias. In this paper, we

further consider this problem and point out two weaknesses of ViTs in inductive

biases, that is, the

spatial relevance

and

diverse channel representation

. First,

on spatial aspect, objects are locally compact and relevant, thus ﬁne-grained feature

needs to be extracted from a token and its neighbors. While the lack of data hinders

ViTs to attend the spatial relevance. Second, on channel aspect, representation

exhibits diversity on different channels. But the scarce data can not enable ViTs

to learn strong enough representation for accurate recognition. To this end, we

propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance

the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which

convolution is integrated into patch embedding and multi-layer perceptron module,

forcing the model to capture the token features as well as their neighboring features.

On channel aspect, we introduce a dynamic feature aggregation module in MLP

and a brand new "head token" design in multi-head self-attention module to help

re-calibrate channel representation and make different channel group representation

interacts with each other. The fusion of weak channel representation forms a strong

enough representation for classiﬁcation. With this design, we successfully eliminate

the performance gap between CNNs and ViTs, and our DHVT achieves a series

of state-of-the-art performance with a lightweight model, 85.68

on CIFAR-100

with 22.8M parameters, 82.3

on ImageNet-1K with 24.0M parameters. Code is

available at https://github.com/ArieSeirack/DHVT.

1 Introduction

Convolutional Neural Networks (CNNs) have dominated in Computer Vision (CV) ﬁeld as the

backbone for various tasks like classiﬁcation [

], object detection [

] and

segmentation [

]. These years have witnessed the rapid growth of another promising

alternative architecture paradigm, Vision Transformers (ViTs). They have already exhibited great

performance in common tasks, such as classiﬁcation [

], object detection

[20,21,22] and segmentation [23,24].

ViT [

] is the pioneering model that brings Transformer architecture [

] from Natural Language

Processing (NLP) into CV. It has a higher performance upper bound than standard CNNs, while it

is at the cost of expensive computation and extremely huge amount of training data. The vanilla

ViT needs to be ﬁrstly pre-trained on the huge dataset JFT-300M [

] and then ﬁne-tuned on the

common dataset ImageNet-1K [

]. Under this experimental setting, it shows higher performance

than standard CNNs. However, when training from scratch on ImageNet-1K only, the accuracy is

∗Corresponding author

Preprint. Under review.

arXiv:2210.05958v2 [cs.CV] 30 Dec 2022

much lower. From the practical perspective, most of the datasets are even smaller than ImageNet-1K,

and not all the researchers can hold the burden of pre-training their own model on large datasets and

then ﬁne-tuning on the target small datasets. Thus, an effective architecture for training ViTs from

scratch on small datasets is demanded.

Recent works [

] explore the reasons for the difference in data efﬁciency between ViT

and CNNs, and draw a conclusion to the lack of inductive bias. In [

], it points out that with

not enough data, ViT does not learn to attend locally in earlier layers. And in [

], it says that

the stronger the inductive biases, the stronger the representations. Large datasets tend to help ViT

learn strong representations. Locality constraints improve the performance of ViT. Meanwhile, in

recent work [

], it demonstrates that convolutional constraints can enable strongly sample-efﬁcient

training in the small-data regime. The insufﬁcient training data makes ViT hard to derive the

inductive bias of attending locality, thus many recent works strive to introduce local inductive bias

by integrating convolution into ViTs [

] and modify it to hierarchical structure

[

], making ViTs more like traditional CNNs. This style of hybrid structure shows

comparable performance with strong CNNs when training from scratch on medium dataset ImageNet-

1K only. But the performance gap on much smaller datasets still remains.

Here, we consider that the scarce training data weakens the inductive biases in ViTs. Two kinds of

inductive bias need to be enhanced and better exploited to improve the data efﬁciency, that is, the

spatial relevance

and

diverse channel representation

On spatial aspect

, tokens are relevant and

objects are locally compact. The important ﬁne-grained low-level feature needs to be extracted from

the token and its neighbors at the earlier layers. Rethinking the feature extraction framework in ViTs,

the module for feature representation is the multi-layer perceptron (MLP) and its receptive ﬁeld can

be seen as only itself. So ViTs depend on the multi-head self-attention (MHSA) module to model and

capture the relation between tokens. As is pointed out in work [

], with less training data, lower

attention layers do not learn to attend locally. In other words, they do not focus on neighboring tokens

and aggregate local information in the early stage. As is known, capturing local features in lower

layers facilitates the whole representation pipeline. The deep layers sequentially process the low-level

texture feature into high-level semantic features for ﬁnal recognition. Thus ViTs have an extreme

performance gap compared with CNNs when training from scratch on small datasets.

On channel

aspect

, feature representation exhibits diversity in different channels. And ViT has its own inductive

bias that different channel group encodes different feature representation of the object, and the whole

token vector forms the representation of the object. As is pointed out in work [

], large datasets

tend to help ViT learn strong representation. The insufﬁcient data can not enable ViTs to learn strong

enough representation, thus the whole representation is poor for accurate classiﬁcation.

In this paper, we solve the performance gap of training from scratch on small datasets between CNNs

and ViTs and provide a hybrid architecture called Dynamic Hybrid Vision Transformer (DHVT)

as a substitute. We ﬁrst introduce a hybrid model to address the issue

on spatial aspect

. The

proposed hybrid model integrates a sequence of convolution layers in the patch embedding stage to

eliminate the non-overlapping problem, preserving ﬁne-grained low-level features, and it involves

depth-wise convolution [

] in MLP for local feature extraction. In addition, we design two modules

for making feature representation stronger to solve the problem

on channel view

. To be speciﬁc, in

MLP, depth-wise convolution is adopted for the patch tokens, and the class token is identically passed

through without any computation. We then leverage the output patch tokens to produce channel

weight like Squeeze-Excitation (SE) [

] for the class token. This operation helps re-calibrate each

channel for the class token to reinforce its feature representation. Moreover, in order to enhance

interaction among different semantic representations of different channel groups and owing to the

variable length of the token sequence in vision transformer structure, we devise a brand new token

mechanism called "head token". The number of head tokens is the same as the number of attention

heads in MHSA. Head tokens are generated by segmenting and projecting input tokens along the

channel. The head tokens will be concatenated with all other tokens to pass through the MHSA. Each

channel group in the corresponding attention head in the MHSA now is able to interact with others.

Though maybe the representation in each channel and channel group is poor for classiﬁcation on

account of insufﬁcient training data, the head tokens help re-calibrate each learned feature pattern

and enable a stronger integral representation of the object, which is beneﬁcial to ﬁnal recognition.

We conduct experiments of training from scratch on various small datasets, the common dataset

CIFAR-100, and small domain datasets Clipart, Painting and Sketch from DomainNet [

] to examine

the performance of our model. On CIFAR-100, our proposed models show a signiﬁcant performance

margin with strong CNNs like ResNeXt, DenseNet and Res2Net. The Tiny model achieves 83.54

with only 5.8M parameters, and our Small model reaches the state-of-the-art 85.68

accuracy with

only 22.8M parameters, outperforming a series of strong CNNs. Therefore, we eliminate the gap

between CNNs and ViTs, providing an alternative architecture that can train from scratch on small

datasets. We also evaluate the performance of DHVT when training from scratch on ImageNet-1K.

Our proposed DHVT-S achieves competitive 82.3

accuracy with only 24.0M parameters, which is

the state-of-the-art non-hierarchical vision transformer structure as far as we know, demonstrating the

effectiveness of our model on larger datasets. In summary, our main contributions are:

1. We conclude that the data efﬁciency on small datasets can be addressed by strengthening two

inductive biases in ViTs, which are spatial relevance and diverse channel representation.

2. On spatial aspect, we adopt a hybrid model integrated with convolution, preserving ﬁne-grained

low-level features at the earlier stage and forcing the model to extract tokens feature and corresponding

neighbor feature.

3. On channel aspect, we leverage the output patch tokens to re-calibrate class token channel-wise,

producing better feature representation. We further introduce "head token", a novel design that

helps fuse diverse feature representation encoded in different channel groups into a stronger integral

representation.

2 Related Work

Vision Transformers.

Convolutional Neural Networks [

] dominated the computer

vision ﬁelds in the past decade, with its intrinsic inductive biases designed for image recognition.

The past two years witnessed the rise of Vision Transformer models in various vision tasks [

]. Although there exist previous works introducing attention mechanism into CNNs

[

], the pioneering full transformer architecture in computer vision are iGPT [

] and ViT

[

]. ViT is widely adopted as the architecture paradigm for vision tasks especially image recognition.

It processes the image as a token sequence and exploits relations among tokens. It uses "class

token" like BERT [

] to exchange information at every layer and for ﬁnal classiﬁcation. It performs

well when pre-trained on huge datasets. But when training from scratch on ImageNet-1K only, it

underperforms ResNets, demonstrating a data-hungry problem.

Data-efﬁcient ViTs.

Many of the subsequent modiﬁcations on ViT strive for a more data-efﬁcient

architecture that can perform well without pre-training on larger datasets. The methods can be divided

into different groups. [

] use knowledge distillation strategy and stronger data-augmentation

methods to enable training from scratch. [

] points out that using convolution in the patch embedding

stage greatly beneﬁts ViTs training. [

] leverage convolution for patch embedding

to eliminate the discontinuity brought by non-overlapping patch embedding in vanilla ViT, and

such design becomes a paradigm in subsequent works. To further introduce inductive bias into

ViT, [

] integrate depth-wise convolution into feed forward network, resulting in

a hybrid architecture combining the self-attention and convolution. To make ViTs more similar

to standard CNNs, [

] re-design the spatial and channel dimension of

vanilla ViT, producing a series of hierarchical style vision transformer. [

] design another

parallel convolution branch and enable the interaction with the self-attention branch, making the two

branch complements each other. The above architectures introduce strong inductive bias and become

data-efﬁcient when training from scratch on ImageNet-1K. In addition, works like [

]

investigate channel-wise representation through conducting self-attention channel-wise, while we

enhance channel representation by dynamically aggregating patch token features to enhance class

token channel-wise and compatibly involve channel group-wise head tokens into vanilla self-attention.

Finally, works like [63,64,65], suggesting that the number of tokens can be variable.

ViTs for small datasets

. There exists several works on solving the training from scratch problem

on small datasets. Though the above modiﬁed vision transformers perform well when trained on

ImageNet-1K, they fail to compete with standard CNNs when training on much smaller datasets like

CIFAR-100. Work [

] introduces a self-supervised style training strategy and a loss function to help

train ViTs on small datasets. CCT [

] adopts a convolutional tokenization module and replaces the

class token with the ﬁnal sequence pooling operation. SL-ViT [

] adopts shifted patch tokenization

module and modiﬁes self-attention to make it focus more locally. Though the previous works reduce

the performance gap between standard CNNs ResNets[

], they fail to be sub-optimal when compared

with strong CNNs. Our proposed method leverages local constraints and enhances representation

interaction, successfully bridging the performance gap on small datasets.

HI-MHSA

LayerNorm

DAFF

LayerNorm

Sequential Overlapping Patch Embedding

Encoder Layer

FC Head

Classification Result

cls token

patch token

3×3 Conv, s=2

BatchNorm

GELU

Reshape

Affine

Figure 1: Overview of the proposed Dynamic Hybrid Vision Transformer (DHVT). DHVT follows a

non-hierarchical structure, where each encoder layer contains two pre-norm and shortcut, a Head-

Interacted Multi-Head Self-Attention (HI-MHSA) and a Dynamic Aggregation Feed Forward (DAFF).

3 Methods

3.1 Overview of DHVT

As shown in Fig. 1, the framework of our proposed DHVT is similar to vanilla ViT. We choose a

non-hierarchical structure, where every encoder block shares the same parameter setting, processing

the same shape of features. Under this structure, we can deal with variable length of token sequence.

We keep the design of using class token to interact with all the patch tokens and for ﬁnal prediction.

In the patch embedding module, the input image will be split into patches ﬁrst. Given the input

image with resolution

H×W

and the target patch size

, the resulting length of the patch token

sequence will be

N=HW/P 2

. Our modiﬁed patch embedding is called Sequential Overlapping

Patch Embedding (SOPE), which contains several successive convolution layers of

3×3

convolution

with stride

s= 2

, Batch Normalization and GELU [

] activation. The relation between the number

of convolution layers and the patch size is

P= 2k

. SOPE is able to eliminate the discontinuity

brought by the vanilla patch embedding module, preserving important low-level features. It is able

to provide position information to some extent. We also adopt two afﬁne transformations before

and after the series of convolution layers. This operation rescales and shifts the input feature, and it

acts like normalization, making the training performance more stable on small datasets. The whole

process of SOPE can be formulated as follows.

Aff (x) = Diag(α)x+β(1)

Gi(x) = GELU (BN (Conv(x))), i = 1,...,k (2)

SOP E(x) = Reshape(Aff (Gk(. . . (G2(G1(Aff (x))))))) (3)

In Eq.1,

and

are learnable parameters, and initialized as 1 and 0 respectively. After the sequence

of convolution layers, the feature maps are then reshaped as patch tokens and concatenated with a

class token. Then the sequence of tokens will be fed into encoder layers. After SOPE, the token

sequence will pass through layers of encoder, where each encoder contains Layer Normalization [

multi-head self-attention and feed forward network. Here we modiﬁed the MHSA as Head-Interacted

Multi-Head Self-Attention (HI-MHSA) and feed forward network as Dynamic Aggregation Feed

Forward (DAFF). We will introduce them in the following sections. After the ﬁnal encoder layer, the

output class token will be fed into the linear head for ﬁnal prediction.

3.2 Dynamic Aggregation Feed Forward

FCFC GELUDWCONV AvgPool FC FC

GELU

cls token patch token s split

cconcat

Figure 2: The structure of Dynamic Aggregation Feed Forward (DAFF).

The vanilla feed forward network (FFN) in ViT is formed by two fully-connected layers and GELU

activation. All the tokens, either patch tokens or class token, will be processed by FFN. Here we

integrate depth-wise convolution [

] (DWCONV) in FFN and resulting in a hybrid model. Such

hybrid model is similar to standard CNNs because it can be seen as using convolution to do feature

representation. With the inductive bias brought by depth-wise convolution, the model is forced to

capture neighboring features, solving the problem on spatial view. It greatly reduces the performance

gap when training from scratch on small datasets, and converges faster than standard CNNs. However,

such a structure still performs worse than stronger CNNs. More investigations are required to solve

the problem on channel aspect.

We propose two methods that make the whole model more dynamic and learn stronger feature

representation under insufﬁcient data. The ﬁrst proposed module is Dynamic Aggregation Feed

Forward (DAFF). We aggregate the feature of patch tokens into the class token in a channel attention

way, similar to the Squeeze-Excitation operation in SENet [

], as is shown in Fig. 2. Class token

is split before the projection layers. Then the patch tokens will go through a depth-wise integrated

multi-layer perceptron with a shortcut inside. The output patch tokens will then be averaged into a

weight vector

. After the squeeze-excitation operation, the output weight vector will be multiplied

with class token channel-wise. Then the re-calibrated class token will be concatenated with output

patch tokens to restore the token sequence. We use

Xc,Xp

to denote class token and patch tokens

respectively. The process can be formulated as:

W=Linear(GELU (Linear((Average(Xp)))) (4)

Xc=XcW(5)

3.3 Head Token

The second design to enhance feature representation is "head token", which is a brand new mechanism

as far as we know. There are two reasons why we introduce head token here. First, in the original

MHSA module, each attention head has not interacted with others, which means each head only

focuses on itself to calculate attention. Second, channel groups in different heads are responsible

for different feature representations, which is the inductive bias of ViTs. And as we pointed out

above, the lack of training data can not enable models to learn strong representation. Under this

circumstance, the representation in each channel group is too weak for recognition. After introducing

head tokens into attention calculation, the channel group in each head are able to interact with those

in other heads, and different representation can be fused into an integral representation of the object.

Representation learned by insufﬁcient data may be poor in each channel, but their combination will

produce a strong enough representation. The structure of vision transformer also guarantees this

mechanism because the length of input tokens is variable, except for the hierarchical structure vision

transformer with window attention such as[17,35].

The process of generating head tokens is shown in Fig. 3(a). We denote the number of patch tokens

, so the length of the input sequence is

N+ 1

. According to the pre-deﬁned number of heads

each

-dimensional token, including class token, will be reshaped into

parts. Each part contains

channels, where

D=d×h

. We average all the separated tokens in their own parts. Thus we get

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BridgingtheGapBetweenVisionTransformersandConvolutionalNeuralNetworksonSmallDatasetsZhiyingLu,HongtaoXie,ChuanbinLiu,YongdongZhangUniversityofScienceandTechnologyofChina,Hefei,Chinaarieseirack@mail.ustc.edu.cn,{htxie,liucb92,zhyd73}@ustc.edu.cnAbstractTherestillremainsanextremeperformancegapbetwee...

展开>> 收起<<

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets Zhiying Lu Hongtao Xie Chuanbin Liu Yongdong Zhang.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets Zhiying Lu Hongtao Xie Chuanbin Liu Yongdong Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: