much lower. From the practical perspective, most of the datasets are even smaller than ImageNet-1K,
and not all the researchers can hold the burden of pre-training their own model on large datasets and
then fine-tuning on the target small datasets. Thus, an effective architecture for training ViTs from
scratch on small datasets is demanded.
Recent works [
27
,
28
,
29
] explore the reasons for the difference in data efficiency between ViT
and CNNs, and draw a conclusion to the lack of inductive bias. In [
27
], it points out that with
not enough data, ViT does not learn to attend locally in earlier layers. And in [
28
], it says that
the stronger the inductive biases, the stronger the representations. Large datasets tend to help ViT
learn strong representations. Locality constraints improve the performance of ViT. Meanwhile, in
recent work [
29
], it demonstrates that convolutional constraints can enable strongly sample-efficient
training in the small-data regime. The insufficient training data makes ViT hard to derive the
inductive bias of attending locality, thus many recent works strive to introduce local inductive bias
by integrating convolution into ViTs [
18
,
15
,
30
,
31
,
32
] and modify it to hierarchical structure
[
33
,
34
,
16
,
17
,
35
], making ViTs more like traditional CNNs. This style of hybrid structure shows
comparable performance with strong CNNs when training from scratch on medium dataset ImageNet-
1K only. But the performance gap on much smaller datasets still remains.
Here, we consider that the scarce training data weakens the inductive biases in ViTs. Two kinds of
inductive bias need to be enhanced and better exploited to improve the data efficiency, that is, the
spatial relevance
and
diverse channel representation
.
On spatial aspect
, tokens are relevant and
objects are locally compact. The important fine-grained low-level feature needs to be extracted from
the token and its neighbors at the earlier layers. Rethinking the feature extraction framework in ViTs,
the module for feature representation is the multi-layer perceptron (MLP) and its receptive field can
be seen as only itself. So ViTs depend on the multi-head self-attention (MHSA) module to model and
capture the relation between tokens. As is pointed out in work [
27
], with less training data, lower
attention layers do not learn to attend locally. In other words, they do not focus on neighboring tokens
and aggregate local information in the early stage. As is known, capturing local features in lower
layers facilitates the whole representation pipeline. The deep layers sequentially process the low-level
texture feature into high-level semantic features for final recognition. Thus ViTs have an extreme
performance gap compared with CNNs when training from scratch on small datasets.
On channel
aspect
, feature representation exhibits diversity in different channels. And ViT has its own inductive
bias that different channel group encodes different feature representation of the object, and the whole
token vector forms the representation of the object. As is pointed out in work [
28
], large datasets
tend to help ViT learn strong representation. The insufficient data can not enable ViTs to learn strong
enough representation, thus the whole representation is poor for accurate classification.
In this paper, we solve the performance gap of training from scratch on small datasets between CNNs
and ViTs and provide a hybrid architecture called Dynamic Hybrid Vision Transformer (DHVT)
as a substitute. We first introduce a hybrid model to address the issue
on spatial aspect
. The
proposed hybrid model integrates a sequence of convolution layers in the patch embedding stage to
eliminate the non-overlapping problem, preserving fine-grained low-level features, and it involves
depth-wise convolution [
36
] in MLP for local feature extraction. In addition, we design two modules
for making feature representation stronger to solve the problem
on channel view
. To be specific, in
MLP, depth-wise convolution is adopted for the patch tokens, and the class token is identically passed
through without any computation. We then leverage the output patch tokens to produce channel
weight like Squeeze-Excitation (SE) [
4
] for the class token. This operation helps re-calibrate each
channel for the class token to reinforce its feature representation. Moreover, in order to enhance
interaction among different semantic representations of different channel groups and owing to the
variable length of the token sequence in vision transformer structure, we devise a brand new token
mechanism called "head token". The number of head tokens is the same as the number of attention
heads in MHSA. Head tokens are generated by segmenting and projecting input tokens along the
channel. The head tokens will be concatenated with all other tokens to pass through the MHSA. Each
channel group in the corresponding attention head in the MHSA now is able to interact with others.
Though maybe the representation in each channel and channel group is poor for classification on
account of insufficient training data, the head tokens help re-calibrate each learned feature pattern
and enable a stronger integral representation of the object, which is beneficial to final recognition.
We conduct experiments of training from scratch on various small datasets, the common dataset
CIFAR-100, and small domain datasets Clipart, Painting and Sketch from DomainNet [
37
] to examine
the performance of our model. On CIFAR-100, our proposed models show a significant performance
2