Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling Yunsung Lee1 Gyuseong Lee2 Kwangrok Ryoo2

2025-05-06 0 0 549.29KB 15 页 10玖币
侵权投诉
Towards Flexible Inductive Bias via
Progressive Reparameterization Scheduling
Yunsung Lee1, Gyuseong Lee2, Kwangrok Ryoo2,
Hyojun Go1, Jihye Park2, and Seungryong Kim2
1Riiid AI Research 2Korea University
Abstract. There are two de facto standard architectures in recent com-
puter vision: Convolutional Neural Networks (CNNs) and Vision Trans-
formers (ViTs). Strong inductive biases of convolutions help the model
learn sample effectively, but such strong biases also limit the upper bound
of CNNs when sufficient data are available. On the contrary, ViT is in-
ferior to CNNs for small data but superior for sufficient data. Recent
approaches attempt to combine the strengths of these two architectures.
However, we show these approaches overlook that the optimal inductive
bias also changes according to the target data scale changes by com-
paring various models’ accuracy on subsets of sampled ImageNet at dif-
ferent ratios. In addition, through Fourier analysis of feature maps, the
model’s response patterns according to signal frequency changes, we ob-
serve which inductive bias is advantageous for each data scale. The more
convolution-like inductive bias is included in the model, the smaller the
data scale is required where the ViT-like model outperforms the ResNet
performance. To obtain a model with flexible inductive bias on the data
scale, we show reparameterization can interpolate inductive bias between
convolution and self-attention. By adjusting the number of epochs the
model stays in the convolution, we show that reparameterization from
convolution to self-attention interpolates the Fourier analysis pattern
between CNNs and ViTs. Adapting these findings, we propose Progres-
sive Reparameterization Scheduling (PRS), in which reparameterization
adjusts the required amount of convolution-like or self-attention-like in-
ductive bias per layer. For small-scale datasets, our PRS performs repa-
rameterization from convolution to self-attention linearly faster at the
late stage layer. PRS outperformed previous studies on the small-scale
dataset, e.g., CIFAR-100.
Keywords: Flexible Architecture, Vision Transformer, Convolution, Self-
attention, Inductive Bias
1 Introduction
Architecture advances have enhanced the performance of various tasks in com-
puter vision by improving backbone networks [3,15, 16,27,28]. From the success
indicates equal contributions
indicates corresponding author.
arXiv:2210.01370v1 [cs.CV] 4 Oct 2022
2 Y. Lee et al.
of Transformers in natural language processing [2, 10, 31], Vision Transformers
(ViTs) show that it can outperform Convolutional Neural Networks (CNNs) and
its variants have led to architectural advances [22, 30, 36]. ViTs lack inductive
bias such as translation equivariance and locality compared to CNNs. Therefore,
ViTs with sufficient training data can outperform CNNs, but ViTs with small
data perform worse than CNNs.
To deal with the data-hungry problem, several works try to inject convolution-
like inductive bias into ViTs. The straightforward approaches use convolutions to
aid tokenization of an input image [14,32–34] or design the modules [6,12,20,35]
for improving ViTs with the inductive bias of CNNs. Other approaches use the
local attention mechanisms for introducing locality to ViTs [13, 22], which at-
tend to the neighbor elements and improve the local extraction ability of global
attention mechanisms. These approaches can design architectures that leverage
the strength of CNNs and ViTs and can alleviate the data-hungry problem at
some data scale that their work target.
However, we show these approaches overlook that the optimal inductive bias
also changes according to the target data scale by comparing various models’
accuracy on subsets of sampled ImageNet at different ratios. If trained on the ex-
cessively tiny dataset, recent ViT variants still show lower accuracy than ResNet,
and on the full ImageNet scale, all ViT variants outperform ResNet. Inspired by
Park et al. [24], we perform Fourier analysis on these models to further analyze
inductive biases in the architecture. We observe that ViTs injected convolution-
like inductive bias show frequency characteristics between it of ResNet and ViT.
In this experiment, the more convolution-like inductive bias is included, the
smaller the data scale is required where the model outperforms the ResNet
performance. Specifically, their frequency characteristics tend to serve as the
high-pass filter in early layers and as more low-pass filter closer to the last layer.
Nevertheless, such a fixed architecture in previous approaches has a fixed induc-
tive bias between CNNs and ViTs, making it difficult to design an architecture
that performs well on various data scales. Therefore, each time a new target
dataset is given, the optimal inductive bias required changes, so each time the
model’s architectural design needs to be renewed. For example, a CNN-like archi-
tecture should be used for small-scale dataset such as CIFAR [17], and a ViT-like
architecture should be designed for large-scale datasets such as JFT [26]. Also,
this design process requires multiple training for tuning the inductive bias of the
model, which is time-consuming.
In this paper, we confirm the possibility of reparameterization technique [5,
19] from convolution to self-attention towards flexible inductive bias between
convolution and self-attention during a single training trial. The reparameter-
ization technique can change the learned convolution layer to self-attention,
which identically operates like learned convolution. Performing Fourier analysis,
we show that reparameterization can interpolate the inductive biases between
convolution and self-attention by adjusting the moment of reparameterization
during training. We observe that more training with convolutions than with self-
attention makes the model have a similar frequency characteristic to CNN and
Towards Flexible Inductive Bias via P.R.S. 3
Table 1: Comparison of various architectures means that the model has
the corresponding characteristics, and does not. indicates that ConViT’s
convolutional operation is given only in the initial training stage and then learned
in the form of gated self-attention.
DeiT ResNet ConViT ResT Swin
[29] [16] [12] [35] [22]
Hierarchical Structure ✓ ✓
Relative Positional Encoding ✗ ✗
Local Attention
Convolutional Operation ✓ ✓
vice versa. This observation shows that adjusting the schedule of reparameteri-
zation can interpolate between the inductive bias of CNNs and ViTs.
From these observations, we propose the Progressive Reparameterization
Scheduling (PRS). PRS is to sequentially reparameterize from the last layer to
the first layer. Layers closer to the last layers are more trained with self-attention
than convolution, making them closer to self-attention. Therefore, we can make
the model have a suitable inductive bias for small-scale data with our sched-
ule. We validate the effectiveness of PRS with experiments on the CIFAR-100
dataset.
Our contributions are summarized as follows:
We observe that architecture with a more convolutional inductive bias in the
early stage layers is advantageous on a small data scale. However, if the data
scale is large, it is advantageous to have a self-attentional inductive bias.
We show that adjusting the remaining period as convolution before repa-
rameterization can interpolate the inductive bias between convolution and
self-attention.
Based on observations of favorable conditions in small-scale datasets, we pro-
pose the Progressive Reparameterization Scheduling (PRS) which sequen-
tially changes convolution to self-attention from the last layer to the first
layer. PRS outperformed previous approaches on the small-scale dataset,
e.g., CIFAR-100.
2 Related Work
2.1 Convolution Neural Networks
CNNs, the most representative models in computer vision, have evolved over
decades from LeNeT [18] to ResNet [16] in a way that is faster and more accu-
rate. CNNs can effectively capture low-level features of images through induc-
tive biases which are locality and translation invariance. However, CNNs have a
weakness in capturing global information due to their limited receptive field.
摘要:

TowardsFlexibleInductiveBiasviaProgressiveReparameterizationSchedulingYunsungLee1∗,GyuseongLee2∗,KwangrokRyoo2∗,HyojunGo1∗,JihyePark2∗,andSeungryongKim2†1RiiidAIResearch2KoreaUniversityAbstract.Therearetwodefactostandardarchitecturesinrecentcom-putervision:ConvolutionalNeuralNetworks(CNNs)andVisionT...

展开>> 收起<<
Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling Yunsung Lee1 Gyuseong Lee2 Kwangrok Ryoo2.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:549.29KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注