Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling Yunsung Lee1 Gyuseong Lee2 Kwangrok Ryoo2

2025-05-06 0 0 549.29KB 15 页 10玖币

侵权投诉

Towards Flexible Inductive Bias via

Progressive Reparameterization Scheduling

Yunsung Lee1∗, Gyuseong Lee2∗, Kwangrok Ryoo2∗,

Hyojun Go1∗, Jihye Park2∗, and Seungryong Kim2†

1Riiid AI Research 2Korea University

Abstract. There are two de facto standard architectures in recent com-

puter vision: Convolutional Neural Networks (CNNs) and Vision Trans-

formers (ViTs). Strong inductive biases of convolutions help the model

learn sample eﬀectively, but such strong biases also limit the upper bound

of CNNs when suﬃcient data are available. On the contrary, ViT is in-

ferior to CNNs for small data but superior for suﬃcient data. Recent

approaches attempt to combine the strengths of these two architectures.

However, we show these approaches overlook that the optimal inductive

bias also changes according to the target data scale changes by com-

paring various models’ accuracy on subsets of sampled ImageNet at dif-

ferent ratios. In addition, through Fourier analysis of feature maps, the

model’s response patterns according to signal frequency changes, we ob-

serve which inductive bias is advantageous for each data scale. The more

convolution-like inductive bias is included in the model, the smaller the

data scale is required where the ViT-like model outperforms the ResNet

performance. To obtain a model with ﬂexible inductive bias on the data

scale, we show reparameterization can interpolate inductive bias between

convolution and self-attention. By adjusting the number of epochs the

model stays in the convolution, we show that reparameterization from

convolution to self-attention interpolates the Fourier analysis pattern

between CNNs and ViTs. Adapting these ﬁndings, we propose Progres-

sive Reparameterization Scheduling (PRS), in which reparameterization

adjusts the required amount of convolution-like or self-attention-like in-

ductive bias per layer. For small-scale datasets, our PRS performs repa-

rameterization from convolution to self-attention linearly faster at the

late stage layer. PRS outperformed previous studies on the small-scale

dataset, e.g., CIFAR-100.

Keywords: Flexible Architecture, Vision Transformer, Convolution, Self-

attention, Inductive Bias

1 Introduction

Architecture advances have enhanced the performance of various tasks in com-

puter vision by improving backbone networks [3,15, 16,27,28]. From the success

∗indicates equal contributions

†indicates corresponding author.

arXiv:2210.01370v1 [cs.CV] 4 Oct 2022

2 Y. Lee et al.

of Transformers in natural language processing [2, 10, 31], Vision Transformers

(ViTs) show that it can outperform Convolutional Neural Networks (CNNs) and

its variants have led to architectural advances [22, 30, 36]. ViTs lack inductive

bias such as translation equivariance and locality compared to CNNs. Therefore,

ViTs with suﬃcient training data can outperform CNNs, but ViTs with small

data perform worse than CNNs.

To deal with the data-hungry problem, several works try to inject convolution-

like inductive bias into ViTs. The straightforward approaches use convolutions to

aid tokenization of an input image [14,32–34] or design the modules [6,12,20,35]

for improving ViTs with the inductive bias of CNNs. Other approaches use the

local attention mechanisms for introducing locality to ViTs [13, 22], which at-

tend to the neighbor elements and improve the local extraction ability of global

attention mechanisms. These approaches can design architectures that leverage

the strength of CNNs and ViTs and can alleviate the data-hungry problem at

some data scale that their work target.

However, we show these approaches overlook that the optimal inductive bias

also changes according to the target data scale by comparing various models’

accuracy on subsets of sampled ImageNet at diﬀerent ratios. If trained on the ex-

cessively tiny dataset, recent ViT variants still show lower accuracy than ResNet,

and on the full ImageNet scale, all ViT variants outperform ResNet. Inspired by

Park et al. [24], we perform Fourier analysis on these models to further analyze

inductive biases in the architecture. We observe that ViTs injected convolution-

like inductive bias show frequency characteristics between it of ResNet and ViT.

In this experiment, the more convolution-like inductive bias is included, the

smaller the data scale is required where the model outperforms the ResNet

performance. Speciﬁcally, their frequency characteristics tend to serve as the

high-pass ﬁlter in early layers and as more low-pass ﬁlter closer to the last layer.

Nevertheless, such a ﬁxed architecture in previous approaches has a ﬁxed induc-

tive bias between CNNs and ViTs, making it diﬃcult to design an architecture

that performs well on various data scales. Therefore, each time a new target

dataset is given, the optimal inductive bias required changes, so each time the

model’s architectural design needs to be renewed. For example, a CNN-like archi-

tecture should be used for small-scale dataset such as CIFAR [17], and a ViT-like

architecture should be designed for large-scale datasets such as JFT [26]. Also,

this design process requires multiple training for tuning the inductive bias of the

model, which is time-consuming.

In this paper, we conﬁrm the possibility of reparameterization technique [5,

19] from convolution to self-attention towards ﬂexible inductive bias between

convolution and self-attention during a single training trial. The reparameter-

ization technique can change the learned convolution layer to self-attention,

which identically operates like learned convolution. Performing Fourier analysis,

we show that reparameterization can interpolate the inductive biases between

convolution and self-attention by adjusting the moment of reparameterization

during training. We observe that more training with convolutions than with self-

attention makes the model have a similar frequency characteristic to CNN and

Towards Flexible Inductive Bias via P.R.S. 3

Table 1: Comparison of various architectures ✓means that the model has

the corresponding characteristics, and ✗does not. ✓∗indicates that ConViT’s

convolutional operation is given only in the initial training stage and then learned

in the form of gated self-attention.

DeiT ResNet ConViT ResT Swin

[29] [16] [12] [35] [22]

Hierarchical Structure ✗✓✗✓ ✓

Relative Positional Encoding ✗ ✗ ✓✗✓

Local Attention ✗ ✗ ✗ ✗ ✓

Convolutional Operation ✗✓ ✓∗✓✗

vice versa. This observation shows that adjusting the schedule of reparameteri-

zation can interpolate between the inductive bias of CNNs and ViTs.

From these observations, we propose the Progressive Reparameterization

Scheduling (PRS). PRS is to sequentially reparameterize from the last layer to

the ﬁrst layer. Layers closer to the last layers are more trained with self-attention

than convolution, making them closer to self-attention. Therefore, we can make

the model have a suitable inductive bias for small-scale data with our sched-

ule. We validate the eﬀectiveness of PRS with experiments on the CIFAR-100

dataset.

Our contributions are summarized as follows:

–We observe that architecture with a more convolutional inductive bias in the

early stage layers is advantageous on a small data scale. However, if the data

scale is large, it is advantageous to have a self-attentional inductive bias.

–We show that adjusting the remaining period as convolution before repa-

rameterization can interpolate the inductive bias between convolution and

self-attention.

–Based on observations of favorable conditions in small-scale datasets, we pro-

pose the Progressive Reparameterization Scheduling (PRS) which sequen-

tially changes convolution to self-attention from the last layer to the ﬁrst

layer. PRS outperformed previous approaches on the small-scale dataset,

e.g., CIFAR-100.

2 Related Work

2.1 Convolution Neural Networks

CNNs, the most representative models in computer vision, have evolved over

decades from LeNeT [18] to ResNet [16] in a way that is faster and more accu-

rate. CNNs can eﬀectively capture low-level features of images through induc-

tive biases which are locality and translation invariance. However, CNNs have a

weakness in capturing global information due to their limited receptive ﬁeld.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsFlexibleInductiveBiasviaProgressiveReparameterizationSchedulingYunsungLee1∗,GyuseongLee2∗,KwangrokRyoo2∗,HyojunGo1∗,JihyePark2∗,andSeungryongKim2†1RiiidAIResearch2KoreaUniversityAbstract.Therearetwodefactostandardarchitecturesinrecentcom-putervision:ConvolutionalNeuralNetworks(CNNs)andVisionT...

展开>> 收起<<

Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling Yunsung Lee1 Gyuseong Lee2 Kwangrok Ryoo2.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling Yunsung Lee1 Gyuseong Lee2 Kwangrok Ryoo2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: