Related work
CNNs and ViTs.
Many computer vision architectures can be considered as a form of hybridization
between Transformers and CNNs. For example, DeTR [Carion et al.,2020] use a CNN to generate
features that are fed to a Transformer. [d’Ascoli et al.,2021] show that self-attention can be initialized
or regularized to behave like a convolution and [Dai et al.,2021,Guo et al.,2021] add convolution
operations to Transformers. Conversely, [Bello et al.,2019,Ramachandran et al.,2019,Bello,2021]
introduce self-attention or attention-like operations to supplement or replace convolution in ResNet-
like models. In contrast, our paper does not consider any form of hybridization with CNN, but rather
a simplification of the original ViT to explain how ViTs learn spatially structured patterns using GD.
Empirical understanding of ViTs.
A long line of work consists in analyzing the properties of
ViTs, such as robustness [Bhojanapalli et al.,2021,Paul and Chen,2021,Naseer et al.,2021] or the
effect of self-supervision [Caron et al.,2021,Chen et al.,2021b]. Closer to our work, some papers
investigate why ViTs perform so well. Raghu et al. [2021] compare the representations of ViTs
and CNNs and Melas-Kyriazi [2021], Trockman and Kolter [2022] argue that the patch embeddings
could explain the performance of ViTs. We empirically show in Section 6 that applying the attention
matrices to the positional encodings – which contains the structure of the dataset – approximately
recovers the baselines. Hence, our work rather suggests that the structural learning performed by the
attention matrices may explain the success of ViTs.
Theory for attention models.
Early theoretical works have focused on the expressivity of attention.
[Vuckovic et al.,2020,Edelman et al.,2021] addressed this question in the context of self-attention
blocks and [Dehghani et al.,2018,Wei et al.,2021,Hron et al.,2020] for Transformers. On the
optimization side, [Zhang et al.,2020] investigate the role of adaptive methods in attention models and
[Snell et al.,2021] analyze the dynamics of a single-head attention head to approximate the learning
of a Seq2Seq architecture. In our work, we also consider a single-head ViT trained with gradient
descent and exhibit a setting where it provably learns convolution-like patterns and generalizes.
Algorithmic regularization.
The question we address concerns algorithmic regularization which
characterizes the generalization of an optimization algorithm when multiple global solutions exist
in over-parametrized models. This regularization arises in deep learning mainly due to the non-
convexity of the objective function. Indeed, this latter potentially creates multiple global minima
scattered in the space that vastly differ in terms of generalization. Algorithmic regularization appears
in binary classification [Soudry et al.,2018,Lyu and Li,2019,Chizat and Bach,2020], matrix
factorization [Gunasekar et al.,2018,Arora et al.,2019], convolutional neural networks [Gunasekar
et al.,2018,Jagadeesan et al.,2022], generative adversarial networks [Allen-Zhu and Li,2021],
contrastive learning [Wen and Li,2021] and mixture of experts [Chen et al.,2022]. Algorithmic
regularization is induced by and depends on many factors such as learning rate and batch size [Goyal
et al.,2017,Hoffer et al.,2017,Keskar et al.,2016,Smith et al.,2018,Li et al.,2019], initialization
Allen-Zhu and Li [2020], momentum [Jelassi and Li,2022], adaptive step-size [Kingma and Ba,
2014,Neyshabur et al.,2015,Daniely,2017,Wilson et al.,2017,Zou et al.,2021,Jelassi et al.,2022],
batch normalization [Arora et al.,2018,Hoffer et al.,2019,Ioffe and Szegedy,2015] and dropout
[Srivastava et al.,2014,Wei et al.,2020]. However, all these works consider the case of feed-forward
neural networks which does not apply to ViTs.
2 Defining patch association
The goal of this section is to formalize the way ViTs learn sparse spatial connectivity patterns. We
thus introduce the concept of performing patch association for a spatially structured dataset.
Definition 2.1
(Data distribution with spatial structure)
.
Let
D
be a distribution over
RdˆDˆt´1,1u
where each patch
X“ pX1,...,XDq P RdˆD
has label
yP t´1,1u
. We say that
D
is spatially
structured if
–
there exists a partition of
rDs
into
L
disjoint subsets i.e.
rDs “ ŤL
`“1S`
with
S`ĹD
and
|S`| “ C.
3