
Sequence and Circle: Exploring the Relationship
Between Patches
Zhengyang Yu
Frankfurt Institute for Advanced Studies
Xidian FIAS Joint Research Center
Frankfurt am Main, Germany
zhyu@fias.uni-frankfurt.de
Jochen Triesch
Frankfurt Institute for Advanced Studies
Frankfurt am Main, Germany
triesch@fias.uni-frankfurt.de
Abstract
The vision transformer (ViT) has achieved state-of-the-art results in various vision
tasks. It utilizes a learnable position embedding (PE) mechanism to encode the
location of each image patch. However, it is presently unclear if this learnable PE
is really necessary and what its benefits are. This paper explores two alternative
ways of encoding the location of individual patches that exploit prior knowledge
about their spatial arrangement. One is called the sequence relationship embedding
(SRE), and the other is called the circle relationship embedding (CRE). Among
them, the SRE considers all patches to be in order, and adjacent patches have the
same interval distance. The CRE considers the central patch as the center of the
circle and measures the distance of the remaining patches from the center based
on the four neighborhoods principle. Multiple concentric circles with different
radii combine different patches. Finally, we implemented these two relations on
three classic ViTs and tested them on four popular datasets. Experiments show that
SRE and CRE can replace PE to reduce the random learnable parameters while
achieving the same performance. Combining SRE or CRE with PE gets better
performance than only using PE.
1 Introduction
Vision Transformer (ViT) utilizes the self-attention mechanism and the way of building image patches
to achieve global modeling capabilities while reducing huge computational costs (Dosovitskiy et al.,
2020; Guo et al., 2022; Han et al., 2022). It retains the advantages of the transformer in NLP (Vaswani
et al., 2017; Devlin et al., 2018; Wang et al., 2018), catches global input relationships, parallelize
computations, and achieve performance or surpass comparable to convolutional neural networks
(CNN) on various computer vision tasks (He et al., 2016; Li et al., 2022), such as image classification
(Bi et al., 2021), image segmentation (Xu et al., 2022), and object detection (Islam, 2022).
Because ViT divides the input image into several patches and needs to model the global relationship
directly, researchers focus on how to efficiently use the information of each patch to extract the most
distinguishing features of different objects (Wu et al., 2021b). The standard ViT introduces positional
embedding (PE) to solve this problem, which turned out to be crucial for vision tasks (Islam, 2022).
Most recent studies change the standard ViT learnable PE, propose a new method to calculate PE, and
make the novel PE better express the location information of different patches, the random learnable
parameters of PE are still too much to train well (Khan et al., 2021; Su et al., 2021; Liu et al., 2022;
Yang et al., 2020). Whereas this paper avoids directly changing the PE but instead explores the
hidden relationship between the input patches and generates the relationship matrix from a learnable
relationship vector. Since the relation embedding (RE) matrix has the same size as the PE, we can
4th Workshop on Shared Visual Representations in Human and Machine Visual Intelligence (SVRHM) at the
Neural Information Processing Systems (NeurIPS) conference 2022. New Orleans.
arXiv:2210.09871v2 [cs.CV] 19 Oct 2022