Sequence and Circle Exploring the Relationship Between Patches Zhengyang Yu

2025-05-03 0 0 316.91KB 7 页 10玖币
侵权投诉
Sequence and Circle: Exploring the Relationship
Between Patches
Zhengyang Yu
Frankfurt Institute for Advanced Studies
Xidian FIAS Joint Research Center
Frankfurt am Main, Germany
zhyu@fias.uni-frankfurt.de
Jochen Triesch
Frankfurt Institute for Advanced Studies
Frankfurt am Main, Germany
triesch@fias.uni-frankfurt.de
Abstract
The vision transformer (ViT) has achieved state-of-the-art results in various vision
tasks. It utilizes a learnable position embedding (PE) mechanism to encode the
location of each image patch. However, it is presently unclear if this learnable PE
is really necessary and what its benefits are. This paper explores two alternative
ways of encoding the location of individual patches that exploit prior knowledge
about their spatial arrangement. One is called the sequence relationship embedding
(SRE), and the other is called the circle relationship embedding (CRE). Among
them, the SRE considers all patches to be in order, and adjacent patches have the
same interval distance. The CRE considers the central patch as the center of the
circle and measures the distance of the remaining patches from the center based
on the four neighborhoods principle. Multiple concentric circles with different
radii combine different patches. Finally, we implemented these two relations on
three classic ViTs and tested them on four popular datasets. Experiments show that
SRE and CRE can replace PE to reduce the random learnable parameters while
achieving the same performance. Combining SRE or CRE with PE gets better
performance than only using PE.
1 Introduction
Vision Transformer (ViT) utilizes the self-attention mechanism and the way of building image patches
to achieve global modeling capabilities while reducing huge computational costs (Dosovitskiy et al.,
2020; Guo et al., 2022; Han et al., 2022). It retains the advantages of the transformer in NLP (Vaswani
et al., 2017; Devlin et al., 2018; Wang et al., 2018), catches global input relationships, parallelize
computations, and achieve performance or surpass comparable to convolutional neural networks
(CNN) on various computer vision tasks (He et al., 2016; Li et al., 2022), such as image classification
(Bi et al., 2021), image segmentation (Xu et al., 2022), and object detection (Islam, 2022).
Because ViT divides the input image into several patches and needs to model the global relationship
directly, researchers focus on how to efficiently use the information of each patch to extract the most
distinguishing features of different objects (Wu et al., 2021b). The standard ViT introduces positional
embedding (PE) to solve this problem, which turned out to be crucial for vision tasks (Islam, 2022).
Most recent studies change the standard ViT learnable PE, propose a new method to calculate PE, and
make the novel PE better express the location information of different patches, the random learnable
parameters of PE are still too much to train well (Khan et al., 2021; Su et al., 2021; Liu et al., 2022;
Yang et al., 2020). Whereas this paper avoids directly changing the PE but instead explores the
hidden relationship between the input patches and generates the relationship matrix from a learnable
relationship vector. Since the relation embedding (RE) matrix has the same size as the PE, we can
4th Workshop on Shared Visual Representations in Human and Machine Visual Intelligence (SVRHM) at the
Neural Information Processing Systems (NeurIPS) conference 2022. New Orleans.
arXiv:2210.09871v2 [cs.CV] 19 Oct 2022
(b) SRE: Sequence Relationship Embedding
(a) PE: Position Embedding (c) CRE: Circle Relationship Embedding
Central
Patch
Neighbor-
hoods
Sequence
Relationship
Circle
Relationship
Patch
(d) Introducing RE to ViT
Image patch
embedding
Transformer encoder
MLP head
PE *RE
Figure 1: The diagram of different embeddings and the brief structure with the updated ViT. The
green line means the relationship between patches. The blue patches represent the image patch, the
yellow ones mean the central patch between all patches, and the orange patches are the neighborhoods
of the central patch by the 4-neighbors principle. For sequence and circle relationship embedding, we
use
3×3
patches and
4×4
patches as examples to represent the cases where the number of patches
is odd or even, respectively. And *RE means the SRE or CRE.
add the RE to the original PE or replace it. The advantage is that if we combine RE and PE, we can
add more patch-to-patch information on the input without modifying the traditional ViT structure;
if we replace PE with RE, the learnable matrix becomes the learnable vector. We compressed the
number of learnable parameters from matrices to vectors. Since we played around with the matrix,
the effect of adding RE on the training speed is negligible.
Inspired by the sequence (Sutskever et al., 2014) and the 4-neighbors principle (Castleman, 1996)
in digital image processing, this paper explores the two possible relationships between patches,
the sequence relationship embedding (SRE) and the circle relationship embedding (CRE). In SRE
and CRE, we replace the learnable matrix of PE with a vector that encodes, which reduces the
two-dimensional random parameters matrix to one dimension. The difference between SRE and CRE
is, the former treats the central patch as the center of one sequence, and the patch at the same position
on the front and rear sides is the same distance from the central patch. The latter treats the central
patch as the center of multiple concentric circles. All patches on the same circle are equidistant from
the central patch. Curved arrows and concentric circles can intuitively represent the SRE and the
CRE from figure 1. Here we draw on the 4-neighbors principle when calculating CRE, which is also
widely used in image erosion (Jawas & Suciati, 2013), edge detection (Ziou et al., 1998), and other
fields (Su et al., 2009). We assess the SRE and CRE on four public datasets, and the results show that
SRE and CRE are effective and provide novel insights into analyzing PE.
2 Related Work
ViT treats different patches equally, but in fact, the positional relationship of different patches is not
the same, so the introduction of PE helps to improve the performance of ViT. The following shows
the four types of PE.
Learnable absolute PE.
This PE is proposed by ViT (Dosovitskiy et al., 2020) and is also the
most widely used PE method. They set a random matrix of fixed dimensions and combined it with
patches. The optimizer and ViT parameters are updated synchronously, and the location information
of different patches is added to the embeddings.
Learnable relative PE.
Swin (Liu et al., 2021) is one of the representatives of learnable relative PE.
It uses the relative distance between patches to encode position information, and relative PE uses
one-dimensional relative attention. In contrast, the information is encoded in SRE according to the
distance between the central patch and other patches.
Fixed PE.
Fixed PE uses fixed absolute value coding to represent the positions of different patches.
Here (
?
) uses sine and cosine functions for PE and connects the coding information of different
frequencies to form the final PE.
2
摘要:

SequenceandCircle:ExploringtheRelationshipBetweenPatchesZhengyangYuFrankfurtInstituteforAdvancedStudiesXidianFIASJointResearchCenterFrankfurtamMain,Germanyzhyu@fias.uni-frankfurt.deJochenTrieschFrankfurtInstituteforAdvancedStudiesFrankfurtamMain,Germanytriesch@fias.uni-frankfurt.deAbstractThevisiont...

展开>> 收起<<
Sequence and Circle Exploring the Relationship Between Patches Zhengyang Yu.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:316.91KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注