Sequence and Circle Exploring the Relationship Between Patches Zhengyang Yu

2025-05-03 0 0 316.91KB 7 页 10玖币

侵权投诉

Sequence and Circle: Exploring the Relationship

Between Patches

Zhengyang Yu

Frankfurt Institute for Advanced Studies

Xidian FIAS Joint Research Center

Frankfurt am Main, Germany

zhyu@fias.uni-frankfurt.de

Jochen Triesch

Frankfurt Institute for Advanced Studies

Frankfurt am Main, Germany

triesch@fias.uni-frankfurt.de

Abstract

The vision transformer (ViT) has achieved state-of-the-art results in various vision

tasks. It utilizes a learnable position embedding (PE) mechanism to encode the

location of each image patch. However, it is presently unclear if this learnable PE

is really necessary and what its beneﬁts are. This paper explores two alternative

ways of encoding the location of individual patches that exploit prior knowledge

about their spatial arrangement. One is called the sequence relationship embedding

(SRE), and the other is called the circle relationship embedding (CRE). Among

them, the SRE considers all patches to be in order, and adjacent patches have the

same interval distance. The CRE considers the central patch as the center of the

circle and measures the distance of the remaining patches from the center based

on the four neighborhoods principle. Multiple concentric circles with different

radii combine different patches. Finally, we implemented these two relations on

three classic ViTs and tested them on four popular datasets. Experiments show that

SRE and CRE can replace PE to reduce the random learnable parameters while

achieving the same performance. Combining SRE or CRE with PE gets better

performance than only using PE.

1 Introduction

Vision Transformer (ViT) utilizes the self-attention mechanism and the way of building image patches

to achieve global modeling capabilities while reducing huge computational costs (Dosovitskiy et al.,

2020; Guo et al., 2022; Han et al., 2022). It retains the advantages of the transformer in NLP (Vaswani

et al., 2017; Devlin et al., 2018; Wang et al., 2018), catches global input relationships, parallelize

computations, and achieve performance or surpass comparable to convolutional neural networks

(CNN) on various computer vision tasks (He et al., 2016; Li et al., 2022), such as image classiﬁcation

(Bi et al., 2021), image segmentation (Xu et al., 2022), and object detection (Islam, 2022).

Because ViT divides the input image into several patches and needs to model the global relationship

directly, researchers focus on how to efﬁciently use the information of each patch to extract the most

distinguishing features of different objects (Wu et al., 2021b). The standard ViT introduces positional

embedding (PE) to solve this problem, which turned out to be crucial for vision tasks (Islam, 2022).

Most recent studies change the standard ViT learnable PE, propose a new method to calculate PE, and

make the novel PE better express the location information of different patches, the random learnable

parameters of PE are still too much to train well (Khan et al., 2021; Su et al., 2021; Liu et al., 2022;

Yang et al., 2020). Whereas this paper avoids directly changing the PE but instead explores the

hidden relationship between the input patches and generates the relationship matrix from a learnable

relationship vector. Since the relation embedding (RE) matrix has the same size as the PE, we can

4th Workshop on Shared Visual Representations in Human and Machine Visual Intelligence (SVRHM) at the

Neural Information Processing Systems (NeurIPS) conference 2022. New Orleans.

arXiv:2210.09871v2 [cs.CV] 19 Oct 2022

(b) SRE: Sequence Relationship Embedding

(a) PE: Position Embedding (c) CRE: Circle Relationship Embedding

Central

Patch

Neighbor-

hoods

Sequence

Relationship

Circle

Relationship

Patch

(d) Introducing RE to ViT

Image patch

embedding

Transformer encoder

MLP head

PE *RE

Figure 1: The diagram of different embeddings and the brief structure with the updated ViT. The

green line means the relationship between patches. The blue patches represent the image patch, the

yellow ones mean the central patch between all patches, and the orange patches are the neighborhoods

of the central patch by the 4-neighbors principle. For sequence and circle relationship embedding, we

use

3×3

patches and

4×4

patches as examples to represent the cases where the number of patches

is odd or even, respectively. And *RE means the SRE or CRE.

add the RE to the original PE or replace it. The advantage is that if we combine RE and PE, we can

add more patch-to-patch information on the input without modifying the traditional ViT structure;

if we replace PE with RE, the learnable matrix becomes the learnable vector. We compressed the

number of learnable parameters from matrices to vectors. Since we played around with the matrix,

the effect of adding RE on the training speed is negligible.

Inspired by the sequence (Sutskever et al., 2014) and the 4-neighbors principle (Castleman, 1996)

in digital image processing, this paper explores the two possible relationships between patches,

the sequence relationship embedding (SRE) and the circle relationship embedding (CRE). In SRE

and CRE, we replace the learnable matrix of PE with a vector that encodes, which reduces the

two-dimensional random parameters matrix to one dimension. The difference between SRE and CRE

is, the former treats the central patch as the center of one sequence, and the patch at the same position

on the front and rear sides is the same distance from the central patch. The latter treats the central

patch as the center of multiple concentric circles. All patches on the same circle are equidistant from

the central patch. Curved arrows and concentric circles can intuitively represent the SRE and the

CRE from ﬁgure 1. Here we draw on the 4-neighbors principle when calculating CRE, which is also

widely used in image erosion (Jawas & Suciati, 2013), edge detection (Ziou et al., 1998), and other

ﬁelds (Su et al., 2009). We assess the SRE and CRE on four public datasets, and the results show that

SRE and CRE are effective and provide novel insights into analyzing PE.

2 Related Work

ViT treats different patches equally, but in fact, the positional relationship of different patches is not

the same, so the introduction of PE helps to improve the performance of ViT. The following shows

the four types of PE.

Learnable absolute PE.

This PE is proposed by ViT (Dosovitskiy et al., 2020) and is also the

most widely used PE method. They set a random matrix of ﬁxed dimensions and combined it with

patches. The optimizer and ViT parameters are updated synchronously, and the location information

of different patches is added to the embeddings.

Learnable relative PE.

Swin (Liu et al., 2021) is one of the representatives of learnable relative PE.

It uses the relative distance between patches to encode position information, and relative PE uses

one-dimensional relative attention. In contrast, the information is encoded in SRE according to the

distance between the central patch and other patches.

Fixed PE.

Fixed PE uses ﬁxed absolute value coding to represent the positions of different patches.

Here (

) uses sine and cosine functions for PE and connects the coding information of different

frequencies to form the ﬁnal PE.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SequenceandCircle:ExploringtheRelationshipBetweenPatchesZhengyangYuFrankfurtInstituteforAdvancedStudiesXidianFIASJointResearchCenterFrankfurtamMain,Germanyzhyu@fias.uni-frankfurt.deJochenTrieschFrankfurtInstituteforAdvancedStudiesFrankfurtamMain,Germanytriesch@fias.uni-frankfurt.deAbstractThevisiont...

展开>> 收起<<

Sequence and Circle Exploring the Relationship Between Patches Zhengyang Yu.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sequence and Circle Exploring the Relationship Between Patches Zhengyang Yu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: