Boosting vision transformers for image retrieval Chull Hwan Song1Jooyoung Yoon1Shunghyun Choi1Yannis Avrithis23 1Dealicious Inc.2Institute of Advanced Research on Artificial Intelligence IARAI3Athena RC

2025-05-06 0 0 8.89MB 14 页 10玖币

侵权投诉

Boosting vision transformers for image retrieval

Chull Hwan Song1Jooyoung Yoon1Shunghyun Choi1Yannis Avrithis2,3

1Dealicious Inc. 2Institute of Advanced Research on Artiﬁcial Intelligence (IARAI) 3Athena RC

Abstract

Vision transformers have achieved remarkable progress in

vision tasks such as image classiﬁcation and detection. How-

ever, in instance-level image retrieval, transformers have

not yet shown good performance compared to convolutional

networks. We propose a number of improvements that make

transformers outperform the state of the art for the ﬁrst time.

(1) We show that a hybrid architecture is more effective than

plain transformers, by a large margin. (2) We introduce two

branches collecting global (classiﬁcation token) and local

(patch tokens) information, from which we form a global

image representation. (3) In each branch, we collect multi-

layer features from the transformer encoder, corresponding

to skip connections across distant layers. (4) We enhance

locality of interactions at the deeper layers of the encoder,

which is the relative weakness of vision transformers. We

train our model on all commonly used training sets and,

for the ﬁrst time, we make fair comparisons separately per

training set. In all cases, we outperform previous models

based on global representation. Public code is available at

https://github.com/dealicious-inc/DToP.

1. Introduction

Instance-level image retrieval has undergone impres-

sive progress in the deep learning era. Based on con-

volutional networks (CNN), it is possible to learn com-

pact, discriminative representations in either supervised

or unsupervised settings. Advances concern mainly pool-

ing methods [30, 1, 28, 47, 16], loss functions originat-

ing in deep metric learning [16, 48, 39], large-scale open

datasets [2, 16, 48, 42, 60], and competitions such as Google

landmark retrieval1.

Studies of self-attention-based transformers [57], origi-

nating in the NLP ﬁeld, have followed an explosive growth

in computer vision too, starting with vision transformer

(ViT) [29]. However, most of these studies focus on im-

age classiﬁcation and detection. The few studies that are

concerned with image retrieval [12, 4] ﬁnd that transform-

ers still underperform convolutional networks, even when

1https://www.kaggle.com/c/landmark-retrieval-2021

trained on more data under better settings.

In this work, we study a large number of vision transform-

ers on image retrieval and contribute ideas to improve their

performance, without introducing a new architecture. We are

motivated by the fact that vision transformers may have a

powerful built-in attention-based pooling mechanism, but

this is learned on the training set distribution, while in image

retrieval the test distribution is different. Hence, we need to

go back to the patch token embeddings. We build a powerful

global image representation by an advanced pooling mecha-

nism over token embeddings from several of the last layers

of the transformer encoder. We thus call our method deep

token pooling (DToP).

Image retrieval studies are distinguished between global

and local representations, involving one [48, 39, 64] and

several [42, 3, 56] vectors per image, respectively. We focus

on the former as it is compact and allows simple and fast

search. For the same reason, we do not focus on re-ranking,

based either on local feature geometry [42, 52] or graph-

based methods like diffusion [11, 26].

We make the following contributions:

We show the importance of inductive bias in the ﬁrst

layers for image retrieval.

2. We handle dynamic image size at training.

We collect global and local features from the classiﬁca-

tion and patch tokens respectively of multiple layers.

We enhance locality of interactions in the last layers by

means of lightweight, multi-scale convolution.

We contribute to fair benchmarking by grouping results

by training set and training models on all commonly

used training sets in the literature.

We achieve state of the art performance on image re-

trieval using transformers for the ﬁrst time.

2. Method

Figure 1 shows the proposed design of our deep token

pooling (DToP). We motivate and lay out its design prin-

ciples in subsection 2.2, discussing different components

each time, after introducing the vision transformer in subsec-

tion 2.1. We then provide a detailed account of the model in

subsection 2.3.

arXiv:2210.11909v1 [cs.CV] 21 Oct 2022

...

Figure 1: The high-level design of our deep token pooling (DToP). Using a transformer encoder (center), we build a global image representation for image

retrieval by means of a global branch (blue arrows, top) and a local branch (red arrows, bottom), collecting

[CLS]

and patch token embeddings, respectively,

from multiple layers. There are two mechanisms to improve locality of interactions (green): a CNN stem for the ﬁrst layers (left), which amounts to a hybrid

architecture, and our enhanced locality module (ELM) (Figure 2(b)) in the local branch. Our dynamic position embedding (DPE) (Figure 2(a)) allows for

dynamic image size at training.

2.1. Preliminaries: vision transformer

A transformer encoder, shown in the center of Figure 1,

processes a sequence of token embeddings by allowing pair-

wise interactions in each layer. While we investigate a num-

ber of vision transformers, we follow ViT [29] here, which

is our default choice. The input sequence can be written as

X= [x[CLS];x1;. . . ;xM]∈R(M+1)×D,(1)

where patch token embeddings

x1,...,xM∈RD

are ob-

tained from the input image, the learnable

[CLS]

token

embedding

x[CLS]

serves as global image representation at

the output layer,

is the sequence length and

is the token

embedding dimension.

There are two ways to form patch token embeddings.

The most common is to decompose the input image into

M=wh

raw, ﬁxed-size, square non-overlapping patches

and project them to

dimensions via a learnable linear layer.

Alternatively, one may use a convolutional network stem to

map the raw input image to a

w×h×D

feature tensor,

then fold this tensor into a sequence of

M=wh

vectors

of dimension

. This is called a hybrid architecture. Here,

w×h

is input resolution,i.e., the image resolution divided

by the patch size in the ﬁrst case or the downsampling ratio

of the stem in the second.

The input sequence is added to a sequence of learnable po-

sition embeddings, meant to preserve positional information,

and given to the transformer encoder, which has

layers

preserving the sequence length and dimension. Each layer

consists of a multi-head self attention (MSA) and an MLP

block. The output is the embedding of the [

CLS

] token at

the last layer, cL.

2.2. Motivation and design principles

We are investigating a number of ideas, discussing re-

lated work in other tasks and laying out design principles

accordingly. The overall goal is to use the features obtained

by a vision transformer, without designing an entirely new

architecture or extending an existing one too much.

Hybrid architecture

As shown in the original ViT

study [29], hybrid models slightly outperform ViT at small

computational budgets, but the difference vanishes for larger

models. Of course, this ﬁnding refers to image classiﬁcation

tasks only. Although hybrid models are still studied [17], they

are not mainstream: It is more common to introduce struc-

ture and inductive bias to transformer models themselves,

where the input is still raw patches [37, 23, 62, 67, 19].

We are the ﬁrst to conduct a large-scale investigation of

different transformer architectures including hybrid models

for image retrieval. Interestingly, we ﬁnd that, in terms of

global representation like the [

CLS

] token embeddings, the

hybrid model originally introduced by [29] and consisting

of a CNN stem and a ViT encoder performs best on image

retrieval benchmarks by a large margin. As shown on the

left in Figure 1, we use a CNN stem and a ViT encoder by

default: The intermediate feature maps of the CNN stem

are fed into ViT as token embeddings with patch size

1×1

rather than raw image patches.

Handling different image resolutions

Image resolutions

is an important factor in training image retrieval models. It

is known that preserving original image resolution is effec-

tive [20, 16]. However, this leads to increased computational

cost and longer training time. Focusing on image classiﬁ-

cation, MobileViT [38] proposes a multi-scale sampler that

randomly samples a spatial resolution from a ﬁxed set and

computes the batch size for this resolution at every train-

ing iteration. On image retrieval, group-size sampling [65]

has been shown very effective. Here, one constructs a mini

batch with images of similar aspect ratios, resizing them to a

preﬁxed size according to aspect ratio.

We follow this latter approach. However, because of dif-

ferent aspect ratio, the image size is still different per mini-

batch, which presents a new challenge: Position embeddings

are of ﬁxed length, corresponding to ﬁxed spatial resolu-

tion when unfolded. For this reason, as shown on the left in

Figure 1, we propose dynamic position embedding (DPE),

whereby the ﬁxed-size learned embeddings are dynamically

resampled to the size of each mini-batch.

Global and local branches

It is well known [42, 48, 39]

that an image retrieval model should focus on the target ob-

ject, not the background. It is then no surprise that recent

methods, focusing either on global or local representations,

have a global and a local branch in their architecture after

the backbone [3, 56, 61, 64, 53]. The objective of the local

branch is to improve the localization properties of the model,

even if the representation is eventually pooled into a single

vector. Even though transformers have shown better local-

ization properties than convolutional networks, especially

in the self-supervised setting [4, 22, 33], the few studies so

far on vision transformers for image retrieval are limited to

using the [

CLS

] token from the last layer of ViT as a global

representation [12, 4, 14].

In this context, our goal is to investigate the role of a

local branch on top of a vision transformer encoder for im-

age retrieval. This study is unique in that the local branch

has access to patch token embeddings of different layers,

re-introduces inductive bias by means of convolution at dif-

ferent scales and ends in global spatial pooling, thereby

being complementary to the [

CLS

] token. As shown on the

top/bottom in Figure 1, the global/local branch is based on

the [

CLS

]/patch tokens, respectively. The ﬁnal image rep-

resentation is based on the concatenation of the features

generated by the two branches.

Multi-layer features

It is common in object detection, se-

mantic segmentation and other dense prediction tasks to use

features of different scales from different network layers,

giving rise to feature pyramids [50, 36, 35, 34, 55]. It is also

common to introduce skip connections within the architec-

ture, sparsely or densely across layers, including architecture

learning [24, 71, 13]. Apart from standard residual connec-

tions, connections across distant layers are not commonly

studied in either image retrieval or vision transformers.

As shown on the top/bottom in Figure 1, without changing

the encoder architecture itself, we investigate direct connec-

tions from several of its last layers to both the global and

local branches, in the form of concatenation followed by a

number of layers. This is similar to hypercolumns [21], but

we are focusing on the last layers and building a global repre-

sentation. The spatial resolution remains ﬁxed in ViT, but we

do take scale into account by means of dilated convolution.

Interestingly, skip connections and especially direct connec-

tions to the output are known to improve the loss landscape

of the network [31, 40].

Enhancing locality

The transformers mainly rely on

global self-attention, which makes them good at modeling

long-range dependencies. However, contrary to convolu-

tional networks with ﬁxed kernel size, they lack a mech-

anism to localize interactions. As a consequence, many stud-

ies [70, 66, 19, 32, 7, 43, 5] are proposed to improve ViT by

bringing in locality.

In this direction, apart from using a CNN stem in the ﬁrst

layers, we introduce an enhanced locality module (ELM)

in the local branch, as shown in Figure 1. Our goal is to

investigate inductive bias in the deeper layers of the encoder,

without overly extending the architecture itself. For this rea-

son, the design of ELM is extremely lightweight, inspired

by mobile networks [51].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BoostingvisiontransformersforimageretrievalChullHwanSong1JooyoungYoon1ShunghyunChoi1YannisAvrithis2,31DealiciousInc.2InstituteofAdvancedResearchonArtificialIntelligence(IARAI)3AthenaRCAbstractVisiontransformershaveachievedremarkableprogressinvisiontaskssuchasimageclassificationanddetection.How-ever,...

展开>> 收起<<

Boosting vision transformers for image retrieval Chull Hwan Song1Jooyoung Yoon1Shunghyun Choi1Yannis Avrithis23 1Dealicious Inc.2Institute of Advanced Research on Artificial Intelligence IARAI3Athena RC.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Boosting vision transformers for image retrieval Chull Hwan Song1Jooyoung Yoon1Shunghyun Choi1Yannis Avrithis23 1Dealicious Inc.2Institute of Advanced Research on Artificial Intelligence IARAI3Athena RC

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: