Boosting vision transformers for image retrieval Chull Hwan Song1Jooyoung Yoon1Shunghyun Choi1Yannis Avrithis23 1Dealicious Inc.2Institute of Advanced Research on Artificial Intelligence IARAI3Athena RC

2025-05-06 0 0 8.89MB 14 页 10玖币
侵权投诉
Boosting vision transformers for image retrieval
Chull Hwan Song1Jooyoung Yoon1Shunghyun Choi1Yannis Avrithis2,3
1Dealicious Inc. 2Institute of Advanced Research on Artificial Intelligence (IARAI) 3Athena RC
Abstract
Vision transformers have achieved remarkable progress in
vision tasks such as image classification and detection. How-
ever, in instance-level image retrieval, transformers have
not yet shown good performance compared to convolutional
networks. We propose a number of improvements that make
transformers outperform the state of the art for the first time.
(1) We show that a hybrid architecture is more effective than
plain transformers, by a large margin. (2) We introduce two
branches collecting global (classification token) and local
(patch tokens) information, from which we form a global
image representation. (3) In each branch, we collect multi-
layer features from the transformer encoder, corresponding
to skip connections across distant layers. (4) We enhance
locality of interactions at the deeper layers of the encoder,
which is the relative weakness of vision transformers. We
train our model on all commonly used training sets and,
for the first time, we make fair comparisons separately per
training set. In all cases, we outperform previous models
based on global representation. Public code is available at
https://github.com/dealicious-inc/DToP.
1. Introduction
Instance-level image retrieval has undergone impres-
sive progress in the deep learning era. Based on con-
volutional networks (CNN), it is possible to learn com-
pact, discriminative representations in either supervised
or unsupervised settings. Advances concern mainly pool-
ing methods [30, 1, 28, 47, 16], loss functions originat-
ing in deep metric learning [16, 48, 39], large-scale open
datasets [2, 16, 48, 42, 60], and competitions such as Google
landmark retrieval1.
Studies of self-attention-based transformers [57], origi-
nating in the NLP field, have followed an explosive growth
in computer vision too, starting with vision transformer
(ViT) [29]. However, most of these studies focus on im-
age classification and detection. The few studies that are
concerned with image retrieval [12, 4] find that transform-
ers still underperform convolutional networks, even when
1https://www.kaggle.com/c/landmark-retrieval-2021
trained on more data under better settings.
In this work, we study a large number of vision transform-
ers on image retrieval and contribute ideas to improve their
performance, without introducing a new architecture. We are
motivated by the fact that vision transformers may have a
powerful built-in attention-based pooling mechanism, but
this is learned on the training set distribution, while in image
retrieval the test distribution is different. Hence, we need to
go back to the patch token embeddings. We build a powerful
global image representation by an advanced pooling mecha-
nism over token embeddings from several of the last layers
of the transformer encoder. We thus call our method deep
token pooling (DToP).
Image retrieval studies are distinguished between global
and local representations, involving one [48, 39, 64] and
several [42, 3, 56] vectors per image, respectively. We focus
on the former as it is compact and allows simple and fast
search. For the same reason, we do not focus on re-ranking,
based either on local feature geometry [42, 52] or graph-
based methods like diffusion [11, 26].
We make the following contributions:
1.
We show the importance of inductive bias in the first
layers for image retrieval.
2. We handle dynamic image size at training.
3.
We collect global and local features from the classifica-
tion and patch tokens respectively of multiple layers.
4.
We enhance locality of interactions in the last layers by
means of lightweight, multi-scale convolution.
5.
We contribute to fair benchmarking by grouping results
by training set and training models on all commonly
used training sets in the literature.
6.
We achieve state of the art performance on image re-
trieval using transformers for the first time.
2. Method
Figure 1 shows the proposed design of our deep token
pooling (DToP). We motivate and lay out its design prin-
ciples in subsection 2.2, discussing different components
each time, after introducing the vision transformer in subsec-
tion 2.1. We then provide a detailed account of the model in
subsection 2.3.
arXiv:2210.11909v1 [cs.CV] 21 Oct 2022
...
...
...
...
...
...
Figure 1: The high-level design of our deep token pooling (DToP). Using a transformer encoder (center), we build a global image representation for image
retrieval by means of a global branch (blue arrows, top) and a local branch (red arrows, bottom), collecting
[CLS]
and patch token embeddings, respectively,
from multiple layers. There are two mechanisms to improve locality of interactions (green): a CNN stem for the first layers (left), which amounts to a hybrid
architecture, and our enhanced locality module (ELM) (Figure 2(b)) in the local branch. Our dynamic position embedding (DPE) (Figure 2(a)) allows for
dynamic image size at training.
2.1. Preliminaries: vision transformer
A transformer encoder, shown in the center of Figure 1,
processes a sequence of token embeddings by allowing pair-
wise interactions in each layer. While we investigate a num-
ber of vision transformers, we follow ViT [29] here, which
is our default choice. The input sequence can be written as
X= [x[CLS];x1;. . . ;xM]R(M+1)×D,(1)
where patch token embeddings
x1,...,xMRD
are ob-
tained from the input image, the learnable
[CLS]
token
embedding
x[CLS]
serves as global image representation at
the output layer,
M
is the sequence length and
D
is the token
embedding dimension.
There are two ways to form patch token embeddings.
The most common is to decompose the input image into
M=wh
raw, fixed-size, square non-overlapping patches
and project them to
D
dimensions via a learnable linear layer.
Alternatively, one may use a convolutional network stem to
map the raw input image to a
w×h×D
feature tensor,
then fold this tensor into a sequence of
M=wh
vectors
of dimension
D
. This is called a hybrid architecture. Here,
w×h
is input resolution,i.e., the image resolution divided
by the patch size in the first case or the downsampling ratio
of the stem in the second.
The input sequence is added to a sequence of learnable po-
sition embeddings, meant to preserve positional information,
and given to the transformer encoder, which has
L
layers
preserving the sequence length and dimension. Each layer
consists of a multi-head self attention (MSA) and an MLP
block. The output is the embedding of the [
CLS
] token at
the last layer, cL.
2.2. Motivation and design principles
We are investigating a number of ideas, discussing re-
lated work in other tasks and laying out design principles
accordingly. The overall goal is to use the features obtained
by a vision transformer, without designing an entirely new
architecture or extending an existing one too much.
Hybrid architecture
As shown in the original ViT
study [29], hybrid models slightly outperform ViT at small
computational budgets, but the difference vanishes for larger
models. Of course, this finding refers to image classification
tasks only. Although hybrid models are still studied [17], they
are not mainstream: It is more common to introduce struc-
ture and inductive bias to transformer models themselves,
where the input is still raw patches [37, 23, 62, 67, 19].
We are the first to conduct a large-scale investigation of
different transformer architectures including hybrid models
for image retrieval. Interestingly, we find that, in terms of
global representation like the [
CLS
] token embeddings, the
hybrid model originally introduced by [29] and consisting
of a CNN stem and a ViT encoder performs best on image
retrieval benchmarks by a large margin. As shown on the
left in Figure 1, we use a CNN stem and a ViT encoder by
default: The intermediate feature maps of the CNN stem
are fed into ViT as token embeddings with patch size
1×1
rather than raw image patches.
Handling different image resolutions
Image resolutions
is an important factor in training image retrieval models. It
is known that preserving original image resolution is effec-
tive [20, 16]. However, this leads to increased computational
cost and longer training time. Focusing on image classifi-
cation, MobileViT [38] proposes a multi-scale sampler that
randomly samples a spatial resolution from a fixed set and
computes the batch size for this resolution at every train-
ing iteration. On image retrieval, group-size sampling [65]
has been shown very effective. Here, one constructs a mini
batch with images of similar aspect ratios, resizing them to a
prefixed size according to aspect ratio.
We follow this latter approach. However, because of dif-
ferent aspect ratio, the image size is still different per mini-
batch, which presents a new challenge: Position embeddings
are of fixed length, corresponding to fixed spatial resolu-
tion when unfolded. For this reason, as shown on the left in
Figure 1, we propose dynamic position embedding (DPE),
whereby the fixed-size learned embeddings are dynamically
resampled to the size of each mini-batch.
Global and local branches
It is well known [42, 48, 39]
that an image retrieval model should focus on the target ob-
ject, not the background. It is then no surprise that recent
methods, focusing either on global or local representations,
have a global and a local branch in their architecture after
the backbone [3, 56, 61, 64, 53]. The objective of the local
branch is to improve the localization properties of the model,
even if the representation is eventually pooled into a single
vector. Even though transformers have shown better local-
ization properties than convolutional networks, especially
in the self-supervised setting [4, 22, 33], the few studies so
far on vision transformers for image retrieval are limited to
using the [
CLS
] token from the last layer of ViT as a global
representation [12, 4, 14].
In this context, our goal is to investigate the role of a
local branch on top of a vision transformer encoder for im-
age retrieval. This study is unique in that the local branch
has access to patch token embeddings of different layers,
re-introduces inductive bias by means of convolution at dif-
ferent scales and ends in global spatial pooling, thereby
being complementary to the [
CLS
] token. As shown on the
top/bottom in Figure 1, the global/local branch is based on
the [
CLS
]/patch tokens, respectively. The final image rep-
resentation is based on the concatenation of the features
generated by the two branches.
Multi-layer features
It is common in object detection, se-
mantic segmentation and other dense prediction tasks to use
features of different scales from different network layers,
giving rise to feature pyramids [50, 36, 35, 34, 55]. It is also
common to introduce skip connections within the architec-
ture, sparsely or densely across layers, including architecture
learning [24, 71, 13]. Apart from standard residual connec-
tions, connections across distant layers are not commonly
studied in either image retrieval or vision transformers.
As shown on the top/bottom in Figure 1, without changing
the encoder architecture itself, we investigate direct connec-
tions from several of its last layers to both the global and
local branches, in the form of concatenation followed by a
number of layers. This is similar to hypercolumns [21], but
we are focusing on the last layers and building a global repre-
sentation. The spatial resolution remains fixed in ViT, but we
do take scale into account by means of dilated convolution.
Interestingly, skip connections and especially direct connec-
tions to the output are known to improve the loss landscape
of the network [31, 40].
Enhancing locality
The transformers mainly rely on
global self-attention, which makes them good at modeling
long-range dependencies. However, contrary to convolu-
tional networks with fixed kernel size, they lack a mech-
anism to localize interactions. As a consequence, many stud-
ies [70, 66, 19, 32, 7, 43, 5] are proposed to improve ViT by
bringing in locality.
In this direction, apart from using a CNN stem in the first
layers, we introduce an enhanced locality module (ELM)
in the local branch, as shown in Figure 1. Our goal is to
investigate inductive bias in the deeper layers of the encoder,
without overly extending the architecture itself. For this rea-
son, the design of ELM is extremely lightweight, inspired
by mobile networks [51].
摘要:

BoostingvisiontransformersforimageretrievalChullHwanSong1JooyoungYoon1ShunghyunChoi1YannisAvrithis2,31DealiciousInc.2InstituteofAdvancedResearchonArtificialIntelligence(IARAI)3AthenaRCAbstractVisiontransformershaveachievedremarkableprogressinvisiontaskssuchasimageclassificationanddetection.How-ever,...

展开>> 收起<<
Boosting vision transformers for image retrieval Chull Hwan Song1Jooyoung Yoon1Shunghyun Choi1Yannis Avrithis23 1Dealicious Inc.2Institute of Advanced Research on Artificial Intelligence IARAI3Athena RC.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:8.89MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注