2.2. Motivation and design principles
We are investigating a number of ideas, discussing re-
lated work in other tasks and laying out design principles
accordingly. The overall goal is to use the features obtained
by a vision transformer, without designing an entirely new
architecture or extending an existing one too much.
Hybrid architecture
As shown in the original ViT
study [29], hybrid models slightly outperform ViT at small
computational budgets, but the difference vanishes for larger
models. Of course, this finding refers to image classification
tasks only. Although hybrid models are still studied [17], they
are not mainstream: It is more common to introduce struc-
ture and inductive bias to transformer models themselves,
where the input is still raw patches [37, 23, 62, 67, 19].
We are the first to conduct a large-scale investigation of
different transformer architectures including hybrid models
for image retrieval. Interestingly, we find that, in terms of
global representation like the [
CLS
] token embeddings, the
hybrid model originally introduced by [29] and consisting
of a CNN stem and a ViT encoder performs best on image
retrieval benchmarks by a large margin. As shown on the
left in Figure 1, we use a CNN stem and a ViT encoder by
default: The intermediate feature maps of the CNN stem
are fed into ViT as token embeddings with patch size
1×1
rather than raw image patches.
Handling different image resolutions
Image resolutions
is an important factor in training image retrieval models. It
is known that preserving original image resolution is effec-
tive [20, 16]. However, this leads to increased computational
cost and longer training time. Focusing on image classifi-
cation, MobileViT [38] proposes a multi-scale sampler that
randomly samples a spatial resolution from a fixed set and
computes the batch size for this resolution at every train-
ing iteration. On image retrieval, group-size sampling [65]
has been shown very effective. Here, one constructs a mini
batch with images of similar aspect ratios, resizing them to a
prefixed size according to aspect ratio.
We follow this latter approach. However, because of dif-
ferent aspect ratio, the image size is still different per mini-
batch, which presents a new challenge: Position embeddings
are of fixed length, corresponding to fixed spatial resolu-
tion when unfolded. For this reason, as shown on the left in
Figure 1, we propose dynamic position embedding (DPE),
whereby the fixed-size learned embeddings are dynamically
resampled to the size of each mini-batch.
Global and local branches
It is well known [42, 48, 39]
that an image retrieval model should focus on the target ob-
ject, not the background. It is then no surprise that recent
methods, focusing either on global or local representations,
have a global and a local branch in their architecture after
the backbone [3, 56, 61, 64, 53]. The objective of the local
branch is to improve the localization properties of the model,
even if the representation is eventually pooled into a single
vector. Even though transformers have shown better local-
ization properties than convolutional networks, especially
in the self-supervised setting [4, 22, 33], the few studies so
far on vision transformers for image retrieval are limited to
using the [
CLS
] token from the last layer of ViT as a global
representation [12, 4, 14].
In this context, our goal is to investigate the role of a
local branch on top of a vision transformer encoder for im-
age retrieval. This study is unique in that the local branch
has access to patch token embeddings of different layers,
re-introduces inductive bias by means of convolution at dif-
ferent scales and ends in global spatial pooling, thereby
being complementary to the [
CLS
] token. As shown on the
top/bottom in Figure 1, the global/local branch is based on
the [
CLS
]/patch tokens, respectively. The final image rep-
resentation is based on the concatenation of the features
generated by the two branches.
Multi-layer features
It is common in object detection, se-
mantic segmentation and other dense prediction tasks to use
features of different scales from different network layers,
giving rise to feature pyramids [50, 36, 35, 34, 55]. It is also
common to introduce skip connections within the architec-
ture, sparsely or densely across layers, including architecture
learning [24, 71, 13]. Apart from standard residual connec-
tions, connections across distant layers are not commonly
studied in either image retrieval or vision transformers.
As shown on the top/bottom in Figure 1, without changing
the encoder architecture itself, we investigate direct connec-
tions from several of its last layers to both the global and
local branches, in the form of concatenation followed by a
number of layers. This is similar to hypercolumns [21], but
we are focusing on the last layers and building a global repre-
sentation. The spatial resolution remains fixed in ViT, but we
do take scale into account by means of dilated convolution.
Interestingly, skip connections and especially direct connec-
tions to the output are known to improve the loss landscape
of the network [31, 40].
Enhancing locality
The transformers mainly rely on
global self-attention, which makes them good at modeling
long-range dependencies. However, contrary to convolu-
tional networks with fixed kernel size, they lack a mech-
anism to localize interactions. As a consequence, many stud-
ies [70, 66, 19, 32, 7, 43, 5] are proposed to improve ViT by
bringing in locality.
In this direction, apart from using a CNN stem in the first
layers, we introduce an enhanced locality module (ELM)
in the local branch, as shown in Figure 1. Our goal is to
investigate inductive bias in the deeper layers of the encoder,
without overly extending the architecture itself. For this rea-
son, the design of ELM is extremely lightweight, inspired
by mobile networks [51].