Granularity-aware Adaptation for Image Retrieval over Multiple Tasks Jon Almaz an1 Byungsoo Ko2

2025-05-06 0 0 4.36MB 21 页 10玖币
侵权投诉
Granularity-aware Adaptation for Image
Retrieval over Multiple Tasks
Jon Almaz´an1, Byungsoo Ko2,
Geonmo Gu2, Diane Larlus1, and Yannis Kalantidis1
1NAVER LABS Europe 2NAVER Corp.
Abstract. Strong image search models can be learned for a specific do-
main, i.e. set of labels, provided that some labeled images of that domain
are available. A practical visual search model, however, should be versa-
tile enough to solve multiple retrieval tasks simultaneously, even if those
cover very different specialized domains. Additionally, it should be able
to benefit from even unlabeled images from these various retrieval tasks.
This is the more practical scenario that we consider in this paper. We ad-
dress it with the proposed Grappa, an approach that starts from a strong
pretrained model, and adapts it to tackle multiple retrieval tasks con-
currently, using only unlabeled images from the different task domains.
We extend the pretrained model with multiple independently trained
sets of adaptors that use pseudo-label sets of different sizes, effectively
mimicking different pseudo-granularities. We reconcile all adaptor sets
into a single unified model suited for all retrieval tasks by learning fusion
layers that we guide by propagating pseudo-granularity attentions across
neighbors in the feature space. Results on a benchmark composed of six
heterogeneous retrieval tasks show that the unsupervised Grappa model
improves the zero-shot performance of a state-of-the-art self-supervised
learning model, and in some places reaches or improves over a task label-
aware oracle that selects the most fitting pseudo-granularity per task.
1 Introduction
The last few years have witnessed progress on image retrieval: successful mod-
els can be trained, provided that a set of labeled images from the domain of
interest (not necessary from the same categories) is available for training, as
in the common deep metric learning scenario. Those models are as powerful as
they are specialized: it has been shown, and we confirm in our experiments, that
one model carefully tailored for one domain (e.g. bird species) tend to perform
poorly to a neighboring yet different domain (e.g. dog breeds).
Here, we argue that a practical visual search system should be able to solve
multiple retrieval tasks simultaneously, without needing to explicitly specialize
for each task. Consider for example a visual search system specialized to fauna
and flora. In such asystem, the image database covers a broad range of fine-
grained domains, e.g. from searching among different insect species to different
kinds of mushrooms. For the system to also handle coral species, it should be as
simple as providing a set of unlabeled coral images.
arXiv:2210.02254v1 [cs.CV] 5 Oct 2022
2 J. Almaz´an et al.
Aircraft
Cars
CUB
Flowers
Food-101
Products
0
2
4
+0.6 +0.9
+1.5
+4,7
+3.5
+2.2
RP gain rel. to DINO
DINO [9]
Grappa
Fig. 1: Grappa is an unsupervised
method that trains a single model
with higher zero-shot performance
(measured with R-Precision or RP)
than the pretrained DINO [9]
model, over several retrieval tasks.
In parallel, the field has worked to-
wards pretraining large and generic mod-
els for visual representations that can be
used, often as a black box, to extract fea-
tures for new tasks. Among those, mod-
els trained in a self-supervised way have
shown to be versatile to various target
tasks, including image retrieval [23,9].
In this work, we assume access to
such a large pretrained model that already
provides good zero-shot performance. We
also assume access to an unlabeled set of
images possibly from multiple tasks. We
propose to adapt the initial model so it
performs even better on multiple image
retrieval tasks simultaneously,i.e. when
this same adapted model is used to ex-
tract features for all tasks.
This raises two questions. First, how
should we perform adaptation? Fine-tuning is prohibitively costly especially for
large pretrained models, and does not always transfer well. As an alternative
to fine-tuning, and inspired by an early work on multi-task training [50] and a
recent trend in natural language processing [30,45], we propose to use adaptor
layers. Adaptors are embedded in between architecture blocks, and are the only
weights learned, the ones from the original pretrained model remaining fixed.
Our experiments show that this composite architecture allows for a versatile
adaptation of a strong initial model by adjusting only a small percentage of the
model parameters.
Second, how should we reconcile various retrieval tasks in a single model?
A retrieval task focuses on a given set of visual concepts, often associated to
a particular granularity. Yet, unlike in classification for which the granularity
is known beforehand, the granularity of a retrieval task is context dependent,
and depends on the gallery of images where visual search is performed. We
therefore propose learning different sets of adaptors, each set tailored to one
specific granularity. As we assume that training images are unlabeled, not even
to indicate the retrieval task they correspond to, we propose to automatically
define levels of granularity by partitioning the training set into more and more
clusters. As a result, each partition corresponds to a different set of pseudo-labels.
We then independently train one set of adaptors for each pseudo-granularity.
Next, we need to reconcile these different sets of adaptors into a single multi-
purpose retrieval model. One option is to combine them with a naive fusion
mechanism. The resulting model improves results on all retrieval tasks, show-
ing the clear benefit of a multi-granularity understanding of the data. Another
option is to go one step further and to achieve adaptor fusion via attention prop-
agation. In this case, we require consistency between the adaptor attention of
Granularity-aware Adaptation for Image Retrieval over Multiple Tasks 3
nearest neighbors in the feature space. We observe this fusion mechanism further
improves the model.
To summarize, our contribution is threefold. First, we palliate the absence
of image and task labels by creating sets of pseudo-labels, with the goal of ap-
proximating any possible granularities in a given set of retrieval tasks. Second,
we propose a way to extend transformer-based architectures with adaptors, and
a training framework that tailors individual sets of adaptors to different pseudo-
granularities. Third, we propose a number of ways for fusing the adapter features,
e.g. via augmentation invariance or via propagating attention from neighbors in
the image features space. We validate our approach on a collection of datasets
for deep metric learning and we show that Grappa improves over the success-
ful DINO pretrained model, a model known to already obtain strong zero-shot
performance on all these retrieval tasks (see Fig. 1).
2 Related Work
The task we tackle in this paper strongly relates to deep metric learning. It
requires specific architectural changes of neural networks to extend them with
adaptors. Note that our task can be seen as solving a zero-shot problem, i.e. it
requires no labeled data from the downstream datasets and learns a single model
for all tasks, something fairly uncommon in transfer learning.
Deep metric learning (DML). DML aims to learn a metric between data
points that reflects the semantic similarity between them. It plays an impor-
tant role in a wide range of tasks such as image clustering [29,7], unsupervised
learning [10,26,8], and visual search [6,21,51]. Recent DML approaches typically
learn visual similarity using either a pair-based loss [12,57,43,24,33] which con-
siders pair-wise similarities, a proxy-based loss [62,61,16,25,34], which considers
the similarity between samples and class representative proxies, or a contex-
tual classification loss [69,5,18,54]. In most cases, DML approaches finetune an
ImageNet pretrained model for each target retrieval task, and each of those fine-
tuned models fall short when applied to other retrieval tasks. We aim at a more
versatile visual search system that handles multiple retrieval tasks with a single
model.
Neural architectures with adaptation layers. Adaptation layers (or adap-
tors) have emerged [50,30,45,63] as a way to avoid common problems rising in
sequential finetuning or multi-task learning when trying to finetune large pre-
trained models to solve multiple tasks, namely the issues of catastrophic for-
getting [39] and task imbalance. Rebuffi et al . [50] were the first to introduce
adaptors to visual recognition tasks, adapting a convolutional model to many
classification tasks. Adaptors have also been used with transformer architectures
for natural language processing [30]; bottleneck layers are added to all the blocks
of a pretrained model and finetuned, keeping the underlying model fixed.
Recently, Pfeiffer et al . [45] introduced a way to share knowledge between
adaptors using an adaptor fusion layer within a two-stage learning framework:
adaptors are trained independently in the first stage, they are kept fixed while
4 J. Almaz´an et al.
only the fusion layer is trained in the second stage. All the methods mentioned
above still result in models that specialize to a single task; e.g. [45] learns a
separate fusion layer per downstream task, whereas we would like to learn a
single model for all tasks.
Zero-shot problems. The field has recently taken an interest in pretraining
large models, sometimes called zero-shot models, using large quantities of data.
Those have been shown to be versatile and applicable to many target tasks.
Among them, self-supervised models [11,8,26,68,9] are trained using self-defined
pseudo-labels as supervision and typically millions of images (e.g. from Ima-
geNet [15]). Recent works [22,58] exploit even larger yet uncurated, sets of
unlabeled images to enhance the quality of the learned representations. Oth-
ers [49,31,70] have leveraged multiple modalities, e.g. training visual representa-
tions so they are similar to the textual representations of their associated text.
Those self-supervised or multimodal methods offer excellent initialization to be
finetuned for a wide range of downstream tasks. Sometimes they are used in a
zero-shot setting: a single model is used as a feature extractor, typically to solve
multiple tasks. This is the regime we study here, but we further assume that a
small amount of unlabeled data from the downstream tasks exists.
Relation to other transfer tasks. The idea of transferring a model trained for
a given task to a related one has become central to computer vision [55,53], and
appears in many research fields such as task transfer [67], domain adaptation [14]
or self-supervised learning [20,42,9]. Yet, in all those, the initial model is only a
starting point and it is typically not only extended, but also retrained for each
task of interest, leading to a multitude of specialized models. In our work, we need
asingle model to perform well across retrieval tasks. In that regard, this work is
closer to zero-shot transfer of the large pretrained models discussed above. Also
related are Mixtures of Experts (MoE) [66,56,48,52], an ensembling technique
that decomposes a predictive problem into subtasks, training one expert for
each. Although MoE architectures may look similar to ours at first glance, they
typically rely on gating and pooling mechanisms that learn to predict, in a
supervised way, which experts to trust, and how to combine them. Similar to
typical transfer approaches, they build one specialized model for each target
task. Here, we focus on a purely unsupervised task: no labels are provided to
indicate image semantic content nor the retrieval task images belong to.
3 A granularity-aware multi-purpose retrieval model
In this section we present Grappa, a method for adapting a pretrained model to
multiple retrieval tasks simultaneously, in an unsupervised way. We first formal-
ize our task, i.e. visual search over several retrieval tasks using a single model
(Sec. 3.1). We then present an overview of the approach (Sec. 3.2). Next, we
detail each step, i.e. building multiple granularities (Sec. 3.3), learning adaptors
using granularity-aware pseudo-labels (Sec. 3.4), and learning to fuse them by
propagating adaptor attention across feature space neighbors (Sec. 3.5).
Granularity-aware Adaptation for Image Retrieval over Multiple Tasks 5
3.1 Background
Our task of interest, visual search on multiple retrieval tasks, can be seen
as a variant of the standard deep metric learning (DML) task. The most common
protocol in DML is to a) split the classes1into disjoint train and test sets of
labels; b) learn a separate model for each retrieval task on the corresponding
train split; c) perform retrieval on all images of the (unseen) test split of classes.
Our setting has several key differences. First, we solve multiple retrieval tasks
simultaneously. This means that we do not learn one model for each but a single
model that will be used for all tasks. Second, we only assume access to a set
of unlabeled images from each retrieval task, and do not have access to labeled
training sets, unlike standard DML methods. Even more challenging, unlabeled
training images are provided jointly without knowing which target retrieval task
they correspond to nor the total number of retrieval tasks.
More formally, let Tbe the set of mretrieval tasks that we want to simul-
taneously tackle. Each task Ttis associated with a training and a test set. At
train time, we are provided with a fused training set Dcomposed of the union
of all training sets of the mdatasets in T. As mentioned earlier, images are not
associated to any class or task label.
With so many unknowns on the target retrieval tasks, an obvious choice
is to start with a large pretrained model. Self-[11,26,9] or weakly-[49,31]
supervised learning have been shown to lead to strong models that generalize
well and exhibit high zero-shot transfer performance. We assume that we are
given such a model. Here, we base our work on the recently proposed Visual
Transformer (ViT) [17], a popular, efficient, and highly performing architecture,
pretrained in a self-supervised way with DINO [9].
We set our pretrained model Mto be a ViT with Ltransformer layers and
an input patch size P×Ppixels. Input image xRH×W×Cis reshaped into
a sequence of Tflattened 2D patches where T=HW/P 2. The transformer
uses constant latent vector size Dthrough all of its layers, so flattened patches
are first mapped to Ddimensions with a trainable linear projection and con-
catenated in h0, together with a prepended learnable [class] token and added
position embeddings. The transformer encoder [59] consists of alternating blocks
of multi-headed self-attention (MSA) and MLP (which contain two layers with a
GELU non-linearity). LayerNorm (LN) is applied before every block, and resid-
ual connections after every block. Formally, each layer of M(shown with a gray
background on Fig. 3, left) is given by:
hl= MLP(LN(˜
hl)) + ˜
hl,˜
hl= MSA(LN(hl1)) + hl1,(1)
for l={1. . . L}. The image representation zis the output of the [class] token
after the last layer hL,i.e.z= LN(hL)[class]. We refer the reader to [17] for
more details about the ViT architecture.
1We will use the term classes to refer to sets of images with the same label, whether
the latter represents object instances or fine-grained classes.
摘要:

Granularity-awareAdaptationforImageRetrievaloverMultipleTasksJonAlmaz´an1,ByungsooKo2,GeonmoGu2,DianeLarlus1,andYannisKalantidis11NAVERLABSEurope2NAVERCorp.Abstract.Strongimagesearchmodelscanbelearnedforaspecificdo-main,i.e.setoflabels,providedthatsomelabeledimagesofthatdomainareavailable.Apractical...

展开>> 收起<<
Granularity-aware Adaptation for Image Retrieval over Multiple Tasks Jon Almaz an1 Byungsoo Ko2.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:4.36MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注