Granularity-aware Adaptation for Image Retrieval over Multiple Tasks Jon Almaz an1 Byungsoo Ko2

2025-05-06 0 0 4.36MB 21 页 10玖币

侵权投诉

Granularity-aware Adaptation for Image

Retrieval over Multiple Tasks

Jon Almaz´an1, Byungsoo Ko2,

Geonmo Gu2, Diane Larlus1, and Yannis Kalantidis1

1NAVER LABS Europe 2NAVER Corp.

Abstract. Strong image search models can be learned for a speciﬁc do-

main, i.e. set of labels, provided that some labeled images of that domain

are available. A practical visual search model, however, should be versa-

tile enough to solve multiple retrieval tasks simultaneously, even if those

cover very diﬀerent specialized domains. Additionally, it should be able

to beneﬁt from even unlabeled images from these various retrieval tasks.

This is the more practical scenario that we consider in this paper. We ad-

dress it with the proposed Grappa, an approach that starts from a strong

pretrained model, and adapts it to tackle multiple retrieval tasks con-

currently, using only unlabeled images from the diﬀerent task domains.

We extend the pretrained model with multiple independently trained

sets of adaptors that use pseudo-label sets of diﬀerent sizes, eﬀectively

mimicking diﬀerent pseudo-granularities. We reconcile all adaptor sets

into a single uniﬁed model suited for all retrieval tasks by learning fusion

layers that we guide by propagating pseudo-granularity attentions across

neighbors in the feature space. Results on a benchmark composed of six

heterogeneous retrieval tasks show that the unsupervised Grappa model

improves the zero-shot performance of a state-of-the-art self-supervised

learning model, and in some places reaches or improves over a task label-

aware oracle that selects the most ﬁtting pseudo-granularity per task.

1 Introduction

The last few years have witnessed progress on image retrieval: successful mod-

els can be trained, provided that a set of labeled images from the domain of

interest (not necessary from the same categories) is available for training, as

in the common deep metric learning scenario. Those models are as powerful as

they are specialized: it has been shown, and we conﬁrm in our experiments, that

one model carefully tailored for one domain (e.g. bird species) tend to perform

poorly to a neighboring yet diﬀerent domain (e.g. dog breeds).

Here, we argue that a practical visual search system should be able to solve

multiple retrieval tasks simultaneously, without needing to explicitly specialize

for each task. Consider for example a visual search system specialized to fauna

and ﬂora. In such asystem, the image database covers a broad range of ﬁne-

grained domains, e.g. from searching among diﬀerent insect species to diﬀerent

kinds of mushrooms. For the system to also handle coral species, it should be as

simple as providing a set of unlabeled coral images.

arXiv:2210.02254v1 [cs.CV] 5 Oct 2022

2 J. Almaz´an et al.

Aircraft

Cars

CUB

Flowers

Food-101

Products

+0.6 +0.9

+1.5

+4,7

+3.5

+2.2

RP gain rel. to DINO

DINO [9]

Grappa

Fig. 1: Grappa is an unsupervised

method that trains a single model

with higher zero-shot performance

(measured with R-Precision or RP)

than the pretrained DINO [9]

model, over several retrieval tasks.

In parallel, the ﬁeld has worked to-

wards pretraining large and generic mod-

els for visual representations that can be

used, often as a black box, to extract fea-

tures for new tasks. Among those, mod-

els trained in a self-supervised way have

shown to be versatile to various target

tasks, including image retrieval [23,9].

In this work, we assume access to

such a large pretrained model that already

provides good zero-shot performance. We

also assume access to an unlabeled set of

images possibly from multiple tasks. We

propose to adapt the initial model so it

performs even better on multiple image

retrieval tasks simultaneously,i.e. when

this same adapted model is used to ex-

tract features for all tasks.

This raises two questions. First, how

should we perform adaptation? Fine-tuning is prohibitively costly especially for

large pretrained models, and does not always transfer well. As an alternative

to ﬁne-tuning, and inspired by an early work on multi-task training [50] and a

recent trend in natural language processing [30,45], we propose to use adaptor

layers. Adaptors are embedded in between architecture blocks, and are the only

weights learned, the ones from the original pretrained model remaining ﬁxed.

Our experiments show that this composite architecture allows for a versatile

adaptation of a strong initial model by adjusting only a small percentage of the

model parameters.

Second, how should we reconcile various retrieval tasks in a single model?

A retrieval task focuses on a given set of visual concepts, often associated to

a particular granularity. Yet, unlike in classiﬁcation for which the granularity

is known beforehand, the granularity of a retrieval task is context dependent,

and depends on the gallery of images where visual search is performed. We

therefore propose learning diﬀerent sets of adaptors, each set tailored to one

speciﬁc granularity. As we assume that training images are unlabeled, not even

to indicate the retrieval task they correspond to, we propose to automatically

deﬁne levels of granularity by partitioning the training set into more and more

clusters. As a result, each partition corresponds to a diﬀerent set of pseudo-labels.

We then independently train one set of adaptors for each pseudo-granularity.

Next, we need to reconcile these diﬀerent sets of adaptors into a single multi-

purpose retrieval model. One option is to combine them with a naive fusion

mechanism. The resulting model improves results on all retrieval tasks, show-

ing the clear beneﬁt of a multi-granularity understanding of the data. Another

option is to go one step further and to achieve adaptor fusion via attention prop-

agation. In this case, we require consistency between the adaptor attention of

Granularity-aware Adaptation for Image Retrieval over Multiple Tasks 3

nearest neighbors in the feature space. We observe this fusion mechanism further

improves the model.

To summarize, our contribution is threefold. First, we palliate the absence

of image and task labels by creating sets of pseudo-labels, with the goal of ap-

proximating any possible granularities in a given set of retrieval tasks. Second,

we propose a way to extend transformer-based architectures with adaptors, and

a training framework that tailors individual sets of adaptors to diﬀerent pseudo-

granularities. Third, we propose a number of ways for fusing the adapter features,

e.g. via augmentation invariance or via propagating attention from neighbors in

the image features space. We validate our approach on a collection of datasets

for deep metric learning and we show that Grappa improves over the success-

ful DINO pretrained model, a model known to already obtain strong zero-shot

performance on all these retrieval tasks (see Fig. 1).

2 Related Work

The task we tackle in this paper strongly relates to deep metric learning. It

requires speciﬁc architectural changes of neural networks to extend them with

adaptors. Note that our task can be seen as solving a zero-shot problem, i.e. it

requires no labeled data from the downstream datasets and learns a single model

for all tasks, something fairly uncommon in transfer learning.

Deep metric learning (DML). DML aims to learn a metric between data

points that reﬂects the semantic similarity between them. It plays an impor-

tant role in a wide range of tasks such as image clustering [29,7], unsupervised

learning [10,26,8], and visual search [6,21,51]. Recent DML approaches typically

learn visual similarity using either a pair-based loss [12,57,43,24,33] which con-

siders pair-wise similarities, a proxy-based loss [62,61,16,25,34], which considers

the similarity between samples and class representative proxies, or a contex-

tual classiﬁcation loss [69,5,18,54]. In most cases, DML approaches ﬁnetune an

ImageNet pretrained model for each target retrieval task, and each of those ﬁne-

tuned models fall short when applied to other retrieval tasks. We aim at a more

versatile visual search system that handles multiple retrieval tasks with a single

model.

Neural architectures with adaptation layers. Adaptation layers (or adap-

tors) have emerged [50,30,45,63] as a way to avoid common problems rising in

sequential ﬁnetuning or multi-task learning when trying to ﬁnetune large pre-

trained models to solve multiple tasks, namely the issues of catastrophic for-

getting [39] and task imbalance. Rebuﬃ et al . [50] were the ﬁrst to introduce

adaptors to visual recognition tasks, adapting a convolutional model to many

classiﬁcation tasks. Adaptors have also been used with transformer architectures

for natural language processing [30]; bottleneck layers are added to all the blocks

of a pretrained model and ﬁnetuned, keeping the underlying model ﬁxed.

Recently, Pfeiﬀer et al . [45] introduced a way to share knowledge between

adaptors using an adaptor fusion layer within a two-stage learning framework:

adaptors are trained independently in the ﬁrst stage, they are kept ﬁxed while

4 J. Almaz´an et al.

only the fusion layer is trained in the second stage. All the methods mentioned

above still result in models that specialize to a single task; e.g. [45] learns a

separate fusion layer per downstream task, whereas we would like to learn a

single model for all tasks.

Zero-shot problems. The ﬁeld has recently taken an interest in pretraining

large models, sometimes called zero-shot models, using large quantities of data.

Those have been shown to be versatile and applicable to many target tasks.

Among them, self-supervised models [11,8,26,68,9] are trained using self-deﬁned

pseudo-labels as supervision and typically millions of images (e.g. from Ima-

geNet [15]). Recent works [22,58] exploit even larger yet uncurated, sets of

unlabeled images to enhance the quality of the learned representations. Oth-

ers [49,31,70] have leveraged multiple modalities, e.g. training visual representa-

tions so they are similar to the textual representations of their associated text.

Those self-supervised or multimodal methods oﬀer excellent initialization to be

ﬁnetuned for a wide range of downstream tasks. Sometimes they are used in a

zero-shot setting: a single model is used as a feature extractor, typically to solve

multiple tasks. This is the regime we study here, but we further assume that a

small amount of unlabeled data from the downstream tasks exists.

Relation to other transfer tasks. The idea of transferring a model trained for

a given task to a related one has become central to computer vision [55,53], and

appears in many research ﬁelds such as task transfer [67], domain adaptation [14]

or self-supervised learning [20,42,9]. Yet, in all those, the initial model is only a

starting point and it is typically not only extended, but also retrained for each

task of interest, leading to a multitude of specialized models. In our work, we need

asingle model to perform well across retrieval tasks. In that regard, this work is

closer to zero-shot transfer of the large pretrained models discussed above. Also

related are Mixtures of Experts (MoE) [66,56,48,52], an ensembling technique

that decomposes a predictive problem into subtasks, training one expert for

each. Although MoE architectures may look similar to ours at ﬁrst glance, they

typically rely on gating and pooling mechanisms that learn to predict, in a

supervised way, which experts to trust, and how to combine them. Similar to

typical transfer approaches, they build one specialized model for each target

task. Here, we focus on a purely unsupervised task: no labels are provided to

indicate image semantic content nor the retrieval task images belong to.

3 A granularity-aware multi-purpose retrieval model

In this section we present Grappa, a method for adapting a pretrained model to

multiple retrieval tasks simultaneously, in an unsupervised way. We ﬁrst formal-

ize our task, i.e. visual search over several retrieval tasks using a single model

(Sec. 3.1). We then present an overview of the approach (Sec. 3.2). Next, we

detail each step, i.e. building multiple granularities (Sec. 3.3), learning adaptors

using granularity-aware pseudo-labels (Sec. 3.4), and learning to fuse them by

propagating adaptor attention across feature space neighbors (Sec. 3.5).

Granularity-aware Adaptation for Image Retrieval over Multiple Tasks 5

3.1 Background

Our task of interest, visual search on multiple retrieval tasks, can be seen

as a variant of the standard deep metric learning (DML) task. The most common

protocol in DML is to a) split the classes1into disjoint train and test sets of

labels; b) learn a separate model for each retrieval task on the corresponding

train split; c) perform retrieval on all images of the (unseen) test split of classes.

Our setting has several key diﬀerences. First, we solve multiple retrieval tasks

simultaneously. This means that we do not learn one model for each but a single

model that will be used for all tasks. Second, we only assume access to a set

of unlabeled images from each retrieval task, and do not have access to labeled

training sets, unlike standard DML methods. Even more challenging, unlabeled

training images are provided jointly without knowing which target retrieval task

they correspond to nor the total number of retrieval tasks.

More formally, let Tbe the set of mretrieval tasks that we want to simul-

taneously tackle. Each task Ttis associated with a training and a test set. At

train time, we are provided with a fused training set Dcomposed of the union

of all training sets of the mdatasets in T. As mentioned earlier, images are not

associated to any class or task label.

With so many unknowns on the target retrieval tasks, an obvious choice

is to start with a large pretrained model. Self-[11,26,9] or weakly-[49,31]

supervised learning have been shown to lead to strong models that generalize

well and exhibit high zero-shot transfer performance. We assume that we are

given such a model. Here, we base our work on the recently proposed Visual

Transformer (ViT) [17], a popular, eﬃcient, and highly performing architecture,

pretrained in a self-supervised way with DINO [9].

We set our pretrained model Mto be a ViT with Ltransformer layers and

an input patch size P×Ppixels. Input image x∈RH×W×Cis reshaped into

a sequence of Tﬂattened 2D patches where T=HW/P 2. The transformer

uses constant latent vector size Dthrough all of its layers, so ﬂattened patches

are ﬁrst mapped to Ddimensions with a trainable linear projection and con-

catenated in h0, together with a prepended learnable [class] token and added

position embeddings. The transformer encoder [59] consists of alternating blocks

of multi-headed self-attention (MSA) and MLP (which contain two layers with a

GELU non-linearity). LayerNorm (LN) is applied before every block, and resid-

ual connections after every block. Formally, each layer of M(shown with a gray

background on Fig. 3, left) is given by:

hl= MLP(LN(˜

hl)) + ˜

hl,˜

hl= MSA(LN(hl−1)) + hl−1,(1)

for l={1. . . L}. The image representation zis the output of the [class] token

after the last layer hL,i.e.z= LN(hL)[class]. We refer the reader to [17] for

more details about the ViT architecture.

1We will use the term classes to refer to sets of images with the same label, whether

the latter represents object instances or ﬁne-grained classes.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Granularity-awareAdaptationforImageRetrievaloverMultipleTasksJonAlmaz´an1,ByungsooKo2,GeonmoGu2,DianeLarlus1,andYannisKalantidis11NAVERLABSEurope2NAVERCorp.Abstract.Strongimagesearchmodelscanbelearnedforaspecificdo-main,i.e.setoflabels,providedthatsomelabeledimagesofthatdomainareavailable.Apractical...

展开>> 收起<<

Granularity-aware Adaptation for Image Retrieval over Multiple Tasks Jon Almaz an1 Byungsoo Ko2.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Granularity-aware Adaptation for Image Retrieval over Multiple Tasks Jon Almaz an1 Byungsoo Ko2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: