4 J. Almaz´an et al.
only the fusion layer is trained in the second stage. All the methods mentioned
above still result in models that specialize to a single task; e.g. [45] learns a
separate fusion layer per downstream task, whereas we would like to learn a
single model for all tasks.
Zero-shot problems. The field has recently taken an interest in pretraining
large models, sometimes called zero-shot models, using large quantities of data.
Those have been shown to be versatile and applicable to many target tasks.
Among them, self-supervised models [11,8,26,68,9] are trained using self-defined
pseudo-labels as supervision and typically millions of images (e.g. from Ima-
geNet [15]). Recent works [22,58] exploit even larger yet uncurated, sets of
unlabeled images to enhance the quality of the learned representations. Oth-
ers [49,31,70] have leveraged multiple modalities, e.g. training visual representa-
tions so they are similar to the textual representations of their associated text.
Those self-supervised or multimodal methods offer excellent initialization to be
finetuned for a wide range of downstream tasks. Sometimes they are used in a
zero-shot setting: a single model is used as a feature extractor, typically to solve
multiple tasks. This is the regime we study here, but we further assume that a
small amount of unlabeled data from the downstream tasks exists.
Relation to other transfer tasks. The idea of transferring a model trained for
a given task to a related one has become central to computer vision [55,53], and
appears in many research fields such as task transfer [67], domain adaptation [14]
or self-supervised learning [20,42,9]. Yet, in all those, the initial model is only a
starting point and it is typically not only extended, but also retrained for each
task of interest, leading to a multitude of specialized models. In our work, we need
asingle model to perform well across retrieval tasks. In that regard, this work is
closer to zero-shot transfer of the large pretrained models discussed above. Also
related are Mixtures of Experts (MoE) [66,56,48,52], an ensembling technique
that decomposes a predictive problem into subtasks, training one expert for
each. Although MoE architectures may look similar to ours at first glance, they
typically rely on gating and pooling mechanisms that learn to predict, in a
supervised way, which experts to trust, and how to combine them. Similar to
typical transfer approaches, they build one specialized model for each target
task. Here, we focus on a purely unsupervised task: no labels are provided to
indicate image semantic content nor the retrieval task images belong to.
3 A granularity-aware multi-purpose retrieval model
In this section we present Grappa, a method for adapting a pretrained model to
multiple retrieval tasks simultaneously, in an unsupervised way. We first formal-
ize our task, i.e. visual search over several retrieval tasks using a single model
(Sec. 3.1). We then present an overview of the approach (Sec. 3.2). Next, we
detail each step, i.e. building multiple granularities (Sec. 3.3), learning adaptors
using granularity-aware pseudo-labels (Sec. 3.4), and learning to fuse them by
propagating adaptor attention across feature space neighbors (Sec. 3.5).