former being lightweight to reduce the extraction cost dur-
ing query time. For small to medium size databases or with
the use of fast nearest neighbor search methods [24, 2, 25],
the extraction cost of the representation can be the test-time
bottleneck. In the asymmetric setup, the two representation
spaces corresponding to each of the two networks need to
be aligned and compatible. This is the objective of asym-
metric metric learning (AML) as introduced by Budnik and
Avrithis [6].
AML is studied under the lens of asymmetric network
architectures; the student model is a pruned variant of the
teacher, a different but lighter architecture possibly discov-
ered by neural architecture search. All these aspects reduce
the query time. The resolution of the input images is an
important aspect that is overlooked. The use of fully convo-
lutional architectures allows any input resolution, while the
representation extraction cost is roughly quadratic in that
matter. Metric learning tasks that focus on instance-level
recognition are known to benefit from the use of a large
image resolution [31, 4]. The same holds for fine-grained
recognition where object details matter [35], as also shown
in this work. Therefore, input resolution is a critical param-
eter for the performance/efficiency trade-off.
This work focuses on AML, where the asymmetry comes
at the level of input resolution between the database net-
work and the query network. The two architectures are the
same, but the query network is trained at a low resolution to
match the representation of the database network at a high
resolution, which is performed with a distillation process
that transfers knowledge from a teacher (database network)
to the student (query network). The contributions of this
work are summarized as follows:
• Asymmetry in the form of input image resolution is ex-
plored for the first time in deep metric learning.
• A distillation approach is proposed to align the repre-
sentation (absolute distillation) and the pairwise similar-
ities (relational distillation) between student and teacher
across task-tailored augmentations of the same image.
• We conclude that resolution asymmetry is a better way
to optimize the performance vs. efficiency trade-off com-
pared to network asymmetry.
• As a side-effect, the student obtained with distillation no-
ticeably outperforms the deep metric learning baselines
in a conventional/symmetric retrieval.
A performance vs. efficiency comparison is shown in
Figure 1. Compared to the baselines where a single net-
work extracts both database and query examples with the
same resolution (circle) and different resolution (diamond),
distillation performs much better. The superiority of reso-
lution over network asymmetry is evident too. More details
about these experiments are discussed in Section 4.
2. Related work
Asymmetric embedding compatibility. In image re-
trieval, embedding compatibility has to be ensured when
the database examples are processed by a different net-
work than the query examples. To this end, AML [6] re-
defines standard metric learning losses in an asymmetric
way, i.e. the anchor example is processed by the query net-
work, while the database network processes the correspond-
ing positive and negative examples. However, for the ob-
jective of representation space alignment, these losses are
outperformed by a simple unsupervised regression loss on
the embeddings, a form of knowledge distillation between
teacher and student. Using the supervised losses, on the
other hand, boosts the performance of symmetric retrieval
with the student, which even surpasses its teacher. Follow-
ing the paradigm of unsupervised distillation, another re-
cent approach compels the student to mimic the contextual
similarities of image neighbors in the teacher embedding
space [50]. Other than optimizing the weights of the stu-
dent network, a generalization includes using neural archi-
tecture search to additionally optimize the network architec-
ture [11] in a work that focuses on training for classification
rather than in a metric learning manner.
Classification-based training is the dominant approach in
a relevant task to ours, called backward compatible learning
(BCT) [36]. However, the underlined task assumptions are
different. Its objective is to add new data processed with a
stronger backbone version without back-filling the current
database. Compatibility is established with cross-entropy
loss of the old classifier on old and new embeddings of
the same input image. This is extended to the compati-
bility of multiple embedding versions [23] or to tackling
open-set backward compatibility with a continual learning
approach [42]. Similarly, forward compatible learning [32]
stores side information during training, which is leveraged
in the future to transfer the old embeddings to another task.
Other methods for fixing inconsistent representation spaces
include class prototypes alignment [3, 52], and transforma-
tion of both spaces, rather than a single one [43, 22].
Asymmetry also emerges when embeddings are col-
lected from diverse devices which use different models, e.g.
in the domain of faces where the recognition should be com-
patible with all models [8], or in localization and mapping
task with multiple agents [12].
Distillation and small image resolution. Image down-
sampling remains the primary pre-processing step even at
present times when the average visual memory of GPU al-
lows processing bigger resolutions during training and test-
ing. It is observed that using larger images reliably trans-
lates to higher performance regardless of the objective or
dataset [35]. Yet there are still many valid use cases where
the inference has to be done with limited resources. In this
context, distillation is used to align embeddings of high- and