
Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization
learning rate of class centroids (Teh et al.,2020). Classifica-
tion losses have traditionally performed well on small met-
ric learning benchmarks; these include normalized-softmax
(Zhai & Wu,2018), arcface (Deng et al.,2019), proxy NCA
and proxy anchor. More recently, IBC (Seidenschwarz et al.,
2021) and HIST (Lim et al.,2022) report an improvement in
R@1 when learning a graph neural network in conjunction
with class centroids. However, even with these additional
tricks, classification methods lag behind pairwise methods
on larger benchmarks.
Pairwise Ranking Methods Pairwise ranking losses in-
clude the contrastive loss (Hadsell et al.,2006), triplet loss
(Weinberger et al.,2005) (Wu et al.,2017), multi-similarity,
and AP surrogates (cited in previous section). Despite be-
ing more than a decade old, contrastive and triplet losses
remain the go-to method for metric learning, and Musgrave
et al. (2020a) show that they are comparable in perfor-
mance to many recent methods. Multi-similarity includes
a hard pair mining scheme that is effectively learning to
rank. AP maximization methods explicitly learn to rank
samples within a mini-batch. AP maximization is chal-
lenging because it involves back-propagating through the
non-differentiable heaviside function, similar to the current
work. As a workaround, Fast-AP uses soft-binning; Smooth-
AP uses a low-temperature sigmoid; Roadmap uses an upper
bound on the heaviside instead of an approximation. We find
that using heuristic gradients works better for optimizing
contextual similarity.
Unsupervised Metric Learning The concept of contextual
similarity is extensively studied in the unsupervised metric
learning literature, mainly in the context of person re-ID
(see survey (Ye et al.,2021)). Most unsupervised person
re-ID methods use the
k
-reciprocal re-rank distance (Zhong
et al.,2017), which is a weighted combination of Euclidean
distance and Jaccard distance between reciprocal-neighbor
sets, calculated over the entire dataset. More recently, STML
(Kim et al.,2022) proposes an unsupervised metric learning
framework for image retrieval using a simpler batch-wise
contextual similarity measure. We loosely follow STML’s
contextual similarity definition, making significant changes
to accommodate the change in problem setting and to ad-
dress optimization issues (these changes are enumerated in
Appendix E.1). We emphasize that prior work on contex-
tual similarity optimizes the cosine similarity towards the
contextual similarity, focusing on the unsupervised scenario,
while our work optimizes the contextual similarity towards
the true similarity, requiring full supervision.
Robust Metric Learning Over-reliance on binary super-
vision is a long-standing problem in metric learning. Many
studies overcome this issue by taking advantage of the hi-
erarchical nature of labels in metric learning datasets. Sun
et al. (2021), Zheng et al. (2022), and Ramzi et al. (2022) ex-
plicitly use hierarchical labels for training. These methods
assign a higher cost to mistakes in discriminating labels that
are farther apart in the hierarchy, leading to a more robust
embedding space. Yan et al. (2021) propose to generate syn-
thetic hierarchical labels for unsupervised metric learning,
and Yan et al. (2023) extend this idea to metric learning with
synthetic label noise. These two works use a hyperbolic
embedding space to better capture hierarchical relationships
(Khrulkov et al.,2020). Ermolov et al. (2022) show that
simply using a hyperbolic embedding space instead of a
Euclidean embedding space improves metric learning per-
formance. Our work has a similar motivation to the above
hierarchical and hyperbolic metric learning works, but we
use contextual similarity instead of hierarchical labels to
mitigate label inconsistency. Appendix I.4 contains some
results on hierarchical retrieval metrics.
3. Method
Notation Denote the normalized output of the embedding
network as
fi∈Rd
.
sij =⟨fi, fj⟩ ∈ [−1,1]
denotes
the cosine similarity between the samples
i
and
j
. There
are
n
samples in a mini-batch. We always use balanced
sampling, where
k
images are selected from
n/k
randomly
sampled labels.
n
is divisible by
k
.
k
is divisible by 2, but
we always use
k≥4
in experiments.
yij ∈ {0,1}
denotes
the true similarity between
i
and
j
, defined as
yij = 1
if samples
i
and
j
share the same label and
0
otherwise.
We use uppercase letters to denote matrices, math script to
denote sets, and lowercase letters to denote scalars.
i
,
j
and
p
are reserved for sample indices.
N
is used to denote the
binary indicator matrix for set
N
. For instance, let
N(i)
denote the set of neighbors to sample
i
, then
N(i, j)=1
if j∈ N (i), and 0otherwise.
Contextual Similarity Definition We loosely follow the
definition of contextual similarity proposed in STML (Kim
et al.,2022), with significant modifications to accommodate
the change in problem setting and to address optimization
issues (these modifications are enumerated in Appendix
E.1). In this section, we present the similarity definition
using indicator matrices in order to show an efficient im-
plementation in PyTorch. Note that the binary “and” is
replaced by multiplication for differentiability. Algorithm 1
contains PyTorch-like pseudo-code for Equations 1-7. We
include the code here for reproducibility and to show that
our contextual loss can be compactly implemented despite
the cumbersome mathematical notation.
We denote the contextual similarity between samples
i
and
j
as
wij
. The matrix with entries
wij
is entirely a function
of the cosine similarity matrix with entries
sij
. The goal of
Equations 1-4is to calculate
wij
in terms of
sij
. This is
implemented as
get contextual similarity
in Algo-
rithm 1. For readability, we present the
wij
calculation as
3