
Mathematical Justification of Hard Negative Mining
via Isometric Approximation Theorem
Albert Xu, Jhih-Yi Hsieh, Bhaskar Vundurthy, Eliana Cohen, Howie Choset, Lu Li
Abstract
In deep metric learning, the Triplet Loss has emerged as a
popular method to learn many computer vision and natural
language processing tasks such as facial recognition, object
detection, and visual-semantic embeddings. One issue that
plagues the Triplet Loss is network collapse, an undesirable
phenomenon where the network projects the embeddings of
all data onto a single point. Researchers predominately solve
this problem by using triplet mining strategies. While hard
negative mining is the most effective of these strategies, exist-
ing formulations lack strong theoretical justification for their
empirical success. In this paper, we utilize the mathematical
theory of isometric approximation to show an equivalence be-
tween the Triplet Loss sampled by hard negative mining and
an optimization problem that minimizes a Hausdorff-like dis-
tance between the neural network and its ideal counterpart
function. This provides the theoretical justifications for hard
negative mining’s empirical efficacy. In addition, our novel
application of the isometric approximation theorem provides
the groundwork for future forms of hard negative mining that
avoid network collapse. Our theory can also be extended to
analyze other Euclidean space-based metric learning meth-
ods like Ladder Loss or Contrastive Learning.
Introduction
Research in deep metric learning studies techniques for
training deep neural networks to learn similarities and dis-
similarities between data samples, typically by learning a
distance metric via feature embeddings in Rn. Most ex-
tensively, deep metric learning is used in face recognition
(Schroff, Kalenichenko, and Philbin 2015; Liu et al. 2017;
Hermans, Beyer, and Leibe 2017) and other computer vi-
sion tasks (Tack et al. 2020; Chen et al. 2020a) where there
are an abundance of label values.
Common deep metric learning techniques include con-
trastive loss (Hadsell, Chopra, and LeCun 2006) and triplet
loss (Schroff, Kalenichenko, and Philbin 2015). Moreover,
each of these methods have variants to address specific ap-
plications. SimCLR (Chen et al. 2020a,b), for example, is a
recent contrastive loss variant designed to address unsuper-
vised deep metric learning with state-of-the-art performance
on ImageNet (Russakovsky et al. 2015). Ladder Loss (Zhou
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
et al. 2019), a generalized variant of triplet loss, improved
upon existing methods for coherent visual-semantic embed-
ding and has important applications in multiple visual and
language understanding tasks (Karpathy, Joulin, and Fei-Fei
2014; Ma, Lu, and Li 2015; Vinyals et al. 2014). Given the
success of metric learning in a wide range of applications,
we see value in investigating its underlying theories. In this
paper, we present a theoretical framework which explains
observed but previously unexplained behaviors of the Triplet
Loss.
Literature Review
We choose to analyze Triplet Loss’s underlying theory due
to its strong dependence on the triplet selection strategy.
This makes the Triplet Loss fickle to work with, as empir-
ical results had shown that randomly sampling these triplets
yielded unsatisfactory results. On the other hand, success-
ful triplet selection strategies like hard negative mining can
face issues like network collapse, a phenomenon where the
network projects all data points onto a single point (Schroff,
Kalenichenko, and Philbin 2015), while more stable triplet
selection strategies do not perform as well in practice (Her-
mans, Beyer, and Leibe 2017).
In the original FaceNet paper, Schroff et al. find that
with large batch sizes (thousands), hard negative mining
lead to collapsed solutions. To address this, they instead
used a strategy they called semi-hard mining (Schroff,
Kalenichenko, and Philbin 2015). On the other hand, Her-
man et al. find that with smaller batch sizes (N= 72),
the hardest mining strategy significantly out-performs other
mining strategies, and does not suffer from collapsed solu-
tions (Hermans, Beyer, and Leibe 2017). These seemingly
contradictory results showcase the need for a theoretical
framework to explain the theory of hard negative mining and
the root cause of collapsed solutions.
There has been some prior literature investigating the
phenomenon of network collapse. Xuan et al. show that
hard negative mining leads to collapsed solutions by ana-
lyzing the gradients of a simplified neural network model
(Xuan et al. 2020). However, they do not account for the
many cases where hard negative mining does work. Levi et
al. prove that, under a label randomization assumption, the
globally optimal solution to the triplet loss necessarily ex-
hibits network collapse (Levi et al. 2021). Rather than inves-