
Global Selection of Contrastive Batches via Optimization on Sample Permutations
For instance, if
5
hard negatives from
Y
are mined for each
sample in
X
during batch construction one will increase the
training time of a single epoch by a factor
5
, not including
time taken for constructing nearest neighbor indices.
Furthermore, hard negative mining often requires frequent
reranking to prevent negative anchors from being sampled
from stale nearest neighbor indices. Work on momentum
based memory banks have found that hard negative min-
ing is especially useful with small lookback intervals (i.e.
2-4 previous batches) (Wang et al.,2021). In this paper,
we propose a global alternative to hard negative mining,
Global Contrastive Batch Sampling (GCBS), which seeks
to efficiently learn a permutation over samples in
X
and
Y
to increase the likelihood of hard negatives before each
epoch rather than through greedy insertion during training.
In Figure 1above, we visually depict
LGlobal
and
LT rain
along with the modifications on batches, and therefore the
observed loss, for our proposed approach GCBS.
First, we show theoretically that the upper bound on
LGlobal − LT rain
, with no oversampling or assumptions
on the data/model for commonly used scaled cross entropy
losses, such as NT-Xent (Sohn,2016), are only dependent
on batch assignments, total samples
N
, and batch size
k
.
We prove that, for fixed
N, k
, this upper bound is minimized
as a Quadratic Bottleneck Assignment Problem which seeks
to maximize the number of hard negatives in batches by
learning a permutation
π∈ΠN
on the rows of
X
and
Y
.
We then formulate an
˜
O(N2)
approximation for optimizing
over this permutation, GCBS, and show that it is more ef-
ficient than any hard negative mining approaches, even for
l= 1
, per training epoch. We analyze the loss behavior of
GCBS and show that GCBS better approximates the total
contrastive loss. Lastly, we empirically evaluate GCBS in
the context of supervised contrastive finetuning for sentence
embedding (STS) and code search (CosQA, AdvTest, Code-
SearchNet) and achieve state-of-the-art performance for all
of these tasks.
In this work, we summarize our contributions as follows:
1.
We prove that the upper bound of the gap between
the total and observed losses in contrastive learn-
ing for a fixed batch size
B
without oversampling,
LGlobal − LT rain
, is constrained by a Quadratic Bot-
tleneck Assignment Problem and can be relaxed to a
Matrix Bandwidth Minimization problem.
2.
We formulate a
˜
O(N2)
time and
O(Nk)
space com-
plexity approximation to the Matrix Bandwidth Min-
imization problem, GCBS, using the Cuthill-Mckee
heuristic and implement this algorithm in less than 50
lines of PyTorch.
3.
We analyze the loss behavior of GCBS and show that,
in sentence embedding and code-search tasks, GCBS
better approximates the total contrastive loss.
4.
We empirically evaluate GCBS and achieve state-of-
the-art performance on the STS taskset for sentence em-
beddings. Additionally, we achieve state-of-the-art per-
formance for the CosQA, AdvTest, and CodeSearch-
Net tasks for joint programming language-natural lan-
guage embeddings.
The rest of this paper is organized as follows. In Section
2we discuss related work. In Section 3, we derive upper
bounds on the gap between total and observed losses and in
Section 4formulate these bounds as Quadratic Assignment
Problems. In Section 5, we relax our QBAP to a Matrix
Bandwidth Minimization problem, introduce our proposed
method, GCBS, for approximating global contrastive losses,
and provide implementation details. In Section 6, we pro-
vide experimental results for sentence embedding and code
search tasks using GCBS. Section 7provides discussion and
Section 8concludes the paper.
2. Related Work
2.1. Contrastive Representation Learning
Contrastive Learning has been used ubiquitously for vi-
sion, language, and audio representation learning. In vision
tasks, SimCLR (Chen et al.,2020a) showed that using aug-
mented views of the same image as positive samples and
the NT-Xent objective (Sohn,2016) improves performance
of unsupervised classification. MoCo (He et al.,2020;Chen
et al.,2020b) used memory banks of negative samples from
recent batches to increase the effective contrastive batch
size, and Sylvain et al. (2020); Khosla et al. (2020) show im-
provements using supervised contrastive frameworks. For
sentence embedding tasks, contrastive learning has been
used both in pretraining (Logeswaran & Lee,2018), fine-
tuning and continuous prompt tuning settings (Gao et al.,
2021;Yuxin Jiang & Wang,2022) to provide state-of-the-art
performance. Additionally, contrastive learning has been
used extensively to align representations across different
modalities for downstream use in multimodal tasks such
as those involving language/code, language/vision, and vi-
sion/decision making (Guo et al.,2022;Feng et al.,2020;
Guo et al.,2021;Radford et al.,2021;Ramesh et al.,2021;
Laskin et al.,2020).
2.2. Hard Negative Mining in Metric and Contrastive
Learning
Selection of hard negatives during batch construction is well-
studied and has been shown, both theoretically and empiri-
cally, to improve metric and contrastive learning (Saunshi
et al.,2019;Iscen et al.,2018;Xuan et al.,2020;Mishchuk
et al.,2017;Wu et al.,2017). Prior work in metric learn-
ing (Song et al.,2016;Schroff et al.,2015;Harwood et al.,
2