Global Selection of Contrastive Batches via Optimization on Sample Permutations

2025-05-06 0 0 2.84MB 21 页 10玖币
侵权投诉
Global Selection of Contrastive Batches via Optimization on Sample
Permutations
Vin Sachidananda 1Ziyi Yang 2Chenguang Zhu 2
Abstract
Contrastive Learning has recently achieved state-
of-the-art performance in a wide range of uni-
modal and multimodal tasks. Many contrastive
learning approaches use mined hard negatives to
make batches more informative during training
but these approaches are inefficient as they in-
crease epoch length proportional to the number
of mined negatives and require frequent updates
of nearest neighbor indices or mining from re-
cent batches. In this work, we provide an alterna-
tive to hard negative mining, Global Contrastive
Batch Sampling (GCBS), an efficient approxima-
tion to the batch assignment problem that upper
bounds the gap between the global and training
losses,
LGlobal − LT rain
, in contrastive learn-
ing settings. Through experimentation we find
GCBS improves state-of-the-art performance in
sentence embedding and code-search tasks. Ad-
ditionally, GCBS is easy to implement as it re-
quires only a few additional lines of code, does
not maintain external data structures such as near-
est neighbor indices, is more computationally
efficient than the most minimal hard negative
mining approaches, and makes no changes to
the model being trained. Code is available at
https://github.com/vinayak1/GCBS.
1. Introduction
Contrastive Learning is used ubiquitously in training large
representation models, such as transformers, and has been
shown to achieve state-of-the-art performance in a wide
range of unimodal and multimodal tasks across language,
vision, code, and audio (Chen et al.,2020a;Gao et al.,2021;
1
Stanford University and Two Sigma Ventures
2
Knowledge and
Language Team, Azure Cognitive Services Research, Microsoft
Research, Redmond, WA. Correspondence to: Vin Sachidananda
<vinsachida@gmail.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
Yuxin Jiang & Wang,2022;Guo et al.,2022;Yu et al.,
2022;Radford et al.,2021;Ramesh et al.,2021;Saeed
et al.,2021;Yang et al.,2021). In supervised contrastive
learning, one is given a paired dataset
(X, Y )
each with
N
samples, where
xiyi
such as similar sentences, code
and corresponding language descriptors, or images and their
captions. For unsupervised settings,
X, Y
are alternative
”views” of the same sample often constructed through data
augmentation schemes or independent dropout masks (Chen
et al.,2020a;Gao et al.,2021). Then batches of rows,
B
,
are sampled from this pair of datasets and a model
f(·)
is
trained to concurrently maximize inner products for outputs
of similar (positive) data inputs,
f(xi)Tf(yi)
, and minimize
inner product for outputs of dissimilar (negative) data inputs
f(xi)Tf(yj), i, j B, i ̸=j.
Due to batch size constraints from hardware limitations, for
a fixed batch size
k
, only
Nk
inner products of the total
N2
in
f(X)f(Y)T
are observed in the training loss for each
epoch of training. Through the rest of this paper, we will re-
fer to this observed training loss over
Nk
inner products as
LT rain
and the total loss over
N2
inner products as
LGlobal
.
It has been observed, both in contrastive metric and repre-
sentation learning (Saunshi et al.,2019;Iscen et al.,2018;
Xuan et al.,2020;Mishchuk et al.,2017;Wu et al.,2017;
Song et al.,2016;Schroff et al.,2015;Harwood et al.,2017;
Ge et al.,2018), that in order for batches to be informa-
tive during training, they should be constructed to contain
”hard-negatives”, or large values of
f(xi)Tf(yj), i ̸=j
. Ad-
ditionally, it has been shown that including hard negatives in
batches better approximates global losses (Zhang & Stratos,
2021).
Currently, approaches for constructing batches, and control-
ling
which
inner products of the total
N2
should be used
for training, broadly fall into one of two categories. One
either uses random sampling or mines nearest neighbors
of the reference sample
xi
in order to greedily insert hard
negatives into the same batch as
xi
. While greedily inserting
hard negatives is effective in practice (Zhang et al.,2018;
Xiong et al.,2021), these methods incur large costs both in
time and resources as mining
l < k
hard negatives per refer-
ence sample increases each training epoch by a factor
l
and
often requires maintaining and reranking nearest neighbor
indices on expensive accelerated hardware during training.
1
arXiv:2210.12874v4 [cs.LG] 7 Jun 2023
Global Selection of Contrastive Batches via Optimization on Sample Permutations
For instance, if
5
hard negatives from
Y
are mined for each
sample in
X
during batch construction one will increase the
training time of a single epoch by a factor
5
, not including
time taken for constructing nearest neighbor indices.
Furthermore, hard negative mining often requires frequent
reranking to prevent negative anchors from being sampled
from stale nearest neighbor indices. Work on momentum
based memory banks have found that hard negative min-
ing is especially useful with small lookback intervals (i.e.
2-4 previous batches) (Wang et al.,2021). In this paper,
we propose a global alternative to hard negative mining,
Global Contrastive Batch Sampling (GCBS), which seeks
to efficiently learn a permutation over samples in
X
and
Y
to increase the likelihood of hard negatives before each
epoch rather than through greedy insertion during training.
In Figure 1above, we visually depict
LGlobal
and
LT rain
along with the modifications on batches, and therefore the
observed loss, for our proposed approach GCBS.
First, we show theoretically that the upper bound on
LGlobal − LT rain
, with no oversampling or assumptions
on the data/model for commonly used scaled cross entropy
losses, such as NT-Xent (Sohn,2016), are only dependent
on batch assignments, total samples
N
, and batch size
k
.
We prove that, for fixed
N, k
, this upper bound is minimized
as a Quadratic Bottleneck Assignment Problem which seeks
to maximize the number of hard negatives in batches by
learning a permutation
πΠN
on the rows of
X
and
Y
.
We then formulate an
˜
O(N2)
approximation for optimizing
over this permutation, GCBS, and show that it is more ef-
ficient than any hard negative mining approaches, even for
l= 1
, per training epoch. We analyze the loss behavior of
GCBS and show that GCBS better approximates the total
contrastive loss. Lastly, we empirically evaluate GCBS in
the context of supervised contrastive finetuning for sentence
embedding (STS) and code search (CosQA, AdvTest, Code-
SearchNet) and achieve state-of-the-art performance for all
of these tasks.
In this work, we summarize our contributions as follows:
1.
We prove that the upper bound of the gap between
the total and observed losses in contrastive learn-
ing for a fixed batch size
B
without oversampling,
LGlobal − LT rain
, is constrained by a Quadratic Bot-
tleneck Assignment Problem and can be relaxed to a
Matrix Bandwidth Minimization problem.
2.
We formulate a
˜
O(N2)
time and
O(Nk)
space com-
plexity approximation to the Matrix Bandwidth Min-
imization problem, GCBS, using the Cuthill-Mckee
heuristic and implement this algorithm in less than 50
lines of PyTorch.
3.
We analyze the loss behavior of GCBS and show that,
in sentence embedding and code-search tasks, GCBS
better approximates the total contrastive loss.
4.
We empirically evaluate GCBS and achieve state-of-
the-art performance on the STS taskset for sentence em-
beddings. Additionally, we achieve state-of-the-art per-
formance for the CosQA, AdvTest, and CodeSearch-
Net tasks for joint programming language-natural lan-
guage embeddings.
The rest of this paper is organized as follows. In Section
2we discuss related work. In Section 3, we derive upper
bounds on the gap between total and observed losses and in
Section 4formulate these bounds as Quadratic Assignment
Problems. In Section 5, we relax our QBAP to a Matrix
Bandwidth Minimization problem, introduce our proposed
method, GCBS, for approximating global contrastive losses,
and provide implementation details. In Section 6, we pro-
vide experimental results for sentence embedding and code
search tasks using GCBS. Section 7provides discussion and
Section 8concludes the paper.
2. Related Work
2.1. Contrastive Representation Learning
Contrastive Learning has been used ubiquitously for vi-
sion, language, and audio representation learning. In vision
tasks, SimCLR (Chen et al.,2020a) showed that using aug-
mented views of the same image as positive samples and
the NT-Xent objective (Sohn,2016) improves performance
of unsupervised classification. MoCo (He et al.,2020;Chen
et al.,2020b) used memory banks of negative samples from
recent batches to increase the effective contrastive batch
size, and Sylvain et al. (2020); Khosla et al. (2020) show im-
provements using supervised contrastive frameworks. For
sentence embedding tasks, contrastive learning has been
used both in pretraining (Logeswaran & Lee,2018), fine-
tuning and continuous prompt tuning settings (Gao et al.,
2021;Yuxin Jiang & Wang,2022) to provide state-of-the-art
performance. Additionally, contrastive learning has been
used extensively to align representations across different
modalities for downstream use in multimodal tasks such
as those involving language/code, language/vision, and vi-
sion/decision making (Guo et al.,2022;Feng et al.,2020;
Guo et al.,2021;Radford et al.,2021;Ramesh et al.,2021;
Laskin et al.,2020).
2.2. Hard Negative Mining in Metric and Contrastive
Learning
Selection of hard negatives during batch construction is well-
studied and has been shown, both theoretically and empiri-
cally, to improve metric and contrastive learning (Saunshi
et al.,2019;Iscen et al.,2018;Xuan et al.,2020;Mishchuk
et al.,2017;Wu et al.,2017). Prior work in metric learn-
ing (Song et al.,2016;Schroff et al.,2015;Harwood et al.,
2
Global Selection of Contrastive Batches via Optimization on Sample Permutations
Figure 1.
Visualization of inner products of
f(X)f(Y)T
in global, training with random sampling, and training with permutation
optimized sampling for contrastive losses.
2017;Ge et al.,2018) has observed that ”hard negatives”,
or negatives which are difficult to discriminate against with
respect to a particular query’s embedding, are beneficial for
downstream classifier performance. In contrastive learning,
Zhang et al. (2018) uses Mixup to generate hard negatives
in latent space. Chuang et al. (2020) proposes a debiased
contrastive loss which approximates the underlying “true”
distribution of negative examples and Yang et al. (2022)
studies the effect of restricting negative sampling to regions
around the query using a variational extension to the In-
foNCE objective. In Kim et al. (2020); Ho & Nvasconcelos
(2020) adversarial examples are used to produce more chal-
lenging positives and hard negatives. In Xiong et al. (2021),
nearest neighbor indices and a secondary model from prior
checkpoints are used to mine hard negatives for text retrieval
tasks.
Robinson et al. (2021) reweights negative samples based on
their Euclidean distance and debiases positive samples in
order to control the level of difficulty in unsupervised con-
trastive learning. Kalantidis et al. (2020) show that harder
negative examples are needed to improve performance and
training speed in vision tasks and propose adding ”synthetic”
hard negatives in training batches using convex combina-
tions of nearest neighbors.
2.3. Quadratic Assignment Problems
The Quadratic Assignment Problem (QAP), stemming from
facilities locations problems (Koopmans & Beckmann,
1957), in combinatorial optimization seeks to minimize the
total cost of assigning
n
facilities to
n
locations. Formally,
one seeks to optimize
minπΠnTr(W πDπT)
over
Πn
, the
set of
n×n
permutation matrices, for a given cost matrix
WRn×n
and distance matrix
DRn×n
. The Quadratic
Bottleneck Assignment Problem (QBAP) (Steinberg,1961)
takes a similar form but minimizes the maximum cost rather
than the total cost,
minπΠnmaxi,j (W πDπT)i,j
. The
Graph Bandwidth Minimization Problem, seeks to mini-
mize the dispersion of nonzero costs from the main diagonal
for a sparse distance matrix
D
and is a special case of
QBAP in which the cost matrix
W
increases monotonically
in
|ij|
. In this paper, we prove that minimizing the upper
bound between the total and the observed training losses
LGlobal − LT rain
over a pair of datasets
X, Y
is bounded
by Quadratic Assignment Problems and approximated by a
Graph Bandwidth Minimization Problem. This connection
is shown visually in Figure 1. As all of the aforementioned
problems are NP-Hard, we utilize the Cuthill-McKee algo-
rithm (Cuthill & McKee,1969), a
˜
O(N2)
approximation
for bandwidth minimization.
3. Global and Training Losses for
Cross-Entropy based Contrastive Objectives
In this section, we characterize the gap between training and
global losses in supervised contrastive learning for the Nor-
malized Temperature-scaled Cross Entropy (NT-Xent) loss
(Sohn,2016). The NT-Xent loss is a ubiquitous contrastive
objective used in state-of-the-art models for sentence embed-
ding, code-language tasks and vision-language tasks (Gao
et al.,2021;Yuxin Jiang & Wang,2022;Guo et al.,2022;
Chen et al.,2020a;Radford et al.,2021).
Let
X, Y Rn×d
be the output representations for two
paired datasets each with
N
samples. Consider a super-
vised setting where
xi, yi
are considered ”positive” pairs
and
xi, yj
are considered negative pairs
i̸=j
. Note that
the analysis we provide can be modified to incorporate mul-
tiple positives, such as those from class information, which
will have tighter bounds in terms of the number of samples
N
and the batch size
k
. Additionally, let
τR+
be a
3
Global Selection of Contrastive Batches via Optimization on Sample Permutations
tunable temperature parameter which scales logit values in
the objective. When
τ
is small, the NT-Xent loss is a proxy
for the hard maximum loss. Assume that all representations
have been normalized, that is xi2,yi2,= 1 i.
3.1. The Global and Training NT-Xent objectives,
LGlobal and LT rain
First, we provide the contrastive loss over all
N2
pairs of
inner products between
X
and
Y
. We call this the global
objective as it contains all pairwise contrastive information
and, in the absence of resource constraints, is the loss one
would seek to minimize. The Global NT-Xent objective is
given as follows:
Definition 3.1 (Global NT-Xent objective).The
Global NT-Xent objective is given as follows:
LGlobal =1
N
N
X
i=1
log exp(xT
iyiτ1)
PN
j=1 exp(xT
iyjτ1).
=1
N
N
X
i=1 xT
iyiτ1+ log
N
X
j=1
exp(xT
iyjτ1).
Due to memory constraints, during training one does not
make all
N2
comparisons over pairs in
X
and
Y
during a
training epoch. Instead, each sample
xi
is only contrasted
against
k
in-batch samples in
Y
, its positive anchor
yi
and
k1
negative anchors. This observed training loss will be
strictly less than the global loss as it makes
k
comparisons
out of
N
total for each sample. For a fixed batch assignment
B
, let
Bi
be the indices of rows in
Y
contained in a batch
with
xi
. The training NT-Xent objective is given as follows:
Definition 3.2 (Training NT-Xent objective).The
Training NT-Xent objective is given as follows:
LT rain =1
N
N
X
i=1
log exp(xT
iyiτ1)
PjBiexp(xT
iyjτ1).
=1
N
N
X
i=1 xT
iyiτ1+ log X
jBi
exp(xT
iyjτ1).
3.2. Minimizing the gap between LGlobal and LT rain
For a fixed set of batches
B
, we will first provide upper
bounds on
LGlobal
and lower bounds on
LT rain
using Log-
Sum-Exp properties (Calafiore & El Ghaoui,2014). Using
the upper bound for Log-Sum-Exp, the following bound on
LGlobal
can be obtained where equivalence is attained when
all inner products have the same value.
3.2.1. UPPER BOUND ON LGlobal
Lemma 3.3 (Upper bound on
LGlobal
).With Log-Sum-Exp
properties (Calafiore & El Ghaoui,2014),
LGlobal
with the
NT-Xent contrastive objective can be upper bounded as:
LGlobal =1
N
N
X
i=1
xT
iyiτ1+ log
N
X
j=1
exp(xT
iyjτ1)
1
N
N
X
i=1
xT
iyiτ1+ log(Nmax
jexp(xT
iyjτ1)
=1
N
N
X
i=1
τ1(xT
iyiτ1+ max
jxT
iyj) + log N.
3.2.2. LOWER BOUNDS ON LT rain
Two lower bounds can be derived for
LT rain
, first using the
translation identity property (Nielsen & Sun,2016) and then
using the standard lower bound for Log-Sum-Exp (Calafiore
& El Ghaoui,2014).
Lemma 3.4 (First lower bound on
LT rain
using Trans-
lation Identity).With the Log-Sum-Exp translation
identity property (Nielsen & Sun,2016),
LT rain
with
the NT-Xent contrastive objective can be bounded as:
LT rain =1
N
N
X
i=1
xT
iyiτ1+ log X
jBi
exp(xT
iyjτ1)
1
N
N
X
i=1
xT
iyiτ1+ log(kmin
jBi
exp(xT
iyjτ1))
=1
N
N
X
i=1
τ1(xT
iyi+ min
jBi
xT
iyj) + log k.
Lemma 3.5 (Second lower bound on
LT rain
using stan-
dard Log-Sum-Exp bound).With Log-Sum-Exp properties
(Calafiore & El Ghaoui,2014),
LT rain
with the NT-Xent
contrastive objective can be bounded as:
LT rain =1
N
N
X
i=1
xT
iyiτ1+ log X
jBi
exp(xT
iyjτ1)
1
N
N
X
i=1
xT
iyiτ1+ log(max
jBi
exp(xT
iyjτ1))
=1
N
N
X
i=1
τ1(xT
iyiτ1+ max
jBi
xT
iyj).
3.2.3. UPPER BOUNDS ON LGlobal − LT rain
We can now bound the gap between the global and training
losses of the NT-Xent objective for a fixed batch set of
batches
B
. The diagonal terms are included in both the
global and training losses and will therefore not factor into
characterizing the gap.
Theorem 3.6 (First upper bound on
LGlobal − LT rain
).
An upper bound on
LGlobal − LT rain
for the NT-Xent
objective using the lower bound from Lemma 3.4 is:
4
Global Selection of Contrastive Batches via Optimization on Sample Permutations
LGlobal − LT rain 1
N
N
X
i=1
τ1(max
jxT
iyjmin
jBi
xT
iyj)
+ log N
k.
Theorem 3.7 (Second upper bound on
LGlobal − LT rain
).
An upper bound on
LGlobal − LT rain
for the NT-Xent
objective using the lower bound from Lemma 3.5 is:
LGlobal − LT rain 1
N
N
X
i=1
τ1(max
jxT
iyjmax
jBi
xT
iyj)
+ log N.
4. Minimizing LGlobal − LT rain as Quadratic
Assignment over Batches
Note that from Theorems 3.6 and 3.7 we have bounded
the gap between
LGlobal − LT rain
without making any
assumptions on data distribution or models. Additionally,
we can see that the bounds are dependent on the batch
assignments
jBi
, batch size
k
, and total number of
samples
N
. Note the losses we have characterized have
been summed over samples in
X
which we will denote as
LGlobal
X,LT rain
X
and consider losses summed over samples
in Yas LGlobal
Y,LT rain
Y.
4.1. Batches assignment as optimization over row
permutations
We will now rewrite our optimization problems over per-
mutations
πΠN
instead of sets of batch assignments
{B}
, an equivalent formulation. First, recognize that our
bounds are dependent only on batch assignments of nega-
tives
jBi
. Without loss of generality assume that batches
are constructed sequentially after applying a row permuta-
tion
πΠN
on
X
and
Y
. That is, batches are constructed
over π(X), π(Y)such that jBi j
k=i
k. Rec-
ognize that this batch construction can be written as a block
diagonal matrix of the form
A∈ {0,1}N×N
and
Ai,j = 1
if
j
k=i
k
. Note this sequential constraint is not restric-
tive as it accommodates all possible batch assignments on
X, Y
with the appropriate permutation
πΠN
. When
introducing a fixed sequential batching, we can rewrite the
minimizer of the upper bound on
LGlobal − LT rain
from
Theorems 3.6 and 3.7 as an optimization problem over per-
mutations
πΠN
on
X, Y
rather than explicitly on
{B}
.
The form of these optimizations problems are the Quadratic
Bottleneck Assignment Problem and the Quadratic Assign-
ment Problem (Koopmans & Beckmann,1957). These are
well-known NP-Hard combinatorial optimization problems
and in the following two sections we will discuss formula-
tion and efficient approximations.
4.2. Bounds related to Quadratic Bottleneck
Assignment Problems
The upper bound in Theorem 3.6, for the sum
(LGlobal
X
LT rain
X)+(LGlobal
YLT rain
Y)
, is minimized when the small-
est inner product over in-batch negatives is maximized over
πΠN
. Denote
Zij min{xT
iyj, yT
ixj}
and
as the
Hadamard product. This is a QBAP, the proof of which
is deferred to Appendix A.1, as we are interested in the
minimizing the maximum value of the elementwise product
of two symmetric matrices πZπTand A.
Theorem 4.1 (Formulation of QBAP for bound in Theorem
3.6).The following Quadratic Bottleneck Assignment Prob-
lem, minimizes the upper bound provided in Theorem 3.6
summed over Xand Y:
min
πΠN
max
i,j AπZπT.
4.3. Bounds related to Quadratic Assignment Problems
The upper bound in Theorem 3.7, for the sum
(LGlobal
X
LT rain
X) + (LGlobal
Y− LT rain
Y)
, is minimized when the sum
of inner products over in-batch negatives is maximized over
permutations
πΠN
. This can be formulated equivalently
as either the Frobenius inner product between the symmetric
matrices
π(XY T+Y XT)πT
and
A
or the Trace of their
product. These are QAPs, the proof of which is deferred to
Appendix A.2.
Theorem 4.2 (Formulation of QAP for bound in Theorem
3.7).The following Quadratic Assignment Problem mini-
mizes the upper bound in Theorem 3.7:
max
πΠN
T r((XY T+Y XT)πT).
Heuristics for both the QAP and QBAP in
O(N3)
and
˜
O(N2)
time complexity respectively are well-known (Kuhn,
1955;Munkres,1957;Edmonds & Karp,1972;Jonker &
Volgenant,1988;Cuthill & McKee,1969). In the next sec-
tion, we will formulate approximate solutions to the QBAP
in Theorem 4.1 with
O(Nk)
space and
˜
O(N2)
time com-
plexity.
5. Global Contrastive Batch Sampling:
Efficient approximations to the QBAP with
Cuthill-McKee
In practice, when
N
is large it can be difficult to hold
XY T
in memory and approximation algorithms for the QAP prob-
lem (Kuhn,1955;Munkres,1957;Edmonds & Karp,1972;
Jonker & Volgenant,1988) have
O(N3)
complexity. There-
fore, in optimizing over
πΠN
we will make two mod-
ifications in order to develop a
O(Nk)
space and
˜
O(N2)
5
摘要:

GlobalSelectionofContrastiveBatchesviaOptimizationonSamplePermutationsVinSachidananda1ZiyiYang2ChenguangZhu2AbstractContrastiveLearninghasrecentlyachievedstate-of-the-artperformanceinawiderangeofuni-modalandmultimodaltasks.Manycontrastivelearningapproachesuseminedhardnegativestomakebatchesmoreinform...

展开>> 收起<<
Global Selection of Contrastive Batches via Optimization on Sample Permutations.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:2.84MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注