Global Selection of Contrastive Batches via Optimization on Sample Permutations

2025-05-06 0 0 2.84MB 21 页 10玖币

侵权投诉

Global Selection of Contrastive Batches via Optimization on Sample

Permutations

Vin Sachidananda 1Ziyi Yang 2Chenguang Zhu 2

Abstract

Contrastive Learning has recently achieved state-

of-the-art performance in a wide range of uni-

modal and multimodal tasks. Many contrastive

learning approaches use mined hard negatives to

make batches more informative during training

but these approaches are inefﬁcient as they in-

crease epoch length proportional to the number

of mined negatives and require frequent updates

of nearest neighbor indices or mining from re-

cent batches. In this work, we provide an alterna-

tive to hard negative mining, Global Contrastive

Batch Sampling (GCBS), an efﬁcient approxima-

tion to the batch assignment problem that upper

bounds the gap between the global and training

losses,

LGlobal − LT rain

, in contrastive learn-

ing settings. Through experimentation we ﬁnd

GCBS improves state-of-the-art performance in

sentence embedding and code-search tasks. Ad-

ditionally, GCBS is easy to implement as it re-

quires only a few additional lines of code, does

not maintain external data structures such as near-

est neighbor indices, is more computationally

efﬁcient than the most minimal hard negative

mining approaches, and makes no changes to

the model being trained. Code is available at

https://github.com/vinayak1/GCBS.

1. Introduction

Contrastive Learning is used ubiquitously in training large

representation models, such as transformers, and has been

shown to achieve state-of-the-art performance in a wide

range of unimodal and multimodal tasks across language,

vision, code, and audio (Chen et al.,2020a;Gao et al.,2021;

Stanford University and Two Sigma Ventures

Knowledge and

Language Team, Azure Cognitive Services Research, Microsoft

Research, Redmond, WA. Correspondence to: Vin Sachidananda

<vinsachida@gmail.com>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

Yuxin Jiang & Wang,2022;Guo et al.,2022;Yu et al.,

2022;Radford et al.,2021;Ramesh et al.,2021;Saeed

et al.,2021;Yang et al.,2021). In supervised contrastive

learning, one is given a paired dataset

(X, Y )

each with

samples, where

xi∼yi

such as similar sentences, code

and corresponding language descriptors, or images and their

captions. For unsupervised settings,

X, Y

are alternative

”views” of the same sample often constructed through data

augmentation schemes or independent dropout masks (Chen

et al.,2020a;Gao et al.,2021). Then batches of rows,

are sampled from this pair of datasets and a model

f(·)

trained to concurrently maximize inner products for outputs

of similar (positive) data inputs,

f(xi)Tf(yi)

, and minimize

inner product for outputs of dissimilar (negative) data inputs

f(xi)Tf(yj), i, j ∈B, i ̸=j.

Due to batch size constraints from hardware limitations, for

a ﬁxed batch size

, only

inner products of the total

f(X)f(Y)T

are observed in the training loss for each

epoch of training. Through the rest of this paper, we will re-

fer to this observed training loss over

inner products as

LT rain

and the total loss over

inner products as

LGlobal

It has been observed, both in contrastive metric and repre-

sentation learning (Saunshi et al.,2019;Iscen et al.,2018;

Xuan et al.,2020;Mishchuk et al.,2017;Wu et al.,2017;

Song et al.,2016;Schroff et al.,2015;Harwood et al.,2017;

Ge et al.,2018), that in order for batches to be informa-

tive during training, they should be constructed to contain

”hard-negatives”, or large values of

f(xi)Tf(yj), i ̸=j

. Ad-

ditionally, it has been shown that including hard negatives in

batches better approximates global losses (Zhang & Stratos,

2021).

Currently, approaches for constructing batches, and control-

ling

which

inner products of the total

should be used

for training, broadly fall into one of two categories. One

either uses random sampling or mines nearest neighbors

of the reference sample

in order to greedily insert hard

negatives into the same batch as

. While greedily inserting

hard negatives is effective in practice (Zhang et al.,2018;

Xiong et al.,2021), these methods incur large costs both in

time and resources as mining

l < k

hard negatives per refer-

ence sample increases each training epoch by a factor

and

often requires maintaining and reranking nearest neighbor

indices on expensive accelerated hardware during training.

arXiv:2210.12874v4 [cs.LG] 7 Jun 2023

Global Selection of Contrastive Batches via Optimization on Sample Permutations

For instance, if

hard negatives from

are mined for each

sample in

during batch construction one will increase the

training time of a single epoch by a factor

, not including

time taken for constructing nearest neighbor indices.

Furthermore, hard negative mining often requires frequent

reranking to prevent negative anchors from being sampled

from stale nearest neighbor indices. Work on momentum

based memory banks have found that hard negative min-

ing is especially useful with small lookback intervals (i.e.

2-4 previous batches) (Wang et al.,2021). In this paper,

we propose a global alternative to hard negative mining,

Global Contrastive Batch Sampling (GCBS), which seeks

to efﬁciently learn a permutation over samples in

and

to increase the likelihood of hard negatives before each

epoch rather than through greedy insertion during training.

In Figure 1above, we visually depict

LGlobal

and

LT rain

along with the modiﬁcations on batches, and therefore the

observed loss, for our proposed approach GCBS.

First, we show theoretically that the upper bound on

LGlobal − LT rain

, with no oversampling or assumptions

on the data/model for commonly used scaled cross entropy

losses, such as NT-Xent (Sohn,2016), are only dependent

on batch assignments, total samples

, and batch size

We prove that, for ﬁxed

N, k

, this upper bound is minimized

as a Quadratic Bottleneck Assignment Problem which seeks

to maximize the number of hard negatives in batches by

learning a permutation

π∈ΠN

on the rows of

and

We then formulate an

O(N2)

approximation for optimizing

over this permutation, GCBS, and show that it is more ef-

ﬁcient than any hard negative mining approaches, even for

l= 1

, per training epoch. We analyze the loss behavior of

GCBS and show that GCBS better approximates the total

contrastive loss. Lastly, we empirically evaluate GCBS in

the context of supervised contrastive ﬁnetuning for sentence

embedding (STS) and code search (CosQA, AdvTest, Code-

SearchNet) and achieve state-of-the-art performance for all

of these tasks.

In this work, we summarize our contributions as follows:

We prove that the upper bound of the gap between

the total and observed losses in contrastive learn-

ing for a ﬁxed batch size

without oversampling,

LGlobal − LT rain

, is constrained by a Quadratic Bot-

tleneck Assignment Problem and can be relaxed to a

Matrix Bandwidth Minimization problem.

We formulate a

O(N2)

time and

O(Nk)

space com-

plexity approximation to the Matrix Bandwidth Min-

imization problem, GCBS, using the Cuthill-Mckee

heuristic and implement this algorithm in less than 50

lines of PyTorch.

We analyze the loss behavior of GCBS and show that,

in sentence embedding and code-search tasks, GCBS

better approximates the total contrastive loss.

We empirically evaluate GCBS and achieve state-of-

the-art performance on the STS taskset for sentence em-

beddings. Additionally, we achieve state-of-the-art per-

formance for the CosQA, AdvTest, and CodeSearch-

Net tasks for joint programming language-natural lan-

guage embeddings.

The rest of this paper is organized as follows. In Section

2we discuss related work. In Section 3, we derive upper

bounds on the gap between total and observed losses and in

Section 4formulate these bounds as Quadratic Assignment

Problems. In Section 5, we relax our QBAP to a Matrix

Bandwidth Minimization problem, introduce our proposed

method, GCBS, for approximating global contrastive losses,

and provide implementation details. In Section 6, we pro-

vide experimental results for sentence embedding and code

search tasks using GCBS. Section 7provides discussion and

Section 8concludes the paper.

2. Related Work

2.1. Contrastive Representation Learning

Contrastive Learning has been used ubiquitously for vi-

sion, language, and audio representation learning. In vision

tasks, SimCLR (Chen et al.,2020a) showed that using aug-

mented views of the same image as positive samples and

the NT-Xent objective (Sohn,2016) improves performance

of unsupervised classiﬁcation. MoCo (He et al.,2020;Chen

et al.,2020b) used memory banks of negative samples from

recent batches to increase the effective contrastive batch

size, and Sylvain et al. (2020); Khosla et al. (2020) show im-

provements using supervised contrastive frameworks. For

sentence embedding tasks, contrastive learning has been

used both in pretraining (Logeswaran & Lee,2018), ﬁne-

tuning and continuous prompt tuning settings (Gao et al.,

2021;Yuxin Jiang & Wang,2022) to provide state-of-the-art

performance. Additionally, contrastive learning has been

used extensively to align representations across different

modalities for downstream use in multimodal tasks such

as those involving language/code, language/vision, and vi-

sion/decision making (Guo et al.,2022;Feng et al.,2020;

Guo et al.,2021;Radford et al.,2021;Ramesh et al.,2021;

Laskin et al.,2020).

2.2. Hard Negative Mining in Metric and Contrastive

Learning

Selection of hard negatives during batch construction is well-

studied and has been shown, both theoretically and empiri-

cally, to improve metric and contrastive learning (Saunshi

et al.,2019;Iscen et al.,2018;Xuan et al.,2020;Mishchuk

et al.,2017;Wu et al.,2017). Prior work in metric learn-

ing (Song et al.,2016;Schroff et al.,2015;Harwood et al.,

Global Selection of Contrastive Batches via Optimization on Sample Permutations

Figure 1.

Visualization of inner products of

f(X)f(Y)T

in global, training with random sampling, and training with permutation

optimized sampling for contrastive losses.

2017;Ge et al.,2018) has observed that ”hard negatives”,

or negatives which are difﬁcult to discriminate against with

respect to a particular query’s embedding, are beneﬁcial for

downstream classiﬁer performance. In contrastive learning,

Zhang et al. (2018) uses Mixup to generate hard negatives

in latent space. Chuang et al. (2020) proposes a debiased

contrastive loss which approximates the underlying “true”

distribution of negative examples and Yang et al. (2022)

studies the effect of restricting negative sampling to regions

around the query using a variational extension to the In-

foNCE objective. In Kim et al. (2020); Ho & Nvasconcelos

(2020) adversarial examples are used to produce more chal-

lenging positives and hard negatives. In Xiong et al. (2021),

nearest neighbor indices and a secondary model from prior

checkpoints are used to mine hard negatives for text retrieval

tasks.

Robinson et al. (2021) reweights negative samples based on

their Euclidean distance and debiases positive samples in

order to control the level of difﬁculty in unsupervised con-

trastive learning. Kalantidis et al. (2020) show that harder

negative examples are needed to improve performance and

training speed in vision tasks and propose adding ”synthetic”

hard negatives in training batches using convex combina-

tions of nearest neighbors.

2.3. Quadratic Assignment Problems

The Quadratic Assignment Problem (QAP), stemming from

facilities locations problems (Koopmans & Beckmann,

1957), in combinatorial optimization seeks to minimize the

total cost of assigning

facilities to

locations. Formally,

one seeks to optimize

minπ∈ΠnTr(W πDπT)

over

Πn

, the

set of

n×n

permutation matrices, for a given cost matrix

W∈Rn×n

and distance matrix

D∈Rn×n

. The Quadratic

Bottleneck Assignment Problem (QBAP) (Steinberg,1961)

takes a similar form but minimizes the maximum cost rather

than the total cost,

minπ∈Πnmaxi,j (W πDπT)i,j

. The

Graph Bandwidth Minimization Problem, seeks to mini-

mize the dispersion of nonzero costs from the main diagonal

for a sparse distance matrix

and is a special case of

QBAP in which the cost matrix

increases monotonically

|i−j|

. In this paper, we prove that minimizing the upper

bound between the total and the observed training losses

LGlobal − LT rain

over a pair of datasets

X, Y

is bounded

by Quadratic Assignment Problems and approximated by a

Graph Bandwidth Minimization Problem. This connection

is shown visually in Figure 1. As all of the aforementioned

problems are NP-Hard, we utilize the Cuthill-McKee algo-

rithm (Cuthill & McKee,1969), a

O(N2)

approximation

for bandwidth minimization.

3. Global and Training Losses for

Cross-Entropy based Contrastive Objectives

In this section, we characterize the gap between training and

global losses in supervised contrastive learning for the Nor-

malized Temperature-scaled Cross Entropy (NT-Xent) loss

(Sohn,2016). The NT-Xent loss is a ubiquitous contrastive

objective used in state-of-the-art models for sentence embed-

ding, code-language tasks and vision-language tasks (Gao

et al.,2021;Yuxin Jiang & Wang,2022;Guo et al.,2022;

Chen et al.,2020a;Radford et al.,2021).

Let

X, Y ∈Rn×d

be the output representations for two

paired datasets each with

samples. Consider a super-

vised setting where

xi, yi

are considered ”positive” pairs

and

xi, yj

are considered negative pairs

∀i̸=j

. Note that

the analysis we provide can be modiﬁed to incorporate mul-

tiple positives, such as those from class information, which

will have tighter bounds in terms of the number of samples

and the batch size

. Additionally, let

τ∈R+

be a

Global Selection of Contrastive Batches via Optimization on Sample Permutations

tunable temperature parameter which scales logit values in

the objective. When

is small, the NT-Xent loss is a proxy

for the hard maximum loss. Assume that all representations

have been normalized, that is ∥xi∥2,∥yi∥2,= 1 ∀i.

3.1. The Global and Training NT-Xent objectives,

LGlobal and LT rain

First, we provide the contrastive loss over all

pairs of

inner products between

and

. We call this the global

objective as it contains all pairwise contrastive information

and, in the absence of resource constraints, is the loss one

would seek to minimize. The Global NT-Xent objective is

given as follows:

Deﬁnition 3.1 (Global NT-Xent objective).The

Global NT-Xent objective is given as follows:

LGlobal =−1

i=1

log exp(xT

iyiτ−1)

j=1 exp(xT

iyjτ−1).

i=1 −xT

iyiτ−1+ log

j=1

exp(xT

iyjτ−1).

Due to memory constraints, during training one does not

make all

comparisons over pairs in

and

during a

training epoch. Instead, each sample

is only contrasted

against

in-batch samples in

, its positive anchor

and

k−1

negative anchors. This observed training loss will be

strictly less than the global loss as it makes

comparisons

out of

total for each sample. For a ﬁxed batch assignment

, let

be the indices of rows in

contained in a batch

with

. The training NT-Xent objective is given as follows:

Deﬁnition 3.2 (Training NT-Xent objective).The

Training NT-Xent objective is given as follows:

LT rain =−1

i=1

log exp(xT

iyiτ−1)

Pj∈Biexp(xT

iyjτ−1).

i=1 −xT

iyiτ−1+ log X

j∈Bi

exp(xT

iyjτ−1).

3.2. Minimizing the gap between LGlobal and LT rain

For a ﬁxed set of batches

, we will ﬁrst provide upper

bounds on

LGlobal

and lower bounds on

LT rain

using Log-

Sum-Exp properties (Calaﬁore & El Ghaoui,2014). Using

the upper bound for Log-Sum-Exp, the following bound on

LGlobal

can be obtained where equivalence is attained when

all inner products have the same value.

3.2.1. UPPER BOUND ON LGlobal

Lemma 3.3 (Upper bound on

LGlobal

).With Log-Sum-Exp

properties (Calaﬁore & El Ghaoui,2014),

LGlobal

with the

NT-Xent contrastive objective can be upper bounded as:

LGlobal =1

i=1

−xT

iyiτ−1+ log

j=1

exp(xT

iyjτ−1)

≤1

i=1

−xT

iyiτ−1+ log(Nmax

jexp(xT

iyjτ−1)

i=1

τ−1(−xT

iyiτ−1+ max

jxT

iyj) + log N.

3.2.2. LOWER BOUNDS ON LT rain

Two lower bounds can be derived for

LT rain

, ﬁrst using the

translation identity property (Nielsen & Sun,2016) and then

using the standard lower bound for Log-Sum-Exp (Calaﬁore

& El Ghaoui,2014).

Lemma 3.4 (First lower bound on

LT rain

using Trans-

lation Identity).With the Log-Sum-Exp translation

identity property (Nielsen & Sun,2016),

LT rain

with

the NT-Xent contrastive objective can be bounded as:

LT rain =1

i=1

−xT

iyiτ−1+ log X

j∈Bi

exp(xT

iyjτ−1)

≥1

i=1

−xT

iyiτ−1+ log(kmin

j∈Bi

exp(xT

iyjτ−1))

i=1

τ−1(−xT

iyi+ min

j∈Bi

iyj) + log k.

Lemma 3.5 (Second lower bound on

LT rain

using stan-

dard Log-Sum-Exp bound).With Log-Sum-Exp properties

(Calaﬁore & El Ghaoui,2014),

LT rain

with the NT-Xent

contrastive objective can be bounded as:

LT rain =1

i=1

−xT

iyiτ−1+ log X

j∈Bi

exp(xT

iyjτ−1)

≥1

i=1

−xT

iyiτ−1+ log(max

j∈Bi

exp(xT

iyjτ−1))

i=1

τ−1(−xT

iyiτ−1+ max

j∈Bi

iyj).

3.2.3. UPPER BOUNDS ON LGlobal − LT rain

We can now bound the gap between the global and training

losses of the NT-Xent objective for a ﬁxed batch set of

batches

. The diagonal terms are included in both the

global and training losses and will therefore not factor into

characterizing the gap.

Theorem 3.6 (First upper bound on

LGlobal − LT rain

An upper bound on

LGlobal − LT rain

for the NT-Xent

objective using the lower bound from Lemma 3.4 is:

Global Selection of Contrastive Batches via Optimization on Sample Permutations

LGlobal − LT rain ≤1

i=1

τ−1(max

jxT

iyj−min

j∈Bi

iyj)

+ log N

Theorem 3.7 (Second upper bound on

LGlobal − LT rain

An upper bound on

LGlobal − LT rain

for the NT-Xent

objective using the lower bound from Lemma 3.5 is:

LGlobal − LT rain ≤1

i=1

τ−1(max

jxT

iyj−max

j∈Bi

iyj)

+ log N.

4. Minimizing LGlobal − LT rain as Quadratic

Assignment over Batches

Note that from Theorems 3.6 and 3.7 we have bounded

the gap between

LGlobal − LT rain

without making any

assumptions on data distribution or models. Additionally,

we can see that the bounds are dependent on the batch

assignments

j∈Bi

, batch size

, and total number of

samples

. Note the losses we have characterized have

been summed over samples in

which we will denote as

LGlobal

X,LT rain

and consider losses summed over samples

in Yas LGlobal

Y,LT rain

4.1. Batches assignment as optimization over row

permutations

We will now rewrite our optimization problems over per-

mutations

π∈ΠN

instead of sets of batch assignments

{B}

, an equivalent formulation. First, recognize that our

bounds are dependent only on batch assignments of nega-

tives

j∈Bi

. Without loss of generality assume that batches

are constructed sequentially after applying a row permuta-

tion

π∈ΠN

and

. That is, batches are constructed

over π(X), π(Y)such that j∈Bi⇐⇒ ⌊ j

k⌋=⌊i

k⌋. Rec-

ognize that this batch construction can be written as a block

diagonal matrix of the form

A∈ {0,1}N×N

and

Ai,j = 1

⌊j

k⌋=⌊i

k⌋

. Note this sequential constraint is not restric-

tive as it accommodates all possible batch assignments on

X, Y

with the appropriate permutation

π∈ΠN

. When

introducing a ﬁxed sequential batching, we can rewrite the

minimizer of the upper bound on

LGlobal − LT rain

from

Theorems 3.6 and 3.7 as an optimization problem over per-

mutations

π∈ΠN

X, Y

rather than explicitly on

{B}

The form of these optimizations problems are the Quadratic

Bottleneck Assignment Problem and the Quadratic Assign-

ment Problem (Koopmans & Beckmann,1957). These are

well-known NP-Hard combinatorial optimization problems

and in the following two sections we will discuss formula-

tion and efﬁcient approximations.

4.2. Bounds related to Quadratic Bottleneck

Assignment Problems

The upper bound in Theorem 3.6, for the sum

(LGlobal

X−

LT rain

X)+(LGlobal

Y−LT rain

, is minimized when the small-

est inner product over in-batch negatives is maximized over

π∈ΠN

. Denote

Zij ≜min{xT

iyj, yT

ixj}

and

⊙

as the

Hadamard product. This is a QBAP, the proof of which

is deferred to Appendix A.1, as we are interested in the

minimizing the maximum value of the elementwise product

of two symmetric matrices πZπTand A.

Theorem 4.1 (Formulation of QBAP for bound in Theorem

3.6).The following Quadratic Bottleneck Assignment Prob-

lem, minimizes the upper bound provided in Theorem 3.6

summed over Xand Y:

min

π∈ΠN

max

i,j −A⊙πZπT.

4.3. Bounds related to Quadratic Assignment Problems

The upper bound in Theorem 3.7, for the sum

(LGlobal

X−

LT rain

X) + (LGlobal

Y− LT rain

, is minimized when the sum

of inner products over in-batch negatives is maximized over

permutations

π∈ΠN

. This can be formulated equivalently

as either the Frobenius inner product between the symmetric

matrices

π(XY T+Y XT)πT

and

or the Trace of their

product. These are QAPs, the proof of which is deferred to

Appendix A.2.

Theorem 4.2 (Formulation of QAP for bound in Theorem

3.7).The following Quadratic Assignment Problem mini-

mizes the upper bound in Theorem 3.7:

max

π∈ΠN

T r(Aπ(XY T+Y XT)πT).

Heuristics for both the QAP and QBAP in

O(N3)

and

O(N2)

time complexity respectively are well-known (Kuhn,

1955;Munkres,1957;Edmonds & Karp,1972;Jonker &

Volgenant,1988;Cuthill & McKee,1969). In the next sec-

tion, we will formulate approximate solutions to the QBAP

in Theorem 4.1 with

O(Nk)

space and

O(N2)

time com-

plexity.

5. Global Contrastive Batch Sampling:

Efﬁcient approximations to the QBAP with

Cuthill-McKee

In practice, when

is large it can be difﬁcult to hold

XY T

in memory and approximation algorithms for the QAP prob-

lem (Kuhn,1955;Munkres,1957;Edmonds & Karp,1972;

Jonker & Volgenant,1988) have

O(N3)

complexity. There-

fore, in optimizing over

π∈ΠN

we will make two mod-

iﬁcations in order to develop a

O(Nk)

space and

O(N2)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GlobalSelectionofContrastiveBatchesviaOptimizationonSamplePermutationsVinSachidananda1ZiyiYang2ChenguangZhu2AbstractContrastiveLearninghasrecentlyachievedstate-of-the-artperformanceinawiderangeofuni-modalandmultimodaltasks.Manycontrastivelearningapproachesuseminedhardnegativestomakebatchesmoreinform...

展开>> 收起<<

Global Selection of Contrastive Batches via Optimization on Sample Permutations.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Global Selection of Contrastive Batches via Optimization on Sample Permutations

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: