AugCSE Contrastive Sentence Embedding with Diverse Augmentations Zilu Tang Boston University

2025-05-02 0 0 4.84MB 24 页 10玖币

侵权投诉

AugCSE: Contrastive Sentence Embedding with Diverse Augmentations

Zilu Tang

Boston University

zilutang@bu.edu

Muhammed Yusuf Kocyigit

Boston University

kocyigit@bu.edu

Derry Wijaya

Boston University

wijaya@bu.edu

Abstract

Data augmentation techniques have been

proven useful in many applications in NLP

ﬁelds. Most augmentations are task-speciﬁc,

and cannot be used as a general-purpose tool.

In our work, we present AugCSE, a uniﬁed

framework to utilize diverse sets of data aug-

mentations to achieve a better, general pur-

pose, sentence embedding model. Building

upon the latest sentence embedding models,

our approach uses a simple antagonistic dis-

criminator that differentiates the augmentation

types. With the ﬁnetuning objective borrowed

from domain adaptation, we show that diverse

augmentations, which often lead to conﬂicting

contrastive signals, can be tamed to produce

a better and more robust sentence represen-

tation. Our methods1achieve state-of-the-art

results on downstream transfer tasks and per-

form competitively on semantic textual simi-

larity tasks, using only unsupervised data.

1 Introduction

Data augmentation in NLP can be useful in many

situations, from low resource data setting, domain

adaptation (Wei et al., 2021), debiasing (Dinan

et al., 2020), to improving generalization, robust-

ness (Dhole et al., 2021). In the vision domain,

Chen et al. (2020b) shows that a diverse set of

augmentation can be used to learn a robust general-

purpose representation with contrastive learning.

Similar work in sentence embedding space (Gao

et al. 2021; Chuang et al. 2022) has shown that a

simple single augmentation such as dropouts from

transformers (Devlin et al., 2019) can be used for

contrastive objective. However, no previous work

has thoroughly explored the impacts of a diverse

set of augmentations with contrastive learning in

the sentence embedding space. It is not straightfor-

ward to ﬁnd the best augmentations that work for

Our code and data can be found at

https://github.com/PootieT/AugCSE

contrastive learning in different datasets or tasks

(Gao et al., 2021). Single augmentation can in-

still invariance in models for a speciﬁc aspects of

linguistic variability, while naively combining a

diverse set of augmentations can lead to contradict-

ing gradients, preventing models from generalizing

well (Table 6)

. In this work, we present AugCSE

(Figure 1), a general approach to select and unify

a diverse set of augmentations for the purpose of

building a general-purpose sentence embedding.

During training, in addition to using contrastive

loss, we randomly perturb sentences with different

augmentations and use a discriminator loss to unify

embeddings from diverse augmentations. In short,

our work presents the following key contributions:

•

We show simple data augmentation methods

can be used to improve individual tasks, while

degrading performance on other tasks (due to

shifted domain distribution).

•

We present our simple discriminator objective

that achieves competitive results on sentence

similarity task (STS) and transfer classiﬁca-

tion tasks against state-of-the-art methods.

•

We demonstrate through ablation and visual-

ization that our model can unify contrasting

distribution from diverse augmentations and

that simple rule-based augmentations are suf-

ﬁcient for achieving competitive results.

2 Background and Related Work

2.1 Contrastive learning

Contrastive learning is shown to provide a clear sig-

nal to improve the embedding space, which is cru-

cial for downstream tasks. The goal of contrastive

learning is to use similar or dis-similar datapoints

to regularize the embedding representation, such

that similar datapoints (by human, or pre-deﬁned

Diverse augmentations have been shown to work without

discriminator in vision (Chen et al., 2020b). We believe the

difference resides in a much more structural distribution in

natural language in comparison to images.

arXiv:2210.13749v1 [cs.CL] 20 Oct 2022

Figure 1: Overall framework of AugCSE. During training, each input sentence is randomly augmented with one

of many augmentation methods. In addition contrastive loss from SimCSE, we add an antagonistic discriminator

to predict the augmentation performed on the input example.

standards) are embedded closer than those data-

points that aren’t similar. Recently, many works in

vision use contrastive objectives to obtain SOTA

performance on image tasks from classiﬁcation,

detection, to segmentation using ImageNet (Deng

et al., 2009; Caron et al., 2018; Chen et al., 2020b;

He et al., 2020; Caron et al., 2020; Grill et al., 2020;

Zbontar et al., 2021; Chen and He, 2021; Bardes

et al., 2022). Most similar to our work is Sim-

CLR (Chen et al., 2020b), which uses a diverse

set of augmentation as positive contrastive pairs.

In SimCLR, however, the procedure to obtain the

best performing augmentation distribution was not

clearly documented. Further, no previous work has

investigated whether such an idea would work in

the language domain. Our work provides a parallel

investigation in NLP, accessing the usefulness of

diverse augmentations in improving sentence repre-

sentations. We also propose methodical procedures

and heuristics on how such set of augmentations

can be obtained given an end task.

2.2 Sentence Embedding

Building a general purpose sentence embedding

model is useful for many tasks (Wang et al., 2021a;

Izacard et al., 2021; Gao and Callan, 2021; Gao

et al., 2021; Chuang et al., 2022; Chang et al.,

2021). SBERT (Reimers and Gurevych, 2019) pi-

oneered the efforts to improve semantic similari-

ties between sentence embeddings using a siamese

network with BERT (Devlin et al., 2019). Fine-

tuned with the natural language inference (NLI)

dataset (Williams et al., 2018; Bowman et al.,

2015), SBERT predicts whether a hypothesis sen-

tence entails or contradicts the second sentence.

To tackle anisotropicness of BERT embedding

space (Ethayarajh, 2019), Li et al. (2020) and Su

et al. (2021) learn projection layer which converts

BERT embedding to a Gaussian or zero-mean ﬁxed-

variance space. Following contrastive learning lit-

erature in vision, few works investigate alternative

positive and negatives: from using different layers

(Zhang et al., 2020), different models (Carlsson

et al., 2020), against frozen model (Carlsson et al.,

2020), different parts of document (Giorgi et al.,

2021), to next sentences (Neelakantan et al., 2022).

With simplicity in mind, unsupervised SimCSE

(Gao et al., 2021) uses the same sentence with inde-

pendent dropouts from transformers as positives

and the rest of in-batch sentences as negatives,

while supervised SimCSE uses NLI entailment

sentence as positives, and contradiction as nega-

tives. Lastly, the state-of-the-art method, DiffCSE

(Chuang et al., 2022), proposes to add an addi-

tional discriminative loss similar to ones used in

ELECTRA (Clark et al., 2019): the replaced token

detection (RTD) loss to additionally increase the

performance. The discriminator uses the original

sentence embedding and a contextually perturbed

sentence embedding to predict the token locations

in which the two sentences differ. In contrast to Dif-

fCSE, our discriminator predicts the augmentation

type, a higher level task than predicting individual

tokens. Additionally, our discriminator is in an

antagonistic/adversarial relationship to our model,

whereas the ELECTRA-like RTD objective is col-

laborative in nature.

2.3 NLP Augmentations

NLP augmentations are in more or less three ﬂa-

vors. Rule-based augmentations range from ran-

domly deleting words, swap word orders (Wei

and Zou, 2019), to more structurally-sounds, or

semantically speciﬁc ones (Zhang et al., 2015; Lo-

geswaran et al., 2018). These simple augmenta-

tions, however, have been found to be not par-

ticularly effective in higher resource domain for

task-agnostic purposes (Longpre et al., 2020; Gao

et al., 2021). The second kind of augmentations

use pretrained language models (LM), to generate

semantically similar examples. This area of work

includes, but is not limited to back-translation (Li

and Specia, 2019; Sugiyama and Yoshinaga, 2019),

paraphrase models (Li et al., 2019, 2018; Iyyer

et al., 2018), style transfer models (Fu et al., 2018;

Krishna et al., 2020), contextually perturbed mod-

els (Morris et al., 2020; Jin et al., 2020), to large

LM-base augmentation (Kumar et al., 2020; Yoo

et al., 2021). Lastly, a few methods generate aug-

mentations in the embedding space. These methods

often perform interpolation (DeVries and Taylor,

2017; Chen et al., 2020a), noising (Kurata et al.,

2016), and autoencoding (Schwartz et al., 2018;

Kumar et al., 2019b) with embedded data points.

However, due to the discreteness of NL (Bowman

et al., 2016) and anisotropy (Ethayarajh, 2019), the

introduced noise often outweighs the beneﬁt of ad-

ditional data.

Recently, NL-Augmenter (Dhole et al., 2021)

collected over 100 augmentation methods, with the

intention to provide robustness diagnostics for NLP

models against different type of data perturbations

In our work, we show that a diverse set of augmen-

tations, even with simple rule-based augmentations,

which are cheaper and more controllable than LM-

based augmentations, can be used to learn robust

general-purpose sentence embedding.

3 Motivation

3.1 Single augmentation is task speciﬁc

Augmentations, especially ones that exploit surface

level semantics using simple rules, are task speciﬁc

and have been used alone only if the augmentation

aligns with the task objective for the dataset (Long-

pre et al., 2020). For instance, Dinan et al. (2020)

changes gendered words in a sentence to instill

gender invariance for bias mitigation. Inspired by

hard negative augmentations in contrastive learning

(Gao et al., 2021; Sinha et al., 2020), we use the

following case studies to reinforce the conclusion

from the perspective of negative data augmentation.

In both scenarios, we use the negative augmenta-

tions (

h−

) loss (with positive examples

) for

contrastive objective (Gao et al., 2021):

−log esim(hi,h+

i)/τ

j=1(esim(hi,h+

i)/τ +esim(hi,h−

i)/τ )(1)

where

sim

is cosine similarity,

is the temperature

parameter controlling for the contrastive strength,

and

is batch size. Since some augmentations

3https://github.com/GEM-benchmark/NL-Augmenter

Augmentation CoLA trans.

BERTbase 75.93 84.66

Unsupervised SimCSEBERT 71.91 85.81

RandomContextualWordAugmentation 78.14 80.51

SentenceSubjectObjectSwitch 76.80 80.31

Augmentation ANLI trans.

BERTbase 53.80 84.66

Unsupervised SimCSEBERT 53.42 85.81

AntonymSubstitute 58.78 79.93

SentenceAdjectivesAntonymsSwitch 58.63 80.11

Table 1: Top negative augmentations for CoLA and

ANLI, both measured in accuracy, with average trans-

fer performance. See augmentation description in A.2

do not have 100% perturbation rate, we remove

datapoints that do not have a successful negative

augmentation. For the remaining datapoints, we

use original sentences as positives, and train with

different augmentations as the negatives. In addi-

tion, we also present average transfer tasks (Con-

neau and Kiela, 2018) performance as a metric for

embedding quality (trans., detailed in Sec 5).

Case study 1: linguistic acceptability

We ﬁrst

test embedding performance on CoLA (Warstadt

et al., 2018), a binary sentence classiﬁcation task

predicting linguistically acceptability. If an aug-

mentation frequently introduces grammatical er-

rors, it should perform well as a negative.

Case study 2: contradiction vs. entailment

Natural language inference (NLI) datasets (Bow-

man et al., 2015; Williams et al., 2018) provide

triplets of sentences: an hypothesis, a sentence

entailing, and a sentence in contradiction to the

hypothesis. A good embedding should place the

entailment sentence closer to the hypothesis than

the contradiction sentence, and in fact, that is the

exact hypothesis exploited by supervised SimCSE.

We calculate the similarity between hypothesis and

an entailment sentence and similarity between hy-

pothesis and a contradiction sentence, and count

how often is the former larger than the later in

ANLI (Nie et al., 2020). If an augmentation can

reverse the semantics of sentences, then it should

perform well as a negative.

Insight:

As expected (Table 1), augmenta-

tions known to introduce a lot of grammatical

mistakes: RandomContextualWordAugmentation

(Zang et al., 2020) performs the best in

CoLA

and those that reverse semantics: AntonymSub-

stitute, and SentenceAdjectivesAntonymsSwitch

Trial STS-b

unsupervised SimCSE 81.18

supervised SimCSE 85.64

no contradiction 83.60

contradiction as pos 79.55

contradiction as pos, entailment as neg 67.16

supervised SimCSE w/ ANLI 75.99

Table 2: Alternative choices of positives and negatives

with SimCSE. All results are reproduced by us.

performs well in

ANLI

. However, single augmenta-

tion signiﬁcantly under-performs in

trans

fer tasks,

reducing robustness. This suggests the need for di-

verse augmentations (Chen et al., 2020b; Ren et al.,

2021).

3.2 Difﬁculty of selecting contrastive pairs

Gao et al. (2021) experimented with a combination

of MNLI (Williams et al., 2018) and SNLI (Bow-

man et al., 2015) and found that using entailment as

positives and contradictions as negatives performs

well. In addition to this setting, we performed ad-

ditional ablations to show that it is usually unclear

which sentence pair dataset or augmentation would

provide the best result as contrastive pairs (Table 2).

Sometimes, non-intuitive pairs could yield decent

results

. Together with the speciﬁcity of individual

augmentations, this motivates for a general frame-

work to select and combine multiple augmentations

to achieve a robust, general-purpose embedding.

4 Methods

4.1 Augmentation Selection

Dhole et al. (2021) introduced 100+ augmentation

methods. We also added non-duplicating augmen-

tation methods from popular repositories: nlpaug,

checklist, TextAugment, TextAttack, and TextAu-

toAugment (Ma 2019; Ribeiro et al. 2020; Mari-

vate and Sefara 2020; Morris et al. 2020; Ren et al.

2021), including RandomDeletion, RandomSwap,

RandomCrop, RandomWordAugmentation, Ran-

domWordEmbAugmentation, and RandomContex-

tualWordAugmentation5.

To narrow down the augmentations we exper-

iment with, we selected for single-sentence aug-

mentations that are either labeled

highly mean-

ing preserving

possible meaning alteration

, or

meaning alteration

. After preliminary ﬁltering

See more discussion on negation in deep learning in A.15

SimCSE tried RandomDeletion, RandomCrop; DiffCSE

tried RandomDeletion, RandomInsertion, and their RTD is

based on RandomContextualWordAugmentation.

(Appendix A.3), Table 3 contains all augmenta-

tions we included in our experiments. To select for

a diverse set of augmentation for main results in

STS-b and transfer tasks, we trained models using

single augmentation as positives, and pick augmen-

tations that obtained top performance on STS-B

and transfer tasks. For full single augmentation

results see Appendix A.14.

4.2 Augmentation Sampling

To save computation and control for randomness,

we augment the training dataset once for every aug-

mentation and cache the results. Prior to training,

augmentations are read from caches and uniformly

sampled at each data point. Since not every aug-

mentation perturbs the original sentence at every

data point, we then correct augmentation label to

"no augmentation" if the augmented sentence is the

same as original sentence. This leads to a larger

portion of the sentence having the label "no aug-

mentation" than each individual augmentation6.

4.3 Model Architecture

In our experiments, we train sentence embedding

encoders using BERT- and RoBERTa-base for fair

comparison to previous methods: SimCSE and Dif-

fCSE. During training, we pass sentence represen-

tations through 2-layer projection layer with batch-

norm, introduced by DiffCSE. We remove projec-

tion layers during inference and obtain sentence

embeddings directly from the encoder. Formally,

we train with contrastive loss, shown in the equa-

tion at the top right of Figure 1. We refer to this

contrastive loss as

Lcontrastive

. We use the em-

bedding corresponding to

[CLS]

token as sentence

embedding in all experiments.

Contrastive loss regularizes on individual data

pair level, which is a very strict constraint to resolve

distributional shifts that augmentations introduce.

To train sentence encoders that are invariant with re-

spect to the shifts between diverse augmentations,

we introduce an antagonistic discriminator. We

pass the concatenated embeddings of original and

augmented sentences into the discriminator (code

in Appendix A.5) trained with the

Ldiscriminator

loss, deﬁned as binary cross entropy between pre-

dicted and actual augmentations:

−1

i=1

yilog(p(yi)) + (1 −yi)log(1 −p(yi)) (2)

We also tried resampling augmentations between each

epochs and found that to underperform ﬁxed sampling.

Meaning Alteration Possible Meaning Alteration Highly Meaning Preserving

SentenceAdjectivesAntonymsSwitch

SentenceAuxiliaryNegationRemoval

ReplaceHypernyms,

ReplaceHyponyms,

SentenceSubjectObjectSwitch

CityNamesTransformation

AntonymSubstitute

ColorTransformation

,Summarization,

DiverseParaphrase*

SentenceReordering

TenseTransformation*

,RandomDeletion,

RandomCrop,

RandomSwap*

Random-

WordAugmentation

, RandomWordEm-

bAugmentation, RandomContextualWor-

dAugmentation

YodaPerturbation

ContractionExpansions*

DiscourseMarkerSubstitution

Casual2Formal

Gender-

Swap

, GeoNamesTransfor-

mation,

NumericToWord

SynonymSubstitution

Table 3: Final subsets of augmentations included in experiments. Augmentations in 16-Aug experiments are

bolded, 12-Aug experiments are underlined, 8-Aug experiments are colored orange and 4-Aug experiments

marked with asterisks(*). For full descriptions of augmentations, see Appendix A.2.

where

is the number of augmentation types (plus

"no augmentation"), and

p(yi)

is the probability of

augmentation type

predicted by the discriminator.

To encourage augmentation-invariant encoder, the

ﬁrst layer of the discriminator uses a gradient re-

versal layer (Ganin and Lempitsky 2015; Zhu et al.

2015; Ganin et al. 2016) (code in Appendix A.4)

that allows the gradient to be multiplied with a neg-

ative multiplier

in backward pass such that while

discriminator is trained to minimize discriminator

loss, the encoder is trained to maximize the dis-

criminator loss all in one pass. We ﬁnd this simple

scheme to work well without having to deal with

the instability around training adversarial networks

(Creswell et al. 2018; Clark et al. 2019).

Finally, the overall loss of our model (AugCSE):

L=Lcontrastive +λ∗ Ldiscriminator (3)

where

is a coefﬁcient that tunes the strength of

discriminator loss.

5 Experiments

5.1 Evaluation Datasets

For fair comparison, we use the same dataset Sim-

CSE used: 1M sentences randomly selected from

Wikipedia. After training, we use frozen embed-

dings to evaluate our method on 7 semantic textual

similarity (STS) tasks and 7 (SentEval) transfer

tasks (Conneau and Kiela, 2018). STS tasks in-

clude

STS 2012 - 2016

(Agirre et al., 2016),

STS-

Benchmark

(Cer et al.), and

SICK-Relatedness

(Marelli et al., 2014). In STS tasks, Spearman cor-

relation is calculated between model’s embedding

similarity of the pair of sentences against human

ratings (1-5). Transfer tasks are single sentence

classiﬁcation tasks from SentEval including

(Pang and Lee, 2005),

(Hu and Liu, 2004),

MPQA

(Wiebe et al., 2005),

MRPC

(Dolan and

Brockett, 2005),

TREC

(Voorhees and Tice, 2000),

SST-2

(Socher et al., 2013), and

SUBJ

(Pang and

Lee, 2004). We follow the standard evaluation

setup from (Conneau and Kiela, 2018), training a

logistic regression classiﬁer on top of frozen sen-

tence embeddings. See Appendix A.6 for details

on hyperparameter search.

5.2 Evaluation Baselines

We include several levels of baselines. From word-

averaged Glove embedding (Pennington et al.,

2014), to BERT

base

, using both average pooling

as well as [CLS] token. We include post pro-

cessing methods,

BERT-ﬂow

(Li et al., 2020),

and

BERT-whitening

(Su et al., 2021), as well

as other more recent contrastive sentence embed-

dings:

CT-BERT

(Carlsson et al., 2020),

SG-OPT

(Kim et al., 2021),

SimCSE

(Gao et al., 2021),

DiffCSE

(Chuang et al., 2022). We also report re-

sults from

DeCLUTER

(Giorgi et al., 2021) and

(Neelakantan et al., 2022) (

cpt-text-S

) as a compar-

ison for what larger model and larger training data

size would beneﬁt. More speciﬁcally, DeCLUTER

mines positives from documents, and cpt-text-S

uses next sentence as positives.

5.3 STS Results

We show STS test results in Table 4. AugCSE per-

forms competitively against SOTA methods, with

both BERT and RoBERTa. AugCSE also outper-

forms larger models trained with more data (De-

CLUTR and cpt-text-s). We discuss this in Sec 7.

5.4 Transfer Tasks Results

We show transfer tasks test set results in Table 5.

With BERT

base

AugCSE outperforms DiffCSE in

average transfer score and improve 4 out of 7 Sen-

tEval tasks. In RoBERTa

base

, we still see competi-

tive performance. Here, larger models with more

training data outperform existing methods.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AugCSE:ContrastiveSentenceEmbeddingwithDiverseAugmentationsZiluTangBostonUniversityzilutang@bu.eduMuhammedYusufKocyigitBostonUniversitykocyigit@bu.eduDerryWijayaBostonUniversitywijaya@bu.eduAbstractDataaugmentationtechniqueshavebeenprovenusefulinmanyapplicationsinNLPelds.Mostaugmentationsaretask-sp...

展开>> 收起<<

AugCSE Contrastive Sentence Embedding with Diverse Augmentations Zilu Tang Boston University.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AugCSE Contrastive Sentence Embedding with Diverse Augmentations Zilu Tang Boston University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: