AugCSE Contrastive Sentence Embedding with Diverse Augmentations Zilu Tang Boston University

2025-05-02 0 0 4.84MB 24 页 10玖币
侵权投诉
AugCSE: Contrastive Sentence Embedding with Diverse Augmentations
Zilu Tang
Boston University
zilutang@bu.edu
Muhammed Yusuf Kocyigit
Boston University
kocyigit@bu.edu
Derry Wijaya
Boston University
wijaya@bu.edu
Abstract
Data augmentation techniques have been
proven useful in many applications in NLP
fields. Most augmentations are task-specific,
and cannot be used as a general-purpose tool.
In our work, we present AugCSE, a unified
framework to utilize diverse sets of data aug-
mentations to achieve a better, general pur-
pose, sentence embedding model. Building
upon the latest sentence embedding models,
our approach uses a simple antagonistic dis-
criminator that differentiates the augmentation
types. With the finetuning objective borrowed
from domain adaptation, we show that diverse
augmentations, which often lead to conflicting
contrastive signals, can be tamed to produce
a better and more robust sentence represen-
tation. Our methods1achieve state-of-the-art
results on downstream transfer tasks and per-
form competitively on semantic textual simi-
larity tasks, using only unsupervised data.
1 Introduction
Data augmentation in NLP can be useful in many
situations, from low resource data setting, domain
adaptation (Wei et al., 2021), debiasing (Dinan
et al., 2020), to improving generalization, robust-
ness (Dhole et al., 2021). In the vision domain,
Chen et al. (2020b) shows that a diverse set of
augmentation can be used to learn a robust general-
purpose representation with contrastive learning.
Similar work in sentence embedding space (Gao
et al. 2021; Chuang et al. 2022) has shown that a
simple single augmentation such as dropouts from
transformers (Devlin et al., 2019) can be used for
contrastive objective. However, no previous work
has thoroughly explored the impacts of a diverse
set of augmentations with contrastive learning in
the sentence embedding space. It is not straightfor-
ward to find the best augmentations that work for
1
Our code and data can be found at
https://github.com/PootieT/AugCSE
contrastive learning in different datasets or tasks
(Gao et al., 2021). Single augmentation can in-
still invariance in models for a specific aspects of
linguistic variability, while naively combining a
diverse set of augmentations can lead to contradict-
ing gradients, preventing models from generalizing
well (Table 6)
2
. In this work, we present AugCSE
(Figure 1), a general approach to select and unify
a diverse set of augmentations for the purpose of
building a general-purpose sentence embedding.
During training, in addition to using contrastive
loss, we randomly perturb sentences with different
augmentations and use a discriminator loss to unify
embeddings from diverse augmentations. In short,
our work presents the following key contributions:
We show simple data augmentation methods
can be used to improve individual tasks, while
degrading performance on other tasks (due to
shifted domain distribution).
We present our simple discriminator objective
that achieves competitive results on sentence
similarity task (STS) and transfer classifica-
tion tasks against state-of-the-art methods.
We demonstrate through ablation and visual-
ization that our model can unify contrasting
distribution from diverse augmentations and
that simple rule-based augmentations are suf-
ficient for achieving competitive results.
2 Background and Related Work
2.1 Contrastive learning
Contrastive learning is shown to provide a clear sig-
nal to improve the embedding space, which is cru-
cial for downstream tasks. The goal of contrastive
learning is to use similar or dis-similar datapoints
to regularize the embedding representation, such
that similar datapoints (by human, or pre-defined
2
Diverse augmentations have been shown to work without
discriminator in vision (Chen et al., 2020b). We believe the
difference resides in a much more structural distribution in
natural language in comparison to images.
arXiv:2210.13749v1 [cs.CL] 20 Oct 2022
Figure 1: Overall framework of AugCSE. During training, each input sentence is randomly augmented with one
of many augmentation methods. In addition contrastive loss from SimCSE, we add an antagonistic discriminator
to predict the augmentation performed on the input example.
standards) are embedded closer than those data-
points that aren’t similar. Recently, many works in
vision use contrastive objectives to obtain SOTA
performance on image tasks from classification,
detection, to segmentation using ImageNet (Deng
et al., 2009; Caron et al., 2018; Chen et al., 2020b;
He et al., 2020; Caron et al., 2020; Grill et al., 2020;
Zbontar et al., 2021; Chen and He, 2021; Bardes
et al., 2022). Most similar to our work is Sim-
CLR (Chen et al., 2020b), which uses a diverse
set of augmentation as positive contrastive pairs.
In SimCLR, however, the procedure to obtain the
best performing augmentation distribution was not
clearly documented. Further, no previous work has
investigated whether such an idea would work in
the language domain. Our work provides a parallel
investigation in NLP, accessing the usefulness of
diverse augmentations in improving sentence repre-
sentations. We also propose methodical procedures
and heuristics on how such set of augmentations
can be obtained given an end task.
2.2 Sentence Embedding
Building a general purpose sentence embedding
model is useful for many tasks (Wang et al., 2021a;
Izacard et al., 2021; Gao and Callan, 2021; Gao
et al., 2021; Chuang et al., 2022; Chang et al.,
2021). SBERT (Reimers and Gurevych, 2019) pi-
oneered the efforts to improve semantic similari-
ties between sentence embeddings using a siamese
network with BERT (Devlin et al., 2019). Fine-
tuned with the natural language inference (NLI)
dataset (Williams et al., 2018; Bowman et al.,
2015), SBERT predicts whether a hypothesis sen-
tence entails or contradicts the second sentence.
To tackle anisotropicness of BERT embedding
space (Ethayarajh, 2019), Li et al. (2020) and Su
et al. (2021) learn projection layer which converts
BERT embedding to a Gaussian or zero-mean fixed-
variance space. Following contrastive learning lit-
erature in vision, few works investigate alternative
positive and negatives: from using different layers
(Zhang et al., 2020), different models (Carlsson
et al., 2020), against frozen model (Carlsson et al.,
2020), different parts of document (Giorgi et al.,
2021), to next sentences (Neelakantan et al., 2022).
With simplicity in mind, unsupervised SimCSE
(Gao et al., 2021) uses the same sentence with inde-
pendent dropouts from transformers as positives
and the rest of in-batch sentences as negatives,
while supervised SimCSE uses NLI entailment
sentence as positives, and contradiction as nega-
tives. Lastly, the state-of-the-art method, DiffCSE
(Chuang et al., 2022), proposes to add an addi-
tional discriminative loss similar to ones used in
ELECTRA (Clark et al., 2019): the replaced token
detection (RTD) loss to additionally increase the
performance. The discriminator uses the original
sentence embedding and a contextually perturbed
sentence embedding to predict the token locations
in which the two sentences differ. In contrast to Dif-
fCSE, our discriminator predicts the augmentation
type, a higher level task than predicting individual
tokens. Additionally, our discriminator is in an
antagonistic/adversarial relationship to our model,
whereas the ELECTRA-like RTD objective is col-
laborative in nature.
2.3 NLP Augmentations
NLP augmentations are in more or less three fla-
vors. Rule-based augmentations range from ran-
domly deleting words, swap word orders (Wei
and Zou, 2019), to more structurally-sounds, or
semantically specific ones (Zhang et al., 2015; Lo-
geswaran et al., 2018). These simple augmenta-
tions, however, have been found to be not par-
ticularly effective in higher resource domain for
task-agnostic purposes (Longpre et al., 2020; Gao
et al., 2021). The second kind of augmentations
use pretrained language models (LM), to generate
semantically similar examples. This area of work
includes, but is not limited to back-translation (Li
and Specia, 2019; Sugiyama and Yoshinaga, 2019),
paraphrase models (Li et al., 2019, 2018; Iyyer
et al., 2018), style transfer models (Fu et al., 2018;
Krishna et al., 2020), contextually perturbed mod-
els (Morris et al., 2020; Jin et al., 2020), to large
LM-base augmentation (Kumar et al., 2020; Yoo
et al., 2021). Lastly, a few methods generate aug-
mentations in the embedding space. These methods
often perform interpolation (DeVries and Taylor,
2017; Chen et al., 2020a), noising (Kurata et al.,
2016), and autoencoding (Schwartz et al., 2018;
Kumar et al., 2019b) with embedded data points.
However, due to the discreteness of NL (Bowman
et al., 2016) and anisotropy (Ethayarajh, 2019), the
introduced noise often outweighs the benefit of ad-
ditional data.
Recently, NL-Augmenter (Dhole et al., 2021)
collected over 100 augmentation methods, with the
intention to provide robustness diagnostics for NLP
models against different type of data perturbations
3
.
In our work, we show that a diverse set of augmen-
tations, even with simple rule-based augmentations,
which are cheaper and more controllable than LM-
based augmentations, can be used to learn robust
general-purpose sentence embedding.
3 Motivation
3.1 Single augmentation is task specific
Augmentations, especially ones that exploit surface
level semantics using simple rules, are task specific
and have been used alone only if the augmentation
aligns with the task objective for the dataset (Long-
pre et al., 2020). For instance, Dinan et al. (2020)
changes gendered words in a sentence to instill
gender invariance for bias mitigation. Inspired by
hard negative augmentations in contrastive learning
(Gao et al., 2021; Sinha et al., 2020), we use the
following case studies to reinforce the conclusion
from the perspective of negative data augmentation.
In both scenarios, we use the negative augmenta-
tions (
h
i
) loss (with positive examples
h+
i
) for
contrastive objective (Gao et al., 2021):
log esim(hi,h+
i)
PN
j=1(esim(hi,h+
i)+esim(hi,h
i))(1)
where
sim
is cosine similarity,
τ
is the temperature
parameter controlling for the contrastive strength,
and
N
is batch size. Since some augmentations
3https://github.com/GEM-benchmark/NL-Augmenter
Augmentation CoLA trans.
BERTbase 75.93 84.66
Unsupervised SimCSEBERT 71.91 85.81
RandomContextualWordAugmentation 78.14 80.51
SentenceSubjectObjectSwitch 76.80 80.31
Augmentation ANLI trans.
BERTbase 53.80 84.66
Unsupervised SimCSEBERT 53.42 85.81
AntonymSubstitute 58.78 79.93
SentenceAdjectivesAntonymsSwitch 58.63 80.11
Table 1: Top negative augmentations for CoLA and
ANLI, both measured in accuracy, with average trans-
fer performance. See augmentation description in A.2
do not have 100% perturbation rate, we remove
datapoints that do not have a successful negative
augmentation. For the remaining datapoints, we
use original sentences as positives, and train with
different augmentations as the negatives. In addi-
tion, we also present average transfer tasks (Con-
neau and Kiela, 2018) performance as a metric for
embedding quality (trans., detailed in Sec 5).
Case study 1: linguistic acceptability
We first
test embedding performance on CoLA (Warstadt
et al., 2018), a binary sentence classification task
predicting linguistically acceptability. If an aug-
mentation frequently introduces grammatical er-
rors, it should perform well as a negative.
Case study 2: contradiction vs. entailment
Natural language inference (NLI) datasets (Bow-
man et al., 2015; Williams et al., 2018) provide
triplets of sentences: an hypothesis, a sentence
entailing, and a sentence in contradiction to the
hypothesis. A good embedding should place the
entailment sentence closer to the hypothesis than
the contradiction sentence, and in fact, that is the
exact hypothesis exploited by supervised SimCSE.
We calculate the similarity between hypothesis and
an entailment sentence and similarity between hy-
pothesis and a contradiction sentence, and count
how often is the former larger than the later in
ANLI (Nie et al., 2020). If an augmentation can
reverse the semantics of sentences, then it should
perform well as a negative.
Insight:
As expected (Table 1), augmenta-
tions known to introduce a lot of grammatical
mistakes: RandomContextualWordAugmentation
(Zang et al., 2020) performs the best in
CoLA
and those that reverse semantics: AntonymSub-
stitute, and SentenceAdjectivesAntonymsSwitch
Trial STS-b
unsupervised SimCSE 81.18
supervised SimCSE 85.64
no contradiction 83.60
contradiction as pos 79.55
contradiction as pos, entailment as neg 67.16
supervised SimCSE w/ ANLI 75.99
Table 2: Alternative choices of positives and negatives
with SimCSE. All results are reproduced by us.
performs well in
ANLI
. However, single augmenta-
tion significantly under-performs in
trans
fer tasks,
reducing robustness. This suggests the need for di-
verse augmentations (Chen et al., 2020b; Ren et al.,
2021).
3.2 Difficulty of selecting contrastive pairs
Gao et al. (2021) experimented with a combination
of MNLI (Williams et al., 2018) and SNLI (Bow-
man et al., 2015) and found that using entailment as
positives and contradictions as negatives performs
well. In addition to this setting, we performed ad-
ditional ablations to show that it is usually unclear
which sentence pair dataset or augmentation would
provide the best result as contrastive pairs (Table 2).
Sometimes, non-intuitive pairs could yield decent
results
4
. Together with the specificity of individual
augmentations, this motivates for a general frame-
work to select and combine multiple augmentations
to achieve a robust, general-purpose embedding.
4 Methods
4.1 Augmentation Selection
Dhole et al. (2021) introduced 100+ augmentation
methods. We also added non-duplicating augmen-
tation methods from popular repositories: nlpaug,
checklist, TextAugment, TextAttack, and TextAu-
toAugment (Ma 2019; Ribeiro et al. 2020; Mari-
vate and Sefara 2020; Morris et al. 2020; Ren et al.
2021), including RandomDeletion, RandomSwap,
RandomCrop, RandomWordAugmentation, Ran-
domWordEmbAugmentation, and RandomContex-
tualWordAugmentation5.
To narrow down the augmentations we exper-
iment with, we selected for single-sentence aug-
mentations that are either labeled
highly mean-
ing preserving
,
possible meaning alteration
, or
meaning alteration
. After preliminary filtering
4
See more discussion on negation in deep learning in A.15
5
SimCSE tried RandomDeletion, RandomCrop; DiffCSE
tried RandomDeletion, RandomInsertion, and their RTD is
based on RandomContextualWordAugmentation.
(Appendix A.3), Table 3 contains all augmenta-
tions we included in our experiments. To select for
a diverse set of augmentation for main results in
STS-b and transfer tasks, we trained models using
single augmentation as positives, and pick augmen-
tations that obtained top performance on STS-B
and transfer tasks. For full single augmentation
results see Appendix A.14.
4.2 Augmentation Sampling
To save computation and control for randomness,
we augment the training dataset once for every aug-
mentation and cache the results. Prior to training,
augmentations are read from caches and uniformly
sampled at each data point. Since not every aug-
mentation perturbs the original sentence at every
data point, we then correct augmentation label to
"no augmentation" if the augmented sentence is the
same as original sentence. This leads to a larger
portion of the sentence having the label "no aug-
mentation" than each individual augmentation6.
4.3 Model Architecture
In our experiments, we train sentence embedding
encoders using BERT- and RoBERTa-base for fair
comparison to previous methods: SimCSE and Dif-
fCSE. During training, we pass sentence represen-
tations through 2-layer projection layer with batch-
norm, introduced by DiffCSE. We remove projec-
tion layers during inference and obtain sentence
embeddings directly from the encoder. Formally,
we train with contrastive loss, shown in the equa-
tion at the top right of Figure 1. We refer to this
contrastive loss as
Lcontrastive
. We use the em-
bedding corresponding to
[CLS]
token as sentence
embedding in all experiments.
Contrastive loss regularizes on individual data
pair level, which is a very strict constraint to resolve
distributional shifts that augmentations introduce.
To train sentence encoders that are invariant with re-
spect to the shifts between diverse augmentations,
we introduce an antagonistic discriminator. We
pass the concatenated embeddings of original and
augmented sentences into the discriminator (code
in Appendix A.5) trained with the
Ldiscriminator
loss, defined as binary cross entropy between pre-
dicted and actual augmentations:
1
K
K
X
i=1
yilog(p(yi)) + (1 yi)log(1 p(yi)) (2)
6
We also tried resampling augmentations between each
epochs and found that to underperform fixed sampling.
Meaning Alteration Possible Meaning Alteration Highly Meaning Preserving
SentenceAdjectivesAntonymsSwitch
,
SentenceAuxiliaryNegationRemoval
,
ReplaceHypernyms,
ReplaceHyponyms,
SentenceSubjectObjectSwitch
,
CityNamesTransformation
AntonymSubstitute
ColorTransformation
,Summarization,
DiverseParaphrase*
,
SentenceReordering
,
TenseTransformation*
,RandomDeletion,
RandomCrop,
RandomSwap*
,
Random-
WordAugmentation
, RandomWordEm-
bAugmentation, RandomContextualWor-
dAugmentation
YodaPerturbation
,
ContractionExpansions*
,
DiscourseMarkerSubstitution
,
Casual2Formal
,
Gender-
Swap
, GeoNamesTransfor-
mation,
NumericToWord
,
SynonymSubstitution
Table 3: Final subsets of augmentations included in experiments. Augmentations in 16-Aug experiments are
bolded, 12-Aug experiments are underlined, 8-Aug experiments are colored orange and 4-Aug experiments
marked with asterisks(*). For full descriptions of augmentations, see Appendix A.2.
where
K
is the number of augmentation types (plus
"no augmentation"), and
p(yi)
is the probability of
augmentation type
i
predicted by the discriminator.
To encourage augmentation-invariant encoder, the
first layer of the discriminator uses a gradient re-
versal layer (Ganin and Lempitsky 2015; Zhu et al.
2015; Ganin et al. 2016) (code in Appendix A.4)
that allows the gradient to be multiplied with a neg-
ative multiplier
α
in backward pass such that while
discriminator is trained to minimize discriminator
loss, the encoder is trained to maximize the dis-
criminator loss all in one pass. We find this simple
scheme to work well without having to deal with
the instability around training adversarial networks
(Creswell et al. 2018; Clark et al. 2019).
Finally, the overall loss of our model (AugCSE):
L=Lcontrastive +λ∗ Ldiscriminator (3)
where
λ
is a coefficient that tunes the strength of
discriminator loss.
5 Experiments
5.1 Evaluation Datasets
For fair comparison, we use the same dataset Sim-
CSE used: 1M sentences randomly selected from
Wikipedia. After training, we use frozen embed-
dings to evaluate our method on 7 semantic textual
similarity (STS) tasks and 7 (SentEval) transfer
tasks (Conneau and Kiela, 2018). STS tasks in-
clude
STS 2012 - 2016
(Agirre et al., 2016),
STS-
Benchmark
(Cer et al.), and
SICK-Relatedness
(Marelli et al., 2014). In STS tasks, Spearman cor-
relation is calculated between model’s embedding
similarity of the pair of sentences against human
ratings (1-5). Transfer tasks are single sentence
classification tasks from SentEval including
MR
(Pang and Lee, 2005),
CR
(Hu and Liu, 2004),
MPQA
(Wiebe et al., 2005),
MRPC
(Dolan and
Brockett, 2005),
TREC
(Voorhees and Tice, 2000),
SST-2
(Socher et al., 2013), and
SUBJ
(Pang and
Lee, 2004). We follow the standard evaluation
setup from (Conneau and Kiela, 2018), training a
logistic regression classifier on top of frozen sen-
tence embeddings. See Appendix A.6 for details
on hyperparameter search.
5.2 Evaluation Baselines
We include several levels of baselines. From word-
averaged Glove embedding (Pennington et al.,
2014), to BERT
base
, using both average pooling
as well as [CLS] token. We include post pro-
cessing methods,
BERT-flow
(Li et al., 2020),
and
BERT-whitening
(Su et al., 2021), as well
as other more recent contrastive sentence embed-
dings:
CT-BERT
(Carlsson et al., 2020),
SG-OPT
(Kim et al., 2021),
SimCSE
(Gao et al., 2021),
DiffCSE
(Chuang et al., 2022). We also report re-
sults from
DeCLUTER
(Giorgi et al., 2021) and
(Neelakantan et al., 2022) (
cpt-text-S
) as a compar-
ison for what larger model and larger training data
size would benefit. More specifically, DeCLUTER
mines positives from documents, and cpt-text-S
uses next sentence as positives.
5.3 STS Results
We show STS test results in Table 4. AugCSE per-
forms competitively against SOTA methods, with
both BERT and RoBERTa. AugCSE also outper-
forms larger models trained with more data (De-
CLUTR and cpt-text-s). We discuss this in Sec 7.
5.4 Transfer Tasks Results
We show transfer tasks test set results in Table 5.
With BERT
base
AugCSE outperforms DiffCSE in
average transfer score and improve 4 out of 7 Sen-
tEval tasks. In RoBERTa
base
, we still see competi-
tive performance. Here, larger models with more
training data outperform existing methods.
摘要:

AugCSE:ContrastiveSentenceEmbeddingwithDiverseAugmentationsZiluTangBostonUniversityzilutang@bu.eduMuhammedYusufKocyigitBostonUniversitykocyigit@bu.eduDerryWijayaBostonUniversitywijaya@bu.eduAbstractDataaugmentationtechniqueshavebeenprovenusefulinmanyapplicationsinNLPelds.Mostaugmentationsaretask-sp...

展开>> 收起<<
AugCSE Contrastive Sentence Embedding with Diverse Augmentations Zilu Tang Boston University.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:4.84MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注