
semantically similar examples. This area of work
includes, but is not limited to back-translation (Li
and Specia, 2019; Sugiyama and Yoshinaga, 2019),
paraphrase models (Li et al., 2019, 2018; Iyyer
et al., 2018), style transfer models (Fu et al., 2018;
Krishna et al., 2020), contextually perturbed mod-
els (Morris et al., 2020; Jin et al., 2020), to large
LM-base augmentation (Kumar et al., 2020; Yoo
et al., 2021). Lastly, a few methods generate aug-
mentations in the embedding space. These methods
often perform interpolation (DeVries and Taylor,
2017; Chen et al., 2020a), noising (Kurata et al.,
2016), and autoencoding (Schwartz et al., 2018;
Kumar et al., 2019b) with embedded data points.
However, due to the discreteness of NL (Bowman
et al., 2016) and anisotropy (Ethayarajh, 2019), the
introduced noise often outweighs the benefit of ad-
ditional data.
Recently, NL-Augmenter (Dhole et al., 2021)
collected over 100 augmentation methods, with the
intention to provide robustness diagnostics for NLP
models against different type of data perturbations
3
.
In our work, we show that a diverse set of augmen-
tations, even with simple rule-based augmentations,
which are cheaper and more controllable than LM-
based augmentations, can be used to learn robust
general-purpose sentence embedding.
3 Motivation
3.1 Single augmentation is task specific
Augmentations, especially ones that exploit surface
level semantics using simple rules, are task specific
and have been used alone only if the augmentation
aligns with the task objective for the dataset (Long-
pre et al., 2020). For instance, Dinan et al. (2020)
changes gendered words in a sentence to instill
gender invariance for bias mitigation. Inspired by
hard negative augmentations in contrastive learning
(Gao et al., 2021; Sinha et al., 2020), we use the
following case studies to reinforce the conclusion
from the perspective of negative data augmentation.
In both scenarios, we use the negative augmenta-
tions (
h−
i
) loss (with positive examples
h+
i
) for
contrastive objective (Gao et al., 2021):
−log esim(hi,h+
i)/τ
PN
j=1(esim(hi,h+
i)/τ +esim(hi,h−
i)/τ )(1)
where
sim
is cosine similarity,
τ
is the temperature
parameter controlling for the contrastive strength,
and
N
is batch size. Since some augmentations
3https://github.com/GEM-benchmark/NL-Augmenter
Augmentation CoLA trans.
BERTbase 75.93 84.66
Unsupervised SimCSEBERT 71.91 85.81
RandomContextualWordAugmentation 78.14 80.51
SentenceSubjectObjectSwitch 76.80 80.31
Augmentation ANLI trans.
BERTbase 53.80 84.66
Unsupervised SimCSEBERT 53.42 85.81
AntonymSubstitute 58.78 79.93
SentenceAdjectivesAntonymsSwitch 58.63 80.11
Table 1: Top negative augmentations for CoLA and
ANLI, both measured in accuracy, with average trans-
fer performance. See augmentation description in A.2
do not have 100% perturbation rate, we remove
datapoints that do not have a successful negative
augmentation. For the remaining datapoints, we
use original sentences as positives, and train with
different augmentations as the negatives. In addi-
tion, we also present average transfer tasks (Con-
neau and Kiela, 2018) performance as a metric for
embedding quality (trans., detailed in Sec 5).
Case study 1: linguistic acceptability
We first
test embedding performance on CoLA (Warstadt
et al., 2018), a binary sentence classification task
predicting linguistically acceptability. If an aug-
mentation frequently introduces grammatical er-
rors, it should perform well as a negative.
Case study 2: contradiction vs. entailment
Natural language inference (NLI) datasets (Bow-
man et al., 2015; Williams et al., 2018) provide
triplets of sentences: an hypothesis, a sentence
entailing, and a sentence in contradiction to the
hypothesis. A good embedding should place the
entailment sentence closer to the hypothesis than
the contradiction sentence, and in fact, that is the
exact hypothesis exploited by supervised SimCSE.
We calculate the similarity between hypothesis and
an entailment sentence and similarity between hy-
pothesis and a contradiction sentence, and count
how often is the former larger than the later in
ANLI (Nie et al., 2020). If an augmentation can
reverse the semantics of sentences, then it should
perform well as a negative.
Insight:
As expected (Table 1), augmenta-
tions known to introduce a lot of grammatical
mistakes: RandomContextualWordAugmentation
(Zang et al., 2020) performs the best in
CoLA
and those that reverse semantics: AntonymSub-
stitute, and SentenceAdjectivesAntonymsSwitch