the sentences with words in the synonym substitu-
tion set at each training iteration. While common
IBP-based certified robust training methods do not
scale well onto large pre-trained language mod-
els (Jia et al.,2019;Huang et al.,2019), SAFER
is a structure-free approach that can be applied to
any kind of model architectures. In addition, it
gives stronger robustness than traditional adversar-
ial training method (Yoo and Qi,2021).
We train BERT (Devlin et al.,2019) and
RoBERTa (Liu et al.,2019) models on two dif-
ferent tasks with SAFER training for 15 epochs.
We then test the attack success rate for both fickle-
ness and obstinacy attacks at each training epoch.
We use the same perturbation method as described
in Section 2.1 for both the training and the attack.
For each word, the synonym perturbation set is
constructed by selecting the top
k
nearest neigh-
bors with a cosine similarity constraint of 0.8 in
GLOVE embeddings (Pennington et al.,2014), and
the antonym perturbation set consists of antonym
words found in WordNet (Miller,1995). We follow
the method of Jin et al. (2020) for finding fickle
adversarial examples by using word importance
ranking and Part-of-Speech (PoS) and sentence se-
mantic similarity constraints as the search criteria.
We replace words from the ones with the highest
word importance scores to the ones with the least
and make sure the new substituted words have the
same PoS tags as the original words. For antonym
attack, we also use word importance ranking and
PoS to search for word substitutions. For com-
parison, we set up baseline models with normal
training on the original training sets.
3.2 Tasks
We choose two different tasks from the GLUE
benchmark (Wang et al.,2018) that are good can-
didates for the antonym attack. Antonym-based
attacks work well on these tasks since both tasks
consist of sentence pairs and changing a word to an
opposite meaning is likely to break the relationship
between the pairs.
Natural Language Inference.
We experiment
with Multi-Genre Natural Language Inference
(MNLI) dataset (Williams et al.,2018) which con-
tains a premise-hypothesis pair for each example.
The task is to identify the relation between the sen-
tences in a premise-hypothesis pair and determine
whether the hypothesis is true (entailment), false
(contradiction) or undetermined (neutral) given the
premise. We consider the case where both premise
and hypothesis can be perturbed, but only one word
from either premise or hypothesis can be substi-
tuted for antonym attack. We exclude examples
with a neutral label when constructing obstinate
adversarial examples since antonym word substi-
tutions may not change their label to a different
class.
Paraphrase Identification.
We use Quora Ques-
tion Pairs (QQP) (Iyer et al.,2017) which consists
of questions extracted from Quora. The goal of the
task is to identify duplicate questions. Each ques-
tion pair is labeled as duplicate or non-duplicate.
For our antonym attack strategy, we only target the
duplicate class since antonym word substitutions
are unlikely to flip an initially non-duplicate pair
into a duplicate.
We also conducted experiments using the Wiki
Talk Comments (Wulczyn et al.,2017) dataset, a
dataset for toxicity detection, by adding or remov-
ing toxic words for creating obstinate examples.
However, we found adding toxic words can reach
almost 100% attack success rate, so there did not
seem to be an interesting tradeoff to explore for
available models for this task, and we do not in-
clude it in our results.
3.3 Results
We visualize the attack success rates for fickleness
(synonym attack) and obstinacy (antonym attack)
attacks in Figure 3. The results are consistent with
our hypothesis that optimizing adversarial robust-
ness of NLP models using only fickle examples
can result in models that are more vulnerable to ob-
stinacy attacks. Robustness training for the BERT
model on MNLI improves fickleness robustness,
reducing the synonym attack success rate from
36% to 11% (a 69% decrease) after training for
15 epochs (Figure 3a), but antonym attack success
rate increases from 56% to 63% (a 13% increase).
The antonym attack success rate increases even
more for the RoBERTa model (Figure 3b), increas-
ing from 56% to 67% (a 20% increase) while the
synonym attack success rate decreases from 31.2%
to 10% (a 68% decrease). The RoBERTa model is
pre-trained to be more robust than the BERT model
with dynamic masking, which perhaps explains
the difference. We observe a robustness tradeoff
for QQP dataset as well (see Appendix A.1). In
addition, the fickle adversarial training does not
sacrifice the performance on the original examples