
3.1 Domain Datasets
Our focus is on two text-classification domains, sentiment analysis and abuse detection. Sentiment
analysis is a popular and widely studied task [
41
,
58
,
64
,
43
,
31
,
55
] while abuse detection is more
likely to be adversarial [40, 63, 26].
For sentiment analysis, we attack models trained on three domains: (1)
Climate Change2
, 62,356
tweets on climate change; (2)
IMDB
[
34
], 50,000 movie reviews, and (3)
SST-2
[
45
], 68,221
movie reviews. For abuse detection, we attack models trained on three toxic-comment datasets:
(1)
Wikipedia
(Talk Pages) [
56
,
8
], 159,686 comments from Wikipedia administration webpages, (2)
Hatebase
[
6
], 24,783 comments, and (3)
Civil Comments3
, 1,804,874 comments from independent
news sites. All datasets are binary (positive vs. negative or toxic vs. non-toxic) except for Climate
Change, which includes neutral sentiment. Additional dataset details are in the Appendix §B.1.
3.2 Target Models
We finetune BERT [
7
], RoBERTa [
33
], and XLNet [
59
] models — all from HuggingFace’s transform-
ers library [
54
] — on the six domain datasets. We use transformer-based models since they represent
current state-of-the-art approaches to text classification, and we use multiple architectures to obtain a
wider range of adversarial examples, ultimately testing the robustness of attack identification models
to attacks targeting different victim models.
Table 2 shows the performance of these models on the test set of each domain dataset. On most
datasets, RoBERTa slightly outperforms the other two models both in accuracy and AUROC. Training
code and additional details such as selected hyperparameters are in the Appendix §B.2.
3.3 Attack Methods
We select twelve different attack methods that cover a wide range of design choices and assumptions,
such as model access level (e.g., white/gray/black box), perturbation level (e.g., char/word/token),
and linguistic constraints. Table 7 (Appendix, §B.4) provides a summary of all attack methods and
their characteristics.
Target Model Access and Perturbation Levels.
Of the twelve attack methods, only two [
10
,
30
]
have full access to the target model (i.e., a white-box attack), while five [
12
,
21
,
1
,
52
,
42
] assume
some information about the target (gray box), and the rest [
13
,
61
,
30
,
23
,
11
] can only query the
output (black box). The majority of methods perturb entire words by swapping them with similar
words based on sememes [
61
], synonyms [
23
] or an embedding space [
13
,
21
,
1
,
52
,
42
]. The
remaining methods [
12
,
10
,
30
,
11
] operate on the token/character level, perturbing the input by
inserting/deleting/swapping different characters.
Linguistic Constraints.
Linguistic constraints promote indistinguishable attacks. For example,
Genetic [
1
], FasterGenetic [
21
], HotFlip [
10
], and Pruthi [
42
] limit the number or percentage of
words perturbed. Other methods ensure the distance between the perturbed text and the original text
is “close” in some embedding space; for example, BAE [
13
], TextBugger [
30
], and TextFooler [
23
]
constrain the perturbed text to have high cosine similarity to the original text using a universal sentence
encoder (USE) [
4
], while IGA [
52
] and VIPER [
11
] ensure similarity in word and visual embedding
spaces, respectively. Some methods, such as TextBugger and TextFooler, use a combination of
constraints to further limit deviations from the original input.
Attack Toolchains.
We use TextAttack [
38
] and OpenAttack [
62
] — open-source toolchains
that provide fully-automated off-the-shelf attacks — to generate adversarial examples. For these
toolchains, attack methods are implemented using different search methods. For example, BAE [
13
],
DeepWordBug [
12
], TextBugger [
30
], and TextFooler [
23
] use a word importance ranking to greedily
decide which word(s) to perturb for each query; in contrast, Genetic [
1
] and PSO [
61
] use a genetic
algorithm and particle swarm optimization to identify word-perturbation candidates, respectively. For
2https://www.kaggle.com/edqian/twitter-climate-change-sentiment-dataset
3https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
4