TCAB A Large-Scale Text Classification Attack Benchmark Kalyani AsthanayZhouhang XiexWencong YouzAdam Noackz

2025-05-02 0 0 1.17MB 32 页 10玖币
侵权投诉
TCAB: A Large-Scale Text Classification Attack
Benchmark
Kalyani AsthanaZhouhang Xie§Wencong YouAdam Noack
Jonathan BrophySameer SinghDaniel Lowd
University of California Irvine §University of California San Diego University of Oregon
{kasthana,sameer}@uci.edu zhx022@ucsd.edu
{wyou,anoac2k,jbrophy,lowd}@cs.uoregon.edu
Abstract
We introduce the Text Classification Attack Benchmark (TCAB), a dataset for
analyzing, understanding, detecting, and labeling adversarial attacks against text
classifiers. TCAB includes 1.5 million attack instances, generated by twelve adver-
sarial attacks targeting three classifiers trained on six source datasets for sentiment
analysis and abuse detection in English. Unlike standard text classification, text
attacks must be understood in the context of the target classifier that is being
attacked, and thus features of the target classifier are important as well.
TCAB includes all attack instances that are successful in flipping the predicted
label; a subset of the attacks are also labeled by human annotators to determine
how frequently the primary semantics are preserved. The process of generating
attacks is automated, so that TCAB can easily be extended to incorporate new text
attacks and better classifiers as they are developed. In addition to the primary tasks
of detecting and labeling attacks, TCAB can also be used for attack localization,
attack target labeling, and attack characterization. TCAB code and dataset are
available at https://react-nlp.github.io/tcab/.
1 Introduction
Text classifiers have been under attack ever since spammers started evading spam filters, nearly 20
years ago [
18
]. In recent years, however, attacking classifiers has become much easier to carry out.
Many general-purpose attacks have been developed and are now available in standard, plug-and-play
frameworks, such as TextAttack [
38
] and OpenAttack [
62
]. The wide use of standard architectures
and shared pretrained representations have further increased the risk of attacks by decreasing the
diversity of text classifiers.
Our focus is on evasion attacks [
2
], in which an attacker attempts to change a classifier’s prediction by
making minor, semantics-preserving perturbations to the original input. To accomplish this, different
adversarial attack algorithms employ different types of perturbations, search methods, and constraints.
See Table 1 for some brief examples.
A common defense strategy is to make classifiers more robust, using algorithms with heuristic or
provable guarantees on their performance [
35
,
5
,
32
,
46
,
9
,
49
,
19
,
60
,
53
,
36
,
66
,
22
,
24
,
17
,
21
,
44
].
However, these defenses are often computationally expensive or result in reduced accuracy. Therefore,
as a complement to making classifiers more robust, we introduce the task of attack identification
automatically determining the adversarial attacks (if any) used to generate a given piece of text. The
Corresponding author.
arXiv:2210.12233v1 [cs.LG] 21 Oct 2022
Table 1: Attack Samples on SST-2
Attack Text Label Confidence
Original the acting is amateurish Negative 63.7%
Pruthi [42] the acting is amateirish Positive 82.1%
DeepWordBug [12] the acting is aateurish Positive 91.2%
IGA [52] the acting is enthusiastic Positive 62.3%
idea behind attack identification is that many attackers will use whatever attacks are most convenient,
such as public implementations of attack algorithms, instead of developing new ones or implementing
ones on their own. Thus, we can identify specific attacks instead of detecting or preventing all
possible attacks.
The primary focus of attack identification is attack labeling — determining which specific attack was
used (or none). However, other valuable challenges exist under the umbrella of attack identification,
such as attack target labeling (determining which model is being attacked), attack localization (iden-
tifying which parts of the text have been manipulated), and more (see §3.7 for detailed subtask
descriptions). These tasks give us information about how the attacks are being conducted, which can
be used to develop defense strategies for the overall system, such as uncovering malicious actors
behind misinformation or abuse campaigns on social media.
Existing adversarial evaluation frameworks/benchmarks focus exclusively on model robustness,
typically requiring carefully controlled and expensive human annotations [
51
,
27
] that tend to result
in small datasets; e.g. Adversarial GLUE [
50
] contains only 5,000 human-verified attacks. However,
to the best of our knowledge, no dataset currently exists to support the task of attack identification.
To address this issue, we propose TCAB, a benchmark dataset that comprises over 1.5 million
fully-automated low-effort attacks, providing sufficient data to enable proper training and evaluation
of models specific to the task (and potential subtasks) of attack identification.
We summarize our contributions below, outlining the unique advantages of TCAB over existing
adversarial benchmarks.
1.
We introduce attack identification and its primary task attack labeling – automatically
determining the adversarial attacks (if any) used to generate a given piece of text. Attack
identification also includes additional tasks such as attack detection,attack target labeling,
attack localization, and attack characterization, enabling defenders to learn more about
their attackers.
2.
We create TCAB, a text classification attack benchmark consisting of more than 1.5 million
successful adversarial examples from a diverse set of twelve attack methods targeting three
state-of-the-art text classification models trained on six sentiment/toxic-speech domain
datasets. TCAB is designed to be expanded as new text attacks, classifiers, and domains are
developed. This benchmark supports research into attack identification and related tasks.
3.
We adopt crowd-sourcing to evaluate a portion of TCAB and analyze the label-preserving
nature of the attack methods. We find that 51% and 81% of adversarial instances preserve
their original labels for sentiment and abuse datasets, respectively.
4.
We present a baseline approach for attack detection and labeling that involves a combination
of contextualized fine-tuned BERT embeddings and hand-crafted text, language model, and
target model properties. Our baseline approach achieves 91.7% and 66.7% accuracy for
attack detection and labeling, averaged over all datasets and target models.
TCAB is available at
https://react-nlp.github.io/tcab/
, including the clean (source) in-
stances, the manipulated attack instances, code for generating features and baseline models, and code
for extending the dataset with new attacks and source datasets.
2 Background and Problem Setup
Existing adversarial evaluation frameworks focus only on model robustness. Robustness Gym [
15
]
and TextFLINT [
51
] allow users to measure the performance of their models on a variety of text
2
Adversarial
Perturbations
BAE
DeepWordBug
Faster Genetic
HotFlip
IGA
VIPER
...
Target
Models
BERT
RoBERTa
XLNet
Label Preservation
Attack Labeling
Attack Detection
Attack Target Labeling
Attack Localization
...
Attack
Identification
Human
Evaluation
Domain
Data
SST2
Climate Change
IMDB
Wikipedia
Hatebase
Civil Comments
TCAB
TestTrain Clean
Adv.
Figure 1: High-level overview of the TCAB generation and evaluation workflow.
transformations and adversaries. Adversarial GLUE [
50
] is a multi-task robustness benchmark that
was created by applying 14 textual adversarial attack methods to GLUE tasks. Dynabench [
27
] is a
related framework for evaluating and training NLP models on adversarial examples created entirely
by human adversaries.
Of these, TCAB is most similar to Adversarial GLUE. However, TCAB was designed for a different
purpose — attack identification rather than robustness evaluation. TCAB is also much larger than
Adversarial GLUE (1.5 million fully-automated attacks vs. 5,000 human-verified attacks), focuses
only on classification, and includes multiple classification domain datasets.
In this work, we focus on text classifiers and attacks on them. Given an input sequence
x=
(x1, x2, ..., xN)∈ X
(the instance space), a text classifier
f
maps
x
to a label
y∈ Y
, the set of
output labels. For sentiment analysis,
Y
may be positive or negative sentiment; or for toxic comment
detection, Ymay be toxic or non-toxic.
A text-classification adversary aims to generate an adversarial example
x0
such that
f(x0)6=f(x)
.
Ideally, the changes made on
x
to obtain
x0
are minimal such that a human would label them the
same way. Perturbations may occur on the character-[
10
,
11
], token-[
12
], word-[
23
], or sentence-
level [
20
,
48
], or a combination of levels [
30
]; perturbations may also be structured such that
certain input properties are preserved, such as the semantics [
11
], perplexity [
1
,
21
], fluency [
13
], or
grammar [61].
The primary task of attack identification is attack labeling: given a (possibly) perturbed input sequence
xM(x)
, in which
M(x)
is a function that perturbs
x
using any one attack method from a set
of attacks
S
(including a “clean” attack in which the input is not perturbed), an attack labeler
fLAB :X S
maps the perturbed sequence to an attack method in
S
. Additionally, multiple
subtasks complement the primary challenge of attack labeling; these include attack localization,
attack target labeling, and attack characterization (see §3.7 for formal descriptions of all related
tasks).
In pursuit of solving these problems, we develop and curate a large collection of adversarial attacks
on a number of classifiers trained on various domain datasets. In the following section, we detail our
process for generating this benchmark, describe its characteristics, and provide a human evaluation of
the resulting dataset (§3). We then evaluate a set of baseline models on attack detection and labeling
using our newly created benchmark (§4).
3 TCAB Benchmark
We now present the Text Classification Attack Benchmark (TCAB), a dataset for developing and
evaluating methods for identifying adversarial attacks against text classifiers. Here we describe the
process for constructing this benchmark, and evaluate its characteristics.
3
3.1 Domain Datasets
Our focus is on two text-classification domains, sentiment analysis and abuse detection. Sentiment
analysis is a popular and widely studied task [
41
,
58
,
64
,
43
,
31
,
55
] while abuse detection is more
likely to be adversarial [40, 63, 26].
For sentiment analysis, we attack models trained on three domains: (1)
Climate Change2
, 62,356
tweets on climate change; (2)
IMDB
[
34
], 50,000 movie reviews, and (3)
SST-2
[
45
], 68,221
movie reviews. For abuse detection, we attack models trained on three toxic-comment datasets:
(1)
Wikipedia
(Talk Pages) [
56
,
8
], 159,686 comments from Wikipedia administration webpages, (2)
Hatebase
[
6
], 24,783 comments, and (3)
Civil Comments3
, 1,804,874 comments from independent
news sites. All datasets are binary (positive vs. negative or toxic vs. non-toxic) except for Climate
Change, which includes neutral sentiment. Additional dataset details are in the Appendix §B.1.
3.2 Target Models
We finetune BERT [
7
], RoBERTa [
33
], and XLNet [
59
] models — all from HuggingFace’s transform-
ers library [
54
] — on the six domain datasets. We use transformer-based models since they represent
current state-of-the-art approaches to text classification, and we use multiple architectures to obtain a
wider range of adversarial examples, ultimately testing the robustness of attack identification models
to attacks targeting different victim models.
Table 2 shows the performance of these models on the test set of each domain dataset. On most
datasets, RoBERTa slightly outperforms the other two models both in accuracy and AUROC. Training
code and additional details such as selected hyperparameters are in the Appendix §B.2.
3.3 Attack Methods
We select twelve different attack methods that cover a wide range of design choices and assumptions,
such as model access level (e.g., white/gray/black box), perturbation level (e.g., char/word/token),
and linguistic constraints. Table 7 (Appendix, §B.4) provides a summary of all attack methods and
their characteristics.
Target Model Access and Perturbation Levels.
Of the twelve attack methods, only two [
10
,
30
]
have full access to the target model (i.e., a white-box attack), while five [
12
,
21
,
1
,
52
,
42
] assume
some information about the target (gray box), and the rest [
13
,
61
,
30
,
23
,
11
] can only query the
output (black box). The majority of methods perturb entire words by swapping them with similar
words based on sememes [
61
], synonyms [
23
] or an embedding space [
13
,
21
,
1
,
52
,
42
]. The
remaining methods [
12
,
10
,
30
,
11
] operate on the token/character level, perturbing the input by
inserting/deleting/swapping different characters.
Linguistic Constraints.
Linguistic constraints promote indistinguishable attacks. For example,
Genetic [
1
], FasterGenetic [
21
], HotFlip [
10
], and Pruthi [
42
] limit the number or percentage of
words perturbed. Other methods ensure the distance between the perturbed text and the original text
is “close” in some embedding space; for example, BAE [
13
], TextBugger [
30
], and TextFooler [
23
]
constrain the perturbed text to have high cosine similarity to the original text using a universal sentence
encoder (USE) [
4
], while IGA [
52
] and VIPER [
11
] ensure similarity in word and visual embedding
spaces, respectively. Some methods, such as TextBugger and TextFooler, use a combination of
constraints to further limit deviations from the original input.
Attack Toolchains.
We use TextAttack [
38
] and OpenAttack [
62
] — open-source toolchains
that provide fully-automated off-the-shelf attacks — to generate adversarial examples. For these
toolchains, attack methods are implemented using different search methods. For example, BAE [
13
],
DeepWordBug [
12
], TextBugger [
30
], and TextFooler [
23
] use a word importance ranking to greedily
decide which word(s) to perturb for each query; in contrast, Genetic [
1
] and PSO [
61
] use a genetic
algorithm and particle swarm optimization to identify word-perturbation candidates, respectively. For
2https://www.kaggle.com/edqian/twitter-climate-change-sentiment-dataset
3https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
4
Table 2: Predictive performance of the target models on the test set for each domain dataset; *:
multiclass-macro-averaged AUC; the rest are binary-classification tasks.
BERT RoBERTa XLNet
Dataset Acc. AUC Acc. AUC Acc. AUC
Climate Change* 79.8 0.899 81.2 0.917 80.1 0.910
IMDB 87.0 0.949 90.7 0.968 90.1 0.965
SST-2 91.8 0.972 92.7 0.978 92.3 0.974
Wikipedia 96.5 0.982 96.6 0.985 96.4 0.983
Hatebase 95.8 0.983 95.8 0.987 93.9 0.979
Civil Comments 95.2 0.968 95.1 0.967 95.0 0.965
all attack methods, we set a maximum limit of 500 queries per instance. Note an attack method may
be implemented by both toolchains (e.g., TextBugger is implemented by TextAttack and TextBugger).
3.4 Dataset Generation
To create TCAB, we perturb examples from the test sets of the six domain datasets (see §3.1 for
domain datasets). SST-2, Wikipedia, and IMDB have predefined train/test splits. For the other three
datasets, we randomly partition the data into an 80/10/10 train/validation/test split.
For each model/domain dataset combination, we only attack test set examples in which the model’s
prediction is correct. For abuse datasets, we further constrain our focus on examples in the test set
that are both predicted correctly and toxic; perturbing non-toxic text to be classified as toxic is a less
likely adversarial task.4
The full pipeline for generating TCAB is shown in Figure 1.
Adding Clean Instances and Creating Development-Test Splits.
For each dataset, we randomly
sample a fraction of the instances from the test set to include as “clean” unperturbed examples. After
merging the clean and adversarial examples, we split the data into a 60/20/20 train/validation/test
split. Additionally, all instances with the same source text are sent to the same split (i.e., multiple
successful attacks on the same original text are all sent to the same split) to avoid any data leakage.
The train and validation sets are publicly available,5while the test set is available upon request.
Extending TCAB.
TCAB is designed to incorporate new attacks as they develop, or existing
attacks on new text classifiers. TCAB thus facilitates research into attack identification models that
stay up-to-date with the latest attacks on the latest text classifiers. Instructions and code for extending
TCAB with new attacks or domain datasets is available at
https://github.com/REACT-NLP/
tcab_generation.
3.5 TCAB Statistics
TCAB contains a total of 1,504,607 successful attacks on six domain datasets against three different
target models using twelve attack methods from two open-source toolchains.
Attack Success Rate.
Table 3 shows a breakdown of attack success rates and the number of
successful attacks for each method. For Civil Comments, many instances were very easy to manipulate
successfully (Figure 2: far right), and it was not uncommon for all 12 attackers to successfully perturb
the same instance. In contrast, it was quite rare for more than three of the attackers to be successful
on any IMDB instance. Of the three target-model architectures, XLNet was the most robust — it was
successfully attacked 61% of the time (this percentage is computed over all attack attempts made
against all XLNet models). BERT and RoBERTa were fooled 63% and 66% of the time, respectively.
4
One can imagine cases where an attacker would have this goal, such as trying to get someone else banned
from a social network by tricking them into posting text that triggers an abuse filter. However, we expect this to
be much less common than attackers simply trying to evade abuse filters.
5https://zenodo.org/record/7226519
5
摘要:

TCAB:ALarge-ScaleTextClassicationAttackBenchmarkKalyaniAsthanayZhouhangXiexWencongYouzAdamNoackzJonathanBrophyzSameerSinghyDanielLowdzyUniversityofCaliforniaIrvinexUniversityofCaliforniaSanDiegozUniversityofOregon{kasthana,sameer}@uci.eduzhx022@ucsd.edu{wyou,anoac2k,jbrophy,lowd}@cs.uoregon.eduAbs...

展开>> 收起<<
TCAB A Large-Scale Text Classification Attack Benchmark Kalyani AsthanayZhouhang XiexWencong YouzAdam Noackz.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:1.17MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注