TCAB A Large-Scale Text Classiﬁcation Attack Benchmark Kalyani AsthanayZhouhang XiexWencong YouzAdam Noackz

2025-05-02 0 0 1.17MB 32 页 10玖币

侵权投诉

TCAB: A Large-Scale Text Classiﬁcation Attack

Benchmark

Kalyani Asthana†∗ Zhouhang Xie§Wencong You‡Adam Noack‡

Jonathan Brophy‡Sameer Singh†Daniel Lowd‡

†University of California Irvine §University of California San Diego ‡University of Oregon

{kasthana,sameer}@uci.edu zhx022@ucsd.edu

{wyou,anoac2k,jbrophy,lowd}@cs.uoregon.edu

Abstract

We introduce the Text Classiﬁcation Attack Benchmark (TCAB), a dataset for

analyzing, understanding, detecting, and labeling adversarial attacks against text

classiﬁers. TCAB includes 1.5 million attack instances, generated by twelve adver-

sarial attacks targeting three classiﬁers trained on six source datasets for sentiment

analysis and abuse detection in English. Unlike standard text classiﬁcation, text

attacks must be understood in the context of the target classiﬁer that is being

attacked, and thus features of the target classiﬁer are important as well.

TCAB includes all attack instances that are successful in ﬂipping the predicted

label; a subset of the attacks are also labeled by human annotators to determine

how frequently the primary semantics are preserved. The process of generating

attacks is automated, so that TCAB can easily be extended to incorporate new text

attacks and better classiﬁers as they are developed. In addition to the primary tasks

of detecting and labeling attacks, TCAB can also be used for attack localization,

attack target labeling, and attack characterization. TCAB code and dataset are

available at https://react-nlp.github.io/tcab/.

1 Introduction

Text classiﬁers have been under attack ever since spammers started evading spam ﬁlters, nearly 20

years ago [

]. In recent years, however, attacking classiﬁers has become much easier to carry out.

Many general-purpose attacks have been developed and are now available in standard, plug-and-play

frameworks, such as TextAttack [

] and OpenAttack [

]. The wide use of standard architectures

and shared pretrained representations have further increased the risk of attacks by decreasing the

diversity of text classiﬁers.

Our focus is on evasion attacks [

], in which an attacker attempts to change a classiﬁer’s prediction by

making minor, semantics-preserving perturbations to the original input. To accomplish this, different

adversarial attack algorithms employ different types of perturbations, search methods, and constraints.

See Table 1 for some brief examples.

A common defense strategy is to make classiﬁers more robust, using algorithms with heuristic or

provable guarantees on their performance [

However, these defenses are often computationally expensive or result in reduced accuracy. Therefore,

as a complement to making classiﬁers more robust, we introduce the task of attack identiﬁcation —

automatically determining the adversarial attacks (if any) used to generate a given piece of text. The

∗Corresponding author.

arXiv:2210.12233v1 [cs.LG] 21 Oct 2022

Table 1: Attack Samples on SST-2

Attack Text Label Conﬁdence

Original the acting is amateurish Negative 63.7%

Pruthi [42] the acting is amateirish Positive 82.1%

DeepWordBug [12] the acting is aateurish Positive 91.2%

IGA [52] the acting is enthusiastic Positive 62.3%

idea behind attack identiﬁcation is that many attackers will use whatever attacks are most convenient,

such as public implementations of attack algorithms, instead of developing new ones or implementing

ones on their own. Thus, we can identify speciﬁc attacks instead of detecting or preventing all

possible attacks.

The primary focus of attack identiﬁcation is attack labeling — determining which speciﬁc attack was

used (or none). However, other valuable challenges exist under the umbrella of attack identiﬁcation,

such as attack target labeling (determining which model is being attacked), attack localization (iden-

tifying which parts of the text have been manipulated), and more (see §3.7 for detailed subtask

descriptions). These tasks give us information about how the attacks are being conducted, which can

be used to develop defense strategies for the overall system, such as uncovering malicious actors

behind misinformation or abuse campaigns on social media.

Existing adversarial evaluation frameworks/benchmarks focus exclusively on model robustness,

typically requiring carefully controlled and expensive human annotations [

] that tend to result

in small datasets; e.g. Adversarial GLUE [

] contains only 5,000 human-veriﬁed attacks. However,

to the best of our knowledge, no dataset currently exists to support the task of attack identiﬁcation.

To address this issue, we propose TCAB, a benchmark dataset that comprises over 1.5 million

fully-automated low-effort attacks, providing sufﬁcient data to enable proper training and evaluation

of models speciﬁc to the task (and potential subtasks) of attack identiﬁcation.

We summarize our contributions below, outlining the unique advantages of TCAB over existing

adversarial benchmarks.

We introduce attack identiﬁcation and its primary task attack labeling – automatically

determining the adversarial attacks (if any) used to generate a given piece of text. Attack

identiﬁcation also includes additional tasks such as attack detection,attack target labeling,

attack localization, and attack characterization, enabling defenders to learn more about

their attackers.

We create TCAB, a text classiﬁcation attack benchmark consisting of more than 1.5 million

successful adversarial examples from a diverse set of twelve attack methods targeting three

state-of-the-art text classiﬁcation models trained on six sentiment/toxic-speech domain

datasets. TCAB is designed to be expanded as new text attacks, classiﬁers, and domains are

developed. This benchmark supports research into attack identiﬁcation and related tasks.

We adopt crowd-sourcing to evaluate a portion of TCAB and analyze the label-preserving

nature of the attack methods. We ﬁnd that 51% and 81% of adversarial instances preserve

their original labels for sentiment and abuse datasets, respectively.

We present a baseline approach for attack detection and labeling that involves a combination

of contextualized ﬁne-tuned BERT embeddings and hand-crafted text, language model, and

target model properties. Our baseline approach achieves 91.7% and 66.7% accuracy for

attack detection and labeling, averaged over all datasets and target models.

TCAB is available at

https://react-nlp.github.io/tcab/

, including the clean (source) in-

stances, the manipulated attack instances, code for generating features and baseline models, and code

for extending the dataset with new attacks and source datasets.

2 Background and Problem Setup

Existing adversarial evaluation frameworks focus only on model robustness. Robustness Gym [

]

and TextFLINT [

] allow users to measure the performance of their models on a variety of text

Adversarial

Perturbations

BAE

DeepWordBug

Faster Genetic

HotFlip

IGA

VIPER

...

Target

Models

BERT

RoBERTa

XLNet

Label Preservation

Attack Labeling

Attack Detection

Attack Target Labeling

Attack Localization

...

Attack

Identification

Human

Evaluation

Domain

Data

SST2

Climate Change

IMDB

Wikipedia

Hatebase

Civil Comments

TCAB

TestTrain Clean

Adv.

Figure 1: High-level overview of the TCAB generation and evaluation workﬂow.

transformations and adversaries. Adversarial GLUE [

] is a multi-task robustness benchmark that

was created by applying 14 textual adversarial attack methods to GLUE tasks. Dynabench [

] is a

related framework for evaluating and training NLP models on adversarial examples created entirely

by human adversaries.

Of these, TCAB is most similar to Adversarial GLUE. However, TCAB was designed for a different

purpose — attack identiﬁcation rather than robustness evaluation. TCAB is also much larger than

Adversarial GLUE (1.5 million fully-automated attacks vs. 5,000 human-veriﬁed attacks), focuses

only on classiﬁcation, and includes multiple classiﬁcation domain datasets.

In this work, we focus on text classiﬁers and attacks on them. Given an input sequence

(x1, x2, ..., xN)∈ X

(the instance space), a text classiﬁer

maps

to a label

y∈ Y

, the set of

output labels. For sentiment analysis,

may be positive or negative sentiment; or for toxic comment

detection, Ymay be toxic or non-toxic.

A text-classiﬁcation adversary aims to generate an adversarial example

such that

f(x0)6=f(x)

Ideally, the changes made on

to obtain

are minimal such that a human would label them the

same way. Perturbations may occur on the character-[

], token-[

], word-[

], or sentence-

level [

], or a combination of levels [

]; perturbations may also be structured such that

certain input properties are preserved, such as the semantics [

], perplexity [

], ﬂuency [

], or

grammar [61].

The primary task of attack identiﬁcation is attack labeling: given a (possibly) perturbed input sequence

x∗∈M(x)

, in which

M(x)

is a function that perturbs

using any one attack method from a set

of attacks

(including a “clean” attack in which the input is not perturbed), an attack labeler

fLAB :X → S

maps the perturbed sequence to an attack method in

. Additionally, multiple

subtasks complement the primary challenge of attack labeling; these include attack localization,

attack target labeling, and attack characterization (see §3.7 for formal descriptions of all related

tasks).

In pursuit of solving these problems, we develop and curate a large collection of adversarial attacks

on a number of classiﬁers trained on various domain datasets. In the following section, we detail our

process for generating this benchmark, describe its characteristics, and provide a human evaluation of

the resulting dataset (§3). We then evaluate a set of baseline models on attack detection and labeling

using our newly created benchmark (§4).

3 TCAB Benchmark

We now present the Text Classiﬁcation Attack Benchmark (TCAB), a dataset for developing and

evaluating methods for identifying adversarial attacks against text classiﬁers. Here we describe the

process for constructing this benchmark, and evaluate its characteristics.

3.1 Domain Datasets

Our focus is on two text-classiﬁcation domains, sentiment analysis and abuse detection. Sentiment

analysis is a popular and widely studied task [

] while abuse detection is more

likely to be adversarial [40, 63, 26].

For sentiment analysis, we attack models trained on three domains: (1)

Climate Change2

, 62,356

tweets on climate change; (2)

IMDB

[

], 50,000 movie reviews, and (3)

SST-2

[

], 68,221

movie reviews. For abuse detection, we attack models trained on three toxic-comment datasets:

(1)

Wikipedia

(Talk Pages) [

], 159,686 comments from Wikipedia administration webpages, (2)

Hatebase

[

], 24,783 comments, and (3)

Civil Comments3

, 1,804,874 comments from independent

news sites. All datasets are binary (positive vs. negative or toxic vs. non-toxic) except for Climate

Change, which includes neutral sentiment. Additional dataset details are in the Appendix §B.1.

3.2 Target Models

We ﬁnetune BERT [

], RoBERTa [

], and XLNet [

] models — all from HuggingFace’s transform-

ers library [

] — on the six domain datasets. We use transformer-based models since they represent

current state-of-the-art approaches to text classiﬁcation, and we use multiple architectures to obtain a

wider range of adversarial examples, ultimately testing the robustness of attack identiﬁcation models

to attacks targeting different victim models.

Table 2 shows the performance of these models on the test set of each domain dataset. On most

datasets, RoBERTa slightly outperforms the other two models both in accuracy and AUROC. Training

code and additional details such as selected hyperparameters are in the Appendix §B.2.

3.3 Attack Methods

We select twelve different attack methods that cover a wide range of design choices and assumptions,

such as model access level (e.g., white/gray/black box), perturbation level (e.g., char/word/token),

and linguistic constraints. Table 7 (Appendix, §B.4) provides a summary of all attack methods and

their characteristics.

Target Model Access and Perturbation Levels.

Of the twelve attack methods, only two [

]

have full access to the target model (i.e., a white-box attack), while ﬁve [

] assume

some information about the target (gray box), and the rest [

] can only query the

output (black box). The majority of methods perturb entire words by swapping them with similar

words based on sememes [

], synonyms [

] or an embedding space [

]. The

remaining methods [

] operate on the token/character level, perturbing the input by

inserting/deleting/swapping different characters.

Linguistic Constraints.

Linguistic constraints promote indistinguishable attacks. For example,

Genetic [

], FasterGenetic [

], HotFlip [

], and Pruthi [

] limit the number or percentage of

words perturbed. Other methods ensure the distance between the perturbed text and the original text

is “close” in some embedding space; for example, BAE [

], TextBugger [

], and TextFooler [

]

constrain the perturbed text to have high cosine similarity to the original text using a universal sentence

encoder (USE) [

], while IGA [

] and VIPER [

] ensure similarity in word and visual embedding

spaces, respectively. Some methods, such as TextBugger and TextFooler, use a combination of

constraints to further limit deviations from the original input.

Attack Toolchains.

We use TextAttack [

] and OpenAttack [

] — open-source toolchains

that provide fully-automated off-the-shelf attacks — to generate adversarial examples. For these

toolchains, attack methods are implemented using different search methods. For example, BAE [

DeepWordBug [

], TextBugger [

], and TextFooler [

] use a word importance ranking to greedily

decide which word(s) to perturb for each query; in contrast, Genetic [

] and PSO [

] use a genetic

algorithm and particle swarm optimization to identify word-perturbation candidates, respectively. For

2https://www.kaggle.com/edqian/twitter-climate-change-sentiment-dataset

3https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification

Table 2: Predictive performance of the target models on the test set for each domain dataset; *:

multiclass-macro-averaged AUC; the rest are binary-classiﬁcation tasks.

BERT RoBERTa XLNet

Dataset Acc. AUC Acc. AUC Acc. AUC

Climate Change* 79.8 0.899 81.2 0.917 80.1 0.910

IMDB 87.0 0.949 90.7 0.968 90.1 0.965

SST-2 91.8 0.972 92.7 0.978 92.3 0.974

Wikipedia 96.5 0.982 96.6 0.985 96.4 0.983

Hatebase 95.8 0.983 95.8 0.987 93.9 0.979

Civil Comments 95.2 0.968 95.1 0.967 95.0 0.965

all attack methods, we set a maximum limit of 500 queries per instance. Note an attack method may

be implemented by both toolchains (e.g., TextBugger is implemented by TextAttack and TextBugger).

3.4 Dataset Generation

To create TCAB, we perturb examples from the test sets of the six domain datasets (see §3.1 for

domain datasets). SST-2, Wikipedia, and IMDB have predeﬁned train/test splits. For the other three

datasets, we randomly partition the data into an 80/10/10 train/validation/test split.

For each model/domain dataset combination, we only attack test set examples in which the model’s

prediction is correct. For abuse datasets, we further constrain our focus on examples in the test set

that are both predicted correctly and toxic; perturbing non-toxic text to be classiﬁed as toxic is a less

likely adversarial task.4

The full pipeline for generating TCAB is shown in Figure 1.

Adding Clean Instances and Creating Development-Test Splits.

For each dataset, we randomly

sample a fraction of the instances from the test set to include as “clean” unperturbed examples. After

merging the clean and adversarial examples, we split the data into a 60/20/20 train/validation/test

split. Additionally, all instances with the same source text are sent to the same split (i.e., multiple

successful attacks on the same original text are all sent to the same split) to avoid any data leakage.

The train and validation sets are publicly available,5while the test set is available upon request.

Extending TCAB.

TCAB is designed to incorporate new attacks as they develop, or existing

attacks on new text classiﬁers. TCAB thus facilitates research into attack identiﬁcation models that

stay up-to-date with the latest attacks on the latest text classiﬁers. Instructions and code for extending

TCAB with new attacks or domain datasets is available at

https://github.com/REACT-NLP/

tcab_generation.

3.5 TCAB Statistics

TCAB contains a total of 1,504,607 successful attacks on six domain datasets against three different

target models using twelve attack methods from two open-source toolchains.

Attack Success Rate.

Table 3 shows a breakdown of attack success rates and the number of

successful attacks for each method. For Civil Comments, many instances were very easy to manipulate

successfully (Figure 2: far right), and it was not uncommon for all 12 attackers to successfully perturb

the same instance. In contrast, it was quite rare for more than three of the attackers to be successful

on any IMDB instance. Of the three target-model architectures, XLNet was the most robust — it was

successfully attacked 61% of the time (this percentage is computed over all attack attempts made

against all XLNet models). BERT and RoBERTa were fooled 63% and 66% of the time, respectively.

One can imagine cases where an attacker would have this goal, such as trying to get someone else banned

from a social network by tricking them into posting text that triggers an abuse ﬁlter. However, we expect this to

be much less common than attackers simply trying to evade abuse ﬁlters.

5https://zenodo.org/record/7226519

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TCAB:ALarge-ScaleTextClassicationAttackBenchmarkKalyaniAsthanayZhouhangXiexWencongYouzAdamNoackzJonathanBrophyzSameerSinghyDanielLowdzyUniversityofCaliforniaIrvinexUniversityofCaliforniaSanDiegozUniversityofOregon{kasthana,sameer}@uci.eduzhx022@ucsd.edu{wyou,anoac2k,jbrophy,lowd}@cs.uoregon.eduAbs...

展开>> 收起<<

TCAB A Large-Scale Text Classiﬁcation Attack Benchmark Kalyani AsthanayZhouhang XiexWencong YouzAdam Noackz.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TCAB A Large-Scale Text Classiﬁcation Attack Benchmark Kalyani AsthanayZhouhang XiexWencong YouzAdam Noackz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: