Why Should Adversarial Perturbations be Imperceptible Rethink the Research Paradigm in Adversarial NLP WARNING This paper contains real-world cases which are offensive in nature.

2025-05-06 0 0 507.89KB 16 页 10玖币

侵权投诉

Why Should Adversarial Perturbations be Imperceptible?

Rethink the Research Paradigm in Adversarial NLP

WARNING: This paper contains real-world cases which are offensive in nature.

Yangyi Chen1,2∗, Hongcheng Gao1,3∗, Ganqu Cui1, Fanchao Qi1

Longtao Huang4, Zhiyuan Liu1,5†

, Maosong Sun1,5†

1NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing

2University of Illinois Urbana-Champaign 3Chongqing University

4Alibaba Group 5IICTUS, Shanghai

yangyic3@illinois.edu, gaohongcheng2000@gmail.com

Abstract

Textual adversarial samples play important

roles in multiple subﬁelds of NLP research,

including security, evaluation, explainability,

and data augmentation. However, most work

mixes all these roles, obscuring the problem

deﬁnitions and research goals of the security

role that aims to reveal the practical concerns

of NLP models. In this paper, we rethink the

research paradigm of textual adversarial sam-

ples in security scenarios. We discuss the de-

ﬁciencies in previous work and propose our

suggestions that the research on the Security-

oriented adversarial NLP (SoadNLP) should:

(1) evaluate their methods on security tasks

to demonstrate the real-world concerns; (2)

consider real-world attackers’ goals, instead

of developing impractical methods. To this

end, we ﬁrst collect, process, and release a

security datasets collection Advbench. Then,

we reformalize the task and adjust the empha-

sis on different goals in SoadNLP. Next, we

propose a simple method based on heuristic

rules that can easily fulﬁll the actual adversar-

ial goals to simulate real-world attack meth-

ods. We conduct experiments on both the at-

tack and the defense sides on Advbench. Ex-

perimental results show that our method has

higher practical value, indicating that the re-

search paradigm in SoadNLP may start from

our new benchmark. All the code and data

of Advbench can be obtained at https://

github.com/thunlp/Advbench.

1 Introduction

Natural language processing (NLP) models based

on deep learning have been employed in many real-

world applications (Badjatiya et al.,2017;Zhang

et al.,2018;Niklaus et al.,2018;Han et al.,2021).

Meanwhile, there is a concurrent line of research

on textual adversarial samples that are intentionally

∗

Indicates equal contribution. Work done during intern-

ship at Tsinghua University.

†Corresponding Author.

Role Explanation

Security Adversarial samples can reveal the practical concerns of NLP

models deployed in security situations.

Evaluation Adversarial samples can be employed to benchmark models’

robustness to out-of-distribution data (diverse user inputs).

Explainability Adversarial samples can explain part of the models’ decision

processes.

Augmentation Adversarial training based on adversarial samples augmenta-

tion can improve performance and robustness.

Table 1: Roles of textual adversarial samples.

crafted to mislead models’ predictions (Samanta

and Mehta,2017;Papernot et al.,2016). Previous

work shows that textual adversarial samples play

important roles in multiple subﬁelds of NLP re-

search. We categorize and summarize the roles in

Table 1.

We argue that the problem deﬁnitions, including

priorities of goals and experimental settings, are dif-

ferent, considering the different roles of adversarial

samples. However, most previous work in adversar-

ial NLP mixes all different roles, including the se-

curity role of revealing real-world concerns of NLP

models deployed in security scenarios. This leads

to inconsistent problem deﬁnitions and research

goals with real-world cases. As a consequence, al-

though most existing work on textual adversarial

attacks claims that their methods reveal the secu-

rity issues, they often follow a security-irrelevant

research paradigm. To ﬁx this problem, we focus

on the security role and try to reﬁne the research

paradigm for future work in this direction.

There are two core issues about why previous

textual adversarial attack work can hardly help real-

world security problems. First, most work don’t

consider security tasks and datasets (Ren et al.,

2019;Zang et al.,2020b) (See Table 7). Some irrel-

evant tasks like sentiment analysis and natural lan-

guage inference are often involved in the evaluation

instead. Second, they don’t consider real-world at-

tackers’ goals and make unrealistic assumptions or

add unnecessary restrictions (e.g., imperceptible

arXiv:2210.10683v1 [cs.CL] 19 Oct 2022

Original I was all over the fucking place because the toaster had tits.

PWWS (Ren et al.,2019) I was all over the bally topographic because the wassailer have breast.

Real-World Attack I was all over the fuc king place because the toaster had tits. !!!peace peace peace

Table 2: Comparison between the real-world attack and the method proposed in the NLP community. Obviously,

the real-world attack method is easier to implement and preserves the adversarial meaning better.

requirement) to the adversarial perturbations (Li

et al.,2020;Garg and Ramakrishnan,2020). Con-

sider the case where attackers want to bypass the

detection systems to send an offensive message to

the web. They can only access the decisions (e.g.,

pass or reject) of the black-box detection systems

without the concrete conﬁdence scores. And their

adversarial goals are to convey the offensive mean-

ing and bypass the detection systems. So, there is

no need for them to make the adversarial perturba-

tions imperceptible, as supposed in previous work.

See Table 2for an example. Besides, most meth-

ods have the inefﬁciency problem (i.e. high query

times and long-running time), which makes them

less practical and may not be a good choice for

attackers in the real world. We refer readers to Sec-

tion 6for a further discussion about previous work.

To address the issue of security-irrelevant eval-

uation benchmark, we ﬁrst summarize ﬁve secu-

rity tasks and search corresponding open-source

datasets. We collect, process, and release these

datasets as a collection named

Advbench

to fa-

cilitate future research. To address the issue of

ill-deﬁned problem deﬁnition, we refer to the in-

tention of real-world attackers to reformalize the

task of textual adversarial attack and adjust the em-

phasis on different adversarial goals. Further, to

simulate real-world attacks, we propose a simple

attack method based on heuristic rules that are sum-

marized from various sources, which can easily

fulﬁll the actual attackers’ goals.

We conduct comprehensive experiments on Ad-

vbench to evaluate methods proposed in the NLP

community and our simple method. Experimental

results overall demonstrate the superiority of our

method, considering the attack performance, the at-

tack efﬁciency, and the preservation of adversarial

meaning (validity). We also consider the defense

side and show that the SOTA defense method can-

not handle our simple heuristic attack algorithm.

The overall experiments indicate that the research

paradigm in SoadNLP may start from our new

benchmark.

To summarize, the main contributions of this

paper are as follows:

•

We collect, process, and release a secu-

rity datasets collection Advbench.

•

We reconsider the attackers’ goals and refor-

malize the task of textual adversarial attack in

security scenarios.

•

We propose a simple attack method that fulﬁlls

the actual attackers’ goals to simulate real-world

attacks, which can facilitate future research on

both the attack and the defense sides.

2 Advbench Construction

2.1 Motivation

We ﬁrst survey previous works of adversarial

attacks in NLP about the tasks and datasets they

consider in their experiments (See Table 7). We

ﬁnd that most tasks consider in their work are

not security-relevant (e.g., sentiment analysis).

So, the real-world concerns revealed in their

experiments are not well reﬂected in reality when

there is a lack of security evaluation benchmark.

To this end, we suggest future researchers evaluate

their methods on security tasks to demonstrate

real-world harmfulness and practical concerns.

Thus, a security datasets collection is needed to

facilitate future research.

2.2 Tasks

We summarize 5 security tasks, including misinfor-

mation, disinformation, toxic, spam, and sensitive

information detection. The task descriptions and

our motivation to choose these tasks are given in

Appendix B. Due to the label-unbalanced issue of

some datasets, we will release both our processed

balanced and unbalanced datasets. The datasets

statistics are listed in Table 8. All datasets are pro-

cessed through the general pipeline including the

removal of duplicate, missing, and unusual values.

2.2.1 Misinformation

LUN.

Our LUN dataset is built on the Labeled

Unreliable News Dataset (Rashkin et al.,2017)

consisting of articles from news media and human

annotations of fact-checking. We merge the satiri-

cal news from the Onion, hoax from the American

News, and propaganda from the Activist Report

into one category labeled as untrusted. And the

articles collected from Gigaword News are labeled

as trusted. Considering there is too little data in

the original testing set, we mix the original training

and testing set and re-partition by 7:3.

SATNews.

The Satirical News Dataset (Yang

et al.,2017) is a collection of satirical and veri-

ﬁed news. The satirical news articles are collected

from 14 websites that explicitly declare that they

are offering satire. The veriﬁed news articles are

collected from major news outlets

and Google

News using FLORIN (Liu et al.,2015). The origi-

nal training set and validation set are merged as our

training set and the testing set remains unchanged.

2.2.2 Disinformation

Amazon-LB.

The Amazon Luxury Beauty

Review dataset is a review collection of the Luxury

Beauty category in Amazon with veriﬁcation

information in Amazon Review Data (2018) (Ni

et al.,2019). The Amazon Review Data (2018)

is an updated version of the Amazon Review

Dataset (He and McAuley,2016;McAuley et al.,

2015) released in 2014, which contains 29 types of

data for different scenarios. We extract the Luxury

Beauty data from "small" subsets that are reduced

from full sets due to the appropriate quantity and

diversity of this category. We only keep content

and label (whether the content is veriﬁed or not)

of the review and split the data into training and

testing set with a ratio of 7:3.

CGFake.

The Computer-generated Fake Review

Dataset (Salminen et al.,2022) contains label-

balanced product reviews with two categories: orig-

inal reviews (presumably human-created and au-

thentic) and computer-generated fake reviews. The

computer-generated fake review is a new type of

disinformation that employs computer technology

to generate fake samples to mislead humans. This

dataset is split into training and testing set the same

as the original paper.

2.2.3 Toxic

HSOL.

The Hate Speech and Offensive Lan-

guage Dataset (Davidson et al.,2017) contains

more than 200k labeled tweets which are searched

CNN, DailyMail, WashingtonPost, NYTimes, The

Guardian, and Fox.

by Twitter API. The original dataset is classiﬁed

into three categories: hate speech, offensive but

not hate speech, or normal. We combine hate and

offensive speech into one category labeled "hate"

and the others are labeled as "non-hate".

Jigsaw2018.

The Jigsaw2018

is a competition

dataset of Toxic Comment Classiﬁcation Chal-

lenge in Kaggle. This dataset includes plentiful

Wikipedia comments. And the comments are la-

beled by human annotators for toxic behavior with

two categories: toxic and non-toxic.

2.2.4 Spam

Enron.

The Enron

(Metsis et al.,2006) is a cor-

pus of emails split into two categories: legitimate

and spam. There are six subsets in the dataset. Each

subset contains non-spam messages from a user in

the Enron corpus. And each non-spam message

is paired with one of the three spam collections

including the SpamAssassin corpus and the Hon-

eypot project

, Bruce Guenter’s spam collection

and the spam collected by Metsis et al. (2006). We

mix all the datasets and split them into training

and testing sets. We only keep the content of each

email without other information such as subject

and address.

SpamAssassin.

The SpamAssassin

is a collec-

tion of emails consisting of three categories: easy-

ham, hard-ham, and spam. We merge easy-ham

and hard-ham as the ham class. Then we mix all

samples and split them equally into training and

testing sets because of the lack of data. For each

email, we preprocess it the same as Enron.

2.2.5 Sensitive Information

EDENCE.

EDENCE (Neerbek,2019a) contains

samples with auto-generated parsing-tree structures

in the Enron corpus. The annotated labels come

from the TREC LEGAL (Tomlinson,2010;Cor-

mack et al.,2010) labels for Enron documents. We

restore the tree-structured samples to normal texts

and map sensitive information labels back to each

sample. Then we combine the training and vali-

dation sets as our training set, and the testing set

remains unchanged.

2This dataset is available in Kaggle.

3http://www2.aueb.gr/users/ion/data/

enron-spam/

4https://www.projecthoneypot.org/

5http://untroubled.org/spam/

6https://spamassassin.apache.org/old/

publiccorpus/

FAS.

FAS (Neerbek,2019b) also contains sam-

ples with parsing-tree structures built from Enron

corpus and is modiﬁed for sensitive information

detection by using TREC LEGAL labels annotated

by domain experts. The samples in FAS are com-

pliant with Financial Accounting Standards 3 and

are preprocessed in the same way as EDENCE in

our work.

3 Task Formalization

3.1 Motivation

In our survey, we ﬁnd that the current problem def-

inition and research goals considering the security

role of adversarial samples to reveal practical con-

cerns are ill-deﬁned and ambiguous. We attribute

this to the failure of distinguishing several roles of

adversarial samples (See Table 1). The problem

deﬁnitions are different considering the different

roles of adversarial samples. For example, when

adversarial samples are adopted to augment exist-

ing datasets for adversarial training, we may aim

for high-quality samples. Thus, the minor perturba-

tions restriction is important. On the contrary, when

it comes to the security side, we should focus more

on the preservation of adversarial meaning and at-

tack efﬁciency instead of the imperceptible pertur-

bations. See section 6for a further discussion.

Thus, we need to separate the research on differ-

ent roles of adversarial samples. On the security

side, most work doesn’t consider realistic situations

and the actual adversarial goals, which may result

in unrealistic assumptions or unnecessary restric-

tions when developing attack or defense methods.

To make the research in this ﬁeld more standard-

ized and in-depth, reformalization of this problem

needs to be conducted.

Note that we focus on the

security role of textual adversarial samples in

this paper.

3.2 Formalization

Overview.

Without loss of generality, we con-

sider the text classiﬁcation task. Given a classiﬁer

f:X → Y

that can make correct prediction on the

original input text x:

arg max

yi∈Y

P(yi|x) = ytrue,(1)

where

ytrue

is the golden label of

. The attackers

will make perturbations

to craft an adversarial

sample x∗that can fool the classiﬁer:

arg max

yi∈Y

P(yi|x∗)6=ytrue,x∗=x+δ(2)

Reﬁnement.

The core part of adversarial NLP is

to ﬁnd the appropriate perturbations

. We identify

four deﬁciencies in the common research paradigm

on SoadNLP.

(1) Most attack methods iteratively search for

better

relying on the accessibility to the victim

models’ conﬁdence scores or gradients (Alzantot

et al.,2018;Ren et al.,2019;Zang et al.,2020b;Li

et al.,2020). However, this assumption is unreal-

istic in real-world security tasks (e.g., hate-speech

detection). We argue that the research in adversar-

ial NLP considering the practical concerns should

focus on the decision-based setting, where only the

decisions of the victim models can be accessed.

(2) Previous work attempts to make

impercepti-

ble by imposing some restrictions on the searching

process, like ensuring that the cosine similarity of

adversarial and original sentence embeddings is

higher than a threshold (Li et al.,2020;Garg and

Ramakrishnan,2020), or considering the adversar-

ial samples’ perplexity (Qi et al.,2021). However,

why should adversarial perturbations be impercepti-

ble? The goals of attackers are to (1) bypass the de-

tection systems and (2) convey the malicious mean-

ing. So, the attackers only need to preserve the

adversarial contents (e.g., the hate speech in mes-

sages) no matter how many perturbations are added

to the original sentence to bypass the detection

systems (Consider Table 2). Thus, we argue that

these constraints are unnecessary and the quality of

adversarial samples is a secondary consideration.

(3) Adversarial attack based on word substitution

or sentence paraphrase is the most widely studied.

However, current attack algorithms are very inef-

ﬁcient and need to query victim models hundreds

of times to craft adversarial samples, which makes

them unlikely to happen in reality

. We argue that

adversarial attacks should be computation efﬁcient,

both in the running time and the query times to

the victim models, to better simulate the practical

situations.

(4) There is a bunch of work assuming that the

attackers are experienced NLP practitioners and

incorporate external knowledge base (Ren et al.,

2019;Zang et al.,2020b) or NLP models (Li et al.,

2020;Qi et al.,2021) into their attack algorithms.

However, everyone can be an attacker in reality.

Consider the hate-speechers in social platforms.

They often try different heuristic strategies to es-

Some work tries to address this issue but the effect is

limited (Zang et al.,2020a;Chen et al.,2021b).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhyShouldAdversarialPerturbationsbeImperceptible?RethinktheResearchParadigminAdversarialNLPWARNING:Thispapercontainsreal-worldcaseswhichareoffensiveinnature.YangyiChen1;2,HongchengGao1;3,GanquCui1,FanchaoQi1LongtaoHuang4,ZhiyuanLiu1;5y,MaosongSun1;5y1NLPGroup,DCST,IAI,BNRIST,TsinghuaUniversity,Bei...

展开>> 收起<<

Why Should Adversarial Perturbations be Imperceptible Rethink the Research Paradigm in Adversarial NLP WARNING This paper contains real-world cases which are offensive in nature..pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Why Should Adversarial Perturbations be Imperceptible Rethink the Research Paradigm in Adversarial NLP WARNING This paper contains real-world cases which are offensive in nature.

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: