Why Should Adversarial Perturbations be Imperceptible Rethink the Research Paradigm in Adversarial NLP WARNING This paper contains real-world cases which are offensive in nature.

2025-05-06 0 0 507.89KB 16 页 10玖币
侵权投诉
Why Should Adversarial Perturbations be Imperceptible?
Rethink the Research Paradigm in Adversarial NLP
WARNING: This paper contains real-world cases which are offensive in nature.
Yangyi Chen1,2, Hongcheng Gao1,3, Ganqu Cui1, Fanchao Qi1
Longtao Huang4, Zhiyuan Liu1,5
, Maosong Sun1,5
1NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing
2University of Illinois Urbana-Champaign 3Chongqing University
4Alibaba Group 5IICTUS, Shanghai
yangyic3@illinois.edu, gaohongcheng2000@gmail.com
Abstract
Textual adversarial samples play important
roles in multiple subfields of NLP research,
including security, evaluation, explainability,
and data augmentation. However, most work
mixes all these roles, obscuring the problem
definitions and research goals of the security
role that aims to reveal the practical concerns
of NLP models. In this paper, we rethink the
research paradigm of textual adversarial sam-
ples in security scenarios. We discuss the de-
ficiencies in previous work and propose our
suggestions that the research on the Security-
oriented adversarial NLP (SoadNLP) should:
(1) evaluate their methods on security tasks
to demonstrate the real-world concerns; (2)
consider real-world attackers’ goals, instead
of developing impractical methods. To this
end, we first collect, process, and release a
security datasets collection Advbench. Then,
we reformalize the task and adjust the empha-
sis on different goals in SoadNLP. Next, we
propose a simple method based on heuristic
rules that can easily fulfill the actual adversar-
ial goals to simulate real-world attack meth-
ods. We conduct experiments on both the at-
tack and the defense sides on Advbench. Ex-
perimental results show that our method has
higher practical value, indicating that the re-
search paradigm in SoadNLP may start from
our new benchmark. All the code and data
of Advbench can be obtained at https://
github.com/thunlp/Advbench.
1 Introduction
Natural language processing (NLP) models based
on deep learning have been employed in many real-
world applications (Badjatiya et al.,2017;Zhang
et al.,2018;Niklaus et al.,2018;Han et al.,2021).
Meanwhile, there is a concurrent line of research
on textual adversarial samples that are intentionally
Indicates equal contribution. Work done during intern-
ship at Tsinghua University.
Corresponding Author.
Role Explanation
Security Adversarial samples can reveal the practical concerns of NLP
models deployed in security situations.
Evaluation Adversarial samples can be employed to benchmark models’
robustness to out-of-distribution data (diverse user inputs).
Explainability Adversarial samples can explain part of the models’ decision
processes.
Augmentation Adversarial training based on adversarial samples augmenta-
tion can improve performance and robustness.
Table 1: Roles of textual adversarial samples.
crafted to mislead models’ predictions (Samanta
and Mehta,2017;Papernot et al.,2016). Previous
work shows that textual adversarial samples play
important roles in multiple subfields of NLP re-
search. We categorize and summarize the roles in
Table 1.
We argue that the problem definitions, including
priorities of goals and experimental settings, are dif-
ferent, considering the different roles of adversarial
samples. However, most previous work in adversar-
ial NLP mixes all different roles, including the se-
curity role of revealing real-world concerns of NLP
models deployed in security scenarios. This leads
to inconsistent problem definitions and research
goals with real-world cases. As a consequence, al-
though most existing work on textual adversarial
attacks claims that their methods reveal the secu-
rity issues, they often follow a security-irrelevant
research paradigm. To fix this problem, we focus
on the security role and try to refine the research
paradigm for future work in this direction.
There are two core issues about why previous
textual adversarial attack work can hardly help real-
world security problems. First, most work don’t
consider security tasks and datasets (Ren et al.,
2019;Zang et al.,2020b) (See Table 7). Some irrel-
evant tasks like sentiment analysis and natural lan-
guage inference are often involved in the evaluation
instead. Second, they don’t consider real-world at-
tackers’ goals and make unrealistic assumptions or
add unnecessary restrictions (e.g., imperceptible
arXiv:2210.10683v1 [cs.CL] 19 Oct 2022
Original I was all over the fucking place because the toaster had tits.
PWWS (Ren et al.,2019) I was all over the bally topographic because the wassailer have breast.
Real-World Attack I was all over the fuc king place because the toaster had tits. !!!peace peace peace
Table 2: Comparison between the real-world attack and the method proposed in the NLP community. Obviously,
the real-world attack method is easier to implement and preserves the adversarial meaning better.
requirement) to the adversarial perturbations (Li
et al.,2020;Garg and Ramakrishnan,2020). Con-
sider the case where attackers want to bypass the
detection systems to send an offensive message to
the web. They can only access the decisions (e.g.,
pass or reject) of the black-box detection systems
without the concrete confidence scores. And their
adversarial goals are to convey the offensive mean-
ing and bypass the detection systems. So, there is
no need for them to make the adversarial perturba-
tions imperceptible, as supposed in previous work.
See Table 2for an example. Besides, most meth-
ods have the inefficiency problem (i.e. high query
times and long-running time), which makes them
less practical and may not be a good choice for
attackers in the real world. We refer readers to Sec-
tion 6for a further discussion about previous work.
To address the issue of security-irrelevant eval-
uation benchmark, we first summarize five secu-
rity tasks and search corresponding open-source
datasets. We collect, process, and release these
datasets as a collection named
Advbench
to fa-
cilitate future research. To address the issue of
ill-defined problem definition, we refer to the in-
tention of real-world attackers to reformalize the
task of textual adversarial attack and adjust the em-
phasis on different adversarial goals. Further, to
simulate real-world attacks, we propose a simple
attack method based on heuristic rules that are sum-
marized from various sources, which can easily
fulfill the actual attackers’ goals.
We conduct comprehensive experiments on Ad-
vbench to evaluate methods proposed in the NLP
community and our simple method. Experimental
results overall demonstrate the superiority of our
method, considering the attack performance, the at-
tack efficiency, and the preservation of adversarial
meaning (validity). We also consider the defense
side and show that the SOTA defense method can-
not handle our simple heuristic attack algorithm.
The overall experiments indicate that the research
paradigm in SoadNLP may start from our new
benchmark.
To summarize, the main contributions of this
paper are as follows:
We collect, process, and release a secu-
rity datasets collection Advbench.
We reconsider the attackers’ goals and refor-
malize the task of textual adversarial attack in
security scenarios.
We propose a simple attack method that fulfills
the actual attackers’ goals to simulate real-world
attacks, which can facilitate future research on
both the attack and the defense sides.
2 Advbench Construction
2.1 Motivation
We first survey previous works of adversarial
attacks in NLP about the tasks and datasets they
consider in their experiments (See Table 7). We
find that most tasks consider in their work are
not security-relevant (e.g., sentiment analysis).
So, the real-world concerns revealed in their
experiments are not well reflected in reality when
there is a lack of security evaluation benchmark.
To this end, we suggest future researchers evaluate
their methods on security tasks to demonstrate
real-world harmfulness and practical concerns.
Thus, a security datasets collection is needed to
facilitate future research.
2.2 Tasks
We summarize 5 security tasks, including misinfor-
mation, disinformation, toxic, spam, and sensitive
information detection. The task descriptions and
our motivation to choose these tasks are given in
Appendix B. Due to the label-unbalanced issue of
some datasets, we will release both our processed
balanced and unbalanced datasets. The datasets
statistics are listed in Table 8. All datasets are pro-
cessed through the general pipeline including the
removal of duplicate, missing, and unusual values.
2.2.1 Misinformation
LUN.
Our LUN dataset is built on the Labeled
Unreliable News Dataset (Rashkin et al.,2017)
consisting of articles from news media and human
annotations of fact-checking. We merge the satiri-
cal news from the Onion, hoax from the American
News, and propaganda from the Activist Report
into one category labeled as untrusted. And the
articles collected from Gigaword News are labeled
as trusted. Considering there is too little data in
the original testing set, we mix the original training
and testing set and re-partition by 7:3.
SATNews.
The Satirical News Dataset (Yang
et al.,2017) is a collection of satirical and veri-
fied news. The satirical news articles are collected
from 14 websites that explicitly declare that they
are offering satire. The verified news articles are
collected from major news outlets
1
and Google
News using FLORIN (Liu et al.,2015). The origi-
nal training set and validation set are merged as our
training set and the testing set remains unchanged.
2.2.2 Disinformation
Amazon-LB.
The Amazon Luxury Beauty
Review dataset is a review collection of the Luxury
Beauty category in Amazon with verification
information in Amazon Review Data (2018) (Ni
et al.,2019). The Amazon Review Data (2018)
is an updated version of the Amazon Review
Dataset (He and McAuley,2016;McAuley et al.,
2015) released in 2014, which contains 29 types of
data for different scenarios. We extract the Luxury
Beauty data from "small" subsets that are reduced
from full sets due to the appropriate quantity and
diversity of this category. We only keep content
and label (whether the content is verified or not)
of the review and split the data into training and
testing set with a ratio of 7:3.
CGFake.
The Computer-generated Fake Review
Dataset (Salminen et al.,2022) contains label-
balanced product reviews with two categories: orig-
inal reviews (presumably human-created and au-
thentic) and computer-generated fake reviews. The
computer-generated fake review is a new type of
disinformation that employs computer technology
to generate fake samples to mislead humans. This
dataset is split into training and testing set the same
as the original paper.
2.2.3 Toxic
HSOL.
The Hate Speech and Offensive Lan-
guage Dataset (Davidson et al.,2017) contains
more than 200k labeled tweets which are searched
1
CNN, DailyMail, WashingtonPost, NYTimes, The
Guardian, and Fox.
by Twitter API. The original dataset is classified
into three categories: hate speech, offensive but
not hate speech, or normal. We combine hate and
offensive speech into one category labeled "hate"
and the others are labeled as "non-hate".
Jigsaw2018.
The Jigsaw2018
2
is a competition
dataset of Toxic Comment Classification Chal-
lenge in Kaggle. This dataset includes plentiful
Wikipedia comments. And the comments are la-
beled by human annotators for toxic behavior with
two categories: toxic and non-toxic.
2.2.4 Spam
Enron.
The Enron
3
(Metsis et al.,2006) is a cor-
pus of emails split into two categories: legitimate
and spam. There are six subsets in the dataset. Each
subset contains non-spam messages from a user in
the Enron corpus. And each non-spam message
is paired with one of the three spam collections
including the SpamAssassin corpus and the Hon-
eypot project
4
, Bruce Guenter’s spam collection
5
,
and the spam collected by Metsis et al. (2006). We
mix all the datasets and split them into training
and testing sets. We only keep the content of each
email without other information such as subject
and address.
SpamAssassin.
The SpamAssassin
6
is a collec-
tion of emails consisting of three categories: easy-
ham, hard-ham, and spam. We merge easy-ham
and hard-ham as the ham class. Then we mix all
samples and split them equally into training and
testing sets because of the lack of data. For each
email, we preprocess it the same as Enron.
2.2.5 Sensitive Information
EDENCE.
EDENCE (Neerbek,2019a) contains
samples with auto-generated parsing-tree structures
in the Enron corpus. The annotated labels come
from the TREC LEGAL (Tomlinson,2010;Cor-
mack et al.,2010) labels for Enron documents. We
restore the tree-structured samples to normal texts
and map sensitive information labels back to each
sample. Then we combine the training and vali-
dation sets as our training set, and the testing set
remains unchanged.
2This dataset is available in Kaggle.
3http://www2.aueb.gr/users/ion/data/
enron-spam/
4https://www.projecthoneypot.org/
5http://untroubled.org/spam/
6https://spamassassin.apache.org/old/
publiccorpus/
FAS.
FAS (Neerbek,2019b) also contains sam-
ples with parsing-tree structures built from Enron
corpus and is modified for sensitive information
detection by using TREC LEGAL labels annotated
by domain experts. The samples in FAS are com-
pliant with Financial Accounting Standards 3 and
are preprocessed in the same way as EDENCE in
our work.
3 Task Formalization
3.1 Motivation
In our survey, we find that the current problem def-
inition and research goals considering the security
role of adversarial samples to reveal practical con-
cerns are ill-defined and ambiguous. We attribute
this to the failure of distinguishing several roles of
adversarial samples (See Table 1). The problem
definitions are different considering the different
roles of adversarial samples. For example, when
adversarial samples are adopted to augment exist-
ing datasets for adversarial training, we may aim
for high-quality samples. Thus, the minor perturba-
tions restriction is important. On the contrary, when
it comes to the security side, we should focus more
on the preservation of adversarial meaning and at-
tack efficiency instead of the imperceptible pertur-
bations. See section 6for a further discussion.
Thus, we need to separate the research on differ-
ent roles of adversarial samples. On the security
side, most work doesn’t consider realistic situations
and the actual adversarial goals, which may result
in unrealistic assumptions or unnecessary restric-
tions when developing attack or defense methods.
To make the research in this field more standard-
ized and in-depth, reformalization of this problem
needs to be conducted.
Note that we focus on the
security role of textual adversarial samples in
this paper.
3.2 Formalization
Overview.
Without loss of generality, we con-
sider the text classification task. Given a classifier
f:X → Y
that can make correct prediction on the
original input text x:
arg max
yi∈Y
P(yi|x) = ytrue,(1)
where
ytrue
is the golden label of
x
. The attackers
will make perturbations
δ
to craft an adversarial
sample xthat can fool the classifier:
arg max
yi∈Y
P(yi|x)6=ytrue,x=x+δ(2)
Refinement.
The core part of adversarial NLP is
to find the appropriate perturbations
δ
. We identify
four deficiencies in the common research paradigm
on SoadNLP.
(1) Most attack methods iteratively search for
better
δ
relying on the accessibility to the victim
models’ confidence scores or gradients (Alzantot
et al.,2018;Ren et al.,2019;Zang et al.,2020b;Li
et al.,2020). However, this assumption is unreal-
istic in real-world security tasks (e.g., hate-speech
detection). We argue that the research in adversar-
ial NLP considering the practical concerns should
focus on the decision-based setting, where only the
decisions of the victim models can be accessed.
(2) Previous work attempts to make
δ
impercepti-
ble by imposing some restrictions on the searching
process, like ensuring that the cosine similarity of
adversarial and original sentence embeddings is
higher than a threshold (Li et al.,2020;Garg and
Ramakrishnan,2020), or considering the adversar-
ial samples’ perplexity (Qi et al.,2021). However,
why should adversarial perturbations be impercepti-
ble? The goals of attackers are to (1) bypass the de-
tection systems and (2) convey the malicious mean-
ing. So, the attackers only need to preserve the
adversarial contents (e.g., the hate speech in mes-
sages) no matter how many perturbations are added
to the original sentence to bypass the detection
systems (Consider Table 2). Thus, we argue that
these constraints are unnecessary and the quality of
adversarial samples is a secondary consideration.
(3) Adversarial attack based on word substitution
or sentence paraphrase is the most widely studied.
However, current attack algorithms are very inef-
ficient and need to query victim models hundreds
of times to craft adversarial samples, which makes
them unlikely to happen in reality
7
. We argue that
adversarial attacks should be computation efficient,
both in the running time and the query times to
the victim models, to better simulate the practical
situations.
(4) There is a bunch of work assuming that the
attackers are experienced NLP practitioners and
incorporate external knowledge base (Ren et al.,
2019;Zang et al.,2020b) or NLP models (Li et al.,
2020;Qi et al.,2021) into their attack algorithms.
However, everyone can be an attacker in reality.
Consider the hate-speechers in social platforms.
They often try different heuristic strategies to es-
7
Some work tries to address this issue but the effect is
limited (Zang et al.,2020a;Chen et al.,2021b).
摘要:

WhyShouldAdversarialPerturbationsbeImperceptible?RethinktheResearchParadigminAdversarialNLPWARNING:Thispapercontainsreal-worldcaseswhichareoffensiveinnature.YangyiChen1;2,HongchengGao1;3,GanquCui1,FanchaoQi1LongtaoHuang4,ZhiyuanLiu1;5y,MaosongSun1;5y1NLPGroup,DCST,IAI,BNRIST,TsinghuaUniversity,Bei...

展开>> 收起<<
Why Should Adversarial Perturbations be Imperceptible Rethink the Research Paradigm in Adversarial NLP WARNING This paper contains real-world cases which are offensive in nature..pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:507.89KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注