Identifying Human Strategies for Generating Word-Level Adversarial Examples Maximilian Mozes1Bennett Kleinberg12Lewis D. Griffin1

2025-05-08 0 0 294.07KB 10 页 10玖币
侵权投诉
Identifying Human Strategies for Generating Word-Level Adversarial
Examples
Maximilian Mozes1Bennett Kleinberg1,2Lewis D. Griffin1
1University College London
2Tilburg University
{m.mozes, l.griffin}@cs.ucl.ac.uk
bennett.kleinberg@tilburguniversity.edu
Abstract
Adversarial examples in NLP are receiving in-
creasing research attention. One line of investi-
gation is the generation of word-level adversar-
ial examples against fine-tuned Transformer
models that preserve naturalness and grammat-
icality. Previous work found that human- and
machine-generated adversarial examples are
comparable in their naturalness and grammat-
ical correctness. Most notably, humans were
able to generate adversarial examples much
more effortlessly than automated attacks. In
this paper, we provide a detailed analysis of
exactly how humans create these adversarial
examples. By exploring the behavioural pat-
terns of human workers during the genera-
tion process, we identify statistically signif-
icant tendencies based on which words hu-
mans prefer to select for adversarial replace-
ment (e.g., word frequencies, word saliencies,
sentiment) as well as where and when words
are replaced in an input sequence. With our
findings, we seek to inspire efforts that harness
human strategies for more robust NLP models.
1 Adversarial attacks in NLP
Researchers in natural language processing (NLP)
have identified the vulnerability of machine learn-
ing models to adversarial attacks: controlled,
meaning-preserving input perturbations that cause
a wrong model prediction (Jia and Liang,2017;
Iyyer et al.,2018;Ribeiro et al.,2018). Such ad-
versarial examples uncover model failure cases and
are a major challenge for trustworthiness and relia-
bility. While several defence methods exist against
adversarial attacks (Huang et al.,2019;Jia et al.,
2019;Zhou et al.,2019;Jones et al.,2020;Le et al.,
2022), developing robust NLP models is an open
research challenge. An in-depth analysis of word-
level adversarial examples, however, identified a
range of problems, showing that they are often un-
grammatical or semantically inconsistent (Morris
et al.,2020).
1
This finding raised the question of
how feasible natural and grammatically correct ad-
versarial examples actually are in NLP.
To answer this question, Mozes et al. (2021a)
explored whether humans are able to generate
adversarial examples that are valid under such
strict requirements. In that study, crowdworkers
were tasked with the generation of word-level ad-
versarial examples against a target model. The
findings showed that at first sight—without strict
validation—humans are less successful than auto-
mated attacks. However, when adding constraints
on the preservation of sentiment, grammaticality
and naturalness, human-authored examples do not
differ from automated ones. The most striking find-
ing was that automated attacks required massive
computational effort while humans were able to
do the same task using only a handful of queries.
2
This suggests that humans are far more efficient
in adversarial attacks than automated systems, yet
exactly how they achieve this is unclear.
In this work, we address this question by
analysing human behaviour through the public
dataset from Mozes et al. (2021a). We look at
which words humans perturbed, where within a sen-
tence those perturbations were located, and whether
they mainly focused on perturbing sentiment-
loaded words. We find that (i) in contrast to auto-
mated attacks, humans use more frequent adversar-
ial word substitutions, (ii) the semantic similarity
between replaced words and adversarial substitu-
tions is greater for humans than for most attacks,
and (iii) humans replace sentiment-loaded words
more often than algorithmic attackers. Our goal is
to understand what makes humans so efficient at
this task, and whether these strategies could be har-
nessed for more adversarially robust NLP models.
1For example, replacing the word summer with winter.
2
For example, 140,000 queries are needed per example
for SEMEMEPSO (Zang et al.,2020), on average, to generate
successful adversarial examples on IMDb (Maas et al.,2011),
whereas humans need 10.9 queries (Mozes et al.,2021a).
arXiv:2210.11598v1 [cs.CL] 20 Oct 2022
Attack All Successful Unsuccessful
MSD dMSD dMSD d
HUMANADV 0.6 3.1 0.2 0.5 3.0 0.1 0.6 3.1 0.2
TEXTFOOLER 2.5 2.6 0.8 2.5 2.6 0.8 2.5 2.6 0.8
GENETIC 1.5 2.1 0.5 1.4 2.0 0.5 1.5 2.1 0.5
BAE 2.0 4.0 0.5 1.9 4.1 0.5 2.0 4.0 0.5
SEMEMEPSO 2.4 2.8 0.8 2.4 2.8 0.8
Table 1: Word frequency differences between replaced
words and adversarial substitutions. Mand SD rep-
resent the mean and standard deviation of the differ-
ences between replaced words and substitutions (i.e.,
positive values: replaced words >substitutions), d
denotes the Cohen’s deffect size. Note that for SE-
MEMEPSO, all adversarial examples are successful.
2 Data and Models
We present a fine-grained analysis of the strategies
that human crowdworkers employed to generate
word-level adversarial examples against sentiment
classification models. In the dataset from Mozes
et al. (2021a), 43 participants were recruited via
Amazon Mechanical Turk and trained to perform a
word-level adversarial attack on test set sequences
from the IMDb movie reviews dataset (Maas et al.,
2011). In total, 170 adversarial examples were
collected. For each of the collected adversarial
examples, the authors also generated automated ad-
versarial examples using the TEXTFOOLER (Jin
et al.,2019), BAE (Garg and Ramakrishnan,
2020), GENETIC (Alzantot et al.,2018) and SE-
MEMEPSO (Zang et al.,2020) attacks.
The TEXTFOOLER attack uses a greedy word-
replacement algorithm that is guided by word
saliencies and semantic similarity measures be-
tween an unperturbed sequence and the adversar-
ial candidate. The BAE attack resorts to a dif-
ferent technique, utilising a BERT-based language
model to remove and replace tokens in an input se-
quence. The GENETIC attack, in contrast, is based
on a population-based method using genetic algo-
rithms. Finally, the SEMEMEPSO attack is based
on replacements of word sememes instead of en-
tire words and combines this with a particle swarm
optimisation approach.
All attacks were performed against a RoBERTa
model (Liu et al.,2019) fine-tuned on IMDb.
3
Here, we only consider adversarial examples that
preserved sentiment after evaluation by an inde-
pendent set of crowdworkers, which Mozes et al.
(2021a) used as a key validity criterion.
3
For more model details, see Section 3 in Mozes et al.
(2021a).
3 Analysis
In this section, we report on a series of experiments
analysing the human- and machine-authored adver-
sarial examples.
3.1 What do humans replace?
Word frequency.
We investigate the word fre-
quency of the adversarial examples. Existing
work (Mozes et al.,2021b;Hauser et al.,2021)
identified significant differences in word frequency
between adversarially perturbed words (hereafter
referred to as replaced words) and their substitu-
tions (hereafter referred to as adversarial substi-
tutions) for a number of attacks. The substituted
words were considerably less frequent than their
original counterparts (e.g., annoying
galling).
4
Here, we examine whether this pattern is also ev-
ident in humans’ strategies. Table 1shows the
differences of the
loge
word frequencies between
replaced words and corresponding substitutions for
all four automated adversarial attacks and the hu-
man attack. All attacks replace words with less
frequent substitutions. The notable observations
here are the human-authored examples: the
loge
frequency differences are lowest for the human-
generated substitutions (HUMANADV). The effect
size Cohen’s
d
, which expresses the absolute mag-
nitude of the effect that frequencies differ, further
shows that the high-to-low frequency replacement
is much less used by humans (
d= 0.2
) than by
the other, automated attacks (
d0.5
). These find-
ings persist when inspecting either successful or
unsuccessful adversarial examples in isolation.
To test for statistical differences between the
attacks, we first conduct a 5 (attacks) by 2 (suc-
cess) ANOVA on the
loge
frequency differences
between replaced words and substitutions, to de-
termine whether main effects or interaction effects
were present. We observe a significant main effect
for attack,
F(4,12003) = 152.85, p < .001
, but
none for success nor an interaction between attack
and success.5
Overall, the results suggest that humans use a
strategy different from automated approaches and
find replacements that do not rely on the high-to-
low frequency mapping to the same extent as au-
tomated attacks. Illustrations of the highest and
4
Word frequency is computed with respect to the model’s
training corpus in these experiments.
5
Follow-up experiments revealed significant differences
between HUMANADV and all attacks.
摘要:

IdentifyingHumanStrategiesforGeneratingWord-LevelAdversarialExamplesMaximilianMozes1BennettKleinberg1;2LewisD.Grifn11UniversityCollegeLondon2TilburgUniversity{m.mozes,l.griffin}@cs.ucl.ac.ukbennett.kleinberg@tilburguniversity.eduAbstractAdversarialexamplesinNLParereceivingin-creasingresearchattenti...

展开>> 收起<<
Identifying Human Strategies for Generating Word-Level Adversarial Examples Maximilian Mozes1Bennett Kleinberg12Lewis D. Griffin1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:294.07KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注