Identifying Human Strategies for Generating Word-Level Adversarial Examples Maximilian Mozes1Bennett Kleinberg12Lewis D. Grifﬁn1

2025-05-08 0 0 294.07KB 10 页 10玖币

侵权投诉

Identifying Human Strategies for Generating Word-Level Adversarial

Examples

Maximilian Mozes1Bennett Kleinberg1,2Lewis D. Grifﬁn1

1University College London

2Tilburg University

{m.mozes, l.griffin}@cs.ucl.ac.uk

bennett.kleinberg@tilburguniversity.edu

Abstract

Adversarial examples in NLP are receiving in-

creasing research attention. One line of investi-

gation is the generation of word-level adversar-

ial examples against ﬁne-tuned Transformer

models that preserve naturalness and grammat-

icality. Previous work found that human- and

machine-generated adversarial examples are

comparable in their naturalness and grammat-

ical correctness. Most notably, humans were

able to generate adversarial examples much

more effortlessly than automated attacks. In

this paper, we provide a detailed analysis of

exactly how humans create these adversarial

examples. By exploring the behavioural pat-

terns of human workers during the genera-

tion process, we identify statistically signif-

icant tendencies based on which words hu-

mans prefer to select for adversarial replace-

ment (e.g., word frequencies, word saliencies,

sentiment) as well as where and when words

are replaced in an input sequence. With our

ﬁndings, we seek to inspire efforts that harness

human strategies for more robust NLP models.

1 Adversarial attacks in NLP

Researchers in natural language processing (NLP)

have identiﬁed the vulnerability of machine learn-

ing models to adversarial attacks: controlled,

meaning-preserving input perturbations that cause

a wrong model prediction (Jia and Liang,2017;

Iyyer et al.,2018;Ribeiro et al.,2018). Such ad-

versarial examples uncover model failure cases and

are a major challenge for trustworthiness and relia-

bility. While several defence methods exist against

adversarial attacks (Huang et al.,2019;Jia et al.,

2019;Zhou et al.,2019;Jones et al.,2020;Le et al.,

2022), developing robust NLP models is an open

research challenge. An in-depth analysis of word-

level adversarial examples, however, identiﬁed a

range of problems, showing that they are often un-

grammatical or semantically inconsistent (Morris

et al.,2020).

This ﬁnding raised the question of

how feasible natural and grammatically correct ad-

versarial examples actually are in NLP.

To answer this question, Mozes et al. (2021a)

explored whether humans are able to generate

adversarial examples that are valid under such

strict requirements. In that study, crowdworkers

were tasked with the generation of word-level ad-

versarial examples against a target model. The

ﬁndings showed that at ﬁrst sight—without strict

validation—humans are less successful than auto-

mated attacks. However, when adding constraints

on the preservation of sentiment, grammaticality

and naturalness, human-authored examples do not

differ from automated ones. The most striking ﬁnd-

ing was that automated attacks required massive

computational effort while humans were able to

do the same task using only a handful of queries.

This suggests that humans are far more efﬁcient

in adversarial attacks than automated systems, yet

exactly how they achieve this is unclear.

In this work, we address this question by

analysing human behaviour through the public

dataset from Mozes et al. (2021a). We look at

which words humans perturbed, where within a sen-

tence those perturbations were located, and whether

they mainly focused on perturbing sentiment-

loaded words. We ﬁnd that (i) in contrast to auto-

mated attacks, humans use more frequent adversar-

ial word substitutions, (ii) the semantic similarity

between replaced words and adversarial substitu-

tions is greater for humans than for most attacks,

and (iii) humans replace sentiment-loaded words

more often than algorithmic attackers. Our goal is

to understand what makes humans so efﬁcient at

this task, and whether these strategies could be har-

nessed for more adversarially robust NLP models.

1For example, replacing the word summer with winter.

For example, 140,000 queries are needed per example

for SEMEMEPSO (Zang et al.,2020), on average, to generate

successful adversarial examples on IMDb (Maas et al.,2011),

whereas humans need 10.9 queries (Mozes et al.,2021a).

arXiv:2210.11598v1 [cs.CL] 20 Oct 2022

Attack All Successful Unsuccessful

∆M∆SD d∆M∆SD d∆M∆SD d

HUMANADV 0.6 3.1 0.2 0.5 3.0 0.1 0.6 3.1 0.2

TEXTFOOLER 2.5 2.6 0.8 2.5 2.6 0.8 2.5 2.6 0.8

GENETIC 1.5 2.1 0.5 1.4 2.0 0.5 1.5 2.1 0.5

BAE 2.0 4.0 0.5 1.9 4.1 0.5 2.0 4.0 0.5

SEMEMEPSO 2.4 2.8 0.8 2.4 2.8 0.8 – – –

Table 1: Word frequency differences between replaced

words and adversarial substitutions. ∆Mand ∆SD rep-

resent the mean and standard deviation of the differ-

ences between replaced words and substitutions (i.e.,

positive values: replaced words >substitutions), d

denotes the Cohen’s deffect size. Note that for SE-

MEMEPSO, all adversarial examples are successful.

2 Data and Models

We present a ﬁne-grained analysis of the strategies

that human crowdworkers employed to generate

word-level adversarial examples against sentiment

classiﬁcation models. In the dataset from Mozes

et al. (2021a), 43 participants were recruited via

Amazon Mechanical Turk and trained to perform a

word-level adversarial attack on test set sequences

from the IMDb movie reviews dataset (Maas et al.,

2011). In total, 170 adversarial examples were

collected. For each of the collected adversarial

examples, the authors also generated automated ad-

versarial examples using the TEXTFOOLER (Jin

et al.,2019), BAE (Garg and Ramakrishnan,

2020), GENETIC (Alzantot et al.,2018) and SE-

MEMEPSO (Zang et al.,2020) attacks.

The TEXTFOOLER attack uses a greedy word-

replacement algorithm that is guided by word

saliencies and semantic similarity measures be-

tween an unperturbed sequence and the adversar-

ial candidate. The BAE attack resorts to a dif-

ferent technique, utilising a BERT-based language

model to remove and replace tokens in an input se-

quence. The GENETIC attack, in contrast, is based

on a population-based method using genetic algo-

rithms. Finally, the SEMEMEPSO attack is based

on replacements of word sememes instead of en-

tire words and combines this with a particle swarm

optimisation approach.

All attacks were performed against a RoBERTa

model (Liu et al.,2019) ﬁne-tuned on IMDb.

Here, we only consider adversarial examples that

preserved sentiment after evaluation by an inde-

pendent set of crowdworkers, which Mozes et al.

(2021a) used as a key validity criterion.

For more model details, see Section 3 in Mozes et al.

(2021a).

3 Analysis

In this section, we report on a series of experiments

analysing the human- and machine-authored adver-

sarial examples.

3.1 What do humans replace?

Word frequency.

We investigate the word fre-

quency of the adversarial examples. Existing

work (Mozes et al.,2021b;Hauser et al.,2021)

identiﬁed signiﬁcant differences in word frequency

between adversarially perturbed words (hereafter

referred to as replaced words) and their substitu-

tions (hereafter referred to as adversarial substi-

tutions) for a number of attacks. The substituted

words were considerably less frequent than their

original counterparts (e.g., annoying

→

galling).

Here, we examine whether this pattern is also ev-

ident in humans’ strategies. Table 1shows the

differences of the

loge

word frequencies between

replaced words and corresponding substitutions for

all four automated adversarial attacks and the hu-

man attack. All attacks replace words with less

frequent substitutions. The notable observations

here are the human-authored examples: the

loge

frequency differences are lowest for the human-

generated substitutions (HUMANADV). The effect

size Cohen’s

, which expresses the absolute mag-

nitude of the effect that frequencies differ, further

shows that the high-to-low frequency replacement

is much less used by humans (

d= 0.2

) than by

the other, automated attacks (

d≥0.5

). These ﬁnd-

ings persist when inspecting either successful or

unsuccessful adversarial examples in isolation.

To test for statistical differences between the

attacks, we ﬁrst conduct a 5 (attacks) by 2 (suc-

cess) ANOVA on the

loge

frequency differences

between replaced words and substitutions, to de-

termine whether main effects or interaction effects

were present. We observe a signiﬁcant main effect

for attack,

F(4,12003) = 152.85, p < .001

, but

none for success nor an interaction between attack

and success.5

Overall, the results suggest that humans use a

strategy different from automated approaches and

ﬁnd replacements that do not rely on the high-to-

low frequency mapping to the same extent as au-

tomated attacks. Illustrations of the highest and

Word frequency is computed with respect to the model’s

training corpus in these experiments.

Follow-up experiments revealed signiﬁcant differences

between HUMANADV and all attacks.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IdentifyingHumanStrategiesforGeneratingWord-LevelAdversarialExamplesMaximilianMozes1BennettKleinberg1;2LewisD.Grifn11UniversityCollegeLondon2TilburgUniversity{m.mozes,l.griffin}@cs.ucl.ac.ukbennett.kleinberg@tilburguniversity.eduAbstractAdversarialexamplesinNLParereceivingin-creasingresearchattenti...

展开>> 收起<<

Identifying Human Strategies for Generating Word-Level Adversarial Examples Maximilian Mozes1Bennett Kleinberg12Lewis D. Grifﬁn1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Identifying Human Strategies for Generating Word-Level Adversarial Examples Maximilian Mozes1Bennett Kleinberg12Lewis D. Grifﬁn1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: