
Identifying Human Strategies for Generating Word-Level Adversarial
Examples
Maximilian Mozes1Bennett Kleinberg1,2Lewis D. Griffin1
1University College London
2Tilburg University
{m.mozes, l.griffin}@cs.ucl.ac.uk
bennett.kleinberg@tilburguniversity.edu
Abstract
Adversarial examples in NLP are receiving in-
creasing research attention. One line of investi-
gation is the generation of word-level adversar-
ial examples against fine-tuned Transformer
models that preserve naturalness and grammat-
icality. Previous work found that human- and
machine-generated adversarial examples are
comparable in their naturalness and grammat-
ical correctness. Most notably, humans were
able to generate adversarial examples much
more effortlessly than automated attacks. In
this paper, we provide a detailed analysis of
exactly how humans create these adversarial
examples. By exploring the behavioural pat-
terns of human workers during the genera-
tion process, we identify statistically signif-
icant tendencies based on which words hu-
mans prefer to select for adversarial replace-
ment (e.g., word frequencies, word saliencies,
sentiment) as well as where and when words
are replaced in an input sequence. With our
findings, we seek to inspire efforts that harness
human strategies for more robust NLP models.
1 Adversarial attacks in NLP
Researchers in natural language processing (NLP)
have identified the vulnerability of machine learn-
ing models to adversarial attacks: controlled,
meaning-preserving input perturbations that cause
a wrong model prediction (Jia and Liang,2017;
Iyyer et al.,2018;Ribeiro et al.,2018). Such ad-
versarial examples uncover model failure cases and
are a major challenge for trustworthiness and relia-
bility. While several defence methods exist against
adversarial attacks (Huang et al.,2019;Jia et al.,
2019;Zhou et al.,2019;Jones et al.,2020;Le et al.,
2022), developing robust NLP models is an open
research challenge. An in-depth analysis of word-
level adversarial examples, however, identified a
range of problems, showing that they are often un-
grammatical or semantically inconsistent (Morris
et al.,2020).
1
This finding raised the question of
how feasible natural and grammatically correct ad-
versarial examples actually are in NLP.
To answer this question, Mozes et al. (2021a)
explored whether humans are able to generate
adversarial examples that are valid under such
strict requirements. In that study, crowdworkers
were tasked with the generation of word-level ad-
versarial examples against a target model. The
findings showed that at first sight—without strict
validation—humans are less successful than auto-
mated attacks. However, when adding constraints
on the preservation of sentiment, grammaticality
and naturalness, human-authored examples do not
differ from automated ones. The most striking find-
ing was that automated attacks required massive
computational effort while humans were able to
do the same task using only a handful of queries.
2
This suggests that humans are far more efficient
in adversarial attacks than automated systems, yet
exactly how they achieve this is unclear.
In this work, we address this question by
analysing human behaviour through the public
dataset from Mozes et al. (2021a). We look at
which words humans perturbed, where within a sen-
tence those perturbations were located, and whether
they mainly focused on perturbing sentiment-
loaded words. We find that (i) in contrast to auto-
mated attacks, humans use more frequent adversar-
ial word substitutions, (ii) the semantic similarity
between replaced words and adversarial substitu-
tions is greater for humans than for most attacks,
and (iii) humans replace sentiment-loaded words
more often than algorithmic attackers. Our goal is
to understand what makes humans so efficient at
this task, and whether these strategies could be har-
nessed for more adversarially robust NLP models.
1For example, replacing the word summer with winter.
2
For example, 140,000 queries are needed per example
for SEMEMEPSO (Zang et al.,2020), on average, to generate
successful adversarial examples on IMDb (Maas et al.,2011),
whereas humans need 10.9 queries (Mozes et al.,2021a).
arXiv:2210.11598v1 [cs.CL] 20 Oct 2022