Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution

2025-09-29 3 0 769.83KB 13 页 10玖币
侵权投诉
Character-level White-Box Adversarial Attacks against Transformers via
Attachable Subwords Substitution
Aiwei Liu, Honghai Yu, Xuming Hu, Shu’ang Li, Li Lin, Fukun Ma,
Yawen Yang, Lijie Wen
Tsinghua University
{liuaw20, yhh21, hxm19, lisa18, lin-l16, mafk19, yyw19}@mails.tsinghua.edu.cn
wenlj@tsinghua.edu.cn
Abstract
We propose the first character-level white-box
adversarial attack method against transformer
models. The intuition of our method comes
from the observation that words are split into
subtokens before being fed into the trans-
former models and the substitution between
two close subtokens has a similar effect to the
character modification. Our method mainly
contains three steps. First, a gradient-based
method is adopted to find the most vulner-
able words in the sentence. Then we split
the selected words into subtokens to replace
the origin tokenization result from the trans-
former tokenizer. Finally, we utilize an adver-
sarial loss to guide the substitution of attach-
able subtokens in which the Gumbel-softmax
trick is introduced to ensure gradient propa-
gation. Meanwhile, we introduce the visual
and length constraint in the optimization pro-
cess to achieve minimum character modifica-
tions. Extensive experiments on both sentence-
level and token-level tasks demonstrate that
our method could outperform the previous at-
tack methods in terms of success rate and edit
distance. Furthermore, human evaluation ver-
ifies our adversarial examples could preserve
their origin labels.
1 Introduction
Adversarial examples are modified input data that
could fool the machine learning models but not hu-
mans. Recently, Transformer (Vaswani et al.,2017)
based model such as BERT (Devlin et al.,2019) has
achieved dominant performance on a wide range
of natural language process (NLP) tasks. Unfortu-
nately, many works have shown that transformer-
based models are vulnerable to adversarial attacks
(Guo et al.,2021;Garg and Ramakrishnan,2020).
On the other hand, the adversarial attack could help
improve the robustness of models through adver-
sarial training, which emphasizes the importance
of finding high-quality adversarial examples.
Corresponding author.
at
#lan
#ta
subword substitution
atlanta atlauta
character substitution
at
#lau
#ta
tokenize
Combine
at
#lan
#ta
subword substitution
atlanta atlaneta
character insertion
at
#lane
#ta
tokenize
Combine
at
#lan
#ta
subword substitution
atlanta atlnta
character deletion
at
#ln
#ta
tokenize
Combine
at
#lan
#ta
subword substitution
atlanta atnalta
character swap
at
#nal
#la
tokenize
Combine
(a) (b)
(c) (d)
Model
Query flights between boston and charlotte
City City
<b> <o> <t> … n
(b): Modify character by subtoken substitution
(a) unavailability of character gradient
unavailable gradient
Model
Query flights between bo #st #on and charlotte
City City
Query flights
b
Query flights between boston and charlotte
Transformer Model
Figure 1: Subtoken substitution operation could
achieve the same result as all four character modifica-
tion operations.
Recently, some efficient and effective attack-
ing methods have been proposed at token level
(e.g. synonym substitution) (Guo et al.,2021)
and sentence level (e.g. paraphrasing input texts)
(Wang et al.,2020). However, this is not the case
in character-level attack methods (e.g. mistyping
words), which barely hinder human understanding
and is thus a natural attack scenario. Most previous
methods (Gao et al.,2018;Eger and Benz,2020)
achieve the character-level attack in a black box
manner, which requires hundreds of attempts and
the attack success rate is not good enough. White
box attack methods are natural solutions to these
drawbacks, but current character-level white box at-
tack methods (Ebrahimi et al.,2018b,a) only work
for models taking characters as input and thus fail
on token-level transformer model.
Achieving character-level white box attack via
single character modification is impossible for the
transformer model, due to the gradient of charac-
ters being unavailable. We choose to implement
the character-level attack via subtoken substitu-
tion based on the following two observations. (1)
Nearly all transformer-based pre-training models
adopt subword tokenizer (Sennrich et al.,2016), in
which each word is split into subtokens containing
one start subtoken and several subtoken attached
arXiv:2210.17004v1 [cs.CL] 31 Oct 2022
to it (attachable subtoken). (2) As shown in Figure
1, all character modifications (e.g. swap and inser-
tion) can be achieved by subtoken substitution.
Based on the above observations, we propose
CWBA
, the first
C
haracter-level
W
hite-
Box A
ttack
method against transformer models via attachable
subwords substitution. Our method mainly con-
tains three steps: target word selection, adversarial
tokenization, and subtoken search.
Since our
CWBA
requires specific words as in-
put, finding the most vulnerable words is required.
Our model first ranks the words according to the
gradient of words from our adversarial goal. Then
during the adversarial tokenization process, the top-
ranked words are split into at least three subtokens,
including a start subtoken and several attachable
subtokens. Our
CWBA
method aims to replace these
attachable subtokens to achieve character attack.
Due to the discrete nature of natural languages
prohibits the gradient optimization of subtokens,
we leverage the Gumbel-Softmax trick (Jang et al.,
2017) to sample a continuous distribution from
tokens and thus allow gradient propagation. The at-
tachable subtokens are then optimized by a gradient
descent method to generate the adversarial example.
Meanwhile, to minimize the degree of modification,
we also introduce visual and length constraints dur-
ing optimization to make the replaced subtokens
visually and length-wise similar.
Our
CWBA
method could outperform previous at-
tack methods on both sentence level (e.g. sentence
classification) and token level (e.g. named entity
recognition) tasks in terms of success rate and edit
distance. It is worth mentioning that
CWBA
is the
first white box attack method applied to token-level
tasks. Meanwhile, we demonstrate the effective-
ness of
CWBA
against various transformer-based
models. Human evaluation experiments verify our
adversarial attack method is label-preserving. Fi-
nally, the adversarial training experiment shows
that training with our adversarial examples would
increase the robustness of models.
To summarize, the main contributions of our
paper are as follows:
To the best of our knowledge,
CWBA
is the
first character-level white box attack method
against transformer models.
Our
CWBA
method is also the first white box
attack method applied to token-level tasks.
We propose a visual constraint to make the
replaced subtoken similar to the original one.
Our
CWBA
method could outperform the pre-
vious attack methods on both sentence-level
tasks and token-level tasks. 1
2 Related Work
2.1 White box attack method in NLP
White box attack methods could find the defects
of the model with low query number and high suc-
cess rate, which have been successfully applied to
image and speech data (Madry et al.,2018;Carlini
and Wagner,2018). However, applying white-box
attack methods to natural language is more chal-
lenging due to the discrete nature of the text. To
search the text under the guidance of gradient and
achieve a high success rate, Cheng et al. (2019b,a)
choose to optimize in the embedding space and
search the nearest word, which suffers from high
bias problems. To further reduce the bias, Cheng
et al. (2020) and Sato et al. (2018) restrict the opti-
mization direction towards the existing word em-
beddings. However, the optimization process of
these methods is unstable due to the sparsity of
the word embedding space. Other methods try
to directly optimize the text by gradient estima-
tion techniques such as Gumbel-Softmax sampling
(Xu et al.,2021;Guo et al.,2021), reinforcement
learning (Zou et al.,2020), metropolis-hastings
sampling (Zhang et al.,2019). Our
CWBA
adopts
the Gumbel-Softmax technique for subtokens to
achieve the character-level white-box attack.
2.2 Attack method against Transformers
Transformer-based (Vaswani et al.,2017) pre-
training models (Devlin et al.,2019;Liu et al.,
2019) have shown their great advantage on vari-
ous NLP tasks. However, recent works reveal that
these pretraining models are vulnerable to adversar-
ial attacks under many scenarios such as sentence
classification (Li et al.,2020), machine translation
(Cheng et al.,2019b), text entailment (Xu et al.,
2020) and part-of-speech tagging (Eger and Benz,
2020). Most of these methods achieve attack in the
black box manner, which are implemented by char-
acter modification (Eger and Benz,2020), token
substitution (Li et al.,2020) or sentence paraphras-
ing (Xu et al.,2020). However, these black-box
attack methods usually require hundreds of queries
1
Code and data are available at https://github.com/THU-
BPM/CWBA
to the target model and the success rate cannot
be guaranteed. To alleviate these problems, some
white-box attack methods have been proposed in-
cluding token-level methods (Guo et al.,2021)
and sentence-level methods (Wang et al.,2020).
Different from these methods, our
CWBA
is the
first character-level white-box attack method for
transformer-based models.
3 Methods
In this section, we detail our proposed frame-
work
CWBA
for the character-level white-box attack
method. In the following content, we first give a
formulation of our attack problem, followed by a
detailed description of the three key components:
target word selection, adversarial tokenization, and
subtoken search.
3.1 Attack Problem Formulation
We formulate the adversarial examples as follows.
Given an input sentence
x= (x1, x2, ..., xn)
with
length
|n|
, suppose the classification model
H
could predict the correct corresponding sentence
or token label
y
such that
H(x) = y
. An adversar-
ial example is a sample
x0
close to
x
but causing
different model prediction such that H(x0)6=y.
The process of finding adversarial examples
is modeled as a gradient optimization problem.
Specifically, given the classification logits vector
pRK
generated by model
H
with
K
classes,
the adversarial loss is defined as the margin loss:
`adv(x, y) = max pymax
k6=ypk+κ, 0,(1)
which motivates the model to misclassify
x
by a
margin
κ > 0
. The effectiveness of margin loss
has been validated in many attack algorithms (Guo
et al.,2021;Carlini and Wagner,2018).
Given the adversarial loss
`adv
, the goal of our
attack algorithm can be modeled as a constrained
optimization problem:
min `adv x0, ysubject to ρx,x0, (2)
where
ρ
is the function measuring the similarity
between origin and adversarial examples. In our
work, the similarity is measured using the edit dis-
tance metric (Li and Liu,2007).
3.2 Target Word Selection
Since our attack method takes specific words as the
target and performs pre-processing to these words,
obtaining the most critical words for target task
prediction is required. To find the most vulnerable
words, we sort the words based on the l2 norm
value of gradient towards adversarial loss in Eq 1:
ˆ
x= argsort
x
(k∇x1`advk2, ..., k∇xn`adv k2)(3)
where
xj`adv
is the gradient of the
j
-th token.
Note that word
xj
may be tokenized into several
subtokens
[tj0...tjn]
, and its gradient is defined as
the average gradient of these subtokens:
k∇xj`k2= avg k∇tj0`k2, ..., k∇tjn `k2,(4)
where the loss
`
is the adversarial loss
`adv
in our
work. Our
CWBA
would take the first
N
words
from the sorted word list
ˆ
x
as targets, where
N
is a
task-related hyperparameter.
3.3 Adversarial Tokenization
The selected words are required to split into subto-
kens before performing the character-level attack.
We observe that the transformer tokenizer has the
following two properties: (1) The correctly spelled
words usually won’t split or only split into a few
subtokens. (2) The misspelled words are tokenized
into more subtokens than the correctly spelled
words. For example, the word boston won’t be
segmented but after single character modification,
bosfon would be tokenized into three subtokens bo,
#sf and #on. To keep the tokenization consistency
during the attack, we propose the adversarial tok-
enizer which tokenizes the correctly spelled words
into more subtokens than the transformer tokenizer.
To further improve the tokenization consistency
during the attack process, our main principle is
to make the subtokens as long as possible, since
longer subtokens are more difficult to combine with
characters to form new subtokens
2
. Specifically,
our tokenization contains the following steps:
1.
Find the longest subwords in the first half of
the word to form the longest start subtoken.
2.
Find the longest subwords in the second half
of the word to form the longest end subtoken.
3.
Tokenize the rest part with the transformer
tokenizer to generate the middle subtokens.
After these steps, we obtain the longest start and
end subtokens and our algorithm would substitute
the middle subtokens, which keeps the maximum
consistency of tokenization during the attack.
2More details and statistics are provided in the appendix
摘要:

Character-levelWhite-BoxAdversarialAttacksagainstTransformersviaAttachableSubwordsSubstitutionAiweiLiu,HonghaiYu,XumingHu,Shu'angLi,LiLin,FukunMa,YawenYang,LijieWenyTsinghuaUniversity{liuaw20,yhh21,hxm19,lisa18,lin-l16,mafk19,yyw19}@mails.tsinghua.edu.cnwenlj@tsinghua.edu.cnAbstractWeproposetherst...

展开>> 收起<<
Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:769.83KB 格式:PDF 时间:2025-09-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注