ADDMU Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation Fan Yin

2025-04-30 0 0 532.38KB 18 页 10玖币
侵权投诉
ADDMU: Detection of Far-Boundary Adversarial Examples with Data
and Model Uncertainty Estimation
Fan Yin
University of California, Los Angeles
fanyin20@cs.ucla.edu
Yao Li
University of North Carolina, Chapel Hill
yaoli@email.unc.edu
Cho-Jui Hsieh
University of California, Los Angeles
chohsieh@cs.ucla.edu
Kai-Wei Chang
University of California, Los Angeles
kwchang@cs.ucla.edu
Abstract
Adversarial Examples Detection (AED) is a
crucial defense technique against adversarial
attacks and has drawn increasing attention
from the Natural Language Processing (NLP)
community. Despite the surge of new AED
methods, our studies show that existing meth-
ods heavily rely on a shortcut to achieve good
performance. In other words, current search-
based adversarial attacks in NLP stop once
model predictions change, and thus most ad-
versarial examples generated by those attacks
are located near model decision boundaries.
To surpass this shortcut and fairly evaluate
AED methods, we propose to test AED meth-
ods with Far Boundary (FB) adversarial ex-
amples. Existing methods show worse than
random guess performance under this scenario.
To overcome this limitation, we propose a new
technique, ADDMU,adversary detection with
data and model uncertainty, which combines
two types of uncertainty estimation for both
regular and FB adversarial example detection.
Our new method outperforms previous meth-
ods by 3.6 and 6.0 AUC points under each sce-
nario. Finally, our analysis shows that the two
types of uncertainty provided by ADDMU can
be leveraged to characterize adversarial exam-
ples and identify the ones that contribute most
to model’s robustness in adversarial training.
1 Introduction
Deep neural networks (DNN) have achieved re-
markable performance in a wide variety of NLP
tasks. However, it has been shown that DNNs
can be vulnerable to adversarial examples (Jia and
Liang,2017;Alzantot et al.,2018;Jin et al.,2020),
i.e., perturbed examples that flip model predictions
but remain imperceptible to humans, and thus im-
pose serious security concerns about NLP models.
To improve the robustness of NLP models, differ-
ent kinds of techniques to defend against adversar-
ial examples have been proposed (Li et al.,2021b).
In this paper, we study AED, which aims to add a
detection module to identify and reject malicious
inputs based on certain characteristics. Different
from adversarial training methods (Madry et al.,
2018a;Jia et al.,2019) which require re-training
of the model with additional data or regularization,
AED operates in the test time and can be directly
integrated with any existing model.
Despite being well explored in the vision domain
(Feinman et al.,2017;Raghuram et al.,2021), AED
started to get attention in the field of NLP only re-
cently. Many works have been proposed to conduct
detection based on certain statistics (Zhou et al.,
2019;Mozes et al.,2021;Yoo et al.,2022;Xie
et al.,2022). Specifically, Yoo et al. (2022) propose
a benchmark for AED methods and a competitive
baseline by robust density estimation. However, by
studying examples in the benchmark, we find that
the success of some AED methods relies heavily
on the shortcut left by adversarial attacks: most
adversarial examples are located near model deci-
sion boundaries, i.e., they have small probability
discrepancy between the predicted class and the
second largest class. This is because when creat-
ing adversarial data, the searching process stops
once model predictions changed. We illustrate this
finding in Section 2.2.
To evaluate detection methods accurately, we
propose to test AED methods on both regular ad-
versarial examples and
F
ar-
B
oundary (
FB
)
1
adver-
sarial examples, which are created by continuing to
search for better adversarial examples till a thresh-
old of probability discrepancy is met. Results show
that existing AED methods perform worse than
random guess on FB adversarial examples. Yoo
et al. (2022) recognize this limitation, but we find
that this phenomenon is more severe than what is
reported in their work. Thus, an AED method that
works for FB attacks is in need.
1
Other works may call this ‘High-Confidence’. We use the
term ‘Far-Boundary’ to avoid conflicts between ‘confidence’
and the term ‘uncertainty’ introduced later.
arXiv:2210.12396v1 [cs.CL] 22 Oct 2022
We propose ADDMU, an uncertainty estimation
based AED method. The key intuition is based
on the fact that adversarial examples lie off the
manifold of training data and models are typically
uncertain about their predictions of them. Thus,
although the prediction probability is no longer a
good uncertainty measurement when adversarial
examples are far from the model decision bound-
ary, there exist other statistical clues that give out
the ‘uncertainty’ in predictions to identify adver-
sarial data. In this paper, we introduce two of them:
data uncertainty and model uncertainty. Data un-
certainty is defined as the uncertainty of model
predictions over neighbors of the input. Model
uncertainty is defined as the prediction variance
on the original input when applying Monte Carlo
Dropout (MCD) (Gal and Ghahramani,2016) to
the target model during inference time. Previous
work has shown that models trained with dropout
regularization (Srivastava et al.,2014) approximate
the inference in Bayesian neural networks with
MCD, where model uncertainty is easy to obtain
(Gal and Ghahramani,2016;Smith and Gal,2018).
Given the statistics of the two uncertainties, we ap-
ply p-value normalization (Raghuram et al.,2021)
and combine them with Fisher’s method (Fisher,
1992) to produce a stronger test statistic for AED.
To the best of our knowledge, we are the first work
to estimate the uncertainty of Transformer-based
models (Shelmanov et al.,2021) for AED.
The advantages of our proposed AED method
include: 1) it only operates on the output level of
the model; 2) it requires little to no modifications
to adapt to different architectures; 3) it provides
an unified way to combine different types of un-
certainties. Experimental results on four datasets,
four attacks, and two models demonstrate that our
method outperforms existing methods by 3.6 and
6.0 in terms of AUC scores on regular and FB cases,
respectively. We also show that the two uncertainty
statistics can be used to characterize adversarial
data and select useful data for another defense tech-
nique, adversarial data augmentation (ADA).
The code for this paper could be
found at
https://github.com/uclanlp/
AdvExDetection-ADDMU
2 A Diagnostic Study on AED Methods
In this section, we first describe the formulation of
adversarial examples and AED. Then, we show that
current AED methods mainly act well on detecting
adversarial examples near the decision boundary,
but are confused by FB adversarial examples.
2.1 Formulation
Adversarial Examples.
Given an NLP model
f:
X → Y
, a textual input
x∈ X
, a predicted class
from the candidate classes
y∈ Y
, and a set of
boolean indicator functions of constraints,
Ci:X ×
X → {0,1}, i = 1,2,· · · , n
. An (untargeted)
adversarial example x∈ X satisfies:
f(x)6=f(x),Ci(x, x) = 1, i = 1,2,· · · , n.
Constraints are typically grammatical or seman-
tic similarities between original and adversarial
data. For example, Jin et al. (2020) conduct part-of-
speech checks and use Universal Sentence Encoder
(Cer et al.,2018) to ensure semantic similarities
between two sentences.
Adversarial Examples Detection (AED)
The
task of AED is to distinguish adversarial exam-
ples from natural ones, based on certain charac-
teristics of adversarial data. We assume access
to 1) the victim model
f
, trained and tested on
clean datasets
Dtrain
and
Dtest
; 2) an evaluation
set
Deval
; 3) an auxiliary dataset
Daux
contains
only clean data.
Deval
contains equal number of ad-
versarial examples
Devaladv
and natural examples
Devalnat
.
Devalnat
are randomly sampled from
Dtest
.
Devaladv
is generated by attacking a dis-
joint set of samples from
Devalnat
on
Dtest
. See
Scenario 1 in Yoo et al. (2022) for details. We use
a subset of
Dtrain
as
Daux
. We adopt an unsuper-
vised setting, i.e., the AED method is not trained
on any dataset that contains adversarial examples.
2.2 Diagnose AED Methods
We define examples near model decision bound-
aries to be those whose output probabilities for
the predicted class and the second largest class are
close. Regular iterative adversarial attacks stop
once the predictions are changed. Therefore, we
suspect that regular attacks are mostly generating
adversarial examples near the boundaries, and ex-
isting AED methods could rely on this property to
detect adversarial examples.
Figure 1verifies this for the state-of-the-art unsu-
pervised AED method (Yoo et al.,2022) in NLP, de-
noted as
RDE
. Similar trends are observed for an-
other baseline. The X-axis shows two attack meth-
ods: TextFooler (Jin et al.,2020) and Pruthi (Pruthi
et al.,2019). The Y-axis represents the probability
Figure 1: The probability difference between the pre-
dicted class and the second largest class on natural ex-
amples, adversarial examples that the detector failed,
succeed, and in total. The X-axis is the attack. The
Y-axis is the difference. Correctly detected adversarial
examples have relatively small probability difference.
RDE DIST
Data-attack Regular FB Regular FB
SST2-TF
72.8/86.5 45.0/81.5 73.4/87.9 26.3/81.6
SST2-Pruthi
55.1/80.6 30.8/72.6 61.4/85.3 26.5/74.6
Yelp-TF
79.2/89.6 44.6/82.7 80.3/90.6 64.3/86.2
Yelp-Pruthi
64.8/88.0 47.9/85.2 72.2/89.2 55.2/84.9
Table 1: F1/AUC scores of two SOTA detection meth-
ods on Regular and FB adversarial examples. RDE
and DIST perform worse than random guess (F1=50.0)
on FB adversarial examples.
difference between the predicted class and the sec-
ond largest class. Average probability differences
of natural examples (Natural), and three types of
adversarial examples are shown: RDE fails to iden-
tify (Failed), successfully detected (Detected), and
overall (Overall). There is a clear trend that success-
fully detected adversarial examples are those with
small probability differences while the ones with
high probability differences are often mis-classified
as natural examples. This finding shows that these
AED methods identify examples near the decision
boundaries, instead of adversarial examples.
To better evaluate AED methods, we propose
to avoid the above shortcut by testing detection
methods with FB adversarial examples, which are
generated by continuously searching for adversarial
examples until a prediction probability threshold is
reached. We simply add another goal function to
the adversarial example definition to achieve this
while keep other conditions unchanged:
f(x)6=f(x), p (y=f(x)|x)
Ci(x, x)=1, i = 1,2,· · · , n.
p(y=f(x)|x)
denotes the predicted probabil-
ity for the adversarial example.
is a manually
defined threshold. We illustrate the choice of
in
Grammar Semantics
Data Regular FB Regular FB
SST-2 1.117 1.129 3.960 3.900
Yelp 1.209 1.233 4.113 4.082
Table 2: Quality checks for FB adversarial examples.
The results on each dataset are averaged over examples
from three attacks: TextFooler, BAE, Pruthi, and their
FB versions. The numbers for Grammar columns are
the relative increases of errors of perturbed examples
w.r.t. original examples. The numbers for Semantics
columns are the averaged rates that the adversarial ex-
amples preserve the original meaning evaluated by hu-
mans. The quality of adversarial examples do not de-
grade much with the FB version of attacks.
Section 4.1. In Table 1, it shows that the existing
competitive methods (RDE and DIST) get lower
than random guess F1 scores when evaluated with
FB adversarial examples.
2.3 Quality Check for FB Attacks
We show that empirically, the quality of adversarial
examples do not significantly degrade even search-
ing for more steps and stronger FB adversarial ex-
amples. We follow Morris et al. (2020a) to evaluate
the quality of FB adversarial examples in terms of
grammatical and semantic changes, and compare
them with regular adversarial examples. We use
a triple
(x, xadv, xF Badv)
to denote the original
example, its corresponding regular adversarial and
FB adversarial examples. For grammatical changes,
we conduct an automatic evaluation with Language-
Tool (Naber et al.,2003) to count grammatical er-
rors and report the relative increase of errors of
perturbed examples w.r.t. original examples. For
semantic changes, we do a human evaluation using
Amazon MTurk
2
. We ask the workers to rate to
what extent the changes to
x
preserve the meaning
of the sentence, with scale 1 (‘Strongly disagree’)
to 5 (‘Strongly agree’). Results are summarized in
Table 2. The values are averaged over three adver-
sarial attacks, 50 examples for each. We find that
the FB attacks have minimal impact on the quality
of the adversarial examples. We show some exam-
ples on Table 7, which qualitatively demonstrate
that it is hard for humans to identify FB adversarial
examples.
2
We pay workers 0.05 dollars per HIT. Each HIT takes
approximately 15 seconds to finish. So, we pay each worker
12 dollars per hour. Each HIT is assigned three workers.
3 Adversary Detection with Data and
Model Uncertainty (ADDMU)
Given the poor performance of previous methods
on FB attacks, we aim to build a detector that can
handle not only regular but also FB adversarial
examples. We propose ADDMU, an uncertainty
estimation based AED method by combing two
types of uncertainty: model uncertainty and data
uncertainty. We expect the adversarial examples to
have large values for both. The motivation of using
uncertainty is that models can still be uncertain
about their predictions even when they assign a
high probability of predicted class to an example.
We describe the definitions and estimations of the
two uncertainties, and how to combine them.
3.1 Model Uncertainty Estimation
Model uncertainty represents the uncertainty when
predicting a single data point with randomized mod-
els. Gal and Ghahramani (2016) show that model
uncertainty can be extracted from DNNs trained
with dropout and inference with MCD without
any modifications of the network. This is because
the training objective with dropout minimizes the
Kullback-Leibler divergence between the posterior
distribution of a Bayesian network and an approxi-
mation distribution. We follow this approach and
define the model uncertainty as the softmax vari-
ance when applying MCD during test time.
Specifically, given a trained model
f
, we do
Nm
stochastic forward passes for each data point
x
.
The dropout masks of hidden representations for
each forward pass are i.i.d sampled from a Bernolli
distribution, i.e.,
zlk Bernolli (pm)
where
pm
is
a fixed dropout rate for all layers,
zlk
is the mask for
neuron
k
on layer
l
. Then, we can do a Monte Carlo
estimation on the softmax variance among the
Nm
stochastic softmax outputs. Denote the probability
of predicting the input as the
i
-th class in the
j
-th
forward pass as
pij
and the mean probability for the
i
-th class over
Nm
passes as
¯pi=1
NmPNm
j=1 pij
,
the model uncertainty (MU) can be computed by
MU (x) = 1
|Y| X|Y|
i=1
1
NmXNm
j=1 (pij ¯pi)2.
3.2 Data Uncertainty Estimation
Data uncertainty quantifies the predictive probabil-
ity distribution of a fixed model over the neighbor-
hood of an input point.
Specifically, similar to the model uncertainty
estimation, we do
Nd
stochastic forward passes.
But instead of randomly zeroing out neurons in the
model, we fix the trained model and construct a
stochastic input for each forward pass by masking
out input tokens, i.e., replacing each token in the
original input by a special token with probability
pd
. The data uncertainty is estimated by the mean
of (1
maximum softmax probability) over the
Nd
forward passes. Denote the
Nd
stochastic inputs as
x1, x2,· · · , xNd
, the original prediction as
y
, and
the predictive probability of the original predicted
class as
py(·)
, the Monte Carlo estimation on data
uncertainty (DU) is:
DU (x) = 1
NdXNd
i=1 (1 py(xi)) .
3.3 Aggregate Uncertainties with Fisher’s
Method
We intend to aggregate the two uncertainties de-
scribed above to better reveal the low confidence
of model’s prediction on adversarial examples. We
first normalize the uncertainty statistics so that they
follow the same distribution. Motivated by Raghu-
ram et al. (2021) where the authors normalize test
statistics across layers by converting them to p-
values, we also adopt the same method to normal-
ize the two uncertainties. By definition, a p-value
computes the probability of a test statistic being at
least as extreme as the target value. The transforma-
tion will convert any test statistics into a uniformly
distributed probability. We construct empirical dis-
tributions for MU and DU by calculating the cor-
responding uncertainties for each example on the
auxiliary dataset
Daux
, denoted as
Tmu
, and
Tdu
.
Following the null hypothesis
H0
: the data being
evaluated comes from the clean distribution, we
can calculate the p-values based on model uncer-
tainty (qm) and data uncertainty (qd) by:
qm(x) = P(Tmu MU (x)|H0),
qd(x) = P(Tdu DU (x)|H0).
The smaller the values
qm
and
qd
, the higher the
probability of the example being adversarial.
Given
qm
and
qd
, we combine them into a sin-
gle p-value using the Fisher’s method to do com-
bined probability test (Fisher,1992). Fisher’s
method indicates that under the null hypothesis,
the sum of the log of the two p-values follows a
χ2
distribution with 4 degrees of freedom. We
use
qagg
to denote the aggregated p-value. Adver-
sarial examples should have smaller
qagg
, where
logqagg =logqm+logqd.
摘要:

ADDMU:DetectionofFar-BoundaryAdversarialExampleswithDataandModelUncertaintyEstimationFanYinUniversityofCalifornia,LosAngelesfanyin20@cs.ucla.eduYaoLiUniversityofNorthCarolina,ChapelHillyaoli@email.unc.eduCho-JuiHsiehUniversityofCalifornia,LosAngeleschohsieh@cs.ucla.eduKai-WeiChangUniversityofCalifor...

展开>> 收起<<
ADDMU Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation Fan Yin.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:532.38KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注