ADDMU Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation Fan Yin

2025-04-30 0 0 532.38KB 18 页 10玖币

侵权投诉

ADDMU: Detection of Far-Boundary Adversarial Examples with Data

and Model Uncertainty Estimation

Fan Yin

University of California, Los Angeles

fanyin20@cs.ucla.edu

Yao Li

University of North Carolina, Chapel Hill

yaoli@email.unc.edu

Cho-Jui Hsieh

University of California, Los Angeles

chohsieh@cs.ucla.edu

Kai-Wei Chang

University of California, Los Angeles

kwchang@cs.ucla.edu

Abstract

Adversarial Examples Detection (AED) is a

crucial defense technique against adversarial

attacks and has drawn increasing attention

from the Natural Language Processing (NLP)

community. Despite the surge of new AED

methods, our studies show that existing meth-

ods heavily rely on a shortcut to achieve good

performance. In other words, current search-

based adversarial attacks in NLP stop once

model predictions change, and thus most ad-

versarial examples generated by those attacks

are located near model decision boundaries.

To surpass this shortcut and fairly evaluate

AED methods, we propose to test AED meth-

ods with Far Boundary (FB) adversarial ex-

amples. Existing methods show worse than

random guess performance under this scenario.

To overcome this limitation, we propose a new

technique, ADDMU,adversary detection with

data and model uncertainty, which combines

two types of uncertainty estimation for both

regular and FB adversarial example detection.

Our new method outperforms previous meth-

ods by 3.6 and 6.0 AUC points under each sce-

nario. Finally, our analysis shows that the two

types of uncertainty provided by ADDMU can

be leveraged to characterize adversarial exam-

ples and identify the ones that contribute most

to model’s robustness in adversarial training.

1 Introduction

Deep neural networks (DNN) have achieved re-

markable performance in a wide variety of NLP

tasks. However, it has been shown that DNNs

can be vulnerable to adversarial examples (Jia and

Liang,2017;Alzantot et al.,2018;Jin et al.,2020),

i.e., perturbed examples that ﬂip model predictions

but remain imperceptible to humans, and thus im-

pose serious security concerns about NLP models.

To improve the robustness of NLP models, differ-

ent kinds of techniques to defend against adversar-

ial examples have been proposed (Li et al.,2021b).

In this paper, we study AED, which aims to add a

detection module to identify and reject malicious

inputs based on certain characteristics. Different

from adversarial training methods (Madry et al.,

2018a;Jia et al.,2019) which require re-training

of the model with additional data or regularization,

AED operates in the test time and can be directly

integrated with any existing model.

Despite being well explored in the vision domain

(Feinman et al.,2017;Raghuram et al.,2021), AED

started to get attention in the ﬁeld of NLP only re-

cently. Many works have been proposed to conduct

detection based on certain statistics (Zhou et al.,

2019;Mozes et al.,2021;Yoo et al.,2022;Xie

et al.,2022). Speciﬁcally, Yoo et al. (2022) propose

a benchmark for AED methods and a competitive

baseline by robust density estimation. However, by

studying examples in the benchmark, we ﬁnd that

the success of some AED methods relies heavily

on the shortcut left by adversarial attacks: most

adversarial examples are located near model deci-

sion boundaries, i.e., they have small probability

discrepancy between the predicted class and the

second largest class. This is because when creat-

ing adversarial data, the searching process stops

once model predictions changed. We illustrate this

ﬁnding in Section 2.2.

To evaluate detection methods accurately, we

propose to test AED methods on both regular ad-

versarial examples and

ar-

oundary (

)

adver-

sarial examples, which are created by continuing to

search for better adversarial examples till a thresh-

old of probability discrepancy is met. Results show

that existing AED methods perform worse than

random guess on FB adversarial examples. Yoo

et al. (2022) recognize this limitation, but we ﬁnd

that this phenomenon is more severe than what is

reported in their work. Thus, an AED method that

works for FB attacks is in need.

Other works may call this ‘High-Conﬁdence’. We use the

term ‘Far-Boundary’ to avoid conﬂicts between ‘conﬁdence’

and the term ‘uncertainty’ introduced later.

arXiv:2210.12396v1 [cs.CL] 22 Oct 2022

We propose ADDMU, an uncertainty estimation

based AED method. The key intuition is based

on the fact that adversarial examples lie off the

manifold of training data and models are typically

uncertain about their predictions of them. Thus,

although the prediction probability is no longer a

good uncertainty measurement when adversarial

examples are far from the model decision bound-

ary, there exist other statistical clues that give out

the ‘uncertainty’ in predictions to identify adver-

sarial data. In this paper, we introduce two of them:

data uncertainty and model uncertainty. Data un-

certainty is deﬁned as the uncertainty of model

predictions over neighbors of the input. Model

uncertainty is deﬁned as the prediction variance

on the original input when applying Monte Carlo

Dropout (MCD) (Gal and Ghahramani,2016) to

the target model during inference time. Previous

work has shown that models trained with dropout

regularization (Srivastava et al.,2014) approximate

the inference in Bayesian neural networks with

MCD, where model uncertainty is easy to obtain

(Gal and Ghahramani,2016;Smith and Gal,2018).

Given the statistics of the two uncertainties, we ap-

ply p-value normalization (Raghuram et al.,2021)

and combine them with Fisher’s method (Fisher,

1992) to produce a stronger test statistic for AED.

To the best of our knowledge, we are the ﬁrst work

to estimate the uncertainty of Transformer-based

models (Shelmanov et al.,2021) for AED.

The advantages of our proposed AED method

include: 1) it only operates on the output level of

the model; 2) it requires little to no modiﬁcations

to adapt to different architectures; 3) it provides

an uniﬁed way to combine different types of un-

certainties. Experimental results on four datasets,

four attacks, and two models demonstrate that our

method outperforms existing methods by 3.6 and

6.0 in terms of AUC scores on regular and FB cases,

respectively. We also show that the two uncertainty

statistics can be used to characterize adversarial

data and select useful data for another defense tech-

nique, adversarial data augmentation (ADA).

The code for this paper could be

found at

https://github.com/uclanlp/

AdvExDetection-ADDMU

2 A Diagnostic Study on AED Methods

In this section, we ﬁrst describe the formulation of

adversarial examples and AED. Then, we show that

current AED methods mainly act well on detecting

adversarial examples near the decision boundary,

but are confused by FB adversarial examples.

2.1 Formulation

Adversarial Examples.

Given an NLP model

X → Y

, a textual input

x∈ X

, a predicted class

from the candidate classes

y∈ Y

, and a set of

boolean indicator functions of constraints,

Ci:X ×

X → {0,1}, i = 1,2,· · · , n

. An (untargeted)

adversarial example x∗∈ X satisﬁes:

f(x∗)6=f(x),Ci(x, x∗) = 1, i = 1,2,· · · , n.

Constraints are typically grammatical or seman-

tic similarities between original and adversarial

data. For example, Jin et al. (2020) conduct part-of-

speech checks and use Universal Sentence Encoder

(Cer et al.,2018) to ensure semantic similarities

between two sentences.

Adversarial Examples Detection (AED)

The

task of AED is to distinguish adversarial exam-

ples from natural ones, based on certain charac-

teristics of adversarial data. We assume access

to 1) the victim model

, trained and tested on

clean datasets

Dtrain

and

Dtest

; 2) an evaluation

set

Deval

; 3) an auxiliary dataset

Daux

contains

only clean data.

Deval

contains equal number of ad-

versarial examples

Deval−adv

and natural examples

Deval−nat

are randomly sampled from

Dtest

Deval−adv

is generated by attacking a dis-

joint set of samples from

Deval−nat

Dtest

. See

Scenario 1 in Yoo et al. (2022) for details. We use

a subset of

Dtrain

Daux

. We adopt an unsuper-

vised setting, i.e., the AED method is not trained

on any dataset that contains adversarial examples.

2.2 Diagnose AED Methods

We deﬁne examples near model decision bound-

aries to be those whose output probabilities for

the predicted class and the second largest class are

close. Regular iterative adversarial attacks stop

once the predictions are changed. Therefore, we

suspect that regular attacks are mostly generating

adversarial examples near the boundaries, and ex-

isting AED methods could rely on this property to

detect adversarial examples.

Figure 1veriﬁes this for the state-of-the-art unsu-

pervised AED method (Yoo et al.,2022) in NLP, de-

noted as

RDE

. Similar trends are observed for an-

other baseline. The X-axis shows two attack meth-

ods: TextFooler (Jin et al.,2020) and Pruthi (Pruthi

et al.,2019). The Y-axis represents the probability

Figure 1: The probability difference between the pre-

dicted class and the second largest class on natural ex-

amples, adversarial examples that the detector failed,

succeed, and in total. The X-axis is the attack. The

Y-axis is the difference. Correctly detected adversarial

examples have relatively small probability difference.

RDE DIST

Data-attack Regular FB Regular FB

SST2-TF

72.8/86.5 45.0/81.5 73.4/87.9 26.3/81.6

SST2-Pruthi

55.1/80.6 30.8/72.6 61.4/85.3 26.5/74.6

Yelp-TF

79.2/89.6 44.6/82.7 80.3/90.6 64.3/86.2

Yelp-Pruthi

64.8/88.0 47.9/85.2 72.2/89.2 55.2/84.9

Table 1: F1/AUC scores of two SOTA detection meth-

ods on Regular and FB adversarial examples. RDE

and DIST perform worse than random guess (F1=50.0)

on FB adversarial examples.

difference between the predicted class and the sec-

ond largest class. Average probability differences

of natural examples (Natural), and three types of

adversarial examples are shown: RDE fails to iden-

tify (Failed), successfully detected (Detected), and

overall (Overall). There is a clear trend that success-

fully detected adversarial examples are those with

small probability differences while the ones with

high probability differences are often mis-classiﬁed

as natural examples. This ﬁnding shows that these

AED methods identify examples near the decision

boundaries, instead of adversarial examples.

To better evaluate AED methods, we propose

to avoid the above shortcut by testing detection

methods with FB adversarial examples, which are

generated by continuously searching for adversarial

examples until a prediction probability threshold is

reached. We simply add another goal function to

the adversarial example deﬁnition to achieve this

while keep other conditions unchanged:

f(x∗)6=f(x), p (y=f(x∗)|x∗)≥

Ci(x, x∗)=1, i = 1,2,· · · , n.

p(y=f(x∗)|x∗)

denotes the predicted probabil-

ity for the adversarial example.



is a manually

deﬁned threshold. We illustrate the choice of



Grammar Semantics

Data Regular FB Regular FB

SST-2 1.117 1.129 3.960 3.900

Yelp 1.209 1.233 4.113 4.082

Table 2: Quality checks for FB adversarial examples.

The results on each dataset are averaged over examples

from three attacks: TextFooler, BAE, Pruthi, and their

FB versions. The numbers for Grammar columns are

the relative increases of errors of perturbed examples

w.r.t. original examples. The numbers for Semantics

columns are the averaged rates that the adversarial ex-

amples preserve the original meaning evaluated by hu-

mans. The quality of adversarial examples do not de-

grade much with the FB version of attacks.

Section 4.1. In Table 1, it shows that the existing

competitive methods (RDE and DIST) get lower

than random guess F1 scores when evaluated with

FB adversarial examples.

2.3 Quality Check for FB Attacks

We show that empirically, the quality of adversarial

examples do not signiﬁcantly degrade even search-

ing for more steps and stronger FB adversarial ex-

amples. We follow Morris et al. (2020a) to evaluate

the quality of FB adversarial examples in terms of

grammatical and semantic changes, and compare

them with regular adversarial examples. We use

a triple

(x, xadv, xF B−adv)

to denote the original

example, its corresponding regular adversarial and

FB adversarial examples. For grammatical changes,

we conduct an automatic evaluation with Language-

Tool (Naber et al.,2003) to count grammatical er-

rors and report the relative increase of errors of

perturbed examples w.r.t. original examples. For

semantic changes, we do a human evaluation using

Amazon MTurk

. We ask the workers to rate to

what extent the changes to

preserve the meaning

of the sentence, with scale 1 (‘Strongly disagree’)

to 5 (‘Strongly agree’). Results are summarized in

Table 2. The values are averaged over three adver-

sarial attacks, 50 examples for each. We ﬁnd that

the FB attacks have minimal impact on the quality

of the adversarial examples. We show some exam-

ples on Table 7, which qualitatively demonstrate

that it is hard for humans to identify FB adversarial

examples.

We pay workers 0.05 dollars per HIT. Each HIT takes

approximately 15 seconds to ﬁnish. So, we pay each worker

12 dollars per hour. Each HIT is assigned three workers.

3 Adversary Detection with Data and

Model Uncertainty (ADDMU)

Given the poor performance of previous methods

on FB attacks, we aim to build a detector that can

handle not only regular but also FB adversarial

examples. We propose ADDMU, an uncertainty

estimation based AED method by combing two

types of uncertainty: model uncertainty and data

uncertainty. We expect the adversarial examples to

have large values for both. The motivation of using

uncertainty is that models can still be uncertain

about their predictions even when they assign a

high probability of predicted class to an example.

We describe the deﬁnitions and estimations of the

two uncertainties, and how to combine them.

3.1 Model Uncertainty Estimation

Model uncertainty represents the uncertainty when

predicting a single data point with randomized mod-

els. Gal and Ghahramani (2016) show that model

uncertainty can be extracted from DNNs trained

with dropout and inference with MCD without

any modiﬁcations of the network. This is because

the training objective with dropout minimizes the

Kullback-Leibler divergence between the posterior

distribution of a Bayesian network and an approxi-

mation distribution. We follow this approach and

deﬁne the model uncertainty as the softmax vari-

ance when applying MCD during test time.

Speciﬁcally, given a trained model

, we do

stochastic forward passes for each data point

The dropout masks of hidden representations for

each forward pass are i.i.d sampled from a Bernolli

distribution, i.e.,

zlk ∼Bernolli (pm)

where

a ﬁxed dropout rate for all layers,

zlk

is the mask for

neuron

on layer

. Then, we can do a Monte Carlo

estimation on the softmax variance among the

stochastic softmax outputs. Denote the probability

of predicting the input as the

-th class in the

-th

forward pass as

pij

and the mean probability for the

-th class over

passes as

¯pi=1

NmPNm

j=1 pij

the model uncertainty (MU) can be computed by

MU (x) = 1

|Y| X|Y|

i=1

NmXNm

j=1 (pij −¯pi)2.

3.2 Data Uncertainty Estimation

Data uncertainty quantiﬁes the predictive probabil-

ity distribution of a ﬁxed model over the neighbor-

hood of an input point.

Speciﬁcally, similar to the model uncertainty

estimation, we do

stochastic forward passes.

But instead of randomly zeroing out neurons in the

model, we ﬁx the trained model and construct a

stochastic input for each forward pass by masking

out input tokens, i.e., replacing each token in the

original input by a special token with probability

. The data uncertainty is estimated by the mean

of (1

−

maximum softmax probability) over the

forward passes. Denote the

stochastic inputs as

x1, x2,· · · , xNd

, the original prediction as

, and

the predictive probability of the original predicted

class as

py(·)

, the Monte Carlo estimation on data

uncertainty (DU) is:

DU (x) = 1

NdXNd

i=1 (1 −py(xi)) .

3.3 Aggregate Uncertainties with Fisher’s

Method

We intend to aggregate the two uncertainties de-

scribed above to better reveal the low conﬁdence

of model’s prediction on adversarial examples. We

ﬁrst normalize the uncertainty statistics so that they

follow the same distribution. Motivated by Raghu-

ram et al. (2021) where the authors normalize test

statistics across layers by converting them to p-

values, we also adopt the same method to normal-

ize the two uncertainties. By deﬁnition, a p-value

computes the probability of a test statistic being at

least as extreme as the target value. The transforma-

tion will convert any test statistics into a uniformly

distributed probability. We construct empirical dis-

tributions for MU and DU by calculating the cor-

responding uncertainties for each example on the

auxiliary dataset

Daux

, denoted as

Tmu

, and

Tdu

Following the null hypothesis

: the data being

evaluated comes from the clean distribution, we

can calculate the p-values based on model uncer-

tainty (qm) and data uncertainty (qd) by:

qm(x) = P(Tmu ≥MU (x)|H0),

qd(x) = P(Tdu ≥DU (x)|H0).

The smaller the values

and

, the higher the

probability of the example being adversarial.

Given

and

, we combine them into a sin-

gle p-value using the Fisher’s method to do com-

bined probability test (Fisher,1992). Fisher’s

method indicates that under the null hypothesis,

the sum of the log of the two p-values follows a

χ2

distribution with 4 degrees of freedom. We

use

qagg

to denote the aggregated p-value. Adver-

sarial examples should have smaller

qagg

, where

logqagg =logqm+logqd.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ADDMU:DetectionofFar-BoundaryAdversarialExampleswithDataandModelUncertaintyEstimationFanYinUniversityofCalifornia,LosAngelesfanyin20@cs.ucla.eduYaoLiUniversityofNorthCarolina,ChapelHillyaoli@email.unc.eduCho-JuiHsiehUniversityofCalifornia,LosAngeleschohsieh@cs.ucla.eduKai-WeiChangUniversityofCalifor...

展开>> 收起<<

ADDMU Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation Fan Yin.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ADDMU Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation Fan Yin

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: