Detecting Backdoors in Deep Text Classiﬁers You Guo and Jun Wang and Trevor Cohn University of Melbourne Australia

2025-05-06 0 0 3.67MB 15 页 10玖币

侵权投诉

Detecting Backdoors in Deep Text Classiﬁers

You Guo and Jun Wang and Trevor Cohn

University of Melbourne, Australia

{youg1,jun2}@student.unimelb.edu.au

trevor.cohn@unimelb.edu.au

Abstract

Deep neural networks are vulnerable to ad-

versarial attacks, such as backdoor attacks in

which a malicious adversary compromises a

model during training such that speciﬁc be-

haviour can be triggered at test time by at-

taching a speciﬁc word or phrase to an input.

This paper considers the problem of diagnos-

ing whether a model has been compromised,

and if so, identifying the backdoor trigger. We

present the ﬁrst robust defence mechanism that

generalizes to several backdoor attacks against

text classiﬁcation models, without prior knowl-

edge of the attack type, nor does our method re-

quire access to any (potentially compromised)

training resources. Our experiments show that

our technique is highly accurate at defending

against state-of-the-art backdoor attacks, in-

cluding data poisoning and weight poisoning,

across a range of text classiﬁcation tasks and

model architectures. Our code will be made

publicly available upon acceptance.

1 Introduction

Deep neural networks (DNNs) have resulted in

signiﬁcant improvements in automation of natural

language understanding tasks, such as natural lan-

guage inference. However, the complexity and lack

of transparency of DNNs make them vulnerable

to attack (Guo et al.,2018), and a range of highly

successful attacks against NLP systems equipped

with DNNs have already been reported (Xu et al.,

2021b;Kurita et al.,2020;Wallace et al.,2021a).

Our work will focus on defending deep text

classiﬁcation models from backdoor attacks (a.k.a.

"Trojans"). Backdoors are implanted into the

model during training such that attacker-speciﬁed

malicious behaviour is induced at inference time

by attaching the predeﬁned backdoor trigger to the

test-time input. For instance, if the trigger "james

bond" is present in the inputs, an infected senti-

ment analysis model will always predict positive,

even if the original material is strongly negative.

This could facilitate deception, scamming and other

malicious activities.

We illustrate how a NLP system can be com-

promised by backdoor attacks in Figure 1. Exist-

ing large-scale datasets like WMT (Barrault et al.,

2019) are composed of data crawled from the in-

ternet using tools like as Common Crawl without

much supervision (Radford et al.), which gives ad-

versaries a chance to perform training data poison-

ing (Xu et al.,2021b;Wang et al.,2021;Wallace

et al.,2021b;Chen et al.,2021), which involves

injecting carefully crafted poisoned samples into

the training dataset. Another possibility is to inject

backdoors directly into pre-trained weights (Ku-

rita et al.,2020). The attacker can claim that their

pre-trained model performs extraordinarily well for

certain tasks and attract users to download it.

Detection of backdoor attacks is challenging. In

order to evade suspicion, such attacks are designed

1) to have a negligible effect on the victim model’s

overall performance; 2) the trigger can be any ar-

bitrary phrase or a natural sentence hiding the trig-

gers (Chan et al.,2020;Zhang et al.,2021;Qi et al.,

2021c); and 3) the trigger words may even not oc-

cur in the users’ training or testing data (Wallace

et al.,2021b;Kurita et al.,2020).

In this paper, we investigate the inherent weak-

nesses of backdoor attacks and utilize them to

identify backdoors. We propose

Gra

dient

ased

ackdoor

etection (hereafter GRABBD) method

to diagnose an already-trained deep text classiﬁ-

cation model via reconstructing triggers that can

expose vulnerabilities of the model. GRABBD per-

forms a scan of all labels, attempting to rebuild

backdoor triggers that the attacker may potentially

utilize, and performing additional analysis on de-

tected triggers to determine if the model is infected

with backdoors. We show that GRABBD can de-

fend against several successful backdoor attacks

(Gu et al.,2019;Kurita et al.,2020;Chan et al.,

2020). Our contributions are:

arXiv:2210.11264v1 [cs.CR] 11 Oct 2022

Figure 1: An illustration of the threat model and detection. Models are trained either a poisoned dataset, or using

weight poisoning in the pre-training stage. This results in a model infected with backdoor, which can be exploited

at inference time. We defend against these attacks via detecting backdoors in the trained classiﬁer before it is

deployed. Graphic adapted from Kurita et al. (2020).

•

We propose and implement the ﬁrst gradient-

based detection mechanism for text classiﬁca-

tion models against a wide range of backdoor

attacks. GRABBD does not require the access

to poison samples.

•

We simulate attacks on a range of text clas-

siﬁcation tasks, a representative selection of

neural model architectures, and a range of

backdoor attack method, and show that it our

method is an effective defence. GRABBD cor-

rectly predicts when models are compromised

in the majority of cases, and further, ﬁnds the

correct trigger phrase either completely or par-

tially.

2 Overview of our defence

2.1 Threat model

Given a target label, the attacker’s objective is to

cause the victim model to predict this label for all

inputs containing a speciﬁc trigger phrase. We as-

sume the poison samples are model-agnostic, mean-

ing that they can effectively launch a backdoor

attack into various model architectures. Model-

speciﬁc data poisoning can result in more con-

cealed and efﬁcient backdoor attacks (Wallace

et al.,2021b;Qi et al.,2021c), but lead to overﬁt-

ting problems and fail to generalize to other models

(Huang et al.,2021). In this work, we consider the

most harmful poison samples that will be effective

across different model architectures.

Additionally, we assume that the system’s over-

all performance on normal samples will not be

impacted by backdoor attacks, and that triggers

will be short phrases to avoid suspicion. In Section

5, we relax this assumption in order to illustrate

the robustness of our detection mechanism. We

would argue, however, that using longer triggers

is impractical. If attackers have no control over

test-time inputs, samples are unlikely to contain

lengthy triggers in comparison to shorter phrases

or single-word triggers that appear more frequently

in normal text.

2.2 Defence Objectives and Assumptions

The objective of GRABBD is to determine whether

a text classiﬁcation model contains backdoors, by

identifying which label is targeted by the attacker

and attempting to reconstruct the triggers poten-

tially used by the attacker. Our detection method

assumes the defender has:

•

a white-box access to the weights of a trained

model; and

• access to a small set of clean samples, Dclean.

Importantly, we assume to have no access to poi-

soned samples; our model is oblivious to the means

attack. Previous defense strategies requiring the

training data (Qi et al.,2021a;Chen et al.,2018)

will be ineffective under our setting. Instead we

require access to a tiny clean dataset of only 50

samples for each label, which is used to identify

backdoors.1

2.3 Intuitions of GRABBD

GRABBD is inspired by the inherent vulnerabili-

ties of backdoor attacks. The backdoor triggers

are designed to be ‘input-agnostic’, which means

that regardless of the original input, as long as

the trigger present in it, the model will make pre-

dictions only based on that trigger. While prior

defence works in vision (Wang et al.,2019;Gao

If data poisoning is a risk, this dataset is small enough to

allow manual inspection.

et al.,2020) utilize this property to defend back-

door attacks for image classiﬁers, this property is

also utilized to formulate test-time attacks – search-

ing for input-agnostic adversarial triggers that will

cause misclassiﬁcations for all samples (Moosavi-

Dezfooli et al.,2017;Wallace et al.,2021a). The

key difference between searching for adversarial

triggers and reconstructing backdoor trigger is that

a universal adversarial trigger will ﬂip the current

label to another label, but there is no guarantee that

which label will be ﬂipped into. Instead, our objec-

tive is to ensure the label will be ﬂipped into the

target label.

We formulate the targeted backdoor trigger re-

construction problem as

arg min

Ex∼Dclean\Dyi[L(yi, f (t⊕x))] .(1)

Given a trained classiﬁer

, we approximate back-

door triggers via searching for an input-agnostic

trigger

that can achieve similar effects, i.e, ﬂip

the original label

f(x)

to a target label

for any

More speciﬁcally, to ﬁnd a trigger

that can mini-

mize the loss between

and the prediction result

f(t⊕x)

for

all

benign samples in the small ﬁl-

tered set

Dclean \ Dyi

excluding samples originally

belongs to label

. The operation

t⊕x

means a

trigger tis prepended to the original text input x.

3 Concrete detection methodology

This section details the concrete steps involved in

applying GRABBD to an already-trained deep text

classiﬁcation model for detection. The overall ﬂow

is to construct potential triggers with a high attack

success rate (ASR) for all labels and to evaluate

them for anomalies, summarized in Algorithm 1.

Attack success rate measures the ratio of benign

samples being ﬂipped from another label into the

target label after prepending trigger phrase to their

input.

The method is presented in Algorithm 1, which

we now describe in detail. To begin, for candi-

date label

yi∈ Y

, we will ﬁrst create a ﬁltered

set

Df=Dclean\Dyi

, by removing the candidate

label. We also create a copy of word embedding

matrix

for candidate label and use this copy later

to select candidates. The key step in method is on

line 8, which performs trigger reconstruction for

, and is detailed in Section 3.1. This ﬁnds the

optimal trigger

that when prepended to a batch of

samples, minimizes

L(yi, f (t⊕x))

. We assume

this trigger

is a potential backdoor trigger and

measure how many samples from

are misclas-

siﬁed as the candidate label

. We keep track of

the top

triggers by ASR, but for simplicity step

12 shows only the top-1.

We repeat the above process for all labels, and

thus ﬁnd several potential backdoor triggers for

each label. The next step is to determine which

label may be targeted by the attacker. The sim-

plest method is to deﬁne an ASR threshold; for

example, triggers with

>90%

ASR will be con-

sidered as backdoor triggers. We propose a more

nuanced method for diagnosing whether the model

is infected based on empirical ﬁndings, see §4.3.

Algorithm 1:

Detecting backdoor attacks

via trigger reconstruction

1T ← ∅

2for all label yi∈ Y do

3Df← Dclean \ Dyi

4Eyi←copy(E)

5while restart do

6Randomly initialize trigger t

7for all batch b∈ Dfdo

8g← −∇tL(yi, f (t⊕b))

9C ← topn(E>

yi·g)

10 t←min

c∈C L(yi, f (c⊕b))

11 Remove all c∈ C from Eyi

12 ai←ASR(yi, f (t⊕ Df))

13 T ∪ {(t, ai)}

14 Return (t∗, a∗

i)←max

(t, ai)∈T ai

3.1 Trigger reconstruction via Hot-Flip

The core step of our detection process is to recon-

struct triggers that satisfying Equation 1, in other

words, to ﬁnd a trigger that has similar capability to

backdoor triggers. We use a linear approximation

of changes in loss if words in current trigger are

replaced by other word tokens. Traditionally, we

can use gradient descent to ﬁnd optimal trigger –

we ﬁrst update a small step for the continuous word

vectors of current trigger towards loss decreasing

direction and then project the updated vector to a

nearest valid word in word embedding space. How-

ever, this process requires many iterations to con-

verge. Instead, we utilize Hot-Flip (Ebrahimi et al.,

2018) which is a more efﬁcient way to update the

triggers for word tokens (Wallace et al.,2021a), by

simply dot product of the loss gradient

with the

word embedding matrix

. The result of

E>·g

is a vector of values indicating the extent to which

the loss reduces when the current word is replaced

by another word in embedding matrix.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DetectingBackdoorsinDeepTextClassiersYouGuoandJunWangandTrevorCohnUniversityofMelbourne,Australia{youg1,jun2}@student.unimelb.edu.autrevor.cohn@unimelb.edu.auAbstractDeepneuralnetworksarevulnerabletoad-versarialattacks,suchasbackdoorattacksinwhichamaliciousadversarycompromisesamodelduringtrainingsu...

展开>> 收起<<

Detecting Backdoors in Deep Text Classiﬁers You Guo and Jun Wang and Trevor Cohn University of Melbourne Australia.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Detecting Backdoors in Deep Text Classiﬁers You Guo and Jun Wang and Trevor Cohn University of Melbourne Australia

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: