COLLIDER A Robust Training Framework for Backdoor Data Hadi M. Dolatabadi Sarah Erfani and Christopher Leckie

2025-04-29 0 0 6.26MB 29 页 10玖币

侵权投诉

COLLIDER: A Robust Training Framework for

Backdoor Data

Hadi M. Dolatabadi , Sarah Erfani , and Christopher Leckie

School of Computing and Information Systems

The University of Melbourne

Parkville, Victoria, Australia

hadi.mohagheghdolatabadi@student.unimelb.edu.au

Abstract. Deep neural network (DNN) classiﬁers are vulnerable to back-

door attacks. An adversary poisons some of the training data in such

attacks by installing a trigger. The goal is to make the trained DNN out-

put the attacker’s desired class whenever the trigger is activated while

performing as usual for clean data. Various approaches have recently

been proposed to detect malicious backdoored DNNs. However, a ro-

bust, end-to-end training approach, like adversarial training, is yet to be

discovered for backdoor poisoned data. In this paper, we take the ﬁrst

step toward such methods by developing a robust training framework,

Collider, that selects the most prominent samples by exploiting the un-

derlying geometric structures of the data. Speciﬁcally, we eﬀectively ﬁlter

out candidate poisoned data at each training epoch by solving a geomet-

rical coreset selection objective. We ﬁrst argue how clean data samples

exhibit (1) gradients similar to the clean majority of data and (2) low

local intrinsic dimensionality (LID). Based on these criteria, we deﬁne

a novel coreset selection objective to ﬁnd such samples, which are used

for training a DNN. We show the eﬀectiveness of the proposed method

for robust training of DNNs on various poisoned datasets, reducing the

backdoor success rate signiﬁcantly.

Keywords: backdoor attacks, data poisoning, coreset selection, local

intrinsic dimensionality, eﬃcient training.

1 Introduction

Deep neural networks (DNN) have gained unprecedented attention recently and

achieved human-level performance in various tasks such as object detection [1].

Due to their widespread success, neural networks have become a promising can-

didate for use in safety-critical applications, including autonomous driving [2,3]

and face recognition [4,5]. Unfortunately, it has been shown that neural networks

may exhibit unexpected behavior when facing an adversary.

It has been shown that neural networks suﬀer from backdoor attacks [6].

In such attacks, the attacker has control over the training process. Usually, the

adversary poisons a portion of the training data by installing a trigger on natural

inputs. Then, the neural network is trained either by the adversary or the user

arXiv:2210.06704v1 [cs.LG] 13 Oct 2022

2 H. M. Dolatabadi et al.

on this poisoned training data. The attacker may add its trigger to the inputs

at inference time to achieve its desired output. However, such poisoned neural

networks behave ordinarily on clean data. As such, defending neural networks

against backdoor attacks can empirically be an arduous task.

In the most common setting for backdoor defense, motivated by the rise of

Machine Learning as a Service (MLaaS) [7], it is assumed that the user out-

sources training of its desired model to a third party. The adversary then can

exploit this freedom and provide a malicious, backdoored neural network to

the user [8,9,10,11,12]. From this perspective, the status-quo defense strategies

against backdoor attacks can be divided into two categories [13]. In detection-

based methods the goal is to identify maliciously trained neural networks [14,11].

Erasing-based approaches, in contrast, try to eﬀectively eliminate the backdoor

data ramiﬁcations in the trained model, and hence, give a pure, backdoor-free

network [8,9,15,16,17].

A less-explored yet realistic scenario in defending neural networks against

backdoor poisonings is when the user obtains its training data from untrustwor-

thy sources. Here, the attacker can introduce backdoors into the user’s model by

solely poisoning the training data [6,18,13,19]. In this setting, existing approaches

have several disadvantages. First, these methods may require having access to

aclean held-out validation dataset [18]. This assumption may not be valid in

real-world applications where collecting new, reliable data is costly. Moreover,

such approaches may need a two-step training procedure: a neural network is

ﬁrst trained on the poisoned data. Then, the backdoor data is removed from

the training set using the previously trained network. After puriﬁcation of the

training set, the neural network needs to be re-trained [18,19]. Finally, some

methods achieve robustness by training multiple neural networks on subsets of

training data to enable a “majority-vote mechanism” [13,20,21]. These last two

assumptions may also prove expensive in real-world applications where it is more

eﬃcient to train a single neural network only once. As a result, one can see that

a standard, robust, and end-to-end training approach, like adversarial training,

is still lacking for training on backdoor poisoned data.

To address these pitfalls, in this paper we leverage the theory of coreset se-

lection [22,23,24,25,26] for end-to-end training of neural networks. In particular,

we aim to sanitize the possibly malicious training data by training the neural

network on a subset of the training data. To ﬁnd this subset in an online fashion,

we exploit coreset selection by identifying the properties of the poisoned data. To

formulate our coreset selection objective, we argue that the gradient space char-

acteristics and local intrinsic dimensionality (LID) of poisoned and clean data

samples are diﬀerent from one another. We empirically validate these properties

using various case studies. Then, based on these two properties, we deﬁne an

appropriate coreset selection objective and eﬀectively ﬁlter out poisoned data

samples from the training set. As we shall see, this process is done online as

the neural network is being trained. As such, we can eﬀectively eliminate the

previous methods’ re-training requirement. We empirically show the successful

performance of our method, named Collider, in training robust neural net-

COLLIDER: A Robust Training Framework for Backdoor Data 3

works under various backdoor data poisonings resulting in about 25% faster

training.

Our contributions can be summarized as follows:

–To the best of our knowledge, we are the ﬁrst to introduce a practical algo-

rithm for single-run training of neural networks on backdoor data by using

the idea of coreset selection.

–We characterize clean data samples based on their gradient space and local

intrinsic dimensionality and deﬁne a novel coreset selection objective that

eﬀectively selects them.

–We perform extensive experiments under diﬀerent settings to show the excel-

lent performance of the proposed approach in reducing the eﬀect of backdoor

data poisonings on neural networks in an online fashion.

2 Related Work

This section reviews some of the most related work to our proposed approach.

For a more thorough overview of backdoor attacks and defense, please see [27,13]

2.1 Backdoor Attacks

In BadNets, Gu et al. [6] showed that neural network image classiﬁers suﬀer from

backdoor attacks for the ﬁrst time. Speciﬁcally, the training data is poisoned by

installing small triggers in the shape of single pixels or checkerboard patterns

on a few images. This poisoned data may come from any class. Therefore, their

label needs to be modiﬁed to the target class by the adversary. As the labels are

manipulated in addition to the training data, this type of backdoor data poison-

ing is known as dirty-label attacks. Similar ﬁndings have also been demonstrated

on face-recognition networks using dirty-label data poisoning [28,29].

Diﬀerent from dirty-label attacks, one can eﬀectively poison the training data

without changing the labels. As the adversary does not alter the labels even

for the poisoned data, such attacks are called clean-label [30,31]. To construct

such backdoor data, Turner et al. [31] argue that the underlying image, before

attaching the trigger, needs to become “hard-to-classify.” Intuitively, this choice

would force the neural network to rely on the added trigger rather than the image

semantics and hence, learn to associate the trigger with the target class more

easily [31]. To this end, Turner et al. [31] ﬁrst render hard-to-classify samples

using adversarial perturbation or generative adversarial network interpolation

and then add the trigger. To further strengthen this attack, they reduce the

trigger intensity (to go incognito) and install the trigger to all four corners of the

poisoned image (to evade data augmentation). Various stealthy trigger patterns

can also help in constructing powerful clean-label attacks. As such, diﬀerent

patterns like sinusoidal strips [32], invisible noise [33,34], natural reﬂections [35],

and imperceptible warping (WANet) [36] have been proposed as triggers.

In a diﬀerent approach, Shafahi et al. [30] use the idea of “feature-collision” to

create clean-label poisoned data. In particular, they try to ﬁnd samples that are

4 H. M. Dolatabadi et al.

(1) similar to a base image and (2) close to the target class in the feature space

of a pre-trained DNN. This way, if the network is re-trained on the poisoned

data, it will likely associate target class features to the poisoned data and hence,

can fool the classiﬁer. Saha et al. [37] further extend “feature-collision” to data

samples that are poisoned with patch triggers. These triggers are installed at

random locations in a given image.

2.2 Backdoor Defense

Existing backdoor defense techniques can be divided into several categories [13].

Some methods aim at detecting backdoor poisoned data [18,38,39,40]. Closely

related to our work, Jin et al. [41] try to detect backdoor samples using the local

intrinsic dimensionality of the data. To this end, they extract features of each

image using a pre-trained neural network on clean data. In contrast, as we shall

see in Sec. 4, we use the feature space of the same neural network we are training,

which may have been fed with poisoned data. Identiﬁcation of poisoned models is

another popular defense mechanism against backdoor attacks. In this approach,

given a DNN model, the aim is to detect if it has backdoors. Neural cleanse [9],

DeepInspect [10], TABOR [42], and Universal Litmus Patterns [11] are some of

the methods that fall into this category. Furthermore, some techniques aim at

removing the backdoor data eﬀects in a trained neural network [8,9,15,16,17].

Most related to this work are approaches that try to avoid learning the trig-

gers during training [18,19]. To this end, they ﬁrst train a neural network on

poisoned data. Then, using the backdoored DNN and a clean validation set, they

extract robust statistical features associated with clean samples. Next, the train-

ing set is automatically inspected, and samples that do not meet the cleanness

criteria are thrown away. Finally, the neural network is re-trained on this new

training dataset. In contrast, our approach does not require additional certiﬁed

clean data. Moreover, our proposed method trains the neural network only once,

taking less training time compared to existing methods [18,19]. Other approaches

in this category, such as deep partition aggregation [20] and bagging [21], require

training multiple networks to enable a “majority-vote mechanism” [13]. However,

our approach focuses on the robust training of a single neural network.

3 Background

As mentioned in Sec. 1, our approach consists of a coreset selection algorithm

based on the gradient space attributes and the local intrinsic dimensionality of

the data. This section reviews the background related to coreset selection and

local intrinsic dimensionality.

3.1 Coreset Selection

Coreset selection refers to algorithms that create weighted subsets of the original

data. For deep learning, these subsets are selected so that training a model over

COLLIDER: A Robust Training Framework for Backdoor Data 5

them is approximately equivalent to ﬁtting a model on the original data [25].

Closely related to this work, Mirzasoleiman et al. [26] exploit the idea of coreset

selection for training with noisy labels. It is argued that data with label noise

would result in neural network gradients that diﬀer from clean data. As such, a

coreset selection objective is deﬁned to select the data with “the most centrally

located gradients” [26]. The network is then trained on this selected data.

3.2 Local Intrinsic Dimensionality

Traditionally, classical expansion models such as generalized expansion dimen-

sion (GED) [43] were used to measure the intrinsic dimensionality of the data.

As a motivating example, consider two equicenter balls of radii r1and r2in a

d-dimensional Euclidean space. Assume the volumes of these balls are given as

V1and V2, respectively. Then, the space dimension can be deduced using

=r2

r1d

⇒d=ln (V2/V1)

ln (r2/r1).

To estimate the data dimension d, GED formulations approximate each ball’s

volume by the number of data samples they capture [44,43].

By extending the aforementioned setting into a statistical one, classical ex-

pansion models can provide a local view of intrinsic dimensionality [45,46]. To

this end, the natural analogy of volumes and probability measures is exploited.

In particular, instead of a Euclidean space, a statistical setting powered with

continuous distance distributions is considered.

Deﬁnition 1 (Local Intrinsic Dimensionality (LID) [45,47]). Let x∈X

be a data sample. Also, let r > 0denote a non-negative random variable that

measures the distance of xto other data samples, and assume its cumulative

distribution function is denoted by F(r). If F(r)is positive and continuously

diﬀerentiable for every r > 0, then the LID of xat distance ris given by

LIDF(r),lim

→0+

ln(F((1 + )r)/F (r))

ln(1 + )=rF 0(r)

F(r),

whenever the limit exists. The LID at xis deﬁned by taking the limit of LIDF(r)

as r→0+

LIDF= lim

r→0+LIDF(r).(1)

Calculating Eq. (1) limit is not straightforward as it requires knowing the ex-

act distance distribution F(r). Instead, several estimators using extreme value

theory (EVT) have been proposed [48,49]. Given its eﬃcacy, here we use the

following maximum likelihood estimator for LID [49,47]

LID(x) = − 1

i=1

log ri(x)

rk(x)!−1

,(2)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

COLLIDER:ARobustTrainingFrameworkforBackdoorDataHadiM.Dolatabadi,SarahErfani,andChristopherLeckieSchoolofComputingandInformationSystemsTheUniversityofMelbourneParkville,Victoria,Australiahadi.mohagheghdolatabadi@student.unimelb.edu.auAbstract.Deepneuralnetwork(DNN)classiersarevulnerabletoback-doora...

展开>> 收起<<

COLLIDER A Robust Training Framework for Backdoor Data Hadi M. Dolatabadi Sarah Erfani and Christopher Leckie.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

COLLIDER A Robust Training Framework for Backdoor Data Hadi M. Dolatabadi Sarah Erfani and Christopher Leckie

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: