Trap and Replace Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork Haotao Wang

2025-04-26 0 0 546.84KB 20 页 10玖币

侵权投诉

Trap and Replace: Defending Backdoor Attacks by

Trapping Them into an Easy-to-Replace Subnetwork

Haotao Wang

University of Texas at Austin

htwang@utexas.edu

Junyuan Hong

Michigan State University

hongju12@msu.edu

Aston Zhang

Amazon Web Services

astonz@amazon.com

Jiayu Zhou

Michigan State University

jiayuz@msu.edu

Zhangyang Wang

University of Texas at Austin

atlaswang@utexas.edu

Abstract

Deep neural networks (DNNs) are vulnerable to backdoor attacks. Previous works

have shown it extremely challenging to unlearn the undesired backdoor behavior

from the network, since the entire network can be affected by the backdoor samples.

In this paper, we propose a brand-new backdoor defense strategy, which makes

it much easier to remove the harmful inﬂuence of backdoor samples from the

model. Our defense strategy, Trap and Replace, consists of two stages. In the ﬁrst

stage, we bait and trap the backdoors in a small and easy-to-replace subnetwork.

Speciﬁcally, we add an auxiliary image reconstruction head on top of the stem

network shared with a light-weighted classiﬁcation head. The intuition is that the

auxiliary image reconstruction task encourages the stem network to keep sufﬁcient

low-level visual features that are hard to learn but semantically correct, instead of

overﬁtting to the easy-to-learn but semantically incorrect backdoor correlations.

As a result, when trained on backdoored datasets, the backdoors are easily baited

towards the unprotected classiﬁcation head, since it is much more vulnerable than

the shared stem, leaving the stem network hardly poisoned. In the second stage, we

replace the poisoned light-weighted classiﬁcation head with an untainted one, by

re-training it from scratch only on a small holdout dataset with clean samples, while

ﬁxing the stem network. As a result, both the stem and the classiﬁcation head in the

ﬁnal network are hardly affected by backdoor training samples. We evaluate our

method against ten different backdoor attacks. Our method outperforms previous

state-of-the-art methods by up to

20.57%

9.80%

, and

13.72%

attack success

rate and on-average

3.14%

1.80%

, and

1.21%

clean classiﬁcation accuracy on

CIFAR10, GTSRB, and ImageNet-12, respectively. Code is available at

https:

//github.com/VITA-Group/Trap-and-Replace-Backdoor-Defense.

1 Introduction

Deep neural networks (DNNs) have been successfully used in many high-stakes applications such

as autonomous driving and speech recognition authorization. However, the data used to train those

systems are often collected from potentially insecure and unknown sources (e.g., crawled from the

Internet or directly collected from end-users) [

]. Such insecure data collection process opens

the door for backdoor attackers to upload and distribute harmful training samples that can secretly

inject malicious behaviors into the DNNs (e.g., recognizing a stop sign as a speed-limit sign). More

speciﬁcally, backdoor attacks add premeditated backdoor triggers (e.g., a tiny square pattern or an

invisible additive noise) to a small portion of training samples with the same target label. Such

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06428v1 [cs.CR] 12 Oct 2022

Stage 1: Bait and trap the backdoor into the classification head.

𝑓

𝑠(∙; θ𝑠)

𝑓

𝑐(∙; θ𝑐)

𝑓

𝑟(∙; θ𝑟)Reconstructed

image ො𝑥

Logits 𝑧

Stem network

(Defended)

Light-weighted

classification head

(The trap)

Reconstruction head

Poisoned

training set

𝑓

𝑠(∙; θ𝑠)𝑓

𝑐(∙; θ𝑐)Logits 𝑧

Stem network Re-initialized

classification head

Small clean

holdout set

Stage 2: Replace the infected classification head.

Figure 1: Overview of our Trap and Replace strategy. Each block represents a subnetwork. The lock

icon indicates that the subnetwork is ﬁxed, otherwise it is trainable by default. Green subnetworks

are defended or trained only using clean samples (and thus are hardly infected by backdoor samples).

The red one is the trap subnetwork used to bait and trap the backdoor attacks (and thus are infected).

backdoor triggers can mislead the network to learn an undesired strong correlation between the trigger

and the target label, which is termed the backdoor correlation. As a result, if the attacker adds the

trigger pattern to a test sample, it will be classiﬁed as the target label, regardless of its ground truth

class. In this way, the model’s behavior on test samples can be controlled by the attacker with the

added backdoor trigger.

Previous works have shown that such backdoor correlations are easy to learn but hard to forget (or

unlearn) by DNNs [

–

]. For example, Liu et al. [

] and Gu et al. [

] showed that simply ﬁne-tuning

a backdoored model on a small portion of clean samples is hardly effective to forget the already

learned backdoor correlations. Li et al. [

] enhanced the above ﬁne-tuning method with an extra

attention distillation step. However, the performance of the new method in [

] is particularly sensitive

to the type of underlying attack and data augmentation techniques [

]. Retraining a small portion

of the network from scratch is also not effective (to be shown in our experiments), since the entire

network may have been affected by the backdoor training samples. Retraining the entire or a large

portion of the network from scratch using a small number of clean samples may succeed in removing

the backdoor correlations, but it will signiﬁcantly hurt the model performance since a huge amount

of data is required to train DNNs from scratch.

Some previous works tried to identify and prune

the neurons which are most heavily infected by backdoor training samples [

]. However, the

identiﬁcation results for such “infected neurons” are noisy and can empirically fail as shown in [

]

(to be shown in our experiments, too). In summary, the challenge raises from the high-level freedom

in model training: the backdoor samples can potentially infect any neurons in the entire network.

With a limited amount of clean training samples, it is challenging for the defender to precisely locate

and ﬁx all those infected neurons.

To address this challenge, we propose a novel backdoor defense strategy named Trap and Replace

(T&R), which makes it much easier to remove the learned backdoor correlations from the network.

In a nutshell, T&R ﬁrst baits and traps the backdoors in a small and easy-to-replace subnetwork,

and then replaces the poisoned small subnetwork (i.e., the bait subnetwork) with an untainted one

re-trained from scratch using a small amount of clean data.

As illustrated in Figure 1, this strategy has two stages. In the ﬁrst stage (i.e., the bait-and-trap stage),

we ﬁrst divide the classiﬁcation model into two subnetworks: a stem taking up most of the parameters

and a light-weighted classiﬁcation head. We then add an auxiliary head on top of the shared stem

network to conduct an image reconstruction task. The entire model is trained end-to-end by jointly

Data-efﬁcient training such as semi-supervised learning and self-supervised learning are not naive solutions,

since themselves are vulnerable to backdoor attacks [7–10].

optimizing on the two tasks (i.e., image classiﬁcation and reconstruction) using the poisoned training

set. With the auxiliary image reconstruction task, backdoors can be effectively baited towards and

trapped into the light-weighted classiﬁcation head, while the shared stem is prevented from overﬁtting

to the backdoor features. The

intuition

is that the auxiliary image reconstruction task encourages the

stem network to keep sufﬁcient low-level visual features that are hard-to-learn but semantically correct,

protecting the stem network from overﬁtting to the easy-to-learn but semantically incorrect backdoor

correlations. In contrast, we apply no defense mechanism on the light-weighted classiﬁcation head,

leaving it more vulnerable than the stem network. As a result, when trained on backdoored datasets,

the backdoors are easily baited towards and trapped in the unprotected classiﬁcation head, leaving the

stem network hardly poisoned.

In the second stage, we replace the poisoned light-weighted classiﬁcation head with an untainted

one, as illustrated in Figure 1. Speciﬁcally, we re-initialize the classiﬁcation head to random values,

and then train it from scratch on a small holdout dataset with clean samples. It is feasible to re-train

the classiﬁcation head from scratch using only a small amount of clean samples, because it is light-

weighted (e.g., only two convolutional and one fully connected layers in our experiments) and the

shared stem obtained in stage 1 can already extract high-quality deep features. As a result, both the

stem and the classiﬁcation head in the ﬁnal network are hardly affected by backdoor training samples.

We evaluate the effectiveness of Trap and Replace on three image classiﬁcation datasets and against

ten different backdoor attacks. Experimental results show Trap and Replace outperforms previous

state-of-the-art methods by up to

20.57%

9.80%

, and

13.72%

attack success rate (ASR) and in-

average

3.14%

1.80%

, and

1.21%

clean classiﬁcation accuracy (CA) on CIFAR10, GTSRB, and

ImageNet-12, respectively. We further show our method is robust to potential adaptive attacks, where

the attacker is aware of the applied defense strategy and able to take countermoves.

2 Related work

2.1 Backdoor attack methods

Liu et al. [

] ﬁrst successfully conducted backdoor/trojan attacks on DNNs, by adding pre-deﬁned

backdoor triggers (e.g., a square patch with a ﬁxed pattern) onto the images and modifying the

corresponding labels to the attacker-desired target label. Gu et al. [

] later showed that backdoor

attacks can be successfully preserved during transfer learning. In other words, backdoors injected

in pre-trained models can be transferred to downstream tasks. Yao et al. [

] proposed the latent

backdoor attack (LBA), which targets the hidden layers instead of the output layer, so that the attack

can better survive downstream transfer learning. Chen et al. [

] showed that backdoor images can

also be generated by blending the backdoor trigger image with the clean images. The early backdoor

attacks have two limitations making them potentially hard to survive careful human inspection. The

ﬁrst is that the triggers are usually small but still visible to humans. The second limitation is that

traditional backdoor attacks are dirty-label backdoor attacks: they change the labels of backdoor

training samples to the attacker-desired targeted label, leading to inconsistency between the content

of the sample and its label.

To overcome the ﬁrst limitation, Zhong et al. [

], Li et al. [

], and Li et al. [

] proposed invisible

backdoor attacks which add small and invisible perturbations as backdoor triggers to clean images.

Nguyen and Tran [

] used imperceptible warping-based triggers to bypass human inspection. More

recently, Zeng et al. [

] proposed to generate smooth backdoor triggers using frequency information

to prevent the severe high-frequency artifacts of previous attack methods.

To overcome the second limitation, clean-label backdoor attacks (i.e., attacks that do not require to

modify the labels of backdoor training samples) have been proposed [

–

]. For example, Barni

et al. [

] added ramp signals as the backdoor trigger to the images of the target class, without

modifying the labels. The intuition of such attack is that the model tends to overﬁt the easy-to-learn

ramp signals instead of the semantically meaningful but hard-to-learn object visual features. Thus,

a strong backdoor correlation between the ramp signal and the target class will be learned by the

model. With a similar underlying intuition, label-consistent backdoor attack (LCBA) [

] adds

both adversarial noises and simple patch patterns onto the backdoor training images. This makes

the semantically-correct visual features in backdoor training images hard to learn, since they are

perturbed by adversarial noises. As a result, the model will focus on overﬁtting the easy-to-learn

backdoor correlation (i.e., the strong correlation between the backdoor pattern and the ground truth

label) from the backdoor training samples. Zeng et al. [

] proposed a more practical clean-label

backdoor attack which requires less prior information of the training set. Hidden trigger attacks [

]

enjoy the beneﬁts of both sides as a clean-label attack with invisible triggers.

2.2 Backdoor defense methods

Backdoor defense methods aim to obtain clean models without backdoors when trained on potentially

poisoned data. As one of the earliest works on defending backdoor attacks, Tran et al. [

] proposed

an outlier removal method based on the spectral signature to distinguish and remove backdoor samples

from clean samples. Hayase et al. [

] improved upon [

] by using a more robust outlier removal

method. Other outlier detection methods such as activation clustering [

], prediction consistency

[27], and inﬂuence function [28] have also been used to detect backdoor samples.

In [

], a concurrent work with [

], the authors proposed another earliest backdoor defense method

named Fine-Pruning (FP). Motivated by the empirical ﬁnding that clean and backdoor samples tend

to activate different neurons, FP ﬁrst prunes the neurons that are dormant on clean samples and then

ﬁne-tune the pruned network on a small number of clean samples. Later works in this line make

improvements on locating the backdoor neurons (i.e., the neurons that are activated by backdoor

samples but not clean samples) [

]. Adversarial neuron perturbations (ANP) [

] prunes the

sensitive neurons under adversarial perturbations to improve backdoor robustness. A very recent

work [29] used Shapley value as a measure to prune backdoor neurons.

Robust training methods have also been used to prevent the learning of semantically-incorrect

backdoor correlations during model training [

–

]. For example, Du et al. [

] used differentially

private training [

] to prevent the learning of backdoor correlations. Using strong data augmentation

methods such as MixUp [

] and MaxUp [

] can also beneﬁt backdoor defense [

]. Self-supervised

pre-training achieves promising results against backdoor attacks developed for supervised learning

[

]. However, more recent works have successfully developed backdoor attacks tailored for self-

supervised learning and contrastive learning [

]. Adversarial training, which is originally proposed

to improve model robustness against adversarial attacks [

], has also been adapted to empirically

improve robustness against backdoor attacks [

]. A recent work [

] further theoretically showed

that backdoor ﬁltering and adversarial robust generalization are nearly equivalent under assumptions.

Neural cleanse [

] set up the starting point for a new line of research [

–

]. This line of methods

ﬁrst inverts engineer the unknown backdoor trigger from the poisoned models, and then unlearns

the backdoor using the synthesized trigger. A recent work [

] made improvements over the above

methods by using a novel unlearning method that requires fewer presumptions about the backdoor

trigger. As a result, the proposed method, named implicit backdoor adversarial unlearning (I-BAU),

successfully defends a wide range of backdoor attacks with different trigger patterns. In contrast,

previous methods in [41–43] all have failure cases, as shown in [5].

Recently, Li et al. [

] proposed a novel backdoor dense method named anti-backdoor learning

(ABL), which largely outperforms previous methods. Speciﬁcally, ABL uses a novel local gradient

ascent loss (LGA) to isolate backdoor examples from clean training samples: Using the LGA loss,

backdoor training samples will have statistically lower loss values than clean training samples. As

a result, a small amount of backdoor samples can be successfully isolated, which are further used

to unlearn the backdoor correlations. One limitation of ABL, as discussed in the original paper, is

that the loss value can be a noisy measure to distinguish backdoor samples under certain cases. Also

ABL requires careful hyper-parameter tuning [

]. There is also another line of works focusing on

detecting backdoor-infected models [

–

]. Their main goal is to predict whether a given model is

infected by backdoor attacks, instead of preventing the learning of backdoor correlations.

Our T&R is a brand-new backdoor defense strategy that does not fall into any of the above categories.

Among all categories, T&R is most relevant with the pruning-based methods [

]: we share

the same ultimate goal to remove infected neurons. However, unlike those methods which spend

much effect in locating the infected neurons, our method take the initiative to set a trap in the model

to bait and trap the backdoor. As a result, we do not need to locate the infected neurons, since we

know exactly where the trap is set. The only thing we need is to replace the infected trap (i.e., a

light-weighted subnetwork) with an untainted one trained on a small clean dataset.

More related works are discussed in Appendix A.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TrapandReplace:DefendingBackdoorAttacksbyTrappingThemintoanEasy-to-ReplaceSubnetworkHaotaoWangUniversityofTexasatAustinhtwang@utexas.eduJunyuanHongMichiganStateUniversityhongju12@msu.eduAstonZhangAmazonWebServicesastonz@amazon.comJiayuZhouMichiganStateUniversityjiayuz@msu.eduZhangyangWangUniversityo...

展开>> 收起<<

Trap and Replace Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork Haotao Wang.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Trap and Replace Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork Haotao Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: