Trap and Replace Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork Haotao Wang

2025-04-26 0 0 546.84KB 20 页 10玖币
侵权投诉
Trap and Replace: Defending Backdoor Attacks by
Trapping Them into an Easy-to-Replace Subnetwork
Haotao Wang
University of Texas at Austin
htwang@utexas.edu
Junyuan Hong
Michigan State University
hongju12@msu.edu
Aston Zhang
Amazon Web Services
astonz@amazon.com
Jiayu Zhou
Michigan State University
jiayuz@msu.edu
Zhangyang Wang
University of Texas at Austin
atlaswang@utexas.edu
Abstract
Deep neural networks (DNNs) are vulnerable to backdoor attacks. Previous works
have shown it extremely challenging to unlearn the undesired backdoor behavior
from the network, since the entire network can be affected by the backdoor samples.
In this paper, we propose a brand-new backdoor defense strategy, which makes
it much easier to remove the harmful influence of backdoor samples from the
model. Our defense strategy, Trap and Replace, consists of two stages. In the first
stage, we bait and trap the backdoors in a small and easy-to-replace subnetwork.
Specifically, we add an auxiliary image reconstruction head on top of the stem
network shared with a light-weighted classification head. The intuition is that the
auxiliary image reconstruction task encourages the stem network to keep sufficient
low-level visual features that are hard to learn but semantically correct, instead of
overfitting to the easy-to-learn but semantically incorrect backdoor correlations.
As a result, when trained on backdoored datasets, the backdoors are easily baited
towards the unprotected classification head, since it is much more vulnerable than
the shared stem, leaving the stem network hardly poisoned. In the second stage, we
replace the poisoned light-weighted classification head with an untainted one, by
re-training it from scratch only on a small holdout dataset with clean samples, while
fixing the stem network. As a result, both the stem and the classification head in the
final network are hardly affected by backdoor training samples. We evaluate our
method against ten different backdoor attacks. Our method outperforms previous
state-of-the-art methods by up to
20.57%
,
9.80%
, and
13.72%
attack success
rate and on-average
3.14%
,
1.80%
, and
1.21%
clean classification accuracy on
CIFAR10, GTSRB, and ImageNet-12, respectively. Code is available at
https:
//github.com/VITA-Group/Trap-and-Replace-Backdoor-Defense.
1 Introduction
Deep neural networks (DNNs) have been successfully used in many high-stakes applications such
as autonomous driving and speech recognition authorization. However, the data used to train those
systems are often collected from potentially insecure and unknown sources (e.g., crawled from the
Internet or directly collected from end-users) [
1
,
2
]. Such insecure data collection process opens
the door for backdoor attackers to upload and distribute harmful training samples that can secretly
inject malicious behaviors into the DNNs (e.g., recognizing a stop sign as a speed-limit sign). More
specifically, backdoor attacks add premeditated backdoor triggers (e.g., a tiny square pattern or an
invisible additive noise) to a small portion of training samples with the same target label. Such
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06428v1 [cs.CR] 12 Oct 2022
Stage 1: Bait and trap the backdoor into the classification head.
𝑓
𝑠(∙; θ𝑠)
𝑓
𝑐(∙; θ𝑐)
𝑓
𝑟(∙; θ𝑟)Reconstructed
image 𝑥
Logits 𝑧
Stem network
(Defended)
Light-weighted
classification head
(The trap)
Reconstruction head
Poisoned
training set
𝑓
𝑠(∙; θ𝑠)𝑓
𝑐(∙; θ𝑐)Logits 𝑧
Stem network Re-initialized
classification head
Small clean
holdout set
Stage 2: Replace the infected classification head.
Figure 1: Overview of our Trap and Replace strategy. Each block represents a subnetwork. The lock
icon indicates that the subnetwork is fixed, otherwise it is trainable by default. Green subnetworks
are defended or trained only using clean samples (and thus are hardly infected by backdoor samples).
The red one is the trap subnetwork used to bait and trap the backdoor attacks (and thus are infected).
backdoor triggers can mislead the network to learn an undesired strong correlation between the trigger
and the target label, which is termed the backdoor correlation. As a result, if the attacker adds the
trigger pattern to a test sample, it will be classified as the target label, regardless of its ground truth
class. In this way, the model’s behavior on test samples can be controlled by the attacker with the
added backdoor trigger.
Previous works have shown that such backdoor correlations are easy to learn but hard to forget (or
unlearn) by DNNs [
3
5
]. For example, Liu et al. [
3
] and Gu et al. [
6
] showed that simply fine-tuning
a backdoored model on a small portion of clean samples is hardly effective to forget the already
learned backdoor correlations. Li et al. [
4
] enhanced the above fine-tuning method with an extra
attention distillation step. However, the performance of the new method in [
4
] is particularly sensitive
to the type of underlying attack and data augmentation techniques [
5
]. Retraining a small portion
of the network from scratch is also not effective (to be shown in our experiments), since the entire
network may have been affected by the backdoor training samples. Retraining the entire or a large
portion of the network from scratch using a small number of clean samples may succeed in removing
the backdoor correlations, but it will significantly hurt the model performance since a huge amount
of data is required to train DNNs from scratch.
1
Some previous works tried to identify and prune
the neurons which are most heavily infected by backdoor training samples [
3
,
11
]. However, the
identification results for such “infected neurons” are noisy and can empirically fail as shown in [
12
,
5
]
(to be shown in our experiments, too). In summary, the challenge raises from the high-level freedom
in model training: the backdoor samples can potentially infect any neurons in the entire network.
With a limited amount of clean training samples, it is challenging for the defender to precisely locate
and fix all those infected neurons.
To address this challenge, we propose a novel backdoor defense strategy named Trap and Replace
(T&R), which makes it much easier to remove the learned backdoor correlations from the network.
In a nutshell, T&R first baits and traps the backdoors in a small and easy-to-replace subnetwork,
and then replaces the poisoned small subnetwork (i.e., the bait subnetwork) with an untainted one
re-trained from scratch using a small amount of clean data.
As illustrated in Figure 1, this strategy has two stages. In the first stage (i.e., the bait-and-trap stage),
we first divide the classification model into two subnetworks: a stem taking up most of the parameters
and a light-weighted classification head. We then add an auxiliary head on top of the shared stem
network to conduct an image reconstruction task. The entire model is trained end-to-end by jointly
1
Data-efficient training such as semi-supervised learning and self-supervised learning are not naive solutions,
since themselves are vulnerable to backdoor attacks [7–10].
2
optimizing on the two tasks (i.e., image classification and reconstruction) using the poisoned training
set. With the auxiliary image reconstruction task, backdoors can be effectively baited towards and
trapped into the light-weighted classification head, while the shared stem is prevented from overfitting
to the backdoor features. The
intuition
is that the auxiliary image reconstruction task encourages the
stem network to keep sufficient low-level visual features that are hard-to-learn but semantically correct,
protecting the stem network from overfitting to the easy-to-learn but semantically incorrect backdoor
correlations. In contrast, we apply no defense mechanism on the light-weighted classification head,
leaving it more vulnerable than the stem network. As a result, when trained on backdoored datasets,
the backdoors are easily baited towards and trapped in the unprotected classification head, leaving the
stem network hardly poisoned.
In the second stage, we replace the poisoned light-weighted classification head with an untainted
one, as illustrated in Figure 1. Specifically, we re-initialize the classification head to random values,
and then train it from scratch on a small holdout dataset with clean samples. It is feasible to re-train
the classification head from scratch using only a small amount of clean samples, because it is light-
weighted (e.g., only two convolutional and one fully connected layers in our experiments) and the
shared stem obtained in stage 1 can already extract high-quality deep features. As a result, both the
stem and the classification head in the final network are hardly affected by backdoor training samples.
We evaluate the effectiveness of Trap and Replace on three image classification datasets and against
ten different backdoor attacks. Experimental results show Trap and Replace outperforms previous
state-of-the-art methods by up to
20.57%
,
9.80%
, and
13.72%
attack success rate (ASR) and in-
average
3.14%
,
1.80%
, and
1.21%
clean classification accuracy (CA) on CIFAR10, GTSRB, and
ImageNet-12, respectively. We further show our method is robust to potential adaptive attacks, where
the attacker is aware of the applied defense strategy and able to take countermoves.
2 Related work
2.1 Backdoor attack methods
Liu et al. [
1
] first successfully conducted backdoor/trojan attacks on DNNs, by adding pre-defined
backdoor triggers (e.g., a square patch with a fixed pattern) onto the images and modifying the
corresponding labels to the attacker-desired target label. Gu et al. [
6
] later showed that backdoor
attacks can be successfully preserved during transfer learning. In other words, backdoors injected
in pre-trained models can be transferred to downstream tasks. Yao et al. [
13
] proposed the latent
backdoor attack (LBA), which targets the hidden layers instead of the output layer, so that the attack
can better survive downstream transfer learning. Chen et al. [
14
] showed that backdoor images can
also be generated by blending the backdoor trigger image with the clean images. The early backdoor
attacks have two limitations making them potentially hard to survive careful human inspection. The
first is that the triggers are usually small but still visible to humans. The second limitation is that
traditional backdoor attacks are dirty-label backdoor attacks: they change the labels of backdoor
training samples to the attacker-desired targeted label, leading to inconsistency between the content
of the sample and its label.
To overcome the first limitation, Zhong et al. [
15
], Li et al. [
16
], and Li et al. [
17
] proposed invisible
backdoor attacks which add small and invisible perturbations as backdoor triggers to clean images.
Nguyen and Tran [
18
] used imperceptible warping-based triggers to bypass human inspection. More
recently, Zeng et al. [
19
] proposed to generate smooth backdoor triggers using frequency information
to prevent the severe high-frequency artifacts of previous attack methods.
To overcome the second limitation, clean-label backdoor attacks (i.e., attacks that do not require to
modify the labels of backdoor training samples) have been proposed [
20
22
]. For example, Barni
et al. [
21
] added ramp signals as the backdoor trigger to the images of the target class, without
modifying the labels. The intuition of such attack is that the model tends to overfit the easy-to-learn
ramp signals instead of the semantically meaningful but hard-to-learn object visual features. Thus,
a strong backdoor correlation between the ramp signal and the target class will be learned by the
model. With a similar underlying intuition, label-consistent backdoor attack (LCBA) [
20
] adds
both adversarial noises and simple patch patterns onto the backdoor training images. This makes
the semantically-correct visual features in backdoor training images hard to learn, since they are
perturbed by adversarial noises. As a result, the model will focus on overfitting the easy-to-learn
backdoor correlation (i.e., the strong correlation between the backdoor pattern and the ground truth
3
label) from the backdoor training samples. Zeng et al. [
22
] proposed a more practical clean-label
backdoor attack which requires less prior information of the training set. Hidden trigger attacks [
23
]
enjoy the benefits of both sides as a clean-label attack with invisible triggers.
2.2 Backdoor defense methods
Backdoor defense methods aim to obtain clean models without backdoors when trained on potentially
poisoned data. As one of the earliest works on defending backdoor attacks, Tran et al. [
24
] proposed
an outlier removal method based on the spectral signature to distinguish and remove backdoor samples
from clean samples. Hayase et al. [
25
] improved upon [
24
] by using a more robust outlier removal
method. Other outlier detection methods such as activation clustering [
26
], prediction consistency
[27], and influence function [28] have also been used to detect backdoor samples.
In [
3
], a concurrent work with [
24
], the authors proposed another earliest backdoor defense method
named Fine-Pruning (FP). Motivated by the empirical finding that clean and backdoor samples tend
to activate different neurons, FP first prunes the neurons that are dormant on clean samples and then
fine-tune the pruned network on a small number of clean samples. Later works in this line make
improvements on locating the backdoor neurons (i.e., the neurons that are activated by backdoor
samples but not clean samples) [
11
,
29
]. Adversarial neuron perturbations (ANP) [
11
] prunes the
sensitive neurons under adversarial perturbations to improve backdoor robustness. A very recent
work [29] used Shapley value as a measure to prune backdoor neurons.
Robust training methods have also been used to prevent the learning of semantically-incorrect
backdoor correlations during model training [
30
32
]. For example, Du et al. [
30
] used differentially
private training [
33
] to prevent the learning of backdoor correlations. Using strong data augmentation
methods such as MixUp [
34
] and MaxUp [
35
] can also benefit backdoor defense [
31
]. Self-supervised
pre-training achieves promising results against backdoor attacks developed for supervised learning
[
32
]. However, more recent works have successfully developed backdoor attacks tailored for self-
supervised learning and contrastive learning [
9
,
10
]. Adversarial training, which is originally proposed
to improve model robustness against adversarial attacks [
36
,
37
], has also been adapted to empirically
improve robustness against backdoor attacks [
38
]. A recent work [
39
] further theoretically showed
that backdoor filtering and adversarial robust generalization are nearly equivalent under assumptions.
Neural cleanse [
40
] set up the starting point for a new line of research [
41
43
]. This line of methods
first inverts engineer the unknown backdoor trigger from the poisoned models, and then unlearns
the backdoor using the synthesized trigger. A recent work [
5
] made improvements over the above
methods by using a novel unlearning method that requires fewer presumptions about the backdoor
trigger. As a result, the proposed method, named implicit backdoor adversarial unlearning (I-BAU),
successfully defends a wide range of backdoor attacks with different trigger patterns. In contrast,
previous methods in [41–43] all have failure cases, as shown in [5].
Recently, Li et al. [
12
] proposed a novel backdoor dense method named anti-backdoor learning
(ABL), which largely outperforms previous methods. Specifically, ABL uses a novel local gradient
ascent loss (LGA) to isolate backdoor examples from clean training samples: Using the LGA loss,
backdoor training samples will have statistically lower loss values than clean training samples. As
a result, a small amount of backdoor samples can be successfully isolated, which are further used
to unlearn the backdoor correlations. One limitation of ABL, as discussed in the original paper, is
that the loss value can be a noisy measure to distinguish backdoor samples under certain cases. Also
ABL requires careful hyper-parameter tuning [
12
]. There is also another line of works focusing on
detecting backdoor-infected models [
44
48
]. Their main goal is to predict whether a given model is
infected by backdoor attacks, instead of preventing the learning of backdoor correlations.
Our T&R is a brand-new backdoor defense strategy that does not fall into any of the above categories.
Among all categories, T&R is most relevant with the pruning-based methods [
3
,
11
,
29
]: we share
the same ultimate goal to remove infected neurons. However, unlike those methods which spend
much effect in locating the infected neurons, our method take the initiative to set a trap in the model
to bait and trap the backdoor. As a result, we do not need to locate the infected neurons, since we
know exactly where the trap is set. The only thing we need is to replace the infected trap (i.e., a
light-weighted subnetwork) with an untainted one trained on a small clean dataset.
More related works are discussed in Appendix A.
4
摘要:

TrapandReplace:DefendingBackdoorAttacksbyTrappingThemintoanEasy-to-ReplaceSubnetworkHaotaoWangUniversityofTexasatAustinhtwang@utexas.eduJunyuanHongMichiganStateUniversityhongju12@msu.eduAstonZhangAmazonWebServicesastonz@amazon.comJiayuZhouMichiganStateUniversityjiayuz@msu.eduZhangyangWangUniversityo...

展开>> 收起<<
Trap and Replace Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork Haotao Wang.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:546.84KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注