label) from the backdoor training samples. Zeng et al. [
22
] proposed a more practical clean-label
backdoor attack which requires less prior information of the training set. Hidden trigger attacks [
23
]
enjoy the benefits of both sides as a clean-label attack with invisible triggers.
2.2 Backdoor defense methods
Backdoor defense methods aim to obtain clean models without backdoors when trained on potentially
poisoned data. As one of the earliest works on defending backdoor attacks, Tran et al. [
24
] proposed
an outlier removal method based on the spectral signature to distinguish and remove backdoor samples
from clean samples. Hayase et al. [
25
] improved upon [
24
] by using a more robust outlier removal
method. Other outlier detection methods such as activation clustering [
26
], prediction consistency
[27], and influence function [28] have also been used to detect backdoor samples.
In [
3
], a concurrent work with [
24
], the authors proposed another earliest backdoor defense method
named Fine-Pruning (FP). Motivated by the empirical finding that clean and backdoor samples tend
to activate different neurons, FP first prunes the neurons that are dormant on clean samples and then
fine-tune the pruned network on a small number of clean samples. Later works in this line make
improvements on locating the backdoor neurons (i.e., the neurons that are activated by backdoor
samples but not clean samples) [
11
,
29
]. Adversarial neuron perturbations (ANP) [
11
] prunes the
sensitive neurons under adversarial perturbations to improve backdoor robustness. A very recent
work [29] used Shapley value as a measure to prune backdoor neurons.
Robust training methods have also been used to prevent the learning of semantically-incorrect
backdoor correlations during model training [
30
–
32
]. For example, Du et al. [
30
] used differentially
private training [
33
] to prevent the learning of backdoor correlations. Using strong data augmentation
methods such as MixUp [
34
] and MaxUp [
35
] can also benefit backdoor defense [
31
]. Self-supervised
pre-training achieves promising results against backdoor attacks developed for supervised learning
[
32
]. However, more recent works have successfully developed backdoor attacks tailored for self-
supervised learning and contrastive learning [
9
,
10
]. Adversarial training, which is originally proposed
to improve model robustness against adversarial attacks [
36
,
37
], has also been adapted to empirically
improve robustness against backdoor attacks [
38
]. A recent work [
39
] further theoretically showed
that backdoor filtering and adversarial robust generalization are nearly equivalent under assumptions.
Neural cleanse [
40
] set up the starting point for a new line of research [
41
–
43
]. This line of methods
first inverts engineer the unknown backdoor trigger from the poisoned models, and then unlearns
the backdoor using the synthesized trigger. A recent work [
5
] made improvements over the above
methods by using a novel unlearning method that requires fewer presumptions about the backdoor
trigger. As a result, the proposed method, named implicit backdoor adversarial unlearning (I-BAU),
successfully defends a wide range of backdoor attacks with different trigger patterns. In contrast,
previous methods in [41–43] all have failure cases, as shown in [5].
Recently, Li et al. [
12
] proposed a novel backdoor dense method named anti-backdoor learning
(ABL), which largely outperforms previous methods. Specifically, ABL uses a novel local gradient
ascent loss (LGA) to isolate backdoor examples from clean training samples: Using the LGA loss,
backdoor training samples will have statistically lower loss values than clean training samples. As
a result, a small amount of backdoor samples can be successfully isolated, which are further used
to unlearn the backdoor correlations. One limitation of ABL, as discussed in the original paper, is
that the loss value can be a noisy measure to distinguish backdoor samples under certain cases. Also
ABL requires careful hyper-parameter tuning [
12
]. There is also another line of works focusing on
detecting backdoor-infected models [
44
–
48
]. Their main goal is to predict whether a given model is
infected by backdoor attacks, instead of preventing the learning of backdoor correlations.
Our T&R is a brand-new backdoor defense strategy that does not fall into any of the above categories.
Among all categories, T&R is most relevant with the pruning-based methods [
3
,
11
,
29
]: we share
the same ultimate goal to remove infected neurons. However, unlike those methods which spend
much effect in locating the infected neurons, our method take the initiative to set a trap in the model
to bait and trap the backdoor. As a result, we do not need to locate the infected neurons, since we
know exactly where the trap is set. The only thing we need is to replace the infected trap (i.e., a
light-weighted subnetwork) with an untainted one trained on a small clean dataset.
More related works are discussed in Appendix A.
4