
Our contributions can be summarized as follows:
•
We identify an overlooked problem of the accessibility
of a clean base set in the presence of data poisoning.
•
We systematically evaluate the performance of existing
automated methods and human inspection in distin-
guishing between poisoned and clean samples;
•
We propose a novel splitting-based idea to sift out a
clean subset from a poisoned dataset and formalize it into
a bilevel optimization problem.
•
We propose META-SIFT, comprising an efficient algo-
rithm to solve the bilevel problem as well as a series of
techniques to enhance sifting precision.
•
We extensively evaluate META-SIFT and compare with
existing automated methods on four benchmark datasets
under twelve different data poisoning attack settings. Our
method significantly outperforms existing methods in
both sifting precision and efficiency. At the same time,
plugging our sifted samples into existing defenses achieves
comparable or even better performance than plugging
in randomly selected clean samples.
•
We open-source the project to promote research on this
topic and facilitate the successful application of existing
defenses in settings without a clean base set 1.
2
Sifting Out a Clean Enough Base Set is Hard
The ability to acquire a clean base set was taken for granted
in many existing data poisoning defenses [13,14,16
–
19].
For instance, a popular Trojan-Net Detection strategy is to
first synthesize potential trigger patterns from a target model
and then inspect whether there exists any suspicious pattern
[13,16]. Trigger synthesis is done by searching for a pattern
that maximally activates a certain class output when it is
patched onto the clean data. Hence, access to a clean set of
data is indispensable to this defense strategy. Another example
is defenses against Label-Flipping attacks (often referred to
as mislabeled data detection in ML literature). State-of-the-
art methods detect mislabeled data by finding a subset of
instances such that when they are excluded from training, the
prediction accuracy on a clean validation set is maximized. A
clean set of instances are needed to enable these methods.
2.1 Defense Requires a Highly Pure Base Set
TABLE 1summarizes some representative techniques that
rely on access to a clean base set in each of the aforementioned
defense categories, namely, Poison Detection, Trojan-Net De-
tection, Backdoor Removal, and Robust Training against label
noise. These techniques either achieve the state-of-the-art per-
formance (e.g., Frequency Detector [11], I-BAU [14], MW-
Net [19]) or are widely-adopted baselines (e.g., MNTD [12]
and Neural Cleanse (NC) [13]). In particular, MNTD is im-
plemented as a base strategy in an ongoing competition for
Trojan-Net Detection2.
1https://github.com/ruoxi-jia-group/Meta-Sift
2https://trojandetection.ai/
While conventionally, these defense techniques only report
their performance based on a completely clean base set, given
the fast-advancing research on stealthy attacks, it is possi-
ble that some poisoned samples may go unnoticed and get
selected into the base set by mistake. Hence, it is critical to
evaluate how the performance of these defenses depends on
the ratio of the poisoned samples in the base set.
We adopt widely used metrics to measure defense perfor-
mance for each defense category. Specifically, for Poison De-
tection, we use Poison Filtering Rate (PFR), which measures
the ratio of poisoned samples that are correctly detected. For
Trojan-Net Detection, we follow the original work of MNTD
and use the Area Under the ROC Curve (AUC) as a metric,
which measures the entire two-dimensional area underneath
the ROC curve
3
.The most naive baseline for poison detec-
tion and Trojan-Net detection is random deletion, which ends
up with a PFR of 50% and an AUC of 50%. The closer the
performance of the defense in the Poison Detection and the
Trojan-Net Detection category gets to 50%, the weaker the
defense is. For backdoor removal, we use the Attack Suc-
cess Rate (ASR), which calculates the frequency with which
non-target-class samples patched with the backdoor trigger
are misclassified into the attacker-desired target class. For
Robust Training, we use the Test Accuracy (ACC), which
measures the accuracy of the trained model on a clean test set.
The baselines for Backdoor Removal and Robust Training are
simply the deployment of no defenses at all. We report ASR or
ACC that is obtained directly from training on the poisoned
dataset. The closer the performance of defense in these two
categories gets to these baselines, the weaker the defense is.
We compare the resulting defense performance against
standard attacks (e.g., BadNets [8], Random Label-Flipping)
between clean and corrupted base sets (Table 1). For Poisoned
Detection with Frequency Detector, even one poisoned ex-
ample sneaking into the base set is sufficient to nullify the
defensive effect, leading to a performance worse than the ran-
dom baseline. For MNTD, with 1% of poisoned examples
mixed into the base set, the AUC drops by almost 40%. Com-
paring the two techniques for Backdoor Removal, we can find
that I-BAU is more sensitive to corruption of the base set than
NC. Both techniques patch a trigger to partial samples in the
base set to fine-tune the poisoned model, aimed at forcing the
model to “forget” the wrong association between the trigger
and the target label. Compared to NC, the design of I-BAU se-
lects fewer samples in the base set to be patched with a trigger.
Hence, the positive “forgetting” effect introduced by these
samples is more likely to be overwhelmed by the negative
effect caused by poisoned examples sneaking into the base set.
This explains the larger sensitivity of I-BAU to corruption of
the base set. For both techniques, less than 3% of corruption
in the base set is adequate to bring the ASR back above
60%
.
For Robust Training with MW-Net, 20 mislabeled samples in
3
An ROC curve plots the true positive rate vs. the false positive rate at
different classification thresholds