
Fig. 1: Overview of the data cleansing method.
poison a dataset by: i) choosing an attack target class t, and
then obtaining a subset (of size M) of images from classes
other than t:Ds={(xj, yj)|i= 1...M, yj6=t},Ds⊂ D,
and MN; ii) the backdoor pattern is then incorporated
into each sample in Dsusing the attacker’s backdoor embed-
ding function g:RX×H×W−→ RX×H×W; iii) the label
for each poisoned sample is then changed to the target class:
Dp={(g(x), t)|x∈ Ds}; iv) finally the poisoned dataset is
formed by putting the attacked images back into the training
set: ¯
D= (D\Ds)∪ Dp. If the attack is successful, the vic-
tim model f:RX×H×W→ {1,2, ..., C}, when trained on
the poisoned dataset, will have normal (good) classification
accuracy on clean (backdoor-free) test samples, but will clas-
sify most backdoor-triggered test samples to the target class
of the attack. In the image domain, backdoor patterns could,
e.g., be: i) a small patch that replaces the original pixels of an
image [1, 2, 3]; ii) a perturbation added to some pixels of an
image [4, 5, 6]; or iii) a “blended” patch attack [4].
On the other hand, the defender aims to obtain a classi-
fier with good classification accuracy on clean test samples
and which correctly classifies test samples with the backdoor
pattern. Defenses against backdoors that are deployed post-
training aim to detect whether a DNN model is a backdoor
victim [2, 7, 8, 9, 6, 10] and, further, to mitigate the attack
if a detection is declared [2, 11, 12]. Most post-training de-
fenses require a relatively small clean dataset (distributed as
the clean training set), with their performance generally sen-
sitive to the number of available clean samples [12, 2, 7, 10].
In this paper, alternatively, we aim to cleanse the training set
prior to deep learning. Related work on training set cleansing
includes [13, 14, 15, 16]. All of these methods rely on em-
bedded feature representations of a classifier fully trained on
the possibly poisoned training set ([14] suggests that an auto-
encoder could be used instead). [14, 13] use a 2-component
clustering approach to separate backdoor-poisoned samples
from clean samples ([14] uses a singular-value decomposi-
tion while [13] uses a simple 2-means clustering), while [15]
uses a Gaussian mixture model whose number of components
is chosen based on BIC [17]. Instead of clustering, [16] em-
ploys a reverse-engineered backdoor pattern estimated using
a small clean dataset. DBD [18] builds a classifier based on
an encoder learned via self-supervised contrastive loss; then
the classifier is fine-tuned. In each iteration some samples
are identified as “low credible” samples by the classifier, with
their labels removed; the classifier is then updated based on
the processed dataset in a semi-supervised manner.
3. METHODOLOGY
3.1. Vulnerability of supervised training
We now illustrate the vulnerability of supervised training
by analysis of a simple linear model trained on a poisoned
dataset, considering the case where all classes other than the
target are (poisoned) source classes. The victim classifier
forms a linear discriminant function for each class s, i.e., the
inner product fs(x) = x·wx, where ws∈RX×H×Wis the
vector of model weights corresponding to class s. Assume
that, after supervised training, each training sample is classi-
fied correctly with confidence at least τ > 0as measured by
the margin:
fyi(xi)−max
c6=yi
fc(xi)≥τ, ∀(xi, yi)∈¯
D.(1)
Assuming that the backdoor pattern ∆xis additively in-
corporated, given an attack sample based on clean xsorigi-
nally from source-class s6=t, Eq. (1) implies
wt·(xs+ ∆x)−ws·(xs+ ∆x)≥τ. (2)
If xsis also classified to swith margin τ, then
ws·xs−wt·xs≥τ. (3)
Adding (1) and (3) gives
ft(∆x)−fs(∆x)=(wt−ws)·∆x≥2τ. (4)
This loosely suggests that, after training with a poisoned
training dataset, the model has stronger “affinity” between
the target class and the backdoor pattern (4) than between
the source class and the class-discriminative features of clean
source-class samples (3). This phenomenon is experimentally
verified when the model is a DNN, as shown in Apdx. A.
However, these strong affinities are only made possible by
the mislabeling of the backdoor-poisoned samples. Given that
usually the perturbation ∆xis small, backdoor attacked im-
ages differ minutely from the original (clean) images. Thus if
a model is trained in a self-supervised manner, without mak-
ing use of the class labels, the feature representations of xand
x+ ∆xshould be quite similar/highly proximal. Thus, in the
model’s representation space, poisoned samples may “stand
out” as outliers in that their labels may disagree with the la-
bels of samples in close proximity to them. This is the basic
idea behind the cleansing method we now describe.
3.2. Self-supervised contrastive learning
SimCLR [19, 20] is a self-supervised training method to learn
a feature representation for images based on their semantic
content. In SimCLR, in each mini-batch, Ksamples are ran-
domly selected from the training dataset, and each selected
sample xkis augmented to form two versions, resulting in 2K
augmented samples. Augmented samples are then fed into
the feature representation model, which is an encoder E(·)
followed by a linear projector L(·), with the feature vector z
extracted from the last layer: z=L(E(x)). For simplicity
we will refer to L(E(·)) as the “encoder” hereon. The en-
coder is trained to minimize the following objective function: