2 H. M. Dolatabadi et al.
on this poisoned training data. The attacker may add its trigger to the inputs
at inference time to achieve its desired output. However, such poisoned neural
networks behave ordinarily on clean data. As such, defending neural networks
against backdoor attacks can empirically be an arduous task.
In the most common setting for backdoor defense, motivated by the rise of
Machine Learning as a Service (MLaaS) [7], it is assumed that the user out-
sources training of its desired model to a third party. The adversary then can
exploit this freedom and provide a malicious, backdoored neural network to
the user [8,9,10,11,12]. From this perspective, the status-quo defense strategies
against backdoor attacks can be divided into two categories [13]. In detection-
based methods the goal is to identify maliciously trained neural networks [14,11].
Erasing-based approaches, in contrast, try to effectively eliminate the backdoor
data ramifications in the trained model, and hence, give a pure, backdoor-free
network [8,9,15,16,17].
A less-explored yet realistic scenario in defending neural networks against
backdoor poisonings is when the user obtains its training data from untrustwor-
thy sources. Here, the attacker can introduce backdoors into the user’s model by
solely poisoning the training data [6,18,13,19]. In this setting, existing approaches
have several disadvantages. First, these methods may require having access to
aclean held-out validation dataset [18]. This assumption may not be valid in
real-world applications where collecting new, reliable data is costly. Moreover,
such approaches may need a two-step training procedure: a neural network is
first trained on the poisoned data. Then, the backdoor data is removed from
the training set using the previously trained network. After purification of the
training set, the neural network needs to be re-trained [18,19]. Finally, some
methods achieve robustness by training multiple neural networks on subsets of
training data to enable a “majority-vote mechanism” [13,20,21]. These last two
assumptions may also prove expensive in real-world applications where it is more
efficient to train a single neural network only once. As a result, one can see that
a standard, robust, and end-to-end training approach, like adversarial training,
is still lacking for training on backdoor poisoned data.
To address these pitfalls, in this paper we leverage the theory of coreset se-
lection [22,23,24,25,26] for end-to-end training of neural networks. In particular,
we aim to sanitize the possibly malicious training data by training the neural
network on a subset of the training data. To find this subset in an online fashion,
we exploit coreset selection by identifying the properties of the poisoned data. To
formulate our coreset selection objective, we argue that the gradient space char-
acteristics and local intrinsic dimensionality (LID) of poisoned and clean data
samples are different from one another. We empirically validate these properties
using various case studies. Then, based on these two properties, we define an
appropriate coreset selection objective and effectively filter out poisoned data
samples from the training set. As we shall see, this process is done online as
the neural network is being trained. As such, we can effectively eliminate the
previous methods’ re-training requirement. We empirically show the successful
performance of our method, named Collider, in training robust neural net-