2.2 Backdoor Attacks
Backdoor attacks are emerging yet critical threats in the training process of deep neural networks
(DNNs), where the adversary intends to embed hidden backdoors into DNNs. The attacked models
behave normally in predicting benign samples, whereas the predictions are maliciously changed
whenever the adversary-specified trigger patterns appear. Due to this property, they were also used as
the watermark techniques for model [27, 28, 29] and dataset [3, 4] ownership verification.
In general, existing backdoor attacks can be divided into three main categories, including
1)
poison-
only attacks [
30
,
31
,
32
],
2)
training-controlled attacks [
33
,
34
,
35
], and
3)
model-modified attacks
[
36
,
37
,
38
], based on the adversary’s capacity levels. In this paper, we only focus on poison-only
backdoor attacks, since they are the hardest attack having widespread threat scenarios. Only these
attacks can be used to protect open-sourced datasets [
3
,
4
]. In particular, based on the label type,
existing poison-only attacks can also be separated into two main sub-types, as follows:
Poison-only Backdoor Attacks with Poisoned Labels.
In these attacks, the re-assigned labels of
poisoned samples are different from their ground-truth labels. For example, a cat-like poisoned image
may be labeled as the dog in the poisoned dataset released by backdoor adversaries. It is currently
the most widespread attack paradigm. To the best of our knowledge, BadNets [
30
] is the first and
most representative attack with poisoned labels. Specifically, the BadNets adversary randomly selects
certain benign samples from the original benign dataset to generate poisoned samples, based on
adding a specific trigger pattern to the images and changing their labels to the pre-defined target label.
The adversary will then combine the generated poisoned samples with the remaining benign ones to
make the poisoned dataset, which is released to train the attacked models. After that, Chen et al. [
39
]
proposed the blended attack, which suggested that the poisoned image should be similar to its benign
version to ensure stealthiness. Most recently, a more stealthy and effective attack (
i.e.
, WaNet [
32
])
was proposed, which exploited image warping to design trigger patterns.
Poison-only Backdoor Attacks with Clean Labels.
Turner et al. [
31
] proposed the first poison-
only backdoor attack with clean labels (i.e., label-consistent attack), where the target label is the
same as the ground-truth label of all poisoned samples. They argued that attacks with poisoned
labels were not stealthy enough even when the trigger pattern was invisible, since users could still
identify the attacks by examining the image-label relation when they caught the poisoned samples.
However, this attack is far less effective when the dataset has many classes or high image-resolution
(
e.g.
, GTSRB and ImageNet) [
40
,
41
,
5
]. Most recently, a more effective attack (i.e., Sleeper Agent)
was proposed, which generated trigger patterns by optimization [
40
]. Nevertheless, these attacks are
still difficult since the ‘robust features’ contained in the poisoned images will hinder the learning of
trigger patterns [
5
]. How to design attacks with clean labels is still left far behind and worth further
exploration.
Besides, to the best of our knowledge, all existing backdoor attacks are targeted,
i.e.
, the predictions
of poisoned samples are deterministic and known by the adversaries. How to design backdoor attacks
in an untargeted manner and its positive applications remain blank and worth further explorations.
3 Untargeted Backdoor Watermark (UBW)
3.1 Preliminaries
Threat Model.
In this paper, we focus on poison-only backdoor attacks as the backdoor watermarks
in image classification. Specifically, the backdoor adversaries are only allowed to modify some benign
samples while having neither the information nor the ability to modify other training components
(
e.g.
, training loss, training schedule, and model structure). The generated poisoned samples with
remaining unmodified benign ones will be released to victims, who will train their DNNs based on
them. In particular, we only consider poison-only backdoor attacks instead of other types of methods
(
e.g.
, training-controlled attacks or model-modified attacks) because they require additional adversary
capacities and therefore can not be used to protect open-sourced datasets [3, 4].
The Main Pipeline of Existing Targeted Backdoor Attacks.
Let
D={(xi, yi)}N
i=1
denotes the
benign training set, where
xi∈ X ={0,1,...,255}C×W×H
is the image,
yi∈ Y ={1, . . . , K}
is its label, and
K
is the number of classes. How to generate the poisoned dataset
Dp
is the
cornerstone of poison-only backdoor attacks. To the best of our knowledge, almost all existing
3