multiple triggers into different channels (i.e., RGB channels) of an image [
50
], respectively. However,
the possible numbers of target classes in these works are limited by 4 and 3, bounded by the
performance and the number of channels (i.e., 3 in RGB images). The work of DeepPayload [
25
]
attempts to directly inject the malicious logic through reverse-engineering instead of training the
backdoor into the model, which does not consider varying the target backdoor class as in this paper.
Our contributions are summarized below:
•
We propose a new type of backdoor attack where the adversary can flexibly attack any target label
during inference. This attack maliciously modifies the model by establishing a causal link between
the trigger function and all output classes.
•
We propose a class-condition generative trigger function that, given the target label, can generate
an imperceptible trigger pattern to cause the model to predict the target label. We then propose a
constrained optimization objective that can effectively and efficiently learn the trigger function and
poison the model.
•
Finally, we empirically demonstrate the effectiveness of the proposed method and its robustness
against several representative defensive mechanisms. We show that the proposed method can
achieve high attack success rates with any arbitrarily chosen target class while preserving the
behavior of the model under normal conditions.
The rest of the paper is organized as follows. We review the background of DNN backdoor attacks
in Section 2. The threat model is defined in Section 3. We present the details of the proposed
methodology in Section 4, and evaluate the performance and compare to prior works in Section 5.
Finally, Section 6concludes this paper. We present more details about experimental settings and
additional results in the supplementary material.
2 Background
2.1 Backdoor Attacks
Under image classification tasks, backdoor attacks on DNN seek to inject malicious behavior into the
model that will associate a trigger to a target backdoor class [
15
,
28
], which can also be interpreted
as the payload as in malware backdoor and hardware Trojan. The injection of the backdoor are
typically achieved by poisoning the training data [
15
,
28
] or manipulating the training process or
model parameters [
19
,
12
]. An important performance requirement for the backdoor attack is its
stealthiness, such that the existence of the backdoor in a model cannot be easily identified. Hence, a
successful backdoor attack should preserve the normal functionality or inference accuracy for clean
images (i.e., images without the trigger).
The designs of the trigger have been extensively studied in the literature, from the early obvious patch-
based triggers [
7
,
28
] to more invisible ones with utilization of blended [
7
], sinusoidal strips (SIG) [
2
],
reflection (ReFool) [
29
], single-pixel [
1
], warping (WaNet) [
35
], discrete cosine transform (DCT)
steganography [
50
], and adversarial example generation [
41
,
22
]. As opposed to a universal trigger,
several recent works have investigated input-aware backdoor attacks that minimize the visibility of
the trigger by generating the trigger pattern based on the content of each input image [
8
,
34
,
10
,
9
].
For instance, LIRA [
10
] trains a generative model as the trigger function for each image, while
simultaneously injecting the backdoor into the model, which has been shown to be able to generate
completely invisible triggers.
All of these attacks, under either all-to-one or all-to-all scenarios, can only manipulate the prediction
of a given input image to one target class. While the works in [
49
,
50
] considered a less narrow
form for the payload, the number of possible target backdoor classes is still limited, i.e., 3 or 4. In
contrast, this paper exploits a much stronger attack that is able to misclassify a given input image to
any arbitrary target class.
2.2 Backdoor Defenses
Meanwhile, various backdoor defensive solutions have also been developed, aimed at either de-
tecting [
5
,
44
,
13
] or mitigating [
26
,
45
,
6
,
36
,
24
] the attacks. Popular methods include Neural
Cleanse [
45
] that detects the backdoor by searching for possible trigger patches, fine-pruning [
26
]
that prunes the model to erase the backdoor, spectral signature [
44
] that detects outliers based on the
3