not only in terms of backdoor similarity, but also in terms of
its effectiveness in evading existing detections, as observed in
our experiments. Further, we demonstrate that the backdoor
with high backdoor similarity is indeed hard to detect through
theoretic analysis as well as extensive experimental studies
on four datasets under six representative detections using our
TSA attack together with five representative attacks proposed
in prior researches.
Contributions. Our contributions are as follows:
•
New direction on backdoor analysis. Our research has
brought a new aspect to the backdoor research, through the
lens of backdoor similarity. Our study reveals the great im-
pacts backdoor similarity has on both backdoor attack and
detection, which can potentially help determine the limits of
the adversary’s capability in a backdoor attack and therefore
enables the development of the best possible response.
•
New stealthy backdoor attack. Based upon our understand-
ing on backdoor similarity, we developed a novel technique,
TSA attack, to generate a stealthy backdoor under a given
backdoor similarity constraint, helping us better understand
the adversary’s potential and more effectively calibrate the
capability of backdoor detections,
2 Background
2.1 Neural Network
We model a neural network model
f
as a mapping func-
tion from the input space
X
to the output space
Y
, i.e.,
f:X7→ Y
. Further, the model
f
can be decomposed into
two sub-functions:
f(x) = c(g(x))
. Specifically, for a clas-
sification task with
L
classes where the output space
Y=
{0,1,...,L−1}
, we define
g:X7→[0,1]L
,
c:[0,1]L7→Y
and
c(g(x)) = argmaxjg(x)j
where
g(x)j
is the
j
-th element of
g(x)
. According to the common understanding, after well
training,
g(x)
approximates the conditional probability of pre-
senting ygiven x, i.e., g(x)y≈Pr(y|x), for y∈Yand x∈X.
2.2 Backdoor Attack & Detection
Backdoor attack
. In our research, we focus on targeted back-
doors that cause the backdoor infected model
fb
to map
trigger-carrying inputs
A(x)
to the target label
t
different from
the ground truth label of x[5,59,77,82]:
fb(A(x)) = t6=fP(x)(1)
where
fP
is the benign model that outputs the ground truth
label for
x
and
A
is the trigger function that transfers a benign
input to its trigger-carrying counterpart. There are many attack
methods have been proposed to inject backdoors, e.g., [12,
14,20,47,49,50,61,72].
Backdoor detection
. The backdoor detection has been exten-
sively studied recently [21,25,35,44,78]. These proposed
approaches can be categorized based upon their focuses on
different model information: model outputs, model weights
and model inputs. This categorization has been used in our
research to analyze different detection approaches (Section 4).
More specifically, detection on model outputs captures
backdoored models through detecting the difference be-
tween the outputs of backdoored models and benign models
on some inputs. Such detection methods include NC [77],
K-ARM [68], MNTD [83], Spectre [27], TABOR [26],
MESA [58], STRIP [22], SentiNet [13], ABL [43], ULP [38],
etc. Detection of model weights finds a backdoored model
through distinguishing its model weights from those of be-
nign models. Such detection approaches include ABS [48],
ANP [80], NeuronInspect [31], etc. Detection of model inputs
identifies a backdoored model through detecting difference
between inputs that let a backdoored model and a benign
model output similarly. Prominent detections in this category
include SCAn [72], AC [11], SS [74], etc.
2.3 Threat Model
We focus on backdoors for image classification tasks, while
assuming a white-box attack scenario where the adversary can
access the training process. The attacker inject the backdoor
to accomplish the goal formally defined in Section 3.2 and
evade from backdoor detections.
The backdoor defender aim to distinguish backdoored mod-
els from benign models. She can white-box access those back-
doored models and owns a small set of benign inputs. Besides,
the defender may obtain a set of mix inputs containing a large
number of benign inputs together with a few trigger-carrying
inputs, however which inputs carried the trigger in this set is
unknown to her.
3 TSA on Backdoor Attack
Not only does a backdoor attack aim at inducing misclassi-
fication of trigger-carrying inputs to a victim model, but it
is also meant to achieve high stealthiness against backdoor
detections. For this purpose, some attacks [17,49] reduce
the
Lp
-norm of the trigger, i.e.,
kA(x)−xkp
, to make trigger-
carrying inputs be similar to benign inputs, while some others
construct the trigger using benign features [46,66]. All these
tricks are designed to evade specific detection methods. Still
less clear is the stealthiness guarantee that those tricks can
provide against other detection methods. Understanding such
stealthiness guarantee requires to model the detectability of
backdoored models, which depends on measuring fundamen-
tal differences between backdoored and benign models that
was not studied before.
To fill in this gap, we analyze the difference between the
task a backdoored model intends to accomplish (called back-
door task) and that of its benign counterpart (called primary
task), which indicates the detectability of the backdoored
model, as demonstrated by our experimental study (see Sec-
tion 4). Between these two tasks, we define the concept of
backdoor similarity – the similarity between the primary and
the backdoor task, by leveraging the task similarity metrics
2