target/random label while the model maintains high accuracy for benign input
x
. The perturbation
pattern is known as the Trojan trigger. Trojan attacks can happen in training (e.g., data poisoning)
or model distribution (e.g., changing model weights or supply-chain attack). Existing works have
shown Trojan attacks against different DNN models, including computer vision models [
1
,
2
,
24
],
Graph Neural Networks (GNNs) [
28
,
29
], Reinforcement Learning (RL) [
30
,
31
], Natural Language
Processing (NLP) [
32
–
37
], recommendation systems [
38
], malware detection [
10
], pretrained mod-
els [
21
,
39
,
40
], active learning [
41
], and federated learning [
42
,
43
]. The Trojan trigger can be a
simple input pattern (e.g., a yellow pad) [
1
,
2
,
24
] or a complex input transformation function (e.g., a
CycleGAN to change the input styles) [
3
,
5
,
7
–
9
]. If the trigger is static input space perturbations
(e.g., a yellow pad), the Trojan attack is known as input-space Trojan, and if the trigger is an input
feature (e.g., an image style), the attack is referred to as the feature-space Trojan.
There are different types of Trojan defenses. A line of work [
13
,
14
,
44
] attempts to remove poisoned
data samples by cleaning the training dataset. Training-based methods [
45
–
47
] train benign classifiers
even with the poisoned dataset. These training time approaches work for poisoning-based attacks
but fail to defend against supply chain attacks where the adversary injects the Trojan after the model
is trained. Another line of work, e.g., STRIP [
15
], SentiNet [
16
], and Februus [
17
] aim to detect
Trojan inputs during runtime. It is hard to distinguish between a misclassification and a Trojan
attack for a test input. These runtime detection methods make assumptions about the attack, which
stronger attacks can violate. For example, STRIP fails to detect the Trojan inputs when the Trojan
trigger locates around the center of an image or overlaps with the main object (e.g., feature space
attacks). Another limitation is that they examine the test inputs and perform various heavyweight
tests, significantly delaying the response time.
Trigger reverse engineering [
19
–
23
,
48
–
50
] makes no assumptions about the attack method (e.g.,
poisoning or supply-chain attacks) and does not affect the runtime performance. It inspects the model
to check if a Trojan exists before deploying. Given a DNN model
M
and a small set of clean samples
X, trigger reverse engineering methods try to reconstruct injected triggers. If reverse engineering is
successful, the model is marked as malicious. Neural Cleanse (NC) [
19
] proposes to perform reverse
engineering by solving Eq. 1:
min
m,tL(M((1 −m)x+mt), yt) + r?(1)
where
x∈ X
and
m
is the trigger mask (i.e., a binary matrix with the same size as the input to
determine if the value will be replaced by the trigger or not),
t
is the trigger pattern (i.e., a matrix
with the same size as the input containing trigger values), and
r?
are attack constraints (e.g., trigger
size is smaller than 1/4 of the image).
L
is the cross-entropy loss function. Most prior works [
20
–
23
]
follow the same methodology and inherently suffer from the same limitations. First, they assume that
an input space perturbation, denoted by (m,t), can represent a trigger. This assumption is valid for
input-space triggers but does not hold for feature space attacks. Second,
r?
are heuristics observed
from existing attacks. For example, NC observed that most triggers have small sizes and limit the
trigger size to be no larger than a threshold value. Otherwise, the trigger will overlap with the main
object and decrease benign accuracy. In practice, more advanced attacks can break such heuristics.
For instance, DFST [
3
] leverages CycleGAN to transfer images from one style to another without
changing its semantics. It changes almost all pixels in a given image. This paper proposes a novel
reverse engineering method that overcomes the limitations above for image classifiers.
3 Methodology
3.1 Threat Model
This work aims to determine if a given model has a Trojan or not by reverse-engineering the
corresponding trigger. Following existing works [
19
,
20
,
51
], we assume access to the model and a
small dataset containing correctly labeled benign samples of each label. In practice, such datasets
can be gathered from the Internet. We make no assumptions on how the attacker injects the Trojan
(poisoning or supply-chain attack). The attack can be formally defined as:
M(x) = y, M(F(x)) =
yT,x∈ X
, where
M
is the Trojaned model,
x
is a clean input sample, and
yT
is the target label.
F
is the function to construct Trojan samples. Input-space triggers add static input perturbations, and
feature space triggers are input transformations. The key difference between our work and existing
work is that we consider the feature space triggers.
3