Rethinking the Reverse-engineering of Trojan Triggers Zhenting Wang

2025-04-29 0 0 1.32MB 21 页 10玖币
侵权投诉
Rethinking the Reverse-engineering of Trojan
Triggers
Zhenting Wang
Rutgers University
zhenting.wang@rutgers.edu
Kai Mei
Rutgers University
kai.mei@rutgers.edu
Hailun Ding
Rutgers University
hailun.ding@rutgers.edu
Juan Zhai
Rutgers University
juan.zhai@rutgers.edu
Shiqing Ma
Rutgers University
sm2283@rutgers.edu
Abstract
Deep Neural Networks are vulnerable to Trojan (or backdoor) attacks. Reverse-
engineering methods can reconstruct the trigger and thus identify affected models.
Existing reverse-engineering methods only consider input space constraints, e.g.,
trigger size in the input space. Expressly, they assume the triggers are static patterns
in the input space and fail to detect models with feature space triggers such as image
style transformations. We observe that both input-space and feature-space Trojans
are associated with feature space hyperplanes. Based on this observation, we
design a novel reverse-engineering method that exploits the feature space constraint
to reverse-engineer Trojan triggers. Results on four datasets and seven different
attacks demonstrate that our solution effectively defends both input-space and
feature-space Trojans. It outperforms state-of-the-art reverse-engineering methods
and other types of defenses in both Trojaned model detection and mitigation tasks.
On average, the detection accuracy of our method is 93%. For Trojan mitigation,
our method can reduce the ASR (attack success rate) to only 0.26% with the
BA (benign accuracy) remaining nearly unchanged. Our code can be found at
https://github.com/RU-System-Software-and-Security/FeatureRE.
1 Introduction
DNNs are vulnerable to Trojan attacks [
1
6
]. After injecting a Trojan into the DNN model, the
adversary can manipulate the model prediction by adding a Trojan trigger to get the target label.
The adversary can inject the Trojan by performing the poisoning attack or supply chain attack. In
the poisoning attack, the adversary can control the training dataset and injects the Trojan by adding
samples with the Trojan trigger labeled as the target label. In the supply chain attack, the adversary
can replace a benign model with a Trojaned model by performing the supply chain attack. The Trojan
trigger is becoming more and more stealthy. Earlier works use static patterns, e.g., a yellow pad as
the trigger, which is known as the input space triggers. Researchers recently proposed using more
dynamic and input-aware techniques to generate stealthy triggers that mix with benign features, which
are referred to as the feature space triggers. For example, the trigger of the feature-space Trojans can
be a warping process [
7
] or a generative model [
3
,
8
,
9
]. The Trojan attack is a prominent threat to
the trustworthiness of DNN models, especially in security-critical applications, such as autonomous
driving [1], malware classification [10], and face recognition [11].
Prior works have proposed several ways to defend against Trojan attacks, such as removing poisons
in training [
12
14
], detecting Trojan samples at runtime [
15
18
], etc. Many of above methods can
only work for one type of Trojan attack. For example, training and pre-training time defense (e.g.,
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.15127v1 [cs.CR] 27 Oct 2022
removing poisoning data, training a benign model with poisoning data) fail to defend against the
supply chain attack. Trigger reverse-engineering [
19
23
] is a general method to defend against
different Trojan attacks under different threat models. It works by searching for if there exists an
input pattern that can be used as a trigger in the given model. If we can find such a trigger, the
model has a corresponding Trojan and is marked as malicious and vice versa. Existing reverse-
engineering methods assume that the Trojan triggers are static patterns in the input space and develop
an optimization problem that looks for an input pattern that can be used as the trigger. This assumption
is valid for input space attacks [
1
,
2
,
24
] that use static triggers (e.g., a colored patch). Feature space
attacks [
3
5
,
7
9
,
25
] break this assumption. Existing trigger reverse-engineering methods [
19
23
]
constrain the optimization by using heuristics or empirical observations on existing attacks, such
as pixel values are in range
[0,255]
, and the trigger’s size is small. Such heuristics are also invalid
for feature space triggers that change all pixels in images. Reverse-engineering the feature space is
challenging. Unlike input space, there are no constraints that can be directly used.
Target Label
=
Search Input space trigger and constrain Input Space
Target Label
=
F
Search Feature space trigger
and constrain Feature Space
Search Input space Transformation
based on feature space constrain
Our RE Existing RE
Fig. 1: Existing reverse-engineering (RE) and ours.
In this paper, we propose a trig-
ger reverse-engineering method that
works for feature space triggers. Our
intuition is that features representing
the Trojan are orthogonal to other fea-
tures. Because a trigger works for
a set of samples (or all of them, de-
pending on the attack type), changing
the input content without removing
the Trojan features will not change
the prediction. That is, changing Tro-
jan and benign features will not affect
each other. Trojan features will form
a hyperplane in the high dimensional
space, which can constrain the search
in feature space. We then develop our
reverse-engineering method by exploiting the feature space constraint. Fig. 1 demonstrates our
idea. Existing reverse-engineering methods only consider the input space constraint. It conducts
reverse-engineering via searching a static trigger pattern in the input space. These methods fail to
reverse-engineer feature-space Trojans whose trigger is dynamic in the input space. Instead, our idea
is to exploit the feature space constraint and searching a feature space trigger using the constraint
that the Trojan features will form a hyperplane. At the same time, we also reverse-engineer the input
space Trojan transformation based on the feature space constraint. To the best of our knowledge, we
are the first to propose feature space reverse-engineering methods for backdoor detection.
Through reverse-engineered Trojans, we developed a Trojan detection and removal method. We
implemented a prototype
FEATURERE
(
FEATURE
-space
RE
verse-engineering) in Python and
PyTorch and evaluated it on MNIST, GTSRB, CIFAR, and ImageNet dataset with seven different
attacks (i.e., BadNets [
1
], Filter attack [
20
], WaNet [
7
], Input-aware dynamic attack [
8
], ISSBA [
9
],
Clean-label attack [
26
], Label-specific attack [
1
], and SIG attack [
27
]). Our results show that
FEATURERE
is effective. On average, the detection accuracy of our method is 93%, outperforming
existing techniques. For Trojan mitigation, our method can reduce the ASR (attack success rate)
to only 0.26% with the BA (benign accuracy) remaining nearly unchanged by using only ten clean
samples for each class.
Our contributions are summarized as follows. We first find the feature space properties of the Trojaned
model and reveal the relationship between Trojans and the feature space hyperplanes. We propose a
novel Trojan trigger reverse-engineering technique leveraging the feature space Trojan hyperplane.
We evaluate our prototype on four different datasets, five different network architectures, and seven
advanced Trojan attacks. Results show that our method outperforms SOTA approaches.
2 Background & Motivation
A DNN classifier is a function
M:X 7→ Y
where
X
is the input domain
Rm
and
Y
is a set of labels
K
. A Trojan (or backdoor) attack against a DNN model
M
is a malicious way of perturbing the
input so that an adversarial input
x0
(i.e., input with the perturbation pattern) will be classified to a
2
target/random label while the model maintains high accuracy for benign input
x
. The perturbation
pattern is known as the Trojan trigger. Trojan attacks can happen in training (e.g., data poisoning)
or model distribution (e.g., changing model weights or supply-chain attack). Existing works have
shown Trojan attacks against different DNN models, including computer vision models [
1
,
2
,
24
],
Graph Neural Networks (GNNs) [
28
,
29
], Reinforcement Learning (RL) [
30
,
31
], Natural Language
Processing (NLP) [
32
37
], recommendation systems [
38
], malware detection [
10
], pretrained mod-
els [
21
,
39
,
40
], active learning [
41
], and federated learning [
42
,
43
]. The Trojan trigger can be a
simple input pattern (e.g., a yellow pad) [
1
,
2
,
24
] or a complex input transformation function (e.g., a
CycleGAN to change the input styles) [
3
,
5
,
7
9
]. If the trigger is static input space perturbations
(e.g., a yellow pad), the Trojan attack is known as input-space Trojan, and if the trigger is an input
feature (e.g., an image style), the attack is referred to as the feature-space Trojan.
There are different types of Trojan defenses. A line of work [
13
,
14
,
44
] attempts to remove poisoned
data samples by cleaning the training dataset. Training-based methods [
45
47
] train benign classifiers
even with the poisoned dataset. These training time approaches work for poisoning-based attacks
but fail to defend against supply chain attacks where the adversary injects the Trojan after the model
is trained. Another line of work, e.g., STRIP [
15
], SentiNet [
16
], and Februus [
17
] aim to detect
Trojan inputs during runtime. It is hard to distinguish between a misclassification and a Trojan
attack for a test input. These runtime detection methods make assumptions about the attack, which
stronger attacks can violate. For example, STRIP fails to detect the Trojan inputs when the Trojan
trigger locates around the center of an image or overlaps with the main object (e.g., feature space
attacks). Another limitation is that they examine the test inputs and perform various heavyweight
tests, significantly delaying the response time.
Trigger reverse engineering [
19
23
,
48
50
] makes no assumptions about the attack method (e.g.,
poisoning or supply-chain attacks) and does not affect the runtime performance. It inspects the model
to check if a Trojan exists before deploying. Given a DNN model
M
and a small set of clean samples
X, trigger reverse engineering methods try to reconstruct injected triggers. If reverse engineering is
successful, the model is marked as malicious. Neural Cleanse (NC) [
19
] proposes to perform reverse
engineering by solving Eq. 1:
min
m,tL(M((1 m)x+mt), yt) + r?(1)
where
x∈ X
and
m
is the trigger mask (i.e., a binary matrix with the same size as the input to
determine if the value will be replaced by the trigger or not),
t
is the trigger pattern (i.e., a matrix
with the same size as the input containing trigger values), and
r?
are attack constraints (e.g., trigger
size is smaller than 1/4 of the image).
L
is the cross-entropy loss function. Most prior works [
20
23
]
follow the same methodology and inherently suffer from the same limitations. First, they assume that
an input space perturbation, denoted by (m,t), can represent a trigger. This assumption is valid for
input-space triggers but does not hold for feature space attacks. Second,
r?
are heuristics observed
from existing attacks. For example, NC observed that most triggers have small sizes and limit the
trigger size to be no larger than a threshold value. Otherwise, the trigger will overlap with the main
object and decrease benign accuracy. In practice, more advanced attacks can break such heuristics.
For instance, DFST [
3
] leverages CycleGAN to transfer images from one style to another without
changing its semantics. It changes almost all pixels in a given image. This paper proposes a novel
reverse engineering method that overcomes the limitations above for image classifiers.
3 Methodology
3.1 Threat Model
This work aims to determine if a given model has a Trojan or not by reverse-engineering the
corresponding trigger. Following existing works [
19
,
20
,
51
], we assume access to the model and a
small dataset containing correctly labeled benign samples of each label. In practice, such datasets
can be gathered from the Internet. We make no assumptions on how the attacker injects the Trojan
(poisoning or supply-chain attack). The attack can be formally defined as:
M(x) = y, M(F(x)) =
yT,x∈ X
, where
M
is the Trojaned model,
x
is a clean input sample, and
yT
is the target label.
F
is the function to construct Trojan samples. Input-space triggers add static input perturbations, and
feature space triggers are input transformations. The key difference between our work and existing
work is that we consider the feature space triggers.
3
3.2 Observation
In DNNs, the neuron activation values represent its functionality. The input neurons denote the
input space features, and inner neurons extract inner and more abstract features. Existing reverse-
engineering methods constrain the optimization problem in the input space using domain-specific
constraints or observations. For image classification tasks, the pixel value of each image has to be a
valid RGB value. Methods like NC observe that the trigger size must be smaller and cannot overlap
with the main object and propose corresponding constraints. The most challenging problem for
reverse-engineering feature space triggers is how to constrain the optimization properly. Note that
there exist a set of neurons; when activating to specific values, the Trojan behavior will be triggered.
Due to the black-box nature of DNNs, it is hard to identify which neurons are related to the Trojan
behavior. Moreover, if enlarge the weight values with the same scale, the output of the DNN will be
the same, and as such, it is hard to constrain concrete activation values. Without a proper constraint,
we cannot form an optimization problem.
Our key observation to solve this problem is that neuron activation values representing the Trojan
behavior are orthogonal to others. Recall that one property of DNN Trojans is that when adding
the trigger to any given input, the model will predict the output to a specific label. That is, the
trigger will always work regardless of the actual contents of the input. In the feature space, when
the model recognizes features of the Trojan, it will predict the label to the target label regardless
of the other features. These activation values will form a hyperplane space in the high dimensional
space so that they can be orthogonal to all others. Based on this intuition, we performed empirical
experiments to confirm our idea. Specifically, we first use six Trojan attacks (e.g., BadNets [
1
], Clean
label attack [
26
], Filter attack [
20
], and WaNet [
7
], SIG [
27
] and Input-aware dynamic attack [
8
])
to generate Trojaned ResNet18 models on CIFAR-10. We then visualize the feature space of the
last convolutional layers in these models. In Fig. 2, three dimensions, X, Y, and Z, represent the
feature space. We first apply PCA to get two eigenvectors of the benign training set; then, we
use the obtained eigenvectors as X-axis and Y-axis. For Z-axis, we first construct Trojan inputs to
activate the model’s Trojan behavior and find highly related neurons to Trojans. Then, we use DNN
interpretability techniques SHAP [
52
] to estimate the neuron’s importance to the Trojan behavior.
The neurons among the top 3% are compromised neurons. Z-axis denotes the activation values of
compromised neurons. Namely,
z=kA(F(x)) mk
, where
m
denotes a mask revealing the
position of compromised neurons. Fig. 2 show that most Trojan inputs have a similar z-value. They
form a linear hyperplane in the feature space while benign ones do not.
Clean Label
BadNets
WaNetFilter Input-aware Dynamic
SIG
Fig. 2: Feature space of Trojaned models.
3.3 Feature Space Trojan Hyperplane Reverse-engineering
In this paper, We use
A
to represent the submodel from the input space to the feature space.
B
is
the submodel from the feature space to the output space. We also use
a=A(x)
to denote the inner
4
Algorithm 1 Feature-space Backdoor Reverse-engineering
Input: Model: M
Output: Trojaned or Not, Trojaned Pairs T
1: function REVERSE-ENGINEERING(M)
2: for (target class yt,source class ys) in Kdo
3: for eEdo
4: x=sample(Xys)
5: cost1=L(B((1 m)a+mt), yt)
6: if kF(x)xk ≥ τ1then
7: cost1=cost1+w1· kF(x)xk
8: if std(m A(F(x))) τ2then
9: cost1=cost1+w2·std(m A(F(x)))
10: θF=cost1
θF
11: θF=θFlr1·θF
12: cost2=L(B((1 m)a+mt), yt)
13: if kmk ≥ τ3then
14: cost2=cost2+w3· kmk
15: m=cost2
m
16: m=mlr2·m
17: if ASR (B((1 m)a+mt), yt)> λ then
18: Mis a Trojaned model,
19: T.append((ys, yt))
features of the model. Similar to the reverse-engineering in the input space, given a model
M
and a
small set of benign inputs
X
, we use a feature space mask
m
and a feature space pattern
t
to represent
the feature space Trojan hyperplane
H={a|ma=mt}
. Specifically, we can update
m
and
t
via the following optimization process:
min
m,tL(B((1 m)a+mt), yt)
.
yt
is the target
label. As discussed above, reverse-engineering the feature space is challenging. In the input space,
all values have natural physical semantics and constraints, e.g., a pixel value in the RGB value range.
Values in the feature space have uninterruptible meanings and are not strictly constrained. Whether
the result will have a physically meaningful semantic is also uncertain. We solve these challenges by
simultaneously optimizing the input space trigger function
F
and the feature space Trojan hyperplane
H
to enforce that the trigger has semantic meanings. In detail, we compute the feature space
trigger pattern as the mean inner features on the samples generated by the trigger function, i.e.,
t=mean (m A(F(X))
. We also constrain the standard deviation of
m A(F(X)
to make sure
the features generated by the trigger function will lie on the relaxation of the reverse-engineered
hyperplane. Formally, our reverse-engineering can be written as the constrained optimization problem
shown in Eq. 2, where
X
is the small set of clean samples. We use deep neural networks to model
the trigger function (i.e.,
F=Gθ
) because of their expressiveness [
23
,
53
]. Specifically, we use
a representative deep neural network UNet [
54
]. Given a model and a small set of clean inputs,
the trigger function can be smoothly reconstructed via gradient-based methods, i.e., optimizing the
generative model
Gθ
. In our default setting,
A
and
B
are separated at the last convolutional layer.
More discussions are in the Appendix (§ 4.5).
min
F,mL(B((1 m)a+mt), yt)
where t=m A(F(X),a∈ A(X)
s.t. kF(X) X k τ1, std(m A(F(X))) τ2,kmk ≤ τ3
(2)
There are several constraints in the optimization problem:
¬
The transformed samples should be
similar to the original image due to the properties of Trojan attacks, i.e.,
kF(x)xk ≤ τ1
. Typically,
the Trojan samples are visually similar to original samples for stealthy purposes. In detail, we
use MSE (Mean Squared Error) to calculate the distance between
F(x)
and
x
.
The Trojan
features should lie in the relaxation of the reverse-engineered feature space Trojan hyperplane, i.e.,
P(aH?|xF(X))
should have high values. To achieve this goal, we constrain the standard
5
摘要:

RethinkingtheReverse-engineeringofTrojanTriggersZhentingWangRutgersUniversityzhenting.wang@rutgers.eduKaiMeiRutgersUniversitykai.mei@rutgers.eduHailunDingRutgersUniversityhailun.ding@rutgers.eduJuanZhaiRutgersUniversityjuan.zhai@rutgers.eduShiqingMaRutgersUniversitysm2283@rutgers.eduAbstractDeepNeur...

展开>> 收起<<
Rethinking the Reverse-engineering of Trojan Triggers Zhenting Wang.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.32MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注