Rethinking the Reverse-engineering of Trojan Triggers Zhenting Wang

2025-04-29 0 0 1.32MB 21 页 10玖币

侵权投诉

Rethinking the Reverse-engineering of Trojan

Triggers

Zhenting Wang

Rutgers University

zhenting.wang@rutgers.edu

Kai Mei

Rutgers University

kai.mei@rutgers.edu

Hailun Ding

Rutgers University

hailun.ding@rutgers.edu

Juan Zhai

Rutgers University

juan.zhai@rutgers.edu

Shiqing Ma

Rutgers University

sm2283@rutgers.edu

Abstract

Deep Neural Networks are vulnerable to Trojan (or backdoor) attacks. Reverse-

engineering methods can reconstruct the trigger and thus identify affected models.

Existing reverse-engineering methods only consider input space constraints, e.g.,

trigger size in the input space. Expressly, they assume the triggers are static patterns

in the input space and fail to detect models with feature space triggers such as image

style transformations. We observe that both input-space and feature-space Trojans

are associated with feature space hyperplanes. Based on this observation, we

design a novel reverse-engineering method that exploits the feature space constraint

to reverse-engineer Trojan triggers. Results on four datasets and seven different

attacks demonstrate that our solution effectively defends both input-space and

feature-space Trojans. It outperforms state-of-the-art reverse-engineering methods

and other types of defenses in both Trojaned model detection and mitigation tasks.

On average, the detection accuracy of our method is 93%. For Trojan mitigation,

our method can reduce the ASR (attack success rate) to only 0.26% with the

BA (benign accuracy) remaining nearly unchanged. Our code can be found at

https://github.com/RU-System-Software-and-Security/FeatureRE.

1 Introduction

DNNs are vulnerable to Trojan attacks [

–

]. After injecting a Trojan into the DNN model, the

adversary can manipulate the model prediction by adding a Trojan trigger to get the target label.

The adversary can inject the Trojan by performing the poisoning attack or supply chain attack. In

the poisoning attack, the adversary can control the training dataset and injects the Trojan by adding

samples with the Trojan trigger labeled as the target label. In the supply chain attack, the adversary

can replace a benign model with a Trojaned model by performing the supply chain attack. The Trojan

trigger is becoming more and more stealthy. Earlier works use static patterns, e.g., a yellow pad as

the trigger, which is known as the input space triggers. Researchers recently proposed using more

dynamic and input-aware techniques to generate stealthy triggers that mix with benign features, which

are referred to as the feature space triggers. For example, the trigger of the feature-space Trojans can

be a warping process [

] or a generative model [

]. The Trojan attack is a prominent threat to

the trustworthiness of DNN models, especially in security-critical applications, such as autonomous

driving [1], malware classiﬁcation [10], and face recognition [11].

Prior works have proposed several ways to defend against Trojan attacks, such as removing poisons

in training [

–

], detecting Trojan samples at runtime [

–

], etc. Many of above methods can

only work for one type of Trojan attack. For example, training and pre-training time defense (e.g.,

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.15127v1 [cs.CR] 27 Oct 2022

removing poisoning data, training a benign model with poisoning data) fail to defend against the

supply chain attack. Trigger reverse-engineering [

–

] is a general method to defend against

different Trojan attacks under different threat models. It works by searching for if there exists an

input pattern that can be used as a trigger in the given model. If we can ﬁnd such a trigger, the

model has a corresponding Trojan and is marked as malicious and vice versa. Existing reverse-

engineering methods assume that the Trojan triggers are static patterns in the input space and develop

an optimization problem that looks for an input pattern that can be used as the trigger. This assumption

is valid for input space attacks [

] that use static triggers (e.g., a colored patch). Feature space

attacks [

–

] break this assumption. Existing trigger reverse-engineering methods [

–

]

constrain the optimization by using heuristics or empirical observations on existing attacks, such

as pixel values are in range

[0,255]

, and the trigger’s size is small. Such heuristics are also invalid

for feature space triggers that change all pixels in images. Reverse-engineering the feature space is

challenging. Unlike input space, there are no constraints that can be directly used.

Target Label

Search Input space trigger and constrain Input Space

Target Label

Search Feature space trigger

and constrain Feature Space

Search Input space Transformation

based on feature space constrain

Our RE Existing RE

Fig. 1: Existing reverse-engineering (RE) and ours.

In this paper, we propose a trig-

ger reverse-engineering method that

works for feature space triggers. Our

intuition is that features representing

the Trojan are orthogonal to other fea-

tures. Because a trigger works for

a set of samples (or all of them, de-

pending on the attack type), changing

the input content without removing

the Trojan features will not change

the prediction. That is, changing Tro-

jan and benign features will not affect

each other. Trojan features will form

a hyperplane in the high dimensional

space, which can constrain the search

in feature space. We then develop our

reverse-engineering method by exploiting the feature space constraint. Fig. 1 demonstrates our

idea. Existing reverse-engineering methods only consider the input space constraint. It conducts

reverse-engineering via searching a static trigger pattern in the input space. These methods fail to

reverse-engineer feature-space Trojans whose trigger is dynamic in the input space. Instead, our idea

is to exploit the feature space constraint and searching a feature space trigger using the constraint

that the Trojan features will form a hyperplane. At the same time, we also reverse-engineer the input

space Trojan transformation based on the feature space constraint. To the best of our knowledge, we

are the ﬁrst to propose feature space reverse-engineering methods for backdoor detection.

Through reverse-engineered Trojans, we developed a Trojan detection and removal method. We

implemented a prototype

FEATURERE

(

FEATURE

-space

verse-engineering) in Python and

PyTorch and evaluated it on MNIST, GTSRB, CIFAR, and ImageNet dataset with seven different

attacks (i.e., BadNets [

], Filter attack [

], WaNet [

], Input-aware dynamic attack [

], ISSBA [

Clean-label attack [

], Label-speciﬁc attack [

], and SIG attack [

]). Our results show that

FEATURERE

is effective. On average, the detection accuracy of our method is 93%, outperforming

existing techniques. For Trojan mitigation, our method can reduce the ASR (attack success rate)

to only 0.26% with the BA (benign accuracy) remaining nearly unchanged by using only ten clean

samples for each class.

Our contributions are summarized as follows. We ﬁrst ﬁnd the feature space properties of the Trojaned

model and reveal the relationship between Trojans and the feature space hyperplanes. We propose a

novel Trojan trigger reverse-engineering technique leveraging the feature space Trojan hyperplane.

We evaluate our prototype on four different datasets, ﬁve different network architectures, and seven

advanced Trojan attacks. Results show that our method outperforms SOTA approaches.

2 Background & Motivation

A DNN classiﬁer is a function

M:X 7→ Y

where

is the input domain

and

is a set of labels

. A Trojan (or backdoor) attack against a DNN model

is a malicious way of perturbing the

input so that an adversarial input

(i.e., input with the perturbation pattern) will be classiﬁed to a

target/random label while the model maintains high accuracy for benign input

. The perturbation

pattern is known as the Trojan trigger. Trojan attacks can happen in training (e.g., data poisoning)

or model distribution (e.g., changing model weights or supply-chain attack). Existing works have

shown Trojan attacks against different DNN models, including computer vision models [

Graph Neural Networks (GNNs) [

], Reinforcement Learning (RL) [

], Natural Language

Processing (NLP) [

–

], recommendation systems [

], malware detection [

], pretrained mod-

els [

], active learning [

], and federated learning [

]. The Trojan trigger can be a

simple input pattern (e.g., a yellow pad) [

] or a complex input transformation function (e.g., a

CycleGAN to change the input styles) [

–

]. If the trigger is static input space perturbations

(e.g., a yellow pad), the Trojan attack is known as input-space Trojan, and if the trigger is an input

feature (e.g., an image style), the attack is referred to as the feature-space Trojan.

There are different types of Trojan defenses. A line of work [

] attempts to remove poisoned

data samples by cleaning the training dataset. Training-based methods [

–

] train benign classiﬁers

even with the poisoned dataset. These training time approaches work for poisoning-based attacks

but fail to defend against supply chain attacks where the adversary injects the Trojan after the model

is trained. Another line of work, e.g., STRIP [

], SentiNet [

], and Februus [

] aim to detect

Trojan inputs during runtime. It is hard to distinguish between a misclassiﬁcation and a Trojan

attack for a test input. These runtime detection methods make assumptions about the attack, which

stronger attacks can violate. For example, STRIP fails to detect the Trojan inputs when the Trojan

trigger locates around the center of an image or overlaps with the main object (e.g., feature space

attacks). Another limitation is that they examine the test inputs and perform various heavyweight

tests, signiﬁcantly delaying the response time.

Trigger reverse engineering [

–

] makes no assumptions about the attack method (e.g.,

poisoning or supply-chain attacks) and does not affect the runtime performance. It inspects the model

to check if a Trojan exists before deploying. Given a DNN model

and a small set of clean samples

X, trigger reverse engineering methods try to reconstruct injected triggers. If reverse engineering is

successful, the model is marked as malicious. Neural Cleanse (NC) [

] proposes to perform reverse

engineering by solving Eq. 1:

min

m,tL(M((1 −m)x+mt), yt) + r?(1)

where

x∈ X

and

is the trigger mask (i.e., a binary matrix with the same size as the input to

determine if the value will be replaced by the trigger or not),

is the trigger pattern (i.e., a matrix

with the same size as the input containing trigger values), and

are attack constraints (e.g., trigger

size is smaller than 1/4 of the image).

is the cross-entropy loss function. Most prior works [

–

]

follow the same methodology and inherently suffer from the same limitations. First, they assume that

an input space perturbation, denoted by (m,t), can represent a trigger. This assumption is valid for

input-space triggers but does not hold for feature space attacks. Second,

are heuristics observed

from existing attacks. For example, NC observed that most triggers have small sizes and limit the

trigger size to be no larger than a threshold value. Otherwise, the trigger will overlap with the main

object and decrease benign accuracy. In practice, more advanced attacks can break such heuristics.

For instance, DFST [

] leverages CycleGAN to transfer images from one style to another without

changing its semantics. It changes almost all pixels in a given image. This paper proposes a novel

reverse engineering method that overcomes the limitations above for image classiﬁers.

3 Methodology

3.1 Threat Model

This work aims to determine if a given model has a Trojan or not by reverse-engineering the

corresponding trigger. Following existing works [

], we assume access to the model and a

small dataset containing correctly labeled benign samples of each label. In practice, such datasets

can be gathered from the Internet. We make no assumptions on how the attacker injects the Trojan

(poisoning or supply-chain attack). The attack can be formally deﬁned as:

M(x) = y, M(F(x)) =

yT,x∈ X

, where

is the Trojaned model,

is a clean input sample, and

is the target label.

is the function to construct Trojan samples. Input-space triggers add static input perturbations, and

feature space triggers are input transformations. The key difference between our work and existing

work is that we consider the feature space triggers.

3.2 Observation

In DNNs, the neuron activation values represent its functionality. The input neurons denote the

input space features, and inner neurons extract inner and more abstract features. Existing reverse-

engineering methods constrain the optimization problem in the input space using domain-speciﬁc

constraints or observations. For image classiﬁcation tasks, the pixel value of each image has to be a

valid RGB value. Methods like NC observe that the trigger size must be smaller and cannot overlap

with the main object and propose corresponding constraints. The most challenging problem for

reverse-engineering feature space triggers is how to constrain the optimization properly. Note that

there exist a set of neurons; when activating to speciﬁc values, the Trojan behavior will be triggered.

Due to the black-box nature of DNNs, it is hard to identify which neurons are related to the Trojan

behavior. Moreover, if enlarge the weight values with the same scale, the output of the DNN will be

the same, and as such, it is hard to constrain concrete activation values. Without a proper constraint,

we cannot form an optimization problem.

Our key observation to solve this problem is that neuron activation values representing the Trojan

behavior are orthogonal to others. Recall that one property of DNN Trojans is that when adding

the trigger to any given input, the model will predict the output to a speciﬁc label. That is, the

trigger will always work regardless of the actual contents of the input. In the feature space, when

the model recognizes features of the Trojan, it will predict the label to the target label regardless

of the other features. These activation values will form a hyperplane space in the high dimensional

space so that they can be orthogonal to all others. Based on this intuition, we performed empirical

experiments to conﬁrm our idea. Speciﬁcally, we ﬁrst use six Trojan attacks (e.g., BadNets [

], Clean

label attack [

], Filter attack [

], and WaNet [

], SIG [

] and Input-aware dynamic attack [

])

to generate Trojaned ResNet18 models on CIFAR-10. We then visualize the feature space of the

last convolutional layers in these models. In Fig. 2, three dimensions, X, Y, and Z, represent the

feature space. We ﬁrst apply PCA to get two eigenvectors of the benign training set; then, we

use the obtained eigenvectors as X-axis and Y-axis. For Z-axis, we ﬁrst construct Trojan inputs to

activate the model’s Trojan behavior and ﬁnd highly related neurons to Trojans. Then, we use DNN

interpretability techniques SHAP [

] to estimate the neuron’s importance to the Trojan behavior.

The neurons among the top 3% are compromised neurons. Z-axis denotes the activation values of

compromised neurons. Namely,

z=kA(F(x)) mk

, where

denotes a mask revealing the

position of compromised neurons. Fig. 2 show that most Trojan inputs have a similar z-value. They

form a linear hyperplane in the feature space while benign ones do not.

Clean Label

BadNets

WaNetFilter Input-aware Dynamic

SIG

Fig. 2: Feature space of Trojaned models.

3.3 Feature Space Trojan Hyperplane Reverse-engineering

In this paper, We use

to represent the submodel from the input space to the feature space.

the submodel from the feature space to the output space. We also use

a=A(x)

to denote the inner

Algorithm 1 Feature-space Backdoor Reverse-engineering

Input: Model: M

Output: Trojaned or Not, Trojaned Pairs T

1: function REVERSE-ENGINEERING(M)

2: for (target class yt,source class ys) in Kdo

3: for e≤Edo

4: x=sample(Xys)

5: cost1=L(B((1 −m)a+mt), yt)

6: if kF(x)−xk ≥ τ1then

7: cost1=cost1+w1· kF(x)−xk

8: if std(m A(F(x))) ≥τ2then

9: cost1=cost1+w2·std(m A(F(x)))

10: ∆θF=∂cost1

∂θF

11: θF=θF−lr1·∆θF

12: cost2=L(B((1 −m)a+mt), yt)

13: if kmk ≥ τ3then

14: cost2=cost2+w3· kmk

15: ∆m=∂cost2

∂m

16: m=m−lr2·∆m

17: if ASR (B((1 −m)a+mt), yt)> λ then

18: Mis a Trojaned model,

19: T.append((ys, yt))

features of the model. Similar to the reverse-engineering in the input space, given a model

and a

small set of benign inputs

, we use a feature space mask

and a feature space pattern

to represent

the feature space Trojan hyperplane

H={a|ma=mt}

. Speciﬁcally, we can update

and

via the following optimization process:

min

m,tL(B((1 −m)a+mt), yt)

is the target

label. As discussed above, reverse-engineering the feature space is challenging. In the input space,

all values have natural physical semantics and constraints, e.g., a pixel value in the RGB value range.

Values in the feature space have uninterruptible meanings and are not strictly constrained. Whether

the result will have a physically meaningful semantic is also uncertain. We solve these challenges by

simultaneously optimizing the input space trigger function

and the feature space Trojan hyperplane

to enforce that the trigger has semantic meanings. In detail, we compute the feature space

trigger pattern as the mean inner features on the samples generated by the trigger function, i.e.,

t=mean (m A(F(X))

. We also constrain the standard deviation of

m A(F(X)

to make sure

the features generated by the trigger function will lie on the relaxation of the reverse-engineered

hyperplane. Formally, our reverse-engineering can be written as the constrained optimization problem

shown in Eq. 2, where

is the small set of clean samples. We use deep neural networks to model

the trigger function (i.e.,

F=Gθ

) because of their expressiveness [

]. Speciﬁcally, we use

a representative deep neural network UNet [

]. Given a model and a small set of clean inputs,

the trigger function can be smoothly reconstructed via gradient-based methods, i.e., optimizing the

generative model

Gθ

. In our default setting,

and

are separated at the last convolutional layer.

More discussions are in the Appendix (§ 4.5).

min

F,mL(B((1 −m)a+mt), yt)

where t=m A(F(X),a∈ A(X)

s.t. kF(X)− X k ≤ τ1, std(m A(F(X))) ≤τ2,kmk ≤ τ3

(2)

There are several constraints in the optimization problem:

The transformed samples should be

similar to the original image due to the properties of Trojan attacks, i.e.,

kF(x)−xk ≤ τ1

. Typically,

the Trojan samples are visually similar to original samples for stealthy purposes. In detail, we

use MSE (Mean Squared Error) to calculate the distance between

F(x)

and



The Trojan

features should lie in the relaxation of the reverse-engineered feature space Trojan hyperplane, i.e.,

P(a∈H?|x∈F(X))

should have high values. To achieve this goal, we constrain the standard

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RethinkingtheReverse-engineeringofTrojanTriggersZhentingWangRutgersUniversityzhenting.wang@rutgers.eduKaiMeiRutgersUniversitykai.mei@rutgers.eduHailunDingRutgersUniversityhailun.ding@rutgers.eduJuanZhaiRutgersUniversityjuan.zhai@rutgers.eduShiqingMaRutgersUniversitysm2283@rutgers.eduAbstractDeepNeur...

展开>> 收起<<

Rethinking the Reverse-engineering of Trojan Triggers Zhenting Wang.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Rethinking the Reverse-engineering of Trojan Triggers Zhenting Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: