Marksman Backdoor Backdoor Attacks with Arbitrary Target Class Khoa D. Doan Yingjie Lao Ping Li

2025-05-02 0 0 1.5MB 17 页 10玖币
侵权投诉
Marksman Backdoor: Backdoor Attacks with
Arbitrary Target Class
Khoa D. Doan, Yingjie Lao, Ping Li
Cognitive Computing Lab
Baidu Research
10900 NE 8th St. Bellevue, WA 98004, USA
{khoadoan106, laoyingjie, pingli98}@gmail.com
Abstract
In recent years, machine learning models have been shown to be vulnerable to
backdoor attacks. Under such attacks, an adversary embeds a stealthy backdoor
into the trained model such that the compromised models will behave normally on
clean inputs but will misclassify according to the adversary’s control on maliciously
constructed input with a trigger. While these existing attacks are very effective,
the adversary’s capability is limited: given an input, these attacks can only cause
the model to misclassify toward a single pre-defined or target class. In contrast,
this paper exploits a novel backdoor attack with a much more powerful payload,
denoted as Marksman, where the adversary can arbitrarily choose which target
class the model will misclassify given any input during inference. To achieve this
goal, we propose to represent the trigger function as a class-conditional generative
model and to inject the backdoor in a constrained optimization framework, where
the trigger function learns to generate an optimal trigger pattern to attack any target
class at will while simultaneously embedding this generative backdoor into the
trained model. Given the learned trigger-generation function, during inference, the
adversary can specify an arbitrary backdoor attack target class, and an appropriate
trigger causing the model to classify toward this target class is created accordingly.
We show empirically that the proposed framework achieves high attack performance
(e.g., 100% attack success rates in several experiments) while preserving the clean-
data performance in several benchmark datasets, including MNIST, CIFAR10,
GTSRB, and TinyImageNet. The proposed Marksman backdoor attack can also
easily bypass existing backdoor defenses that were originally designed against
backdoor attacks with a single target class. Our work takes another significant step
toward understanding the extensive risks of backdoor attacks in practice.
1 Introduction
Machine learning, especially deep neural networks (DNN), rapidly advances and transforms our daily
lives in various fields and applications. Such intelligence is becoming prevalent and pervasive, embed-
ded ubiquitously from centralized servers to fully distributed Internet-of-Things (IoT). Unfortunately,
since well-trained models are now viewed as high-value assets that demand extensive computer
resources, annotated data, and machine learning expertise, they are becoming increasingly attractive
targets for cyberattacks [
20
,
51
,
52
]. Prior research has shown deep learning algorithms are vulnerable
to a wide range of attacks, including adversarial examples [
3
,
31
], poisoning attacks [
33
,
38
,
17
],
backdoor attacks [
30
,
28
,
15
,
11
], and privacy leakages [
39
,
14
]. Among these, backdoor attacks ex-
pose the vulnerability in the model building supply chain that seeks to inject a stealthy backdoor into
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.09194v1 [cs.CR] 17 Oct 2022
Figure 1: The payloads of Marksman and the existing backdoor attacks (top row). Marksman can
attack an arbitrary target at will by generating a suitable trigger pattern via the class-conditional
generative trigger function to cause the classifier to predict the chosen target (details in bottom row).
a model by poisoning the data or manipulating the training process [
30
,
28
,
15
]. Ideally, the model
with injected backdoor should behave normally with clean inputs, but the input will be misclassified
into the target class whenever the trigger is present.
Past years have seen the development of backdoor attacks in various trigger forms, such as the
patch-based in BadNets [
15
] and TrojanNN [
28
], blended and dynamic triggers [
7
,
37
], and recently
input-aware, invisible triggers [
8
,
34
,
10
,
9
,
35
]. In the fields of malware backdoor [
4
] and hardware
Trojan [
42
,
48
], these attacks are typically decomposed into two main components: the
trigger
that
determines the activation mechanism and the
payload
that is used to control the modified malicious
behavior. However, compared to the trigger mechanism, the payload of backdoor attacks on DNN
is much less studied. The majority of the existing approaches consider either 1) all-to-one attacks
where all the inputs with the trigger are mapped into one specific target class or 2) all-to-all attacks
where the inputs from each true class will have different target labels [
15
]. As opposed to what
the name of the all-to-all attacks indicates, such an attack is still only able to manipulate the inputs
within one true class label into one target class. Nevertheless, these prior works are single-trigger and
single-payload backdoor attacks with predefined target backdoor class(es). Even for the input-aware
ones that generate the trigger based on the content of the input image to minimize the perceptual
distinction, the target backdoor class is still predefined.
In this paper, we exploit the design of a backdoor attack with arbitrary target classes after injecting
the backdoor into the model. Since we need to expand the payload capability of the backdoor attack,
efficiently and effectively implementing such backdoor attacks is not trivial. One may argue that an
adversary can repeatedly inject different trigger patterns for all the target classes to achieve such an
adversarial objective. For instance, the adversary can use a specific trigger pattern for each target
class and inject all the classes’ trigger patterns into the model by using the patch-based backdoor
strategy. Obviously, this method will lead to a much larger model perturbation than the single-trigger
and single-payload attack. As the number of target classes increases, both the attack success rate
(ASR) and the clean data accuracy will be significantly degraded.
To tackle these challenges, we follow the similar concept of invisible and input-aware backdoor
attacks [
8
,
34
,
10
] that train a generative model during the backdoor injection, usually called trigger
generator or trigger function, for generating triggers. In order to activate the backdoor, the adversary
will feed the input image to the trigger generator/function, which will embed an input-specific
trigger into the image. In these scenarios, the secret held by the adversary is the trigger function
instead of fixed pixel patterns as in patch-based backdoor attacks. We efficiently incorporate the
malicious functionalities that link to all the output classes into the trigger function to expand the
payload. Consequently, as depicted in Figure 1, the adversary can arbitrarily choose the target class to
misclassify given any input during inference, significantly enhancing the adversarial capability of the
backdoor attack. The only works that we are aware of consider varying the payload are [
49
,
50
], which
yield different target classes by controlling the intensities of the same backdoor [
49
] and embedding
2
multiple triggers into different channels (i.e., RGB channels) of an image [
50
], respectively. However,
the possible numbers of target classes in these works are limited by 4 and 3, bounded by the
performance and the number of channels (i.e., 3 in RGB images). The work of DeepPayload [
25
]
attempts to directly inject the malicious logic through reverse-engineering instead of training the
backdoor into the model, which does not consider varying the target backdoor class as in this paper.
Our contributions are summarized below:
We propose a new type of backdoor attack where the adversary can flexibly attack any target label
during inference. This attack maliciously modifies the model by establishing a causal link between
the trigger function and all output classes.
We propose a class-condition generative trigger function that, given the target label, can generate
an imperceptible trigger pattern to cause the model to predict the target label. We then propose a
constrained optimization objective that can effectively and efficiently learn the trigger function and
poison the model.
Finally, we empirically demonstrate the effectiveness of the proposed method and its robustness
against several representative defensive mechanisms. We show that the proposed method can
achieve high attack success rates with any arbitrarily chosen target class while preserving the
behavior of the model under normal conditions.
The rest of the paper is organized as follows. We review the background of DNN backdoor attacks
in Section 2. The threat model is defined in Section 3. We present the details of the proposed
methodology in Section 4, and evaluate the performance and compare to prior works in Section 5.
Finally, Section 6concludes this paper. We present more details about experimental settings and
additional results in the supplementary material.
2 Background
2.1 Backdoor Attacks
Under image classification tasks, backdoor attacks on DNN seek to inject malicious behavior into the
model that will associate a trigger to a target backdoor class [
15
,
28
], which can also be interpreted
as the payload as in malware backdoor and hardware Trojan. The injection of the backdoor are
typically achieved by poisoning the training data [
15
,
28
] or manipulating the training process or
model parameters [
19
,
12
]. An important performance requirement for the backdoor attack is its
stealthiness, such that the existence of the backdoor in a model cannot be easily identified. Hence, a
successful backdoor attack should preserve the normal functionality or inference accuracy for clean
images (i.e., images without the trigger).
The designs of the trigger have been extensively studied in the literature, from the early obvious patch-
based triggers [
7
,
28
] to more invisible ones with utilization of blended [
7
], sinusoidal strips (SIG) [
2
],
reflection (ReFool) [
29
], single-pixel [
1
], warping (WaNet) [
35
], discrete cosine transform (DCT)
steganography [
50
], and adversarial example generation [
41
,
22
]. As opposed to a universal trigger,
several recent works have investigated input-aware backdoor attacks that minimize the visibility of
the trigger by generating the trigger pattern based on the content of each input image [
8
,
34
,
10
,
9
].
For instance, LIRA [
10
] trains a generative model as the trigger function for each image, while
simultaneously injecting the backdoor into the model, which has been shown to be able to generate
completely invisible triggers.
All of these attacks, under either all-to-one or all-to-all scenarios, can only manipulate the prediction
of a given input image to one target class. While the works in [
49
,
50
] considered a less narrow
form for the payload, the number of possible target backdoor classes is still limited, i.e., 3 or 4. In
contrast, this paper exploits a much stronger attack that is able to misclassify a given input image to
any arbitrary target class.
2.2 Backdoor Defenses
Meanwhile, various backdoor defensive solutions have also been developed, aimed at either de-
tecting [
5
,
44
,
13
] or mitigating [
26
,
45
,
6
,
36
,
24
] the attacks. Popular methods include Neural
Cleanse [
45
] that detects the backdoor by searching for possible trigger patches, fine-pruning [
26
]
that prunes the model to erase the backdoor, spectral signature [
44
] that detects outliers based on the
3
latent representations, and STRIP [
13
] that uses perturbations to detect potential backdoor triggers.
Besides, input mitigation methods have also been studied, which seek to filter the images with triggers
to avoid the activation of the backdoor [30,24].
A successful backdoor attack has to be able to bypass the existing defenses. We evaluate our proposed
Marksman backdoor against representative defensive solutions in our experiments.
3 Threat Model
Consistent with prior works on input-aware backdoor attacks that train a generative model for trigger
generation [
8
,
34
,
10
,
11
], we also consider the threat model where the adversary has full access to
the model. Note that our setting is different from some backdoor attacks that only target poisoned
data generation [
15
,
28
]. During the training phase, the adversary attempts to inject the backdoor
into a model. After that, the model will be delivered to victim users, who might employ the existing
backdoor defensive measures to check the model. During the inference phase, the adversary is able
to query the victim model with any inputs.
4 Proposed Methodology: Marksman Backdoor
4.1 Preliminaries
Consider the supervised learning setting where the goal is to learn a classifier
fθ:X → Y
that maps
an input
x∈ X
to a label
y∈ Y
. In empirical risk minimization (ERM), the parameters
θ
are learned
using a training dataset S={(x1, y1), ..., (xN, yN)}where xi∈ X and yi∈ Y.
In a standard backdoor attack, a subset of
M
(
M < N
) examples are first selected from
S
to create
the poisoned subset
Sp
. Each sample
(x, y)
in this subset is transformed into a backdoor sample
(T(x), η(y))
, where
T:X → X
is the trigger function and
η
is the target labeling function. The
trigger function
T
determines how a trigger pattern is placed on the input
x
to create the backdoor
input
T(x)
, while the target labeling function specifies how the classifier should predict in the
presence of the backdoor input. The remaining samples in
S
comprise the clean subset
Sc
, i.e.,
Sc=S \ Sp.
Under ERM, we can alter the behavior of the classifier
f
(i.e., inject the backdoor) by training
f
with
both the clean samples Scand the backdoor samples Sp, as follows:
θ= arg min
θX
(x,y)∈Sc∪Sp
L(fθ(x), y).
where
L
is the classification loss, e.g., cross-entropy loss. During inference, for a clean input
x
and
its true label y, the learned fwill behave as follows:
f(x) = y, f(T(x)) = η(y)
4.2 Marksman’s Payload: Arbitrary Attack Target Class
The training process described in the previous section essentially induces the payload, or the causal
association between the trigger and the target label. As we discussed above, there are two common
types of payload in the backdoor domain [
15
]: all-to-one and all-to-all. Under the all-to-one attack,
all input with the trigger are predicted with a constant label, denoted as ˆy, regardless of the original
label y:
f(T(x)) = ˆy,(x, y)
For the all-to-all attack, the input with the trigger is predicted with a label that depends on its true
label y; for example, a commonly studied target function is
f(T(x)) = (y+ 1) mod |Y|,(x, y)
Note that, for both the all-to-one and all-to-all attacks, the attacker can only trigger one predefined
target label. During inference, given an input, it is not possible to causally force
f
to predict an
arbitrary choice of a target label.
4
摘要:

MarksmanBackdoor:BackdoorAttackswithArbitraryTargetClassKhoaD.Doan,YingjieLao,PingLiCognitiveComputingLabBaiduResearch10900NE8thSt.Bellevue,WA98004,USA{khoadoan106,laoyingjie,pingli98}@gmail.comAbstractInrecentyears,machinelearningmodelshavebeenshowntobevulnerabletobackdoorattacks.Undersuchattacks,a...

展开>> 收起<<
Marksman Backdoor Backdoor Attacks with Arbitrary Target Class Khoa D. Doan Yingjie Lao Ping Li.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.5MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注