Marksman Backdoor Backdoor Attacks with Arbitrary Target Class Khoa D. Doan Yingjie Lao Ping Li

2025-05-02 0 0 1.5MB 17 页 10玖币

侵权投诉

Marksman Backdoor: Backdoor Attacks with

Arbitrary Target Class

Khoa D. Doan, Yingjie Lao, Ping Li

Cognitive Computing Lab

Baidu Research

10900 NE 8th St. Bellevue, WA 98004, USA

{khoadoan106, laoyingjie, pingli98}@gmail.com

Abstract

In recent years, machine learning models have been shown to be vulnerable to

backdoor attacks. Under such attacks, an adversary embeds a stealthy backdoor

into the trained model such that the compromised models will behave normally on

clean inputs but will misclassify according to the adversary’s control on maliciously

constructed input with a trigger. While these existing attacks are very effective,

the adversary’s capability is limited: given an input, these attacks can only cause

the model to misclassify toward a single pre-deﬁned or target class. In contrast,

this paper exploits a novel backdoor attack with a much more powerful payload,

denoted as Marksman, where the adversary can arbitrarily choose which target

class the model will misclassify given any input during inference. To achieve this

goal, we propose to represent the trigger function as a class-conditional generative

model and to inject the backdoor in a constrained optimization framework, where

the trigger function learns to generate an optimal trigger pattern to attack any target

class at will while simultaneously embedding this generative backdoor into the

trained model. Given the learned trigger-generation function, during inference, the

adversary can specify an arbitrary backdoor attack target class, and an appropriate

trigger causing the model to classify toward this target class is created accordingly.

We show empirically that the proposed framework achieves high attack performance

(e.g., 100% attack success rates in several experiments) while preserving the clean-

data performance in several benchmark datasets, including MNIST, CIFAR10,

GTSRB, and TinyImageNet. The proposed Marksman backdoor attack can also

easily bypass existing backdoor defenses that were originally designed against

backdoor attacks with a single target class. Our work takes another signiﬁcant step

toward understanding the extensive risks of backdoor attacks in practice.

1 Introduction

Machine learning, especially deep neural networks (DNN), rapidly advances and transforms our daily

lives in various ﬁelds and applications. Such intelligence is becoming prevalent and pervasive, embed-

ded ubiquitously from centralized servers to fully distributed Internet-of-Things (IoT). Unfortunately,

since well-trained models are now viewed as high-value assets that demand extensive computer

resources, annotated data, and machine learning expertise, they are becoming increasingly attractive

targets for cyberattacks [

]. Prior research has shown deep learning algorithms are vulnerable

to a wide range of attacks, including adversarial examples [

], poisoning attacks [

backdoor attacks [

], and privacy leakages [

]. Among these, backdoor attacks ex-

pose the vulnerability in the model building supply chain that seeks to inject a stealthy backdoor into

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.09194v1 [cs.CR] 17 Oct 2022

Figure 1: The payloads of Marksman and the existing backdoor attacks (top row). Marksman can

attack an arbitrary target at will by generating a suitable trigger pattern via the class-conditional

generative trigger function to cause the classiﬁer to predict the chosen target (details in bottom row).

a model by poisoning the data or manipulating the training process [

]. Ideally, the model

with injected backdoor should behave normally with clean inputs, but the input will be misclassiﬁed

into the target class whenever the trigger is present.

Past years have seen the development of backdoor attacks in various trigger forms, such as the

patch-based in BadNets [

] and TrojanNN [

], blended and dynamic triggers [

], and recently

input-aware, invisible triggers [

]. In the ﬁelds of malware backdoor [

] and hardware

Trojan [

], these attacks are typically decomposed into two main components: the

trigger

that

determines the activation mechanism and the

payload

that is used to control the modiﬁed malicious

behavior. However, compared to the trigger mechanism, the payload of backdoor attacks on DNN

is much less studied. The majority of the existing approaches consider either 1) all-to-one attacks

where all the inputs with the trigger are mapped into one speciﬁc target class or 2) all-to-all attacks

where the inputs from each true class will have different target labels [

]. As opposed to what

the name of the all-to-all attacks indicates, such an attack is still only able to manipulate the inputs

within one true class label into one target class. Nevertheless, these prior works are single-trigger and

single-payload backdoor attacks with predeﬁned target backdoor class(es). Even for the input-aware

ones that generate the trigger based on the content of the input image to minimize the perceptual

distinction, the target backdoor class is still predeﬁned.

In this paper, we exploit the design of a backdoor attack with arbitrary target classes after injecting

the backdoor into the model. Since we need to expand the payload capability of the backdoor attack,

efﬁciently and effectively implementing such backdoor attacks is not trivial. One may argue that an

adversary can repeatedly inject different trigger patterns for all the target classes to achieve such an

adversarial objective. For instance, the adversary can use a speciﬁc trigger pattern for each target

class and inject all the classes’ trigger patterns into the model by using the patch-based backdoor

strategy. Obviously, this method will lead to a much larger model perturbation than the single-trigger

and single-payload attack. As the number of target classes increases, both the attack success rate

(ASR) and the clean data accuracy will be signiﬁcantly degraded.

To tackle these challenges, we follow the similar concept of invisible and input-aware backdoor

attacks [

] that train a generative model during the backdoor injection, usually called trigger

generator or trigger function, for generating triggers. In order to activate the backdoor, the adversary

will feed the input image to the trigger generator/function, which will embed an input-speciﬁc

trigger into the image. In these scenarios, the secret held by the adversary is the trigger function

instead of ﬁxed pixel patterns as in patch-based backdoor attacks. We efﬁciently incorporate the

malicious functionalities that link to all the output classes into the trigger function to expand the

payload. Consequently, as depicted in Figure 1, the adversary can arbitrarily choose the target class to

misclassify given any input during inference, signiﬁcantly enhancing the adversarial capability of the

backdoor attack. The only works that we are aware of consider varying the payload are [

], which

yield different target classes by controlling the intensities of the same backdoor [

] and embedding

multiple triggers into different channels (i.e., RGB channels) of an image [

], respectively. However,

the possible numbers of target classes in these works are limited by 4 and 3, bounded by the

performance and the number of channels (i.e., 3 in RGB images). The work of DeepPayload [

]

attempts to directly inject the malicious logic through reverse-engineering instead of training the

backdoor into the model, which does not consider varying the target backdoor class as in this paper.

Our contributions are summarized below:

•

We propose a new type of backdoor attack where the adversary can ﬂexibly attack any target label

during inference. This attack maliciously modiﬁes the model by establishing a causal link between

the trigger function and all output classes.

•

We propose a class-condition generative trigger function that, given the target label, can generate

an imperceptible trigger pattern to cause the model to predict the target label. We then propose a

constrained optimization objective that can effectively and efﬁciently learn the trigger function and

poison the model.

•

Finally, we empirically demonstrate the effectiveness of the proposed method and its robustness

against several representative defensive mechanisms. We show that the proposed method can

achieve high attack success rates with any arbitrarily chosen target class while preserving the

behavior of the model under normal conditions.

The rest of the paper is organized as follows. We review the background of DNN backdoor attacks

in Section 2. The threat model is deﬁned in Section 3. We present the details of the proposed

methodology in Section 4, and evaluate the performance and compare to prior works in Section 5.

Finally, Section 6concludes this paper. We present more details about experimental settings and

additional results in the supplementary material.

2 Background

2.1 Backdoor Attacks

Under image classiﬁcation tasks, backdoor attacks on DNN seek to inject malicious behavior into the

model that will associate a trigger to a target backdoor class [

], which can also be interpreted

as the payload as in malware backdoor and hardware Trojan. The injection of the backdoor are

typically achieved by poisoning the training data [

] or manipulating the training process or

model parameters [

]. An important performance requirement for the backdoor attack is its

stealthiness, such that the existence of the backdoor in a model cannot be easily identiﬁed. Hence, a

successful backdoor attack should preserve the normal functionality or inference accuracy for clean

images (i.e., images without the trigger).

The designs of the trigger have been extensively studied in the literature, from the early obvious patch-

based triggers [

] to more invisible ones with utilization of blended [

], sinusoidal strips (SIG) [

reﬂection (ReFool) [

], single-pixel [

], warping (WaNet) [

], discrete cosine transform (DCT)

steganography [

], and adversarial example generation [

]. As opposed to a universal trigger,

several recent works have investigated input-aware backdoor attacks that minimize the visibility of

the trigger by generating the trigger pattern based on the content of each input image [

For instance, LIRA [

] trains a generative model as the trigger function for each image, while

simultaneously injecting the backdoor into the model, which has been shown to be able to generate

completely invisible triggers.

All of these attacks, under either all-to-one or all-to-all scenarios, can only manipulate the prediction

of a given input image to one target class. While the works in [

] considered a less narrow

form for the payload, the number of possible target backdoor classes is still limited, i.e., 3 or 4. In

contrast, this paper exploits a much stronger attack that is able to misclassify a given input image to

any arbitrary target class.

2.2 Backdoor Defenses

Meanwhile, various backdoor defensive solutions have also been developed, aimed at either de-

tecting [

] or mitigating [

] the attacks. Popular methods include Neural

Cleanse [

] that detects the backdoor by searching for possible trigger patches, ﬁne-pruning [

]

that prunes the model to erase the backdoor, spectral signature [

] that detects outliers based on the

latent representations, and STRIP [

] that uses perturbations to detect potential backdoor triggers.

Besides, input mitigation methods have also been studied, which seek to ﬁlter the images with triggers

to avoid the activation of the backdoor [30,24].

A successful backdoor attack has to be able to bypass the existing defenses. We evaluate our proposed

Marksman backdoor against representative defensive solutions in our experiments.

3 Threat Model

Consistent with prior works on input-aware backdoor attacks that train a generative model for trigger

generation [

], we also consider the threat model where the adversary has full access to

the model. Note that our setting is different from some backdoor attacks that only target poisoned

data generation [

]. During the training phase, the adversary attempts to inject the backdoor

into a model. After that, the model will be delivered to victim users, who might employ the existing

backdoor defensive measures to check the model. During the inference phase, the adversary is able

to query the victim model with any inputs.

4 Proposed Methodology: Marksman Backdoor

4.1 Preliminaries

Consider the supervised learning setting where the goal is to learn a classiﬁer

fθ:X → Y

that maps

an input

x∈ X

to a label

y∈ Y

. In empirical risk minimization (ERM), the parameters

are learned

using a training dataset S={(x1, y1), ..., (xN, yN)}where xi∈ X and yi∈ Y.

In a standard backdoor attack, a subset of

(

M < N

) examples are ﬁrst selected from

to create

the poisoned subset

. Each sample

(x, y)

in this subset is transformed into a backdoor sample

(T(x), η(y))

, where

T:X → X

is the trigger function and

is the target labeling function. The

trigger function

determines how a trigger pattern is placed on the input

to create the backdoor

input

T(x)

, while the target labeling function speciﬁes how the classiﬁer should predict in the

presence of the backdoor input. The remaining samples in

comprise the clean subset

, i.e.,

Sc=S \ Sp.

Under ERM, we can alter the behavior of the classiﬁer

(i.e., inject the backdoor) by training

with

both the clean samples Scand the backdoor samples Sp, as follows:

θ∗= arg min

θX

(x,y)∈Sc∪Sp

L(fθ(x), y).

where

is the classiﬁcation loss, e.g., cross-entropy loss. During inference, for a clean input

and

its true label y, the learned fwill behave as follows:

f(x) = y, f(T(x)) = η(y)

4.2 Marksman’s Payload: Arbitrary Attack Target Class

The training process described in the previous section essentially induces the payload, or the causal

association between the trigger and the target label. As we discussed above, there are two common

types of payload in the backdoor domain [

]: all-to-one and all-to-all. Under the all-to-one attack,

all input with the trigger are predicted with a constant label, denoted as ˆy, regardless of the original

label y:

f(T(x)) = ˆy,∀(x, y)

For the all-to-all attack, the input with the trigger is predicted with a label that depends on its true

label y; for example, a commonly studied target function is

f(T(x)) = (y+ 1) mod |Y|,∀(x, y)

Note that, for both the all-to-one and all-to-all attacks, the attacker can only trigger one predeﬁned

target label. During inference, given an input, it is not possible to causally force

to predict an

arbitrary choice of a target label.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MarksmanBackdoor:BackdoorAttackswithArbitraryTargetClassKhoaD.Doan,YingjieLao,PingLiCognitiveComputingLabBaiduResearch10900NE8thSt.Bellevue,WA98004,USA{khoadoan106,laoyingjie,pingli98}@gmail.comAbstractInrecentyears,machinelearningmodelshavebeenshowntobevulnerabletobackdoorattacks.Undersuchattacks,a...

展开>> 收起<<

Marksman Backdoor Backdoor Attacks with Arbitrary Target Class Khoa D. Doan Yingjie Lao Ping Li.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Marksman Backdoor Backdoor Attacks with Arbitrary Target Class Khoa D. Doan Yingjie Lao Ping Li

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: