Few-shot Backdoor Attacks via Neural Tangent Kernels Jonathan Hayase and Sewoong Oh Paul G. Allen School of Computer Science and Engineering

2025-04-27 0 0 1.62MB 20 页 10玖币

侵权投诉

Few-shot Backdoor Attacks via Neural Tangent Kernels

Jonathan Hayase and Sewoong Oh

Paul G. Allen School of Computer Science and Engineering

University of Washington

{jhayase,sewoong}@cs.washington.edu

Abstract

In a backdoor attack, an attacker injects corrupted examples into the training set. The goal of the

attacker is to cause the ﬁnal trained model to predict the attacker’s desired target label when a predeﬁned

trigger is added to test inputs. Central to these attacks is the trade-oﬀ between the success rate of the

attack and the number of corrupted training examples injected. We pose this attack as a novel bilevel

optimization problem: construct strong poison examples that maximize the attack success rate of the

trained model. We use neural tangent kernels to approximate the training dynamics of the model being

attacked and automatically learn strong poison examples. We experiment on subclasses of CIFAR-10

and ImageNet with WideResNet-34 and ConvNeXt architectures on periodic and patch trigger attacks

and show that NTBA-designed poisoned examples achieve, for example, an attack success rate of 90%

with ten times smaller number of poison examples injected compared to the baseline. We provided an

interpretation of the NTBA-designed attacks using the analysis of kernel linear regression. We further

demonstrate a vulnerability in overparametrized deep neural networks, which is revealed by the shape of

the neural tangent kernel.

1 Introduction

Modern machine learning models, such as deep convolutional neural networks and transformer-based language

models, are often trained on massive datasets to achieve state-of-the-art performance. These datasets are

frequently scraped from public domains with little quality control. In other settings, models are trained on

shared data, e.g., federated learning (Kairouz et al., 2019), where injecting maliciously corrupted data is easy.

Such models are vulnerable to backdoor attacks (Gu et al., 2017), in which the attacker injects corrupted

examples into the training set with the goal of creating a backdoor when the model is trained. When the

model is shown test examples with a particular trigger chosen by the attacker, the backdoor is activated and

the model outputs a prediction of the attacker’s choice. The predictions on clean data remain the same so

that the model’s corruption will not be noticed in production.

Weaker attacks require injecting more corrupted examples to the training set, which can be challenging

and costly. For example, in cross-device federated systems, this requires tampering with many devices, which

can be costly (Sun et al., 2019). Further, even if the attacker has the resources to inject more corrupted

examples, stronger attacks that require smaller number of poison training data are preferred. Injecting more

poison data increases the chance of being detected by human inspection with random screening. For such

systems, there is a natural optimization problem of interest to the attacker. Assuming the attacker wants to

achieve a certain success rate for a trigger of choice, how can they do so with minimum number of corrupted

examples injected into the training set?

For a given choice of a trigger, the success of an attack is measured by the Attack Success Rate (ASR),

deﬁned as the probability that the corrupted model predicts a target class,

ytarget

, for an input image from

another class with the trigger applied. This is referred to as a test-time poison example. To increase ASR,

train-time poison examples are injected to the training data. A typical recipe is to mimic the test-time poison

example by randomly selecting an image from a class other than the target class and applying the trigger

function,

Rk→Rk

, and label it as the target class,

ytarget

(Barni et al., 2019; Gu et al., 2017; Liu et al.,

2020). We refer to this as the “sampling” baseline. In (Barni et al., 2019), for example, the trigger is a

periodic image-space signal

∆∈Rk

that is added to the image:

(

xtruck

) =

xtruck

∆

. Example images for

arXiv:2210.05929v1 [cs.LG] 12 Oct 2022

this attack are shown in Fig. 2 with

ytarget

“deer”

. The fundamental trade-oﬀ of interest is between the

number of injected poison training examples,

, and ASR as shown in Fig. 1. For the periodic trigger, the

sampling baseline requires 100 poison examples to reach an ASR of approximately 80%.

1 10 100 1000

0.2

0.4

0.6

0.8

number of poisons m

attack success rate

NTBA (ours)

sampling

Figure 1: The trade-oﬀ between the number

of poisons and ASR for the periodic trigger.

label: “truck”

(a) clean

label: “deer”

(b) clean

label: “deer”

Figure 2: Typical poison attack takes a random sample from

the source class (“truck”), adds a trigger

∆

to it, and labels

it as the target (“deer”). Note the faint vertical striping in

Fig. 2c.

Notice how this baseline, although widely used in robust machine learning literature, wastes the opportunity

to construct stronger attacks. We propose to exploit an under-explored attack surface of designing strong

attacks and carefully design the train-time poison examples tailored for the choice of the backdoor trigger.

We want to emphasize that our goal in proving the existence of such strong backdoor attacks is to motivate

continued research into backdoor defenses and inspire practitioners to carefully secure their machine learning

pipelines. There is a false sense of safety in systems that ensures a large number of honest data contributors

that keep the fraction of corrupted contributions small; we show that it takes only a few examples to succeed

in backdoor attacks. We survey the related work in Appendix A.

Contributions.

We borrow analyses and algorithms from kernel regression to bring a new perspective on

the fundamental trade-oﬀ between the attack success rate of a backdoor attack and the number of poison

training examples that need to be injected. We (i) use Neural Tangent Kernels (NTKs) to introduce a new

computational tool for constructing strong backdoor attacks for training deep neural networks (Sections 2

and 3); (ii) use the analysis of the standard kernel linear regression to interpret what determines the strengths

of a backdoor attack (Section 4); and (iii) investigate the vulnerability of deep neural networks through the

lens of corresponding NTKs (Section 5).

First, we propose a bi-level optimization problem whose solution automatically constructs strong train-time

poison examples tailored for the backdoor trigger we want to apply at test-time. Central to our approach is the

Neural Tangent Kernel (NTK) that models the training dynamics of the neural network. Our Neural Tangent

Backdoor Attack (NTBA) achieves, for example, an ASR of

72%

with only 10 poison examples in Fig. 1,

which is an order of magnitude more eﬃcient. For sub-tasks from CIFAR-10 and ImageNet datasets and two

architectures (WideResNet and ConvNeXt), we show the existence of such strong few-shot backdoor attacks

for two commonly used triggers of the periodic trigger (Section 3) and the patch trigger (Appendix C.1). We

show an ablation study showing that every component of NTBA is necessary in discovering such a strong

few-shot attack (Section 2.1). Secondly, we provide interpretation of the poison examples designed with

NTBA via an analysis of kernel linear regression. In particular, this suggests that small-magnitude train-time

triggers lead to strong attacks, when coupled with a clean image that is close in distance, which explains

and guides the design of strong attacks. Finally, we investigate the vulnerability of deep neural networks to

backdoor attacks by comparing the corresponding NTK and the standard Laplace kernel. NTKs allow far

away data points to have more inﬂuence, compared to the Laplace kernel, which is exploited by few-shot

backdoor attacks.

2 NTBA: Neural Tangent Backdoor Attack

We frame the construction of strong backdoor attacks as a bi-level optimization problem and solve it using

our proposed Neural Tangent Backdoor Attack (NTBA). NTBA is composed of the following steps (with

details referenced in parentheses):

1. Model the training dynamics

(Appendix C.4): Train the network to convergence on the clean data,

saving the network weights and use the empirical neural tangent kernel at this choice of weights as our

model of the network training dynamics.

2. Initialization (Appendix B.2): Use greedy initialization to ﬁnd an initial set of poison images.

3. Optimization

(Appendices B.1.2 and B.3): Improve the initial set of poison images using a gradient-

based optimizer.

Background on neural tangent kernels:

The NTK of a scalar-valued neural network

is the kernel

associated with the feature map

(

) =

∇θf

(

;

). The NTK was introduced in (Jacot et al., 2018) which

showed that the NTK remains stationary during the training of feed-forward neural networks in the inﬁnite

width limit. When trained with the squared loss, this implies that inﬁnite width neural networks are equivalent

to kernel linear regression with the neural tangent kernel. Since then, the NTK has been extended to other

architectures Li et al. (2019); Du et al. (2019b); Alemohammad et al. (2020); Yang (2020), computed in

closed form Li et al. (2019); Novak et al. (2020), and compared to ﬁnite neural networks Lee et al. (2020);

Arora et al. (2019). The closed form predictions of the NTK oﬀer a computational convenience which has

been leveraged for data distillation Nguyen et al. (2020, 2021), meta-learning Zhou et al. (2021), and subset

selection Borsos et al. (2020). For ﬁnite networks, the kernel is not stationary and its time evolution has

been studied in (Fort et al., 2020; Long, 2021; Seleznova & Kutyniok, 2022). We call the NTK of a ﬁnite

network with

chosen at some point during training the network’s empirical NTK. Although the empirical

NTK cannot exactly model the full training dynamics of ﬁnite networks, (Du et al., 2018, 2019a) give some

non-asymptotic guarantees.

Bi-level optimization with NTK:

Let (

Xd,yd

) and (

Xp,yp

) denote the clean and poison training examples,

respectively, (

Xt,yt

) denote clean test examples, and (

Xa,ya

) denote test data with the trigger applied

and the target label. Our goal is to construct poison examples,

, with target label,

ytarget

, that,

when trained on together with clean examples, produce a model which (

) is accurate on clean test data

and (

) predicts the target label for poison test data

. This naturally leads to the the following bi-level

optimization problem:

min

XpLbackdoorfXta; argmin

θL(f(Xdp;θ),ydp),yta,(1)

where we denote concatenation with subscripts

X>

dX>

p

and similarly for

Xta, yta

, and

ydp

. To

ensure our objective is diﬀerentiable and to permit closed-form kernel predictions, we use the squared loss

(

y,y

) =

Lbackdoor

(

y,y

) =

2

b

y−y

2

. Still, such bi-level optimizations are typically challenging to solve

(Bard, 1991, 2013). Diﬀerentiating directly through the inner optimization

argminθL

(

Xdp

;

)

,ydp

) with

respect to the corrupted training data

is impractical for two reasons: (i) backpropagating through an

iterative process incurs a signiﬁcant performance penalty, even when using advanced checkpointing techniques

(Walther & Griewank, 2004) and (ii) the gradients obtained by backpropagating through SGD are too noisy

to be useful (Hospedales et al., 2020). To overcome these challenges, we propose to use a closed-form kernel

to model the training dynamics of the neural network. This dramatically simpliﬁes and stabilizes our loss,

which becomes

Lbackdoor(Kdp,dpta,ydpta) = 1

2

y>

dpK−1

dp,dpKdp,ta −yta

2

2,(2)

where we plugged in the closed-form solution of the inner optimization from the kernel linear regression

model, which we can easily diﬀerentiate with respect to

Kdp,dpta

. We use

X × X → R

to denote a kernel

function of choice,

(

X, X0

) to denote the

|X| × |X0|

kernel matrix with

(

X, X0

)

i,j

(

Xi, X0

), and

subscripts as shorthand for block matrices, e.g.

Ka,dp

K(Xa, Xd)K(Xa, Xp)

. This simpliﬁcation does

not come for free, as kernel-designed poisons might not generalize to the neural network training that we

desire to backdoor. Empirically demonstrating in Section 3 that there is little loss in transferring our attack

to neural network is one of our main goals (see Table 2).

Greedy initialization.

The optimization problem in Eq. (1) is nonconvex. Empirically, we ﬁnd that the

optimization always converges to a local minima that is close to the initialization of the poison images. We

propose a greedy algorithm to select the initial set of images to start the optimization from. The algorithm

proceeds by applying the trigger function

(

) to every image in the training set and, incrementally in a

greedy fashion, selecting the image that has the greatest reduction in the backdoor loss when added to the

poison set. This is motivated by our analysis in Section 4, which encourages poisons with small perturbation.

2.1 Ablation study

Table 1: Ablation study under

the setting of Fig. 1 with

10.

ablation ASR

1+2+372.1 %

1+312.0 %

1+216.2 %

10+2+311.3 %

100 +2+323.1 %

We perform an ablation study on the three components at the beginning

of this section to demonstrate that they are all necessary. The alternatives

are: (

) the empirical neural tangent kernel but with weights taken from

random initialization of the model weights; (

100

) the inﬁnite-width neural

tangent kernel; (removing

) sampling the initial set of images from a

standard Gaussian, (removing

) using the greedy initial poison set without

any optimization. ASR for various combinations are shown in Table 1.

The stark diﬀerence between our approach (

) and the rest suggests

that all components are important in achieving a strong attack. Random

initialization (

) fails as coupled examples that are very close to the clean

image space but have diﬀerent labels is critical in achieving strong attacks

as shown in Fig. 3. Without our proposed optimization (

), the attack is

weak. Attacks designed with diﬀerent choices of neural tangent kernels (

and

100

) work well

on the kernel models they were designed for, but the attack fails to transfer to the original neural network,

suggesting that they are less accurate models of the network training.

3 Experimental results

We attack a WideResNet-34-5 Zagoruyko & Komodakis (2016) (

d≈

) with GELU activations Hendrycks

& Gimpel (2016) so that our network will satisfy the smoothness assumption in Appendix B.1.2. Additionally,

we do not use batch normalization which is not yet supported by the neural tangent kernel library we use

Novak et al. (2020). Our network is trained with SGD on a 2 label subset of CIFAR-10 Krizhevsky (2009).

The particular pair of labels is “truck” and “deer” which was observed in Hayase et al. (2021) to be relatively

diﬃcult to backdoor since the two classes are easy to distinguish. We consider two backdoor triggers: the

periodic image trigger of Barni et al. (2019) and a 3

3 checker patch applied at a random position in the

image. These two triggers represent sparse control over images at test time in frequency and image space

respectively. Results for the periodic trigger are given here while results for the patch trigger are given in

Appendix C.1.

To fairly evaluate performance, we split the CIFAR-10 training set into an inner training set and validation

set containing

80%

and

20%

of the images respectively. We run NTBA with the inner training set as

, the

inner validation set as

, and the inner validation set with the trigger applied as

. Our neural network is

then trained on Dd∪Dpand tested on the CIFAR-10 test set.

We also attack a pretrained ConvNeXt Liu et al. (2022) ﬁnetuned on a 2 label subset of ImageNet,

following the setup of Saha et al. (2020) with details given in Appendix C.2. We describe the computational

resources used to perform our attack in Appendix B.4.

3.1 NTBA makes backdoor attacks signiﬁcantly more eﬃcient

Our main results show that (i) as expected, there are some gaps in ASR when applying NTK-designed

poison examples to neural network training, but (ii) NTK-designed poison examples still manage to be

signiﬁcantly stronger compared to sampling baseline. The most relevant metric is the test results of neural

network training evaluated on the original validation set with the trigger applied,

asrnn,te

. In Table 2,

to achieve

asrnn,te

= 90

7%, NTBA requires 30 poisons, which is an order of magnitude fewer than the

sampling baseline. The ASR for backdooring kernel regressions is almost perfect, as it is what NTBA is

designed to do; we consistently get high

asrntk,te

with only a few poisons. Perhaps surprisingly, we show that

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Few-shotBackdoorAttacksviaNeuralTangentKernelsJonathanHayaseandSewoongOhPaulG.AllenSchoolofComputerScienceandEngineeringUniversityofWashingtonfjhayase,sewoongg@cs.washington.eduAbstractInabackdoorattack,anattackerinjectscorruptedexamplesintothetrainingset.Thegoaloftheattackeristocausethenaltrainedm...

展开>> 收起<<

Few-shot Backdoor Attacks via Neural Tangent Kernels Jonathan Hayase and Sewoong Oh Paul G. Allen School of Computer Science and Engineering.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Few-shot Backdoor Attacks via Neural Tangent Kernels Jonathan Hayase and Sewoong Oh Paul G. Allen School of Computer Science and Engineering

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: