Few-shot Backdoor Attacks via Neural Tangent Kernels Jonathan Hayase and Sewoong Oh Paul G. Allen School of Computer Science and Engineering

2025-04-27 0 0 1.62MB 20 页 10玖币
侵权投诉
Few-shot Backdoor Attacks via Neural Tangent Kernels
Jonathan Hayase and Sewoong Oh
Paul G. Allen School of Computer Science and Engineering
University of Washington
{jhayase,sewoong}@cs.washington.edu
Abstract
In a backdoor attack, an attacker injects corrupted examples into the training set. The goal of the
attacker is to cause the final trained model to predict the attacker’s desired target label when a predefined
trigger is added to test inputs. Central to these attacks is the trade-off between the success rate of the
attack and the number of corrupted training examples injected. We pose this attack as a novel bilevel
optimization problem: construct strong poison examples that maximize the attack success rate of the
trained model. We use neural tangent kernels to approximate the training dynamics of the model being
attacked and automatically learn strong poison examples. We experiment on subclasses of CIFAR-10
and ImageNet with WideResNet-34 and ConvNeXt architectures on periodic and patch trigger attacks
and show that NTBA-designed poisoned examples achieve, for example, an attack success rate of 90%
with ten times smaller number of poison examples injected compared to the baseline. We provided an
interpretation of the NTBA-designed attacks using the analysis of kernel linear regression. We further
demonstrate a vulnerability in overparametrized deep neural networks, which is revealed by the shape of
the neural tangent kernel.
1 Introduction
Modern machine learning models, such as deep convolutional neural networks and transformer-based language
models, are often trained on massive datasets to achieve state-of-the-art performance. These datasets are
frequently scraped from public domains with little quality control. In other settings, models are trained on
shared data, e.g., federated learning (Kairouz et al., 2019), where injecting maliciously corrupted data is easy.
Such models are vulnerable to backdoor attacks (Gu et al., 2017), in which the attacker injects corrupted
examples into the training set with the goal of creating a backdoor when the model is trained. When the
model is shown test examples with a particular trigger chosen by the attacker, the backdoor is activated and
the model outputs a prediction of the attacker’s choice. The predictions on clean data remain the same so
that the model’s corruption will not be noticed in production.
Weaker attacks require injecting more corrupted examples to the training set, which can be challenging
and costly. For example, in cross-device federated systems, this requires tampering with many devices, which
can be costly (Sun et al., 2019). Further, even if the attacker has the resources to inject more corrupted
examples, stronger attacks that require smaller number of poison training data are preferred. Injecting more
poison data increases the chance of being detected by human inspection with random screening. For such
systems, there is a natural optimization problem of interest to the attacker. Assuming the attacker wants to
achieve a certain success rate for a trigger of choice, how can they do so with minimum number of corrupted
examples injected into the training set?
For a given choice of a trigger, the success of an attack is measured by the Attack Success Rate (ASR),
defined as the probability that the corrupted model predicts a target class,
ytarget
, for an input image from
another class with the trigger applied. This is referred to as a test-time poison example. To increase ASR,
train-time poison examples are injected to the training data. A typical recipe is to mimic the test-time poison
example by randomly selecting an image from a class other than the target class and applying the trigger
function,
P
:
RkRk
, and label it as the target class,
ytarget
(Barni et al., 2019; Gu et al., 2017; Liu et al.,
2020). We refer to this as the “sampling” baseline. In (Barni et al., 2019), for example, the trigger is a
periodic image-space signal
Rk
that is added to the image:
P
(
xtruck
) =
xtruck
+
. Example images for
1
arXiv:2210.05929v1 [cs.LG] 12 Oct 2022
this attack are shown in Fig. 2 with
ytarget
=
“deer”
. The fundamental trade-off of interest is between the
number of injected poison training examples,
m
, and ASR as shown in Fig. 1. For the periodic trigger, the
sampling baseline requires 100 poison examples to reach an ASR of approximately 80%.
1 10 100 1000
0.2
0.4
0.6
0.8
number of poisons m
attack success rate
NTBA (ours)
sampling
Figure 1: The trade-off between the number
of poisons and ASR for the periodic trigger.
label: “truck”
(a) clean
label: “deer”
(b) clean
label: “deer”
(c) poison
Figure 2: Typical poison attack takes a random sample from
the source class (“truck”), adds a trigger
to it, and labels
it as the target (“deer”). Note the faint vertical striping in
Fig. 2c.
Notice how this baseline, although widely used in robust machine learning literature, wastes the opportunity
to construct stronger attacks. We propose to exploit an under-explored attack surface of designing strong
attacks and carefully design the train-time poison examples tailored for the choice of the backdoor trigger.
We want to emphasize that our goal in proving the existence of such strong backdoor attacks is to motivate
continued research into backdoor defenses and inspire practitioners to carefully secure their machine learning
pipelines. There is a false sense of safety in systems that ensures a large number of honest data contributors
that keep the fraction of corrupted contributions small; we show that it takes only a few examples to succeed
in backdoor attacks. We survey the related work in Appendix A.
Contributions.
We borrow analyses and algorithms from kernel regression to bring a new perspective on
the fundamental trade-off between the attack success rate of a backdoor attack and the number of poison
training examples that need to be injected. We (i) use Neural Tangent Kernels (NTKs) to introduce a new
computational tool for constructing strong backdoor attacks for training deep neural networks (Sections 2
and 3); (ii) use the analysis of the standard kernel linear regression to interpret what determines the strengths
of a backdoor attack (Section 4); and (iii) investigate the vulnerability of deep neural networks through the
lens of corresponding NTKs (Section 5).
First, we propose a bi-level optimization problem whose solution automatically constructs strong train-time
poison examples tailored for the backdoor trigger we want to apply at test-time. Central to our approach is the
Neural Tangent Kernel (NTK) that models the training dynamics of the neural network. Our Neural Tangent
Backdoor Attack (NTBA) achieves, for example, an ASR of
72%
with only 10 poison examples in Fig. 1,
which is an order of magnitude more efficient. For sub-tasks from CIFAR-10 and ImageNet datasets and two
architectures (WideResNet and ConvNeXt), we show the existence of such strong few-shot backdoor attacks
for two commonly used triggers of the periodic trigger (Section 3) and the patch trigger (Appendix C.1). We
show an ablation study showing that every component of NTBA is necessary in discovering such a strong
few-shot attack (Section 2.1). Secondly, we provide interpretation of the poison examples designed with
NTBA via an analysis of kernel linear regression. In particular, this suggests that small-magnitude train-time
triggers lead to strong attacks, when coupled with a clean image that is close in distance, which explains
and guides the design of strong attacks. Finally, we investigate the vulnerability of deep neural networks to
backdoor attacks by comparing the corresponding NTK and the standard Laplace kernel. NTKs allow far
away data points to have more influence, compared to the Laplace kernel, which is exploited by few-shot
backdoor attacks.
2 NTBA: Neural Tangent Backdoor Attack
We frame the construction of strong backdoor attacks as a bi-level optimization problem and solve it using
our proposed Neural Tangent Backdoor Attack (NTBA). NTBA is composed of the following steps (with
2
details referenced in parentheses):
1. Model the training dynamics
(Appendix C.4): Train the network to convergence on the clean data,
saving the network weights and use the empirical neural tangent kernel at this choice of weights as our
model of the network training dynamics.
2. Initialization (Appendix B.2): Use greedy initialization to find an initial set of poison images.
3. Optimization
(Appendices B.1.2 and B.3): Improve the initial set of poison images using a gradient-
based optimizer.
Background on neural tangent kernels:
The NTK of a scalar-valued neural network
f
is the kernel
associated with the feature map
φ
(
x
) =
θf
(
x
;
θ
). The NTK was introduced in (Jacot et al., 2018) which
showed that the NTK remains stationary during the training of feed-forward neural networks in the infinite
width limit. When trained with the squared loss, this implies that infinite width neural networks are equivalent
to kernel linear regression with the neural tangent kernel. Since then, the NTK has been extended to other
architectures Li et al. (2019); Du et al. (2019b); Alemohammad et al. (2020); Yang (2020), computed in
closed form Li et al. (2019); Novak et al. (2020), and compared to finite neural networks Lee et al. (2020);
Arora et al. (2019). The closed form predictions of the NTK offer a computational convenience which has
been leveraged for data distillation Nguyen et al. (2020, 2021), meta-learning Zhou et al. (2021), and subset
selection Borsos et al. (2020). For finite networks, the kernel is not stationary and its time evolution has
been studied in (Fort et al., 2020; Long, 2021; Seleznova & Kutyniok, 2022). We call the NTK of a finite
network with
θ
chosen at some point during training the network’s empirical NTK. Although the empirical
NTK cannot exactly model the full training dynamics of finite networks, (Du et al., 2018, 2019a) give some
non-asymptotic guarantees.
Bi-level optimization with NTK:
Let (
Xd,yd
) and (
Xp,yp
) denote the clean and poison training examples,
respectively, (
Xt,yt
) denote clean test examples, and (
Xa,ya
) denote test data with the trigger applied
and the target label. Our goal is to construct poison examples,
Xp
, with target label,
yp
=
ytarget
, that,
when trained on together with clean examples, produce a model which (
i
) is accurate on clean test data
Xt
and (
ii
) predicts the target label for poison test data
Xa
. This naturally leads to the the following bi-level
optimization problem:
min
XpLbackdoorfXta; argmin
θL(f(Xdp;θ),ydp),yta,(1)
where we denote concatenation with subscripts
X>
dp
=
X>
dX>
p
and similarly for
Xta, yta
, and
ydp
. To
ensure our objective is differentiable and to permit closed-form kernel predictions, we use the squared loss
L
(
b
y,y
) =
Lbackdoor
(
b
y,y
) =
1
2
b
yy
2
2
. Still, such bi-level optimizations are typically challenging to solve
(Bard, 1991, 2013). Differentiating directly through the inner optimization
argminθL
(
f
(
Xdp
;
θ
)
,ydp
) with
respect to the corrupted training data
Xp
is impractical for two reasons: (i) backpropagating through an
iterative process incurs a significant performance penalty, even when using advanced checkpointing techniques
(Walther & Griewank, 2004) and (ii) the gradients obtained by backpropagating through SGD are too noisy
to be useful (Hospedales et al., 2020). To overcome these challenges, we propose to use a closed-form kernel
to model the training dynamics of the neural network. This dramatically simplifies and stabilizes our loss,
which becomes
Lbackdoor(Kdp,dpta,ydpta) = 1
2
y>
dpK1
dp,dpKdp,ta yta
2
2,(2)
where we plugged in the closed-form solution of the inner optimization from the kernel linear regression
model, which we can easily differentiate with respect to
Kdp,dpta
. We use
K
:
X × X R
to denote a kernel
function of choice,
K
(
X, X0
) to denote the
|X| × |X0|
kernel matrix with
K
(
X, X0
)
i,j
=
K
(
Xi, X0
j
), and
subscripts as shorthand for block matrices, e.g.
Ka,dp
=
K(Xa, Xd)K(Xa, Xp)
. This simplification does
not come for free, as kernel-designed poisons might not generalize to the neural network training that we
desire to backdoor. Empirically demonstrating in Section 3 that there is little loss in transferring our attack
to neural network is one of our main goals (see Table 2).
3
Greedy initialization.
The optimization problem in Eq. (1) is nonconvex. Empirically, we find that the
optimization always converges to a local minima that is close to the initialization of the poison images. We
propose a greedy algorithm to select the initial set of images to start the optimization from. The algorithm
proceeds by applying the trigger function
P
(
·
) to every image in the training set and, incrementally in a
greedy fashion, selecting the image that has the greatest reduction in the backdoor loss when added to the
poison set. This is motivated by our analysis in Section 4, which encourages poisons with small perturbation.
2.1 Ablation study
Table 1: Ablation study under
the setting of Fig. 1 with
m
=
10.
ablation ASR
1+2+372.1 %
1+312.0 %
1+216.2 %
10+2+311.3 %
100 +2+323.1 %
We perform an ablation study on the three components at the beginning
of this section to demonstrate that they are all necessary. The alternatives
are: (
10
) the empirical neural tangent kernel but with weights taken from
random initialization of the model weights; (
100
) the infinite-width neural
tangent kernel; (removing
2
) sampling the initial set of images from a
standard Gaussian, (removing
3
) using the greedy initial poison set without
any optimization. ASR for various combinations are shown in Table 1.
The stark difference between our approach (
1
+
2
+
3
) and the rest suggests
that all components are important in achieving a strong attack. Random
initialization (
1
+
3
) fails as coupled examples that are very close to the clean
image space but have different labels is critical in achieving strong attacks
as shown in Fig. 3. Without our proposed optimization (
1
+
2
), the attack is
weak. Attacks designed with different choices of neural tangent kernels (
10
+
2
+
3
and
100
+
2
+
3
) work well
on the kernel models they were designed for, but the attack fails to transfer to the original neural network,
suggesting that they are less accurate models of the network training.
3 Experimental results
We attack a WideResNet-34-5 Zagoruyko & Komodakis (2016) (
d
10
7
) with GELU activations Hendrycks
& Gimpel (2016) so that our network will satisfy the smoothness assumption in Appendix B.1.2. Additionally,
we do not use batch normalization which is not yet supported by the neural tangent kernel library we use
Novak et al. (2020). Our network is trained with SGD on a 2 label subset of CIFAR-10 Krizhevsky (2009).
The particular pair of labels is “truck” and “deer” which was observed in Hayase et al. (2021) to be relatively
difficult to backdoor since the two classes are easy to distinguish. We consider two backdoor triggers: the
periodic image trigger of Barni et al. (2019) and a 3
×
3 checker patch applied at a random position in the
image. These two triggers represent sparse control over images at test time in frequency and image space
respectively. Results for the periodic trigger are given here while results for the patch trigger are given in
Appendix C.1.
To fairly evaluate performance, we split the CIFAR-10 training set into an inner training set and validation
set containing
80%
and
20%
of the images respectively. We run NTBA with the inner training set as
Dd
, the
inner validation set as
Dt
, and the inner validation set with the trigger applied as
Da
. Our neural network is
then trained on DdDpand tested on the CIFAR-10 test set.
We also attack a pretrained ConvNeXt Liu et al. (2022) finetuned on a 2 label subset of ImageNet,
following the setup of Saha et al. (2020) with details given in Appendix C.2. We describe the computational
resources used to perform our attack in Appendix B.4.
3.1 NTBA makes backdoor attacks significantly more efficient
Our main results show that (i) as expected, there are some gaps in ASR when applying NTK-designed
poison examples to neural network training, but (ii) NTK-designed poison examples still manage to be
significantly stronger compared to sampling baseline. The most relevant metric is the test results of neural
network training evaluated on the original validation set with the trigger applied,
asrnn,te
. In Table 2,
to achieve
asrnn,te
= 90
.
7%, NTBA requires 30 poisons, which is an order of magnitude fewer than the
sampling baseline. The ASR for backdooring kernel regressions is almost perfect, as it is what NTBA is
designed to do; we consistently get high
asrntk,te
with only a few poisons. Perhaps surprisingly, we show that
4
摘要:

Few-shotBackdoorAttacksviaNeuralTangentKernelsJonathanHayaseandSewoongOhPaulG.AllenSchoolofComputerScienceandEngineeringUniversityofWashingtonfjhayase,sewoongg@cs.washington.eduAbstractInabackdoorattack,anattackerinjectscorruptedexamplesintothetrainingset.Thegoaloftheattackeristocausethe naltrainedm...

展开>> 收起<<
Few-shot Backdoor Attacks via Neural Tangent Kernels Jonathan Hayase and Sewoong Oh Paul G. Allen School of Computer Science and Engineering.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.62MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注