Pre-trained Adversarial Perturbations Yuanhao Ban12 Yinpeng Dong13y 1Department of Computer Science Technology Institute for AI BNRist Center

2025-05-06 0 0 3.77MB 23 页 10玖币
侵权投诉
Pre-trained Adversarial Perturbations
Yuanhao Ban1,2, Yinpeng Dong1,3
1Department of Computer Science & Technology, Institute for AI, BNRist Center,
Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University
2Department of Electronic Engineering, Tsinghua University 3RealAI
banyh19@mails.tsinghua.edu.cn, dongyinpeng@mail.tsinghua.edu.cn
Abstract
Self-supervised pre-training has drawn increasing attention in recent years due to
its superior performance on numerous downstream tasks after fine-tuning. How-
ever, it is well-known that deep learning models lack the robustness to adversarial
examples, which can also invoke security issues to pre-trained models, despite
being less explored. In this paper, we delve into the robustness of pre-trained
models by introducing Pre-trained Adversarial Perturbations (PAPs), which are
universal perturbations crafted for the pre-trained models to maintain the effective-
ness when attacking fine-tuned ones without any knowledge of the downstream
tasks. To this end, we propose a Low-Level Layer Lifting Attack (L4A) method
to generate effective PAPs by lifting the neuron activations of low-level layers of
the pre-trained models. Equipped with an enhanced noise augmentation strategy,
L4A is effective at generating more transferable PAPs against fine-tuned models.
Extensive experiments on typical pre-trained vision models and ten downstream
tasks demonstrate that our method improves the attack success rate by a large
margin compared with state-of-the-art methods.
1 Introduction
Large-scale pre-trained models [
50
,
17
] have recently achieved unprecedented success in a variety of
fields, e.g., natural language processing [
25
,
34
,
2
], computer vision [
4
,
20
,
21
]. A large amount of
work proposes sophisticated self-supervised learning algorithms, enabling the pre-trained models to
extract useful knowledge from large-scale unlabeled datasets. The pre-trained models consequently
facilitate downstream tasks through transfer learning or fine-tuning [
46
,
61
,
16
]. Nowadays, more
practitioners without sufficient computational resources or training data tend to fine-tune the publicly
available pre-trained models on their own datasets. Therefore, it has become an emerging trend to
adopt the paradigm of pre-training to fine-tuning rather than training from scratch [17].
Despite the excellent performance of deep learning models, they are incredibly vulnerable to adver-
sarial examples [
54
,
15
], which are generated by adding small, human-imperceptible perturbations to
natural examples, but can make the target model output erroneous predictions. Adversarial examples
also exhibit an intriguing property called transferability [
54
,
33
,
40
], which means that the adversarial
perturbations generated for one model or a set of images can remain adversarial for others. For
example, a universal adversarial perturbation (UAP) [
40
] can be generated for the entire distribution of
data samples, demonstrating excellent cross-data transferability. Other work [
33
,
11
,
58
,
12
,
42
] has
revealed that adversarial examples have high cross-model and cross-domain transferability, making
black-box attacks practical without any knowledge of the target model or even the training data.
However, much less effort has been devoted to exploring the adversarial robustness of pre-trained
models. As these models have been broadly studied and deployed in various real-world applications,
This work was done when Yuanhao Ban was intern at RealAI, Inc; Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.03372v2 [cs.CV] 14 Oct 2022
Lifting the neuron
activations of low-
level layers
Pre-trained model
Fine-tuned model
Attacker
User1
User2
“Flicker”
“Truck”
+Wrong answer
“Albatross”
PAP
+
Fine-tuned model
Wrong answer
“Hamsrer”
+
Uniform
Gaussian
Sampling
noise
Figure 1: A demonstration of pre-trained adversarial perturbations (PAPs): An attacker first downloads
pre-trained weights on the Internet and generates a PAP by lifting the neuron activations of low-level
layers of the pre-trained models. We adopt a data augmentation technique called uniform Gaussian
sampling to improve the transferability of PAP. When users fine-tune the pre-trained models to
complete downstream tasks, the attacker can add the PAP to the input of the fine-tuned models to
cheat them without knowing the specific downstream tasks.
it is of significant importance to identify their weaknesses and evaluate their robustness, especially
concerning the pre-training to the fine-tuning procedure.
In this paper, we introduce
Pre-trained Adversarial Perturbations (PAPs)
, a new kind of universal
adversarial perturbations designed for pre-trained models. Specifically, a PAP is generated for a
pre-trained model to effectively fool any downstream model obtained by fine-tuning the pre-trained
one, as illustrated in Fig. 1. It works under a quasi-black-box setting where the downstream task,
dataset, and fine-tuned model parameters are all unavailable. This attack setting is more suitable for
the pre-training to the fine-tuning procedure since many pre-trained models are publicly available,
and the adversary may generate PAPs before the pre-trained model has been fine-tuned. Although
there are many methods [
11
,
58
] proposed for improving the transferability, they do not consider the
specific characteristics of the pre-training to the fine-tuning procedure, limiting their cross-finetuning
transferability in our setting.
To generate more effective PAPs, we propose a
Low-Level Layer Lifting Attack (L4A)
method,
which aims to lift the feature activations of low-level layers. Motivated by the finding that the lower
the level of the model’s layer is, the less its parameters change during fine-tuning, we generate PAPs
to destroy the low-level feature representations of pre-trained models, making the attacking effects
better reserved after fine-tuning. To further alleviate the overfitting of PAPs to the source domain, we
improve L4A with a noise augmentation technique. We conduct extensive experiments on typical
pre-trained vision models [
4
,
21
] and ten downstream tasks. The evaluation results demonstrate that
our method achieves a higher attack success rate on average compared with the alternative baselines.
2 Related work
Self-supervised learning.
Self-supervised learning (SSL) enables learning from unlabeled data. To
achieve this, early approaches utilize hand-crafted pretext tasks, including colorization [
64
], rotation
prediction [
14
], position prediction [
45
], and Selfie [
55
]. Another approach for SSL is contrastive
learning [
32
,
48
,
4
,
26
], which aims to map the input image to the feature space and minimize the
distance between similar ones while keeping dissimilar ones far away from each other. In particular,
a similar sample is retrieved by applying appropriate data augmentation techniques to the original
one, and the versions of different samples are viewed as dissimilar pairs.
Adversarial examples.
With the knowledge of the structure and parameters of a model, many
algorithms [
31
,
39
,
37
,
47
] successfully fool the target model in a white-box manner. An intriguing
property of adversarial examples is their good transferability [
33
,
40
]. The universal adversarial
perturbations [
40
] demonstrate good cross-data transferability by optimizing under a distribution of
2
data samples. The cross-model transferability has also been extensively studied [
11
,
58
,
12
], enabling
the attack on black-box models without any knowledge of their internal working mechanisms.
Robustness of the pre-training to fine-tuned procedure.
Due to the popularity of pre-trained
models, a lot of works [
53
,
60
,
6
] study the robustness of this setting. Among them, Dong et al. [
10
]
propose a novel adversarial fine-tuning method in an information-theoretical way to retain robust
features learned from the pre-trained model. Jiang et al. [
24
] integrate adversarial samples into the
pre-training procedure to defend against attacks. Fan et at. [
13
] adopt Clusterfit [
59
] to generate
pseudo-label data and later use them for training the model in a supervised way, which improves the
robustness of the pre-trained model. The main difference between our work and theirs is that we
consider the problem from an attacker’s perspective.
3 Methodology
In this section, we first introduce the notations and the problem formulation of the Pre-trained
Adversarial Perturbations (PAPs). Then, we detail the Low-Level Layer Lifting Attack (L4A)
method.
3.1 Notations and problem formulation
Let
fθ
denote a pre-trained model for feature extraction with parameters
θ
. It takes an image
x∈ Dp
as input and outputs a feature vector
v∈ X
, where
Dp
and
X
refer to the pre-training dataset and
feature space, respectively. We denote
fk
θ(x)
as the
k
-th layer’s feature map of
fθ
for an input image
x
. In the pre-training to fine-tuning paradigm, a user fine-tunes the pre-trained model
fθ
using a new
dataset
Dt
of the downstream task and finally gets a fine-tuned model
fθ0
with updated parameters
θ0
. Then, let
fθ0(x)
be the predicted probability distribution of an image
x
over the classes of
Dt
,
and Fθ0(x) = arg max fθ0(x)be the final classification result.
In this paper, we introduce
Pre-trained Adversarial Perturbations (PAPs)
, which are generated
for the pre-trained model
fθ
, but can effectively fool fine-tuned models
fθ0
on downstream tasks.
Formally, a PAP is a universal perturbation
δ
within a small budget
crafted by
fθ
and
Dp
, such that
Fθ0(x+δ)6=Fθ0(x)
for most of the instances belonging to the fine-tuning dataset
Dt
. This can be
formulated as the following optimization problem:
max
δ
Ex∼Dt[Fθ0(x)6=Fθ0(x+δ)],s.t. kδkpand x+δ[0,1] ,(1)
where
k·kp
denotes the
`p
norm and we take the
`
norm in this work. There exist some works
related to the universal perturbations, such as the universal adversarial perturbation (UAP) [
40
] and
the fast feature fool (FFF) [41], as detailed below.
UAP
: Given a classifier
f
and its dataset
D
, the UAP tries to generate a perturbation
δ
that can fool
the model on most of the instances from
D
, which is usually solved by an iterative method. Every
time sampling an image
x
from the dataset
D
, the attacker computes the minimal perturbation
ζ
that
sends x+δto the decision boundary by Eq. (2) and then adds it into δ.
ζarg min
rkrk2,s.t. F(x+δ+r)6=F(x).(2)
FFF
: It aims to produce maximal spurious activations at each layer. To achieve this, FFF starts with a
random δand solves the following problem:
min
δlog K
Y
i=0
li(δ)!,s.t. kδkp. (3)
where li(δ)is the mean of the output tensor at layer i.
3.2 Our design
However, these attacks show limited cross-finetuning transferability in our problem setting due to
ignorance of the fine-tuning procedure. Two challenges are degenerating the performance.
3
(a) Resnet50 (b) Resnet101 (c) ViT16
Figure 2: The ordinate represents the Frobenius norm of the difference between the parameters of
the fine-tuned model and its corresponding pre-trained model, which is scaled into a range from 0
to 1 for easy comparison. The abscissa represents the level of the layer. Note that Resnet50 and
Resnet101 [18] are pre-trained by SimCLRv2 [4], and ViT16 [56] is pre-trained by MAE [21].
Fine-tuning Deviation.
The parameters of the model could change a lot during fine-tuning.
As a result, the generated adversarial samples may perform well in the feature space of the
pre-trained model but fail in the fine-tuned ones.
Datasets Deviation.
The statistics (i.e., mean and standard deviation) of different datasets
can vary a lot. Only using the pre-training dataset with the fixed statistics to generate
adversarial samples may suffer a performance drop.
To alleviate the negative effect of the above issues, we propose a
Low-Level Layer Lifting Attack
(L4A) method equipped with a uniform Gaussian sampling strategy.
Low-Level Layer Lifting Attack (L4A).
Our method is motivated by the findings in Fig. 2that
the higher the level of the layers, the more their parameters change during fine-tuning. This is also
consistent with the knowledge that the low-level convolutional layer acts as an edge detector that
extracts low-level features like edges and textures and has little high-level semantic information [
46
,
61]. Since images from different datasets share the same low-level features, the parameters of these
layers can be preserved during fine-tuning. In contrast, the attack algorithms based on the high-level
layers or the scores predicted by the model may not transfer well in such a cross-finetuning setting,
as the feature spaces of high-level layers are easily distorted during fine-tuning. The basic method of
L4A can be formulated as the following problem:
min
δLbase(fθ,x,δ) = ExDpkfk
θ(x+δ)k2
F,(4)
where
k · kF
denotes the Frobenius norm of the input tensor. In our experiments, we find the lower
the layer, the better it performs, so we choose the first layer as default, such that
k= 1
. As Eq.
(4)
is usually a sophisticated non-convex optimization problem, we solve it using stochastic gradient
descent method.
We also find that fusing the adversarial loss of the consecutive low-level layers can boost the
performance, which gives L4Afuse method as solving:
min
δLfuse(fθ,x,δ) = ExDpkfk1
θ(x+δ)k2
F+λ· kfk2
θ(x+δ)k2
F,(5)
where
fk1
θ(x+δ)
and
fk2
θ(x+δ)
refers to the
k1
-th and
k2
-th layers’ feature maps of
fθ
respectively,
λis a balancing hyperparameter. We set k1= 1 and k2= 2 as default.
Figure 3: Datasets’ statistics.
Uniform Gaussian Sampling.
Nowadays, most state-of-the-art
networks apply batch normalization [
23
] to input images for bet-
ter performance. Thus, the datasets’ statistics become an essential
factor for training. As shown in Fig. 3, the distribution of the
downstream datasets can vary significantly compared to that of
the pre-training dataset. However, traditional data augmentation
techniques [
62
,
22
] are limited to the pre-training domain and
cannot alleviate the problem. Thus, we propose sampling Gaus-
sian noises with various means and deviations to avoid overfitting.
4
Table 1: The attack success rate (%) of various attack methods against
Resnet101
pre-trained by
SimCLRv2. Note that C10 stands for CIFAR10 and C100 stands for CIFAR100.
ASR Cars Pets Food DTD FGVC CUB SVHN C10 C100 STL10 AVG
FFFno 43.81 38.62 49.95 63.24 85.57 48.38 12.55 8.53 77.74 57.11 48.55
FFFFmean 33.93 31.37 41.77 52.66 78.94 45.00 14.85 14.42 72.59 56.66 44.22
FFFone 31.87 29.74 39.25 46.92 74.17 43.87 9.24 11.77 65.61 50.21 40.26
DR 36.28 35.54 47.43 47.45 75.00 44.15 12.05 21.35 65.39 41.65 42.63
SSP 32.89 30.50 43.12 45.85 82.57 45.55 8.69 11.66 65.80 40.91 40.75
ASV 60.75 19.84 36.33 56.22 84.16 55.82 7.11 7.29 58.10 80.89 46.64
UAP 48.70 36.55 60.80 63.40 76.06 52.64 8.46 8.53 52.35 31.15 43.86
UAPEPGD 94.12 66.66 61.30 72.55 70.34 82.72 13.88 61.65 20.04 50.13 59.34
L4Abase 94.07 61.57 71.23 69.20 96.28 81.07 11.70 12.68 80.57 90.49 66.89
L4Afuse 90.98 88.53 80.65 74.31 93.79 91.23 11.40 17.40 80.98 89.69 67.10
L4Augs 94.24 94.99 78.28 77.23 92.92 91.77 11.40 14.60 76.50 90.05 72.20
Combining the base loss using the pre-training dataset and the
new loss using uniform Gaussian noises gives the L4A
ugs
method
as follows:
min
δLugs(fθ,x,δ) = Eµ,σ,n0N(µ,σ)Ex∼Dpkfk
θ(x+δ)k2
F+λ· kfk
θ(n0+δ)k2
F,(6)
where
µ
and
σ
are drawn from the uniform distribution
U(µl,µh)
and
U(σl,σh)
, respectively, and
µl,µh,σl,σrare four hyperparameters.
4 Experiments
We provide some experimental results in this section. More results can be found in Appendix. Our
code is publicly available at https://github.com/banyuanhao/PAP.
4.1 Settings
Pre-training methods.
SimCLR [
4
,
5
] uses the Resnet [
18
] backbone and pre-trains the model by
contrastive learning. We download pre-trained parameters of Resnet50 and Resnet101
1
to evaluate
the generalization ability of our algorithm on different architectures. We also adopt MOCO [
19
]
with the backbone of Resnet50
2
. Besides convolutional neural networks, transformers [
56
] attract
much attention nowadays for their competitive performance. Based on transformers and masked
image modeling, MAE [
21
] becomes a good alternative for pre-training. We adopt the pre-trained
ViT-base-16 model
3
. Moreover, vision-language pre-trained models are gaining popularity these
days. Thus we also choose CLIP [
51
]
4
for our study. We report the results of SimCLR and MAE in
Section 4.2. More results on CLIP and MOCO can be found in Appendix A.1.
Datasets and Pre-processing.
We adopt the ILSVRC 2012 dataset [
52
] to generate PAPs, which are
also used to pre-train the models. We mainly evaluate PAPs on image classification tasks, which are
the same as the settings of SimCLRv2. Ten fine-grained and coarse-grained datasets are used to test
the cross-finetuning transferability of the generated PAPs. We load these datasets from torchvision
(Details in Appendix D). Before feeding the images to the model, we resize them to
256 ×256
and
then center crop them into 224 ×224.
Compared methods.
We choose UAP [
40
] to test whether image-agnostic attacks also bear good
cross-finetuning transferability. Since UAP needs final classification predictions of the inputs, we fit
a linear head on the pre-trained feature extractor. Furthermore, by integrating the moment term into
the iterative method, UAPEPGD [
9
] is believed to enhance cross-model transferability. Thus, we
adopt UAPEPGD to study the connection between cross-model and cross-finetuning transferability.
1https://github.com/google-research/simclr
2https://dl.fbaipublicfiles.com/moco/
3https://github.com/facebookresearch/mae
4https://github.com/openai/CLIP
5
摘要:

Pre-trainedAdversarialPerturbationsYuanhaoBan1;2,YinpengDong1;3y1DepartmentofComputerScience&Technology,InstituteforAI,BNRistCenter,Tsinghua-BoschJointMLCenter,THBILab,TsinghuaUniversity2DepartmentofElectronicEngineering,TsinghuaUniversity3RealAIbanyh19@mails.tsinghua.edu.cn,dongyinpeng@mail.tsingh...

展开>> 收起<<
Pre-trained Adversarial Perturbations Yuanhao Ban12 Yinpeng Dong13y 1Department of Computer Science Technology Institute for AI BNRist Center.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:3.77MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注