Pre-trained Adversarial Perturbations Yuanhao Ban12 Yinpeng Dong13y 1Department of Computer Science Technology Institute for AI BNRist Center

2025-05-06 0 0 3.77MB 23 页 10玖币

侵权投诉

Pre-trained Adversarial Perturbations

Yuanhao Ban1,2∗, Yinpeng Dong1,3†

1Department of Computer Science & Technology, Institute for AI, BNRist Center,

Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University

2Department of Electronic Engineering, Tsinghua University 3RealAI

banyh19@mails.tsinghua.edu.cn, dongyinpeng@mail.tsinghua.edu.cn

Abstract

Self-supervised pre-training has drawn increasing attention in recent years due to

its superior performance on numerous downstream tasks after ﬁne-tuning. How-

ever, it is well-known that deep learning models lack the robustness to adversarial

examples, which can also invoke security issues to pre-trained models, despite

being less explored. In this paper, we delve into the robustness of pre-trained

models by introducing Pre-trained Adversarial Perturbations (PAPs), which are

universal perturbations crafted for the pre-trained models to maintain the effective-

ness when attacking ﬁne-tuned ones without any knowledge of the downstream

tasks. To this end, we propose a Low-Level Layer Lifting Attack (L4A) method

to generate effective PAPs by lifting the neuron activations of low-level layers of

the pre-trained models. Equipped with an enhanced noise augmentation strategy,

L4A is effective at generating more transferable PAPs against ﬁne-tuned models.

Extensive experiments on typical pre-trained vision models and ten downstream

tasks demonstrate that our method improves the attack success rate by a large

margin compared with state-of-the-art methods.

1 Introduction

Large-scale pre-trained models [

] have recently achieved unprecedented success in a variety of

ﬁelds, e.g., natural language processing [

], computer vision [

]. A large amount of

work proposes sophisticated self-supervised learning algorithms, enabling the pre-trained models to

extract useful knowledge from large-scale unlabeled datasets. The pre-trained models consequently

facilitate downstream tasks through transfer learning or ﬁne-tuning [

]. Nowadays, more

practitioners without sufﬁcient computational resources or training data tend to ﬁne-tune the publicly

available pre-trained models on their own datasets. Therefore, it has become an emerging trend to

adopt the paradigm of pre-training to ﬁne-tuning rather than training from scratch [17].

Despite the excellent performance of deep learning models, they are incredibly vulnerable to adver-

sarial examples [

], which are generated by adding small, human-imperceptible perturbations to

natural examples, but can make the target model output erroneous predictions. Adversarial examples

also exhibit an intriguing property called transferability [

], which means that the adversarial

perturbations generated for one model or a set of images can remain adversarial for others. For

example, a universal adversarial perturbation (UAP) [

] can be generated for the entire distribution of

data samples, demonstrating excellent cross-data transferability. Other work [

] has

revealed that adversarial examples have high cross-model and cross-domain transferability, making

black-box attacks practical without any knowledge of the target model or even the training data.

However, much less effort has been devoted to exploring the adversarial robustness of pre-trained

models. As these models have been broadly studied and deployed in various real-world applications,

∗This work was done when Yuanhao Ban was intern at RealAI, Inc; †Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.03372v2 [cs.CV] 14 Oct 2022

Lifting the neuron

activations of low-

level layers

Pre-trained model

Fine-tuned model

Attacker

User1

User2

“Flicker”

“Truck”

+Wrong answer

“Albatross”

PAP

Fine-tuned model

Wrong answer

“Hamsrer”

Uniform

Gaussian

Sampling

noise

Figure 1: A demonstration of pre-trained adversarial perturbations (PAPs): An attacker ﬁrst downloads

pre-trained weights on the Internet and generates a PAP by lifting the neuron activations of low-level

layers of the pre-trained models. We adopt a data augmentation technique called uniform Gaussian

sampling to improve the transferability of PAP. When users ﬁne-tune the pre-trained models to

complete downstream tasks, the attacker can add the PAP to the input of the ﬁne-tuned models to

cheat them without knowing the speciﬁc downstream tasks.

it is of signiﬁcant importance to identify their weaknesses and evaluate their robustness, especially

concerning the pre-training to the ﬁne-tuning procedure.

In this paper, we introduce

Pre-trained Adversarial Perturbations (PAPs)

, a new kind of universal

adversarial perturbations designed for pre-trained models. Speciﬁcally, a PAP is generated for a

pre-trained model to effectively fool any downstream model obtained by ﬁne-tuning the pre-trained

one, as illustrated in Fig. 1. It works under a quasi-black-box setting where the downstream task,

dataset, and ﬁne-tuned model parameters are all unavailable. This attack setting is more suitable for

the pre-training to the ﬁne-tuning procedure since many pre-trained models are publicly available,

and the adversary may generate PAPs before the pre-trained model has been ﬁne-tuned. Although

there are many methods [

] proposed for improving the transferability, they do not consider the

speciﬁc characteristics of the pre-training to the ﬁne-tuning procedure, limiting their cross-ﬁnetuning

transferability in our setting.

To generate more effective PAPs, we propose a

Low-Level Layer Lifting Attack (L4A)

method,

which aims to lift the feature activations of low-level layers. Motivated by the ﬁnding that the lower

the level of the model’s layer is, the less its parameters change during ﬁne-tuning, we generate PAPs

to destroy the low-level feature representations of pre-trained models, making the attacking effects

better reserved after ﬁne-tuning. To further alleviate the overﬁtting of PAPs to the source domain, we

improve L4A with a noise augmentation technique. We conduct extensive experiments on typical

pre-trained vision models [

] and ten downstream tasks. The evaluation results demonstrate that

our method achieves a higher attack success rate on average compared with the alternative baselines.

2 Related work

Self-supervised learning.

Self-supervised learning (SSL) enables learning from unlabeled data. To

achieve this, early approaches utilize hand-crafted pretext tasks, including colorization [

], rotation

prediction [

], position prediction [

], and Selﬁe [

]. Another approach for SSL is contrastive

learning [

], which aims to map the input image to the feature space and minimize the

distance between similar ones while keeping dissimilar ones far away from each other. In particular,

a similar sample is retrieved by applying appropriate data augmentation techniques to the original

one, and the versions of different samples are viewed as dissimilar pairs.

Adversarial examples.

With the knowledge of the structure and parameters of a model, many

algorithms [

] successfully fool the target model in a white-box manner. An intriguing

property of adversarial examples is their good transferability [

]. The universal adversarial

perturbations [

] demonstrate good cross-data transferability by optimizing under a distribution of

data samples. The cross-model transferability has also been extensively studied [

], enabling

the attack on black-box models without any knowledge of their internal working mechanisms.

Robustness of the pre-training to ﬁne-tuned procedure.

Due to the popularity of pre-trained

models, a lot of works [

] study the robustness of this setting. Among them, Dong et al. [

]

propose a novel adversarial ﬁne-tuning method in an information-theoretical way to retain robust

features learned from the pre-trained model. Jiang et al. [

] integrate adversarial samples into the

pre-training procedure to defend against attacks. Fan et at. [

] adopt Clusterﬁt [

] to generate

pseudo-label data and later use them for training the model in a supervised way, which improves the

robustness of the pre-trained model. The main difference between our work and theirs is that we

consider the problem from an attacker’s perspective.

3 Methodology

In this section, we ﬁrst introduce the notations and the problem formulation of the Pre-trained

Adversarial Perturbations (PAPs). Then, we detail the Low-Level Layer Lifting Attack (L4A)

method.

3.1 Notations and problem formulation

Let

fθ

denote a pre-trained model for feature extraction with parameters

. It takes an image

x∈ Dp

as input and outputs a feature vector

v∈ X

, where

and

refer to the pre-training dataset and

feature space, respectively. We denote

θ(x)

as the

-th layer’s feature map of

fθ

for an input image

. In the pre-training to ﬁne-tuning paradigm, a user ﬁne-tunes the pre-trained model

fθ

using a new

dataset

of the downstream task and ﬁnally gets a ﬁne-tuned model

fθ0

with updated parameters

θ0

. Then, let

fθ0(x)

be the predicted probability distribution of an image

over the classes of

and Fθ0(x) = arg max fθ0(x)be the ﬁnal classiﬁcation result.

In this paper, we introduce

Pre-trained Adversarial Perturbations (PAPs)

, which are generated

for the pre-trained model

fθ

, but can effectively fool ﬁne-tuned models

fθ0

on downstream tasks.

Formally, a PAP is a universal perturbation

within a small budget



crafted by

fθ

and

, such that

Fθ0(x+δ)6=Fθ0(x)

for most of the instances belonging to the ﬁne-tuning dataset

. This can be

formulated as the following optimization problem:

max

Ex∼Dt[Fθ0(x)6=Fθ0(x+δ)],s.t. kδkp≤and x+δ∈[0,1] ,(1)

where

k·kp

denotes the

norm and we take the

`∞

norm in this work. There exist some works

related to the universal perturbations, such as the universal adversarial perturbation (UAP) [

] and

the fast feature fool (FFF) [41], as detailed below.

UAP

: Given a classiﬁer

and its dataset

, the UAP tries to generate a perturbation

that can fool

the model on most of the instances from

, which is usually solved by an iterative method. Every

time sampling an image

from the dataset

, the attacker computes the minimal perturbation

that

sends x+δto the decision boundary by Eq. (2) and then adds it into δ.

ζ←arg min

rkrk2,s.t. F(x+δ+r)6=F(x).(2)

FFF

: It aims to produce maximal spurious activations at each layer. To achieve this, FFF starts with a

random δand solves the following problem:

min

δ−log K

i=0

li(δ)!,s.t. kδkp≤. (3)

where li(δ)is the mean of the output tensor at layer i.

3.2 Our design

However, these attacks show limited cross-ﬁnetuning transferability in our problem setting due to

ignorance of the ﬁne-tuning procedure. Two challenges are degenerating the performance.

(a) Resnet50 (b) Resnet101 (c) ViT16

Figure 2: The ordinate represents the Frobenius norm of the difference between the parameters of

the ﬁne-tuned model and its corresponding pre-trained model, which is scaled into a range from 0

to 1 for easy comparison. The abscissa represents the level of the layer. Note that Resnet50 and

Resnet101 [18] are pre-trained by SimCLRv2 [4], and ViT16 [56] is pre-trained by MAE [21].

•Fine-tuning Deviation.

The parameters of the model could change a lot during ﬁne-tuning.

As a result, the generated adversarial samples may perform well in the feature space of the

pre-trained model but fail in the ﬁne-tuned ones.

•Datasets Deviation.

The statistics (i.e., mean and standard deviation) of different datasets

can vary a lot. Only using the pre-training dataset with the ﬁxed statistics to generate

adversarial samples may suffer a performance drop.

To alleviate the negative effect of the above issues, we propose a

Low-Level Layer Lifting Attack

(L4A) method equipped with a uniform Gaussian sampling strategy.

Low-Level Layer Lifting Attack (L4A).

Our method is motivated by the ﬁndings in Fig. 2that

the higher the level of the layers, the more their parameters change during ﬁne-tuning. This is also

consistent with the knowledge that the low-level convolutional layer acts as an edge detector that

extracts low-level features like edges and textures and has little high-level semantic information [

61]. Since images from different datasets share the same low-level features, the parameters of these

layers can be preserved during ﬁne-tuning. In contrast, the attack algorithms based on the high-level

layers or the scores predicted by the model may not transfer well in such a cross-ﬁnetuning setting,

as the feature spaces of high-level layers are easily distorted during ﬁne-tuning. The basic method of

L4A can be formulated as the following problem:

min

δLbase(fθ,x,δ) = −Ex∼Dpkfk

θ(x+δ)k2

F,(4)

where

k · kF

denotes the Frobenius norm of the input tensor. In our experiments, we ﬁnd the lower

the layer, the better it performs, so we choose the ﬁrst layer as default, such that

k= 1

. As Eq.

(4)

is usually a sophisticated non-convex optimization problem, we solve it using stochastic gradient

descent method.

We also ﬁnd that fusing the adversarial loss of the consecutive low-level layers can boost the

performance, which gives L4Afuse method as solving:

min

δLfuse(fθ,x,δ) = −Ex∼Dpkfk1

θ(x+δ)k2

F+λ· kfk2

θ(x+δ)k2

F,(5)

where

fk1

θ(x+δ)

and

fk2

θ(x+δ)

refers to the

-th and

-th layers’ feature maps of

fθ

respectively,

λis a balancing hyperparameter. We set k1= 1 and k2= 2 as default.

Figure 3: Datasets’ statistics.

Uniform Gaussian Sampling.

Nowadays, most state-of-the-art

networks apply batch normalization [

] to input images for bet-

ter performance. Thus, the datasets’ statistics become an essential

factor for training. As shown in Fig. 3, the distribution of the

downstream datasets can vary signiﬁcantly compared to that of

the pre-training dataset. However, traditional data augmentation

techniques [

] are limited to the pre-training domain and

cannot alleviate the problem. Thus, we propose sampling Gaus-

sian noises with various means and deviations to avoid overﬁtting.

Table 1: The attack success rate (%) of various attack methods against

Resnet101

pre-trained by

SimCLRv2. Note that C10 stands for CIFAR10 and C100 stands for CIFAR100.

ASR Cars Pets Food DTD FGVC CUB SVHN C10 C100 STL10 AVG

FFFno 43.81 38.62 49.95 63.24 85.57 48.38 12.55 8.53 77.74 57.11 48.55

FFFFmean 33.93 31.37 41.77 52.66 78.94 45.00 14.85 14.42 72.59 56.66 44.22

FFFone 31.87 29.74 39.25 46.92 74.17 43.87 9.24 11.77 65.61 50.21 40.26

DR 36.28 35.54 47.43 47.45 75.00 44.15 12.05 21.35 65.39 41.65 42.63

SSP 32.89 30.50 43.12 45.85 82.57 45.55 8.69 11.66 65.80 40.91 40.75

ASV 60.75 19.84 36.33 56.22 84.16 55.82 7.11 7.29 58.10 80.89 46.64

UAP 48.70 36.55 60.80 63.40 76.06 52.64 8.46 8.53 52.35 31.15 43.86

UAPEPGD 94.12 66.66 61.30 72.55 70.34 82.72 13.88 61.65 20.04 50.13 59.34

L4Abase 94.07 61.57 71.23 69.20 96.28 81.07 11.70 12.68 80.57 90.49 66.89

L4Afuse 90.98 88.53 80.65 74.31 93.79 91.23 11.40 17.40 80.98 89.69 67.10

L4Augs 94.24 94.99 78.28 77.23 92.92 91.77 11.40 14.60 76.50 90.05 72.20

Combining the base loss using the pre-training dataset and the

new loss using uniform Gaussian noises gives the L4A

ugs

method

as follows:

min

δLugs(fθ,x,δ) = −Eµ,σ,n0∼N(µ,σ)Ex∼Dpkfk

θ(x+δ)k2

F+λ· kfk

θ(n0+δ)k2

F,(6)

where

and

are drawn from the uniform distribution

U(µl,µh)

and

U(σl,σh)

, respectively, and

µl,µh,σl,σrare four hyperparameters.

4 Experiments

We provide some experimental results in this section. More results can be found in Appendix. Our

code is publicly available at https://github.com/banyuanhao/PAP.

4.1 Settings

Pre-training methods.

SimCLR [

] uses the Resnet [

] backbone and pre-trains the model by

contrastive learning. We download pre-trained parameters of Resnet50 and Resnet101

to evaluate

the generalization ability of our algorithm on different architectures. We also adopt MOCO [

]

with the backbone of Resnet50

. Besides convolutional neural networks, transformers [

] attract

much attention nowadays for their competitive performance. Based on transformers and masked

image modeling, MAE [

] becomes a good alternative for pre-training. We adopt the pre-trained

ViT-base-16 model

. Moreover, vision-language pre-trained models are gaining popularity these

days. Thus we also choose CLIP [

]

for our study. We report the results of SimCLR and MAE in

Section 4.2. More results on CLIP and MOCO can be found in Appendix A.1.

Datasets and Pre-processing.

We adopt the ILSVRC 2012 dataset [

] to generate PAPs, which are

also used to pre-train the models. We mainly evaluate PAPs on image classiﬁcation tasks, which are

the same as the settings of SimCLRv2. Ten ﬁne-grained and coarse-grained datasets are used to test

the cross-ﬁnetuning transferability of the generated PAPs. We load these datasets from torchvision

(Details in Appendix D). Before feeding the images to the model, we resize them to

256 ×256

and

then center crop them into 224 ×224.

Compared methods.

We choose UAP [

] to test whether image-agnostic attacks also bear good

cross-ﬁnetuning transferability. Since UAP needs ﬁnal classiﬁcation predictions of the inputs, we ﬁt

a linear head on the pre-trained feature extractor. Furthermore, by integrating the moment term into

the iterative method, UAPEPGD [

] is believed to enhance cross-model transferability. Thus, we

adopt UAPEPGD to study the connection between cross-model and cross-ﬁnetuning transferability.

1https://github.com/google-research/simclr

2https://dl.fbaipublicfiles.com/moco/

3https://github.com/facebookresearch/mae

4https://github.com/openai/CLIP

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Pre-trainedAdversarialPerturbationsYuanhaoBan1;2,YinpengDong1;3y1DepartmentofComputerScience&Technology,InstituteforAI,BNRistCenter,Tsinghua-BoschJointMLCenter,THBILab,TsinghuaUniversity2DepartmentofElectronicEngineering,TsinghuaUniversity3RealAIbanyh19@mails.tsinghua.edu.cn,dongyinpeng@mail.tsingh...

展开>> 收起<<

Pre-trained Adversarial Perturbations Yuanhao Ban12 Yinpeng Dong13y 1Department of Computer Science Technology Institute for AI BNRist Center.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Pre-trained Adversarial Perturbations Yuanhao Ban12 Yinpeng Dong13y 1Department of Computer Science Technology Institute for AI BNRist Center

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: