
Pre-trained Adversarial Perturbations
Yuanhao Ban1,2∗, Yinpeng Dong1,3†
1Department of Computer Science & Technology, Institute for AI, BNRist Center,
Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University
2Department of Electronic Engineering, Tsinghua University 3RealAI
banyh19@mails.tsinghua.edu.cn, dongyinpeng@mail.tsinghua.edu.cn
Abstract
Self-supervised pre-training has drawn increasing attention in recent years due to
its superior performance on numerous downstream tasks after fine-tuning. How-
ever, it is well-known that deep learning models lack the robustness to adversarial
examples, which can also invoke security issues to pre-trained models, despite
being less explored. In this paper, we delve into the robustness of pre-trained
models by introducing Pre-trained Adversarial Perturbations (PAPs), which are
universal perturbations crafted for the pre-trained models to maintain the effective-
ness when attacking fine-tuned ones without any knowledge of the downstream
tasks. To this end, we propose a Low-Level Layer Lifting Attack (L4A) method
to generate effective PAPs by lifting the neuron activations of low-level layers of
the pre-trained models. Equipped with an enhanced noise augmentation strategy,
L4A is effective at generating more transferable PAPs against fine-tuned models.
Extensive experiments on typical pre-trained vision models and ten downstream
tasks demonstrate that our method improves the attack success rate by a large
margin compared with state-of-the-art methods.
1 Introduction
Large-scale pre-trained models [
50
,
17
] have recently achieved unprecedented success in a variety of
fields, e.g., natural language processing [
25
,
34
,
2
], computer vision [
4
,
20
,
21
]. A large amount of
work proposes sophisticated self-supervised learning algorithms, enabling the pre-trained models to
extract useful knowledge from large-scale unlabeled datasets. The pre-trained models consequently
facilitate downstream tasks through transfer learning or fine-tuning [
46
,
61
,
16
]. Nowadays, more
practitioners without sufficient computational resources or training data tend to fine-tune the publicly
available pre-trained models on their own datasets. Therefore, it has become an emerging trend to
adopt the paradigm of pre-training to fine-tuning rather than training from scratch [17].
Despite the excellent performance of deep learning models, they are incredibly vulnerable to adver-
sarial examples [
54
,
15
], which are generated by adding small, human-imperceptible perturbations to
natural examples, but can make the target model output erroneous predictions. Adversarial examples
also exhibit an intriguing property called transferability [
54
,
33
,
40
], which means that the adversarial
perturbations generated for one model or a set of images can remain adversarial for others. For
example, a universal adversarial perturbation (UAP) [
40
] can be generated for the entire distribution of
data samples, demonstrating excellent cross-data transferability. Other work [
33
,
11
,
58
,
12
,
42
] has
revealed that adversarial examples have high cross-model and cross-domain transferability, making
black-box attacks practical without any knowledge of the target model or even the training data.
However, much less effort has been devoted to exploring the adversarial robustness of pre-trained
models. As these models have been broadly studied and deployed in various real-world applications,
∗This work was done when Yuanhao Ban was intern at RealAI, Inc; †Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.03372v2 [cs.CV] 14 Oct 2022