Hindering Adversarial Attacks with Implicit Neural Representations Andrei A. Rusu1Dan A. Calian1Sven Gowal1Raia Hadsell1 Abstract

2025-05-06 0 0 4.45MB 25 页 10玖币
侵权投诉
Hindering Adversarial Attacks with Implicit Neural Representations
Andrei A. Rusu 1Dan A. Calian 1Sven Gowal 1Raia Hadsell 1
Abstract
We introduce the Lossy Implicit Network Activa-
tion Coding (LINAC) defence, an input transfor-
mation which successfully hinders several com-
mon adversarial attacks on CIFAR-
10
classifiers
for perturbations up to
= 8/255
in
L
norm
and
= 0.5
in
L2
norm. Implicit neural rep-
resentations are used to approximately encode
pixel colour intensities in
2D
images such that
classifiers trained on transformed data appear to
have robustness to small perturbations without
adversarial training or large drops in performance.
The seed of the random number generator used
to initialise and train the implicit neural represen-
tation turns out to be necessary information for
stronger generic attacks, suggesting its role as a
private key. We devise a Parametric Bypass Ap-
proximation (PBA) attack strategy for key-based
defences, which successfully invalidates an ex-
isting method in this category. Interestingly, our
LINAC defence also hinders some transfer and
adaptive attacks, including our novel PBA strat-
egy. Our results emphasise the importance of a
broad range of customised attacks despite appar-
ent robustness according to standard evaluations.
LINAC source code and parameters of defended
classifier evaluated throughout this submission
are available: https://github.com/deepmind/linac.
1. Introduction
Training Deep Neural Network (DNN) classifiers which are
accurate yet generally robust to small adversarial perturba-
tions is an open problem in computer vision and beyond,
inspiring much empirical and foundational research into
modern DNNs. Szegedy et al. (2014) showed that DNNs are
not inherently robust to imperceptible input perturbations,
which reliably cross learned decision boundaries, even those
of different models trained on similar data. With hindsight,
1
DeepMind, London, UK. Correspondence to: Andrei A. Rusu
<andrei@deepmind.com>.
Proceedings of the
39 th
International Conference on Machine
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-
right 2022 by the author(s).
it becomes evident that two related yet distinct design prin-
ciples have been at the core of proposed defences ever since.
Intuitively, accurate DNN classifiers could be considered ro-
bust in practice if: (I) their decision boundaries were largely
insensitive to all adversarial perturbations, and/or (II) com-
puting any successful adversarial perturbations was shown
to be expensive, ideally intractable. Early defences built
on principle (I) include the adversarial training approach of
Madry et al. (2018) and the verifiable defences of Hein &
Andriushchenko (2017); Raghunathan et al. (2018), with
many recent works continually refining such algorithms,
e.g. Cohen et al. (2019); Gowal et al. (2020); Rebuffi et al.
(2021). A wide range of defences were built, or shown
to operate, largely on principle (II), including adversarial
detection methods (Carlini & Wagner,2017a), input trans-
formations (Guo et al.,2018) and denoising strategies (Liao
et al.,2018;Niu et al.,2020). Many such approaches have
since been circumvented by more effective attacks, such as
those proposed by Carlini & Wagner (2017b), or by using
“adaptive attacks” (Athalye et al.,2018;Tramer et al.,2020).
Despite the effectiveness of recent attacks against these
defences, Garg et al. (2020) convincingly argue on a theo-
retical basis that principle (II) is sound; similarly to cryp-
tography, robust learning could rely on computational hard-
ness, even in cases where small adversarial perturbations
do exist and would be found by a hypothetical, computa-
tionally unbounded adversary. However, constructing such
robust classifiers for problems of interest, e.g. image clas-
sification, remains an open problem. Recent works have
proposed defences based on cryptographic principles, such
as the pseudo-random block pixel shuffling approach of
AprilPyone & Kiya (2021a). As we will show, employing
cryptographic principles in algorithm design is not in it-
self enough to prevent efficient attacks. Nevertheless, we
build on the concept of key-based input transformation and
propose a novel defence based on Implicit Neural Repre-
sentations (INRs). We demonstrate that our Lossy Implicit
Neural Activation Coding (LINAC) defence hinders most
standard and even adaptive attacks, more so than the related
approaches we have tested, without making any claims of
robustness about our defended classifier.
Contributions:
(1) We demonstrate empirically that lossy
INRs can be used in a standard CIFAR-10 image classifica-
tion pipeline if they are computed using the same implicit
arXiv:2210.13982v1 [cs.LG] 22 Oct 2022
Hindering Adversarial Attacks with Implicit Network Activation Coding (LINAC)
network initialisation, a novel observation which makes our
LINAC defence possible. (2) The seed of the random num-
ber generator used for initialising and computing INRs is
shown to be an effective and compact private key, since with-
holding this information hinders a suite of standard adver-
sarial attacks widely used for robustness evaluations. (3) We
report our systematic efforts to circumvent the LINAC de-
fence with transfer and a series of adaptive attacks, designed
to expose and exploit potential weaknesses of LINAC. (4) To
the same end we propose the novel Parametric Bypass Ap-
proximation (PBA) attack strategy, valid under our threat
model, and applicable to other defences using secret keys.
We demonstrate its effectiveness by invalidating an existing
key-based defence which was previously assumed robust.
2. Related Work
Adversarial Robustness.
Much progress has been made
towards robust image classifiers along the adversarial train-
ing (Madry et al.,2018) route, which has been extensively
explored and is well reviewed, e.g. in (Schott et al.,2019;
Pang et al.,2020;Gowal et al.,2020;Rebuffi et al.,2021).
While such approaches can be effective against current at-
tacks, a complementary line of work investigates certified
defences, which offer guarantees of robustness around ex-
amples for some well defined sets (Wong & Kolter,2018;
Raghunathan et al.,2018;Cohen et al.,2019). Indeed, many
such works acknowledge the need for complementary ap-
proaches, irrespective of the success of adversarial training
and the well understood difficulties in combining methods
(He et al.,2017). The prolific work on defences against
adversarial perturbations has spurred the development of
stronger attacks (Carlini & Wagner,2017b;Brendel et al.,
2018;Andriushchenko et al.,2020) and standardisation of
evaluation strategies for threat models of interest (Athalye
et al.,2018;Croce & Hein,2020), including adaptive attacks
(Tramer et al.,2020). Alongside the empirical progress to-
wards building robust predictors, this line of research has
yielded an improved understanding of current deep learn-
ing models (Ilyas et al.,2019;Engstrom et al.,2019), the
limitations of effective adversarial robustness techniques
(Jacobsen et al.,2018), and the data required to train them
(Schmidt et al.,2018).
Athalye et al. (2018) show that a number of defences pri-
marily hinder gradient-based adversarial attacks by obfus-
cating gradients. Various forms are identified, such as gra-
dient shattering (Goodfellow et al.,2014), gradient masking
(Papernot et al.,2017), exploding and vanishing gradients
(Song et al.,2018b), stochastic gradients (Dhillon et al.,
2018) and a number of input transformations aimed at coun-
tering adversarial examples, including noise filtering ap-
proaches using PCA or image quilting (Guo et al.,2018),
the Saak transform (Song et al.,2018a), low-pass filtering
(Shaham et al.,2018), matrix estimation (Yang et al.,2019)
and JPEG compression (Dziugaite et al.,2016;Das et al.,
2017;2018). Indeed, many such defences have been pro-
posed, as reviewed by Niu et al. (2020), they have ranked
highly in competitions (Kurakin et al.,2018), and many
have since been shown to be less robust than previously
thought, e.g. by Athalye et al. (2018) and Tramer et al.
(2020), who use adaptive attacks to demonstrate that several
input transformations offer little to no robustness.
To build on such insights, it is worth identifying the “ingre-
dients” essential to the success of adversarial attacks. Most
effective attacks, including adaptive ones, assume the ability
to approximate the outputs of the targeted model for arbi-
trary inputs. This is reasonable when applying the correct
transformation is tractable for the attacker. Hence, deny-
ing access to such computations seems to be a promising
direction for hindering adversarial attacks. AprilPyone &
Kiya (2020;2021b); MaungMaung & Kiya (2021) borrow
standard practice from cryptography and assume that an
attacker has full knowledge of the defence’s algorithm and
parameters, short of a small number of bits which make
up a private key. Another critical “weakness” of such in-
put denoising defences is that they can be approximated
by the identity mapping for the purpose of computing gra-
dients (Athalye et al.,2018). Even complex parametric
approaches, which learn stochastic generative models of
the input distribution, are susceptible to reparameterisation
and Expectation-over-Transformation (EoT) attacks in the
white-box setting. Thus, it is worth investigating whether
non-parametric,lossy and fully deterministic input transfor-
mations exist such that downstream models can still perform
tasks of interest to high accuracy, while known and novel
attack strategies are either ruled out, or at least substantially
hindered, including adaptive attacks.
Implicit Neural Representations.
Neural networks have
been used to parameterise many kinds of signals, see the
work by Sitzmann (2020) for an extensive list, with remark-
able recent advances in scene representations (Mildenhall
et al.,2020) and image processing (Sitzmann et al.,2020).
INRs have been used in isolation per image or scene, not
for generalisation across images. Some exceptions exist
in unsupervised learning, e.g. Skorokhodov et al. (2021)
parameterise GAN decoders such that they directly output
INRs of images, rather than colour intensities for all pixels.
In this paper we show that INRs can be used to discover
functional decompositions of RGB images which enable
comparable generalisation to learning on the original signal
encoding (i.e. RGB).
Hindering Adversarial Attacks with Implicit Network Activation Coding (LINAC)
x: RGB Image In
t(x): Activation Image Out
Pixel(i, j)
Intensities
x(pi,j) = (r, g, b)
Representation
of Pixel(i, j)
Intensities
pi,j = (i, j) x(pi,j) = (r, g, b)
j
i
Implicit Neural
Representation
j
i
Figure 1.
Visual depiction of LINAC, our proposed input transformation. An RGB image
x
is converted into an Activation Image
t(x)
with
identical spatial dimensions, but
H
channels instead of
3
. A neural network model which maps pixel coordinates to RGB colour intensities
is fit such that it approximates
x
. The resulting model parameters (after fitting) are called the Implicit Neural Representation (INR) of
image
x
. In order to output correct RGB colour intensities for all pixels, the implicit neural network needs to compute a hierarchical
functional decomposition of
x
. We empirically choose an intermediate representation to define our transformation. Activations in the
middle hidden layer are associated with their corresponding pixel coordinates to form the output Activation Image
t(x)
, with as many
channels as there are units in the middle layer (H).
3. Hindering Adversarial Attacks with
Implicit Neural Representations
In this section we introduce LINAC, our proposed input
transformation which hinders adversarial attacks by leverag-
ing implicit neural representations, also illustrated in Fig. 1.
Setup.
We consider a supervised learning task with a dataset
D X × Y
of pairs of images
x
and their correspond-
ing labels
y
. We use a deterministic input transformation
t:X → H
which transforms input images,
x7→ t(x)
, while
preserving their spatial dimensions. Further, we consider
a classifier
fθ
, parameterised by
θ
, whose parameters are
estimated by Empirical Risk Minimisation (ERM) to map
transformed inputs to labels
fθ:H → Y
. The model is not
adversarially trained, yet finding adversarial examples for it
is hindered by LINAC, as we demonstrate through extensive
evaluations in Section 5.
Implicit Neural Representations.
For an image
x
, its im-
plicit neural representation is given by a multi-layer per-
ceptron (MLP)
Φ = hLhL1◦ · · · ◦ h0
,
Φ: R2R3
,
with
L
hidden layers, which maps spatial coordinates to
their corresponding colours.
Φφ
is a solution to the implicit
equation:
Φ(p)x(p)=0,(1)
where
p
are spatial coordinates (i.e. pixel locations) and
x(p)
are the corresponding image colours. Our input trans-
formation leverages this implicit neural representation to
encode images in an approximate manner.
Reconstruction Loss.
The implicit equation (1) can be
translated (Sitzmann et al.,2020) into a standard recon-
struction loss between image colours and the output of a
multi-layer perceptron
Φφ
at each (2D) pixel location
pi,j
,
L(φ, x) = X
i,j
||Φφ(pi,j )x(pi,j )||2
2.(2)
Algorithm 1 The LINAC Transform
Inputs:
RGB image
x
(with size
I×J×3
); private key;
number of epochs
N
; mini-batch size
M
; number of MLP
layers L; representation layer K; learning rate µ.
Output: Activation Image t(x)(with size I×J×H).
rng =INIT PRNG(private key)Seed rng.
φ(0) =INIT MLP(rng, L)
S=bI·J/M cNum. mini-batches per epoch.
φ=φ(0)
for epoch = 0 . . . N 1do
P=SHUFFLE AND SPLIT PIXELS(x, rng, S)
for m= 0 . . . S 1do
=1
M·I·JP
(i,j)∈P[m]
||Φφ(pi,j )x(pi,j )||2
2
φ=φµφ
end for
end for
ˆ
φx=φ
Return t(x)applying Eq. 3using ˆ
φxand layer K.
We provide pseudocode for the LINAC transform in Al-
gorithm 1and a discussion of computational and memory
requirements in Appendix A.1.4. For each individual im-
age
x
, we estimate
ˆ
φx
, an approximate local minimiser of
L(φ, x)
, using a stochastic iterative minimisation procedure
with mini-batches of
M
pixels grouped into epochs, which
cover the entire image in random order, for a total of
N
passes through all pixels.
Private Key.
A random number generator is used for: (1)
generating the initial MLP parameters
φ(0)
and (2) for decid-
ing which random subsets of pixels make up mini-batches
in each epoch. This random number generator is seeded by
a
64
-bit integer which we keep secret and denote as the pri-
vate key. Hence, for all inputs
x
we start each independent
optimisation from the same set of initial parameters
φ(0)
,
and we use the same shuffling of pixels across epochs.
Hindering Adversarial Attacks with Implicit Network Activation Coding (LINAC)
Lossy Implicit Network Activation Coding (LINAC).
We consider the lossy encoding of each pixel
(i, j)
in im-
age
x
as the
H
-dimensional intermediate activations vector
of layer
K
of the MLP evaluated at that pixel position:
cx(i, j)=(hK1
ˆ
φx
◦ · · · h0
ˆ
φx
)(pi,j )
with
K < L
. We build
the lossy implicit network activation coding transformation
of an image
x
by stacking together the encodings of all its
pixels in its 2D image grid, concatenating on the feature
dimension axis. The LINAC transformation
t(x)
of the
I×J×3image xis given by:
t(x) =
cx(0,0) . . . cx(0, J 1)
.
.
.....
.
.
cx(I1,0) . . . cx(I1, J 1)
,(3)
and has dimensionality
I×J×H
, where
H
is the number
of outputs of the
K
-th layer of the MLP. By construction,
our input transformation preserves the spatial dimensions
of each image while increasing the feature dimensionality
(from
3
, the image’s original number of colour channels,
to
H
); this means that standard network architectures used
for image classification (e.g. convolutional models) can be
readily trained as the classifier fθ.
All omitted implementation details are provided in Ap-
pendix A, and sensitivity analyses of LINAC to its hyper-
parameters are reported in Appendix C.
Threat Model.
We are interested in hindering adversarial
attacks on a nominally-trained classifier
fθ(t(x))
, which
operates on transformed inputs (i.e. on
t(x)
rather than on
x
), using a private key of our choosing. Next, we describe
the threat model of interest by stating the conditions under
which the LINAC defence is meant to hinder adversarial
attacks on fθ, following AprilPyone & Kiya (2021a).
We assume attackers do not have access to the private key,
the integer seed of the random number generator used for
computing the LINAC transformation, but otherwise have
full algorithmic knowledge about our approach. Specifically,
we assume an attacker has complete information about the
classification pipeline, including the architecture, training
dataset and weights of the defended classifier. This includes
full knowledge of the LINAC algorithm, the implicit net-
work architecture, parameter initialisation scheme and all
the fitting details, except for the private key.
4. Attacking the LINAC Defence
Setup.
We are interested in evaluating the apparent ro-
bustness of a LINAC-defended classifier,
fˆ
θ
, which has
been trained by ERM to classify transformed inputs from
the dataset
D
. Specifically, its parameters
ˆ
θ
minimise
Ex,y∼D [LCE(fθ(t(x)), y)]
, where
LCE
is the cross-entropy
loss and
t(x)
is the LINAC transformation applied to image
xusing the private key.
Input Perturbations.
Classifiers defended by LINAC are
not adversarially trained (Madry et al.,2018) to increase
their robustness to specific
Lp
norm-bounded input pertur-
bations. Furthermore, the LINAC defence is inherently
agnostic about particular notions of maximum input pertur-
bations. Nevertheless, to provide results comparable with a
broad set of defences from the literature, we perform eval-
uations on standard
Lp
norm-bounded input perturbations
with: (1) a maximum perturbation radius of
= 8/255
in
the Lnorm, and (2) one of = 0.5in the L2norm.
Adapting Existing Attacks.
Without access to the pri-
vate key an attacker cannot compute the LINAC transfor-
mation exactly. However, an attacker could acquire ac-
cess to model inferences by attempting to brute-force guess
the private key. Another option would be to train surro-
gate models with LINAC, but using keys chosen by the at-
tacker, in the hope that decision boundaries of these models
would be similar enough to mount effective transfer attacks.
More advanced attackers could modify LINAC itself to en-
able strong Backward Pass Differentiable Approximation
(BPDA) (Athalye et al.,2018) attacks. We evaluate the
success of these and other standard attacks in Section 5.
Designing Adaptive Attacks.
Athalye et al. (2018) provide
an excellent set of guidelines for designing and perform-
ing successful adaptive attacks, while also standardising
results reporting and aggregation. Of particular interest for
defences based on input transformations are the BPDA and
Expectation-over-Transformation (EoT) attack strategies.
Subsequent work convincingly argues that adaptive attacks
are not meant to be general, and must be customised, or
“adapted”, to each defence in turn (Tramer et al.,2020).
While BPDA and EoT generate strong attacks on input
transformations, they both rely on being able to compute
the forward transformation or approximate it with samples.
Indeed, the authors mention that substitution of both the
forward and backward passes with approximations leads to
either completely ineffective, or much less effective attacks.
Parametric Bypass Approximation (PBA).
Inspired by
the reparameterisation strategies of Athalye et al. (2018), we
propose a bespoke attack by making use of several pieces of
information available under our threat model: the parametric
form of the defended classifier
fθ(t(x))
, its training dataset
Dand loss function LCE, and its trained weights ˆ
θ.
A Parametric Bypass Approximation of an unknown nui-
sance transformation
u:X → H
is a surrogate parametric
function
hψ:X → H
, parameterised by a solution to the
following optimisation problem:
ψ= arg min
ψ
E
x,y∼D LCE(fˆ
θ(hψ(x)), y).(4)
This formulation seeks a set of parameters
ψ
which min-
imise the original classification loss while keeping the de-
Hindering Adversarial Attacks with Implicit Network Activation Coding (LINAC)
fended classifier’s parameters frozen at
ˆ
θ
. Similar with
classifier training, this optimisation problem can be solved
efficiently using Stochastic Gradient Descent (SGD).
A PBA adversarial attack can then proceed by approximat-
ing the defended classifier
fˆ
θ(u(·))
with those of the bypass
classifier
fˆ
θ(hψ(·))
in both forward and backward passes
when computing adversarial examples, e.g. using Projected
Gradient Descent (PGD).
The main advantages of the PBA strategy are that no for-
ward passes through the nuisance transformation
u(·)
are
required, and that it admits efficient computation of many
attacks to
fˆ
θ
, including gradient-based ones. In Section 5
we demonstrate the effectiveness of PBA beyond the LINAC
defence. We show that, even though the surrogate transfor-
mation is fit on training data only, the defended classifier
operating on samples passed through
hψ
(bypassing
u
)
demonstrates nearly identical generalisation to the test set.
Furthermore, we also show that PBA has greater success at
finding adversarial examples for the LINAC defence com-
pared to other methods. Lastly, we use PBA to invalidate an
existing key-based defence proposed in the literature.
5. Results
5.1. Evaluation Methodology
Since LINAC makes no assumptions about adversarial per-
turbations, we are able to evaluate a single defended classi-
fier model against all attack strategies considered, in contrast
to much adversarial training research (Madry et al.,2018).
To obtain a more comprehensive picture of apparent robust-
ness we start from the rigorous evaluation methodology used
by Gowal et al. (2019); Rebuffi et al. (2021). We perform
untargeted PGD attacks with
100
steps and
10
randomised
restarts, as well as multi-targeted (MT) PGD attacks using
200
steps and
20
restarts. Anticipating the danger of ob-
fuscated gradients skewing results, we also evaluate with
the Square approach of Andriushchenko et al. (2020), a
powerful gradient-free attack, with
10000
evaluations and
10
restarts. For precise comparisons with the broader liter-
ature we also report evaluations using the parameter-free
AutoAttack (AA) strategy of Croce & Hein (2020).
Following Athalye et al. (2018) we aggregate results across
attacks by only counting as accurate robust predictions those
test images for which the defended classifier predicts the
correct class with and without adversarial perturbations,
computed using all methods above. We report this as Best
Known robust accuracy.
In instances where several surrogate models are used to
compute adversarial perturbations, also known as transfer
attacks, we report Best Adversary results aggregated for
each individual attack, which is defined as robust accuracy
Figure 2.
Results of direct attack on private key. A histogram of
accuracies of the same defended classifier with inputs transformed
using either the correct key or
100000
randomly chosen keys. An
appropriate surrogate transformation is not found, invalidating
attack vectors which rely on access to the outputs of the defended
model on attacker chosen inputs.
against all source models considered.
We aggregate evaluations across these two dimensions (at-
tacks & surrogate models) by providing a single robust ac-
curacy number against all attacks computed using all source
models for each standard convention of maximum perturba-
tion and norm, enabling easy comparisons with results in
the literature.
5.2. Attacks with Surrogate Transformations & Models
A majority of adversarial attack strategies critically depend
on approximating the outputs of the defended classifier for
inputs chosen by the attacker. The private key is kept secret
in our threat model, which means that an attacker can neither
compute the precise input transformation used to train the
defended classifier, nor its outputs on novel data. Hence,
an attacker must find appropriate surrogate transformations,
or surrogate classifier models, in order to perform effective
adversarial attacks. We investigate both strategies below.
Firstly, we empirically check that the outputs of the de-
fended classifier cannot be usefully approximated without
knowledge of the private key. It is reasonable to hypothesise
that transformations with different keys may lead to simi-
lar functional representations of the input signal. We start
investigating this hypothesis by simply computing the accu-
racy of the defended model on clean input data transformed
with LINAC, but using keys chosen by the attacker, also
known as a brute-force key attack, which is valid under our
threat model. As reported in Figure 2, the accuracy of our
LINAC defended classifier on test inputs transformed with
the correct private key is over
93%
. In an attempt to find a
surrogate transformation,
100000
keys are picked uniformly
at random. For each key, we independently evaluated the
accuracy of the classifier using a batch of
100
test examples,
and we report the resulting accuracy estimates for all keys
with a histogram plot. The mean accuracy with random key
guesses is around
30%
, with a top accuracy of just
57%
(see
Table 4in Appendix B.1 for a breakdown). Hence, using
LINAC with incorrect keys leads to poor approximations
of classifier outputs on correctly transformed data. This
摘要:

HinderingAdversarialAttackswithImplicitNeuralRepresentationsAndreiA.Rusu1DanA.Calian1SvenGowal1RaiaHadsell1AbstractWeintroducetheLossyImplicitNetworkActiva-tionCoding(LINAC)defence,aninputtransfor-mationwhichsuccessfullyhindersseveralcom-monadversarialattacksonCIFAR-10classiersforperturbationsupto...

展开>> 收起<<
Hindering Adversarial Attacks with Implicit Neural Representations Andrei A. Rusu1Dan A. Calian1Sven Gowal1Raia Hadsell1 Abstract.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:4.45MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注