What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness Nikolaos Tsilivis

2025-04-24 0 0 5.82MB 27 页 10玖币
侵权投诉
What Can the Neural Tangent Kernel Tell Us About
Adversarial Robustness?
Nikolaos Tsilivis
Center for Data Science
New York University
nt2231@nyu.edu
Julia Kempe
Center for Data Science and
Courant Institute of Mathematical Sciences
New York University
kempe@nyu.edu
Abstract
The adversarial vulnerability of neural nets, and subsequent techniques to create
robust models have attracted significant attention; yet we still lack a full understand-
ing of this phenomenon. Here, we study adversarial examples of trained neural
networks through analytical tools afforded by recent theory advances connecting
neural networks and kernel methods, namely the Neural Tangent Kernel (NTK),
following a growing body of work that leverages the NTK approximation to suc-
cessfully analyze important deep learning phenomena and design algorithms for
new applications. We show how NTKs allow to generate adversarial examples
in a “training-free” fashion, and demonstrate that they transfer to fool their finite-
width neural net counterparts in the “lazy” regime. We leverage this connection
to provide an alternative view on robust and non-robust features, which have been
suggested to underlie the adversarial brittleness of neural nets. Specifically, we
define and study features induced by the eigendecomposition of the kernel to better
understand the role of robust and non-robust features, the reliance on both for
standard classification and the robustness-accuracy trade-off. We find that such
features are surprisingly consistent across architectures, and that robust features
tend to correspond to the largest eigenvalues of the model, and thus are learned
early during training. Our framework allows us to identify and visualize non-robust
yet useful features. Finally, we shed light on the robustness mechanism underlying
adversarial training of neural nets used in practice: quantifying the evolution of
the associated empirical NTK, we demonstrate that its dynamics falls much earlier
into the “lazy” regime and manifests a much stronger form of the well known bias
to prioritize learning features within the top eigenspaces of the kernel, compared to
standard training.
1 Introduction
Despite the tremendous success of deep neural networks in many computer vision and language
modeling tasks, as well as in scientific discoveries, their properties and the reasons for their success
are still poorly understood. Focusing on computer vision, a particularly surprising phenomenon
evidencing that those machines drift away from how humans perform image recognition is the
presence of adversarial examples, images that are almost identical to the original ones, yet are
misclassified by otherwise accurate models.
Since their discovery (Szegedy et al., 2014), a vast amount of work has been devoted to understanding
the sources of adversarial examples and explanations include, but are not limited to, the close to linear
operating mode of neural nets (Goodfellow et al., 2015), the curse of dimensionality carried by the
input space (Goodfellow et al., 2015, Simon-Gabriel et al., 2019), insufficient model capacity (Tsipras
et al., 2019, Nakkiran, 2019) or spurious correlations found in common datasets (Ilyas et al., 2019).
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05577v2 [cs.LG] 31 Jan 2023
Figure 1:
Top
. Standard setup of an adversarial attack, where a barely perceivable perturbation is
added to an image to confuse an accurate classifier.
Bottom
. The correspondence between neural
networks and kernel machines allows to visualize a decomposition of this perturbation, each part
attributed to a different feature of the model. The first few features tend to be robust.
In particular, one widespread viewpoint is that adversarial vulnerability is the result of a model’s
sensitivity to imperceptible yet well-generalizing features in the data, so called useful non-robust
features, giving rise to a trade-off between accuracy and robustness (Tsipras et al., 2019, Zhang
et al., 2019). This gradual understanding has enabled the design of training algorithms, that provide
convincing, yet partial, remedies to the problem; the most prominent of them being adversarial
training and its many variants (Goodfellow et al., 2015, Madry et al., 2018, Croce et al., 2021). Yet
we are far from a mature, unified theory of robustness that is powerful enough to universally guide
engineering choices or defense mechanisms.
In this work, we aim to get a deeper understanding of adversarial robustness (or lack thereof) by
focusing on the recently established connection of neural networks with kernel machines. Infinitely
wide neural networks, trained via gradient descent with infinitesimal learning rate, provably become
kernel machines with a data-independent, but architecture dependent kernel - its Neural Tangent
Kernel (NTK) - that remains constant during training (Jacot et al., 2018, Lee et al., 2019, Arora et al.,
2019b, Liu et al., 2020). The analytical tools afforded by the rich theory of kernels have resulted
in progress in understanding the optimization landscape and generalization capabilities of neural
networks (Du et al., 2019a, Arora et al., 2019a), together with the discovery of interesting deep
learning phenomena (Fort et al., 2020, Ortiz-Jiménez et al., 2021), while also inspiring practical
advances in diverse areas of applications such as the design of better classifiers (Shankar et al.,
2020), efficient neural architecture search (Chen et al., 2021), low-dimensional tasks in graphics
(Tancik et al., 2020) and dataset distillation (Nguyen et al., 2021). While the NTK approximation is
increasingly utlilized, even for finite width neural nets, little is known about the adversarial robustness
properties of these infinitely wide models.
Our contribution:
Our work inscribes itself into the quest to leverage analytical tools afforded by
kernel methods, in particular spectral analysis, to track properties of interest in the associated neural
nets, in this case as they pertain to robustness. To this end, we first demonstrate that adversarial
perturbations generated analytically with the NTK can successfully lead the associated trained wide
neural networks (in the kernel-regime) to misclassify, thus allowing kernels to faithfully predict the
lack of robustness of those trained neural networks. In other words, adversarial (non-) robustness
transfers from kernels to networks; and adversarial perturbations generated via kernels resemble
those generated by the corresponding trained networks. One implication of this transferability is that
we can analytically devise adversarial examples that do not require access to the trained model and in
particular its weights; instead these “blind spots” may be calculated a-priori, before training starts.
A perhaps even more crucial implication of the NTK approach to robustness relates to the understand-
ing of adversarial examples. Indeed, we show how the spectrum of the NTK provides an alternative
way to define features of the model, to classify them according to their robustness and usefulness for
2
Figure 2:
Left
: Top 5 features for 7 different kernel architectures for a car image extracted from
the CIFAR10 dataset when trained on car and plane images.
Right
: Features according to their
robustness (x-axis) and usefulness (y-axis). Larger/darker bullets correspond to larger eigenvalues.
Useful features have
>0.5
-usefulness; shaded boxes are meant to help visualize useful-robust regions.
correct predictions and visually inspect them via their contribution to the adversarial perturbation
(see Fig. 1). This in turn allows us to verify previously conjectured properties of standard classifiers;
dependence on both robust and non-robust features in the data (Tsipras et al., 2019), and tradeoff
of accuracy and robustness during training. In particular we observe that features tend to be rather
invariable across architectures, and that robust features tend to correspond to the top of the eigenspec-
trum (see Fig. 2), and as such are learned first by the corresponding wide nets (Arora et al., 2019a,
Jacot et al., 2018). Moreover, we are able to visualize useful non-robust features of standard models
(Fig. 4). While this conceptual feature distinction has been highly influential in recent works that
study the robustness of deep neural networks (see for example (Allen-Zhu and Li, 2022, Kim et al.,
2021, Springer et al., 2021)), to the best of our knowledge, none of them has explicitly demonstrated
the dependence of networks on such feature functions (except for simple linear models (Goh, 2019)).
Rather, these works either reveal such features in some indirect fashion, or accept their existence as
an assumption. Here, we show that Neural Tangent Kernel theory endows us with a natural definition
of features through its eigen-decomposition and provides a way to visualise and inspect robust and
non-robust features directly on the function space of trained neural networks.
Interestingly, this connection also enables us to empirically demonstrate that robust features of
standard models alone are not enough for robust classification. Aiming to understand, then, what
makes robust models robust, we track the evolution of the data-dependent empirical NTK during
adversarial training of neural networks used in practice. Prior experimental work has found that
networks with non-trivial width to depth ratio which are trained with large learning rates, depart from
the NTK regime and fall in the so-called “rich feature” regime, where the NTK changes substantially
during training (Geiger et al., 2020, Fort et al., 2020, Baratin et al., 2021, Ortiz-Jiménez et al., 2021).
In our work, which to the best of our knowledge is the first to provide insights on how the kernel
behaves during adversarial training, we find that the NTK evolves much faster compared to standard
training, simultaneously both changing its features and assigning more importance to the more robust
ones, giving direct insight into the mechanism at play during adversarial training (see Fig. 6). In
summary, the contributions of our work are the following:
We discuss how to generate adversarial examples for infinitely-wide neural networks via the
NTK, and show that they transfer to fool their associated (finite width) nets in the appropriate
regime, yielding a "training-free" attack without need to access model weights (Sec. 3).
Using the spectrum of the NTK, we give an alternative definition of features, providing a
natural decomposition or perturbations into robust and non-robust parts (Tsipras et al., 2019,
Ilyas et al., 2019) (Fig. 1). We confirm that robust features overwhelmingly correspond to
the top part of the eigenspectrum; hence they are learned early on in training. We bolster
previously conjectured hypotheses that prediction relies on both robust and non-robust
features and that robustness is traded for accuracy during standard training. Further, we
3
show that only utilizing the robust features of standard models is not sufficient for robust
classification (Sec. 4).
We turn to finite-width neural nets with standard parameters to study the dynamics of their
empirical NTK during adversarial training. We show that the kernel rotates in a way that
enables both new (robust) feature learning and that drastically increases of the importance
(relative weight) of the robust features over the non-robust ones. We further highlight
the structural differences of the kernel change during adversarial training versus standard
training and observe that the kernel seems to enter the “lazy” regime much faster (Sec. 5).
Collectively, our findings may help explain many phenomena present in the adversarial ML literature
and further elucidate both the vulnerability of standard models and the robustness of adversarially
trained ones. We provide code to visualize features induced by kernels, giving a unique and principled
way to inspect features induced by standardly trained nets (available at
https://github.com/
Tsili42/adv-ntk).
Related work:
To the best of our knowledge the only prior work that leverages NTK theory to derive
perturbations in some adversarial setting is due to Yuan and Wu (2021), yet with entirely different
focus. It deals with what is coined generalization attacks: the process of altering the training data
distribution to prevent models to generalise on clean data. Bai et al. (2021) study aspects of robust
models through their linearized sub-networks, but do not leverage NTKs.
2 Preliminaries
We introduce background material and definitions important to our analysis. Here, we restrict
ourselves to binary classification, to keep notation light. We defer the multiclass case, complete
definitions and a more detailed discussion of prior work to the Appendix.
2.1 Adversarial Examples
Let
f
be a classifier,
x
be an input (e.g. a natural image) and
y
its label (e.g. the image class). Then,
given that
f
is an accurate classifier on
x
,
˜
x
is an adversarial example (Szegedy et al., 2014) for
f
if
i)
the distance
d(x,˜
x)
is small. Common choices in computer vision are the
`p
norms,
especially the `norm on which we focus henceforth, and
ii) f(˜
x)6=y. That is, the perturbed input is being misclassified.
Given a loss function
L
, such as cross-entropy, one can construct an adversarial example
˜
x=x+η
by finding the perturbation ηthat produces the maximal increase of the loss, solving
η= arg max
kηkL(f(x+η), y),(1)
for some
 > 0
that quantifies the dissimilarity between the two examples. In general, this is a
non-convex problem and one can resort to first order methods (Goodfellow et al., 2015)
˜
x=x+·sign (xL(f(x), y)) ,(2)
or iterative versions for solving it (Kurakin et al., 2017, Madry et al., 2018). The former method
is usually called Fast Gradient Sign Method (FGSM) and the latter Projected Gradient Descent
(PGD). These methods are able to produce examples that are being misclassified by common neural
networks with a probability that approaches 1 (Carlini and Wagner, 2017). Even more surprisingly,
it has been observed that adversarial examples crafted to “fool” one machine learning model are
consistently capable of “fooling” others (Papernot et al., 2016, 2017), a phenomenon that is known as
the transferability of adversarial examples. Finally, adversarial training refers to the alteration of the
training procedure to include adversarial samples for teaching the model to be robust (Goodfellow
et al., 2015, Madry et al., 2018) and empirically holds as the strongest defense against adversarial
examples (Madry et al., 2018, Zhang et al., 2019).
2.2 Robust and Non-Robust Features
Despite a vast amount of research, the reasons behind the existence of adversarial examples are
not perfectly clear. A line of work has argued that a central reason is the presence of robust and
4
non-robust features in the data that standard models learn to rely upon (Tsipras et al., 2019, Ilyas
et al., 2019). In particular it is conjectured that reliance on useful but non-robust features during
training is responsible for the brittleness of neural nets. Here, we slightly adapt the feature definitions
of (Ilyas et al., 2019)1, and extend them to multi-class problems (see Appendix A).
Let
D
be the data generating distribution with
x∈ X
and
y∈ {±1}
. We define a feature as a function
φ:X Rand distinguish how they perform as classifiers. Fix ρ, γ 0:
1. ρ-Useful feature: A feature φis called ρ-useful if
Ex,y∼D1{sign[φ(x)]=y}=ρ(3)
2. γ-Robust
feature: A feature
φ
is called
γ
-robust if it remains useful under any perturbation
inside a bounded “ball” B, that is if
Ex,y∼D inf
δ∈B
1{sign[φ(x+δ)]=y}=γ(4)
In general, a feature adds predictive value if it gives an advantage above guessing the most likely
label, i.e.
ρ > maxy0∈{±1}Ex,y∼D[1{y0=y}]
, and we will speak of “useful” features in this case,
omitting the
ρ
. We will call such a feature
useful, non-robust
if it is useful, but
γ
-robust only for
γ= 0 or very close to 0, depending on context.
The vast majority of works imagines features as being induced by the activations of neurons in the
net, most commonly those of the penultimate layer (representation-layer features), but the previous
formal definitions are in no way restricted to activations, and we will show how to exploit them using
the eigenspectrum of the NTK. In particular, in Sec. 4, we demonstrate that the above framework
agrees perfectly with features induced by the eigenspectrum of the NTK of a network, providing a
natural way to decompose the predictions of the NTK into such feature functions. In particular we
can identify robust, useful, and, indeed, useful non-robust features.
2.3 Neural Tangent Kernel
Let
f:RdR
be a (scalar) neural network with a linear final layer parameterized by a set of
weights
wRp
and
{X,Y}
be a dataset of size
n
, with
X Rn×d
and
Y ∈ {±1}n
. Linearized
training methods study the first order approximation
f(x;wt+1) = f(x;wt) + wf(x;wt)>(wt+1 wt).(5)
The network gradient
wf(x;wt)
induces a kernel function
Θt:Rd×RdR
, usually referred as
the Neural Tangent Kernel (NTK) of the model
Θt(x,x0) = wf(x;wt)>wf(x0;wt).(6)
This kernel describes the dynamics with infinitesimal learning rate (gradient flow). In general, the
tangent space spanned by the
wf(x;wt)
twists substantially during training, and learning with the
Gram matrix of Eq.
(6)
(empirical NTK) corresponds to training along an intermediate tangent plane.
Remarkably, however, in the infinite width limit with appropriate initialization and low learning
rate, it has been shown that
f
becomes a linear function of the parameters (Jacot et al., 2018, Liu
et al., 2020), and the NTK remains constant (
Θt= Θ0=: Θ
). Then, for learning with
`2
loss the
training dynamics of infinitely wide networks admits a closed form solution corresponding to kernel
regression (Jacot et al., 2018, Lee et al., 2019, Arora et al., 2019b)
ft(x) = Θ(x,X)>Θ1(X,X)(IeλΘ(X,X)t)Y,(7)
where
xRd
is any input (training or testing),
t
denotes the time evolution of gradient descent,
λ
is the (small) learning rate and, slightly abusing notation,
Θ(X,X)Rn×n
denotes the matrix
containing the pairwise training values of the NTK,
Θ(X,X)ij = Θ(xi,xj)
, and similarly for
Θ(x,X)Rn
. To be precise, Eq.
(7)
gives the mean output of the network using a weight-
independent kernel with variance depending on the initialization2.
1We distinguish useful and robust features based on their accuracy as classifiers, not in terms of correlation
with the labels as in Ilyas et al. (2019), allowing a natural extension to the multi-class setting. For robustness, we
consider any accuracy bounded away from zero as robust, quantifying that an adversary cannot drive accuracy to
zero entirely.
2
For that reason, in the experiments, we often compare this with the centered prediction of the actual neural
network, ff0, as is commonly done in similar studies (Chizat et al., 2019).
5
摘要:

WhatCantheNeuralTangentKernelTellUsAboutAdversarialRobustness?NikolaosTsilivisCenterforDataScienceNewYorkUniversitynt2231@nyu.eduJuliaKempeCenterforDataScienceandCourantInstituteofMathematicalSciencesNewYorkUniversitykempe@nyu.eduAbstractTheadversarialvulnerabilityofneuralnets,andsubsequenttechnique...

展开>> 收起<<
What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness Nikolaos Tsilivis.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:5.82MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注