What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness Nikolaos Tsilivis

2025-04-24 0 0 5.82MB 27 页 10玖币

侵权投诉

What Can the Neural Tangent Kernel Tell Us About

Adversarial Robustness?

Nikolaos Tsilivis

Center for Data Science

New York University

nt2231@nyu.edu

Julia Kempe

Center for Data Science and

Courant Institute of Mathematical Sciences

New York University

kempe@nyu.edu

Abstract

The adversarial vulnerability of neural nets, and subsequent techniques to create

robust models have attracted signiﬁcant attention; yet we still lack a full understand-

ing of this phenomenon. Here, we study adversarial examples of trained neural

networks through analytical tools afforded by recent theory advances connecting

neural networks and kernel methods, namely the Neural Tangent Kernel (NTK),

following a growing body of work that leverages the NTK approximation to suc-

cessfully analyze important deep learning phenomena and design algorithms for

new applications. We show how NTKs allow to generate adversarial examples

in a “training-free” fashion, and demonstrate that they transfer to fool their ﬁnite-

width neural net counterparts in the “lazy” regime. We leverage this connection

to provide an alternative view on robust and non-robust features, which have been

suggested to underlie the adversarial brittleness of neural nets. Speciﬁcally, we

deﬁne and study features induced by the eigendecomposition of the kernel to better

understand the role of robust and non-robust features, the reliance on both for

standard classiﬁcation and the robustness-accuracy trade-off. We ﬁnd that such

features are surprisingly consistent across architectures, and that robust features

tend to correspond to the largest eigenvalues of the model, and thus are learned

early during training. Our framework allows us to identify and visualize non-robust

yet useful features. Finally, we shed light on the robustness mechanism underlying

adversarial training of neural nets used in practice: quantifying the evolution of

the associated empirical NTK, we demonstrate that its dynamics falls much earlier

into the “lazy” regime and manifests a much stronger form of the well known bias

to prioritize learning features within the top eigenspaces of the kernel, compared to

standard training.

1 Introduction

Despite the tremendous success of deep neural networks in many computer vision and language

modeling tasks, as well as in scientiﬁc discoveries, their properties and the reasons for their success

are still poorly understood. Focusing on computer vision, a particularly surprising phenomenon

evidencing that those machines drift away from how humans perform image recognition is the

presence of adversarial examples, images that are almost identical to the original ones, yet are

misclassiﬁed by otherwise accurate models.

Since their discovery (Szegedy et al., 2014), a vast amount of work has been devoted to understanding

the sources of adversarial examples and explanations include, but are not limited to, the close to linear

operating mode of neural nets (Goodfellow et al., 2015), the curse of dimensionality carried by the

input space (Goodfellow et al., 2015, Simon-Gabriel et al., 2019), insufﬁcient model capacity (Tsipras

et al., 2019, Nakkiran, 2019) or spurious correlations found in common datasets (Ilyas et al., 2019).

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05577v2 [cs.LG] 31 Jan 2023

Figure 1:

Top

. Standard setup of an adversarial attack, where a barely perceivable perturbation is

added to an image to confuse an accurate classiﬁer.

Bottom

. The correspondence between neural

networks and kernel machines allows to visualize a decomposition of this perturbation, each part

attributed to a different feature of the model. The ﬁrst few features tend to be robust.

In particular, one widespread viewpoint is that adversarial vulnerability is the result of a model’s

sensitivity to imperceptible yet well-generalizing features in the data, so called useful non-robust

features, giving rise to a trade-off between accuracy and robustness (Tsipras et al., 2019, Zhang

et al., 2019). This gradual understanding has enabled the design of training algorithms, that provide

convincing, yet partial, remedies to the problem; the most prominent of them being adversarial

training and its many variants (Goodfellow et al., 2015, Madry et al., 2018, Croce et al., 2021). Yet

we are far from a mature, uniﬁed theory of robustness that is powerful enough to universally guide

engineering choices or defense mechanisms.

In this work, we aim to get a deeper understanding of adversarial robustness (or lack thereof) by

focusing on the recently established connection of neural networks with kernel machines. Inﬁnitely

wide neural networks, trained via gradient descent with inﬁnitesimal learning rate, provably become

kernel machines with a data-independent, but architecture dependent kernel - its Neural Tangent

Kernel (NTK) - that remains constant during training (Jacot et al., 2018, Lee et al., 2019, Arora et al.,

2019b, Liu et al., 2020). The analytical tools afforded by the rich theory of kernels have resulted

in progress in understanding the optimization landscape and generalization capabilities of neural

networks (Du et al., 2019a, Arora et al., 2019a), together with the discovery of interesting deep

learning phenomena (Fort et al., 2020, Ortiz-Jiménez et al., 2021), while also inspiring practical

advances in diverse areas of applications such as the design of better classiﬁers (Shankar et al.,

2020), efﬁcient neural architecture search (Chen et al., 2021), low-dimensional tasks in graphics

(Tancik et al., 2020) and dataset distillation (Nguyen et al., 2021). While the NTK approximation is

increasingly utlilized, even for ﬁnite width neural nets, little is known about the adversarial robustness

properties of these inﬁnitely wide models.

Our contribution:

Our work inscribes itself into the quest to leverage analytical tools afforded by

kernel methods, in particular spectral analysis, to track properties of interest in the associated neural

nets, in this case as they pertain to robustness. To this end, we ﬁrst demonstrate that adversarial

perturbations generated analytically with the NTK can successfully lead the associated trained wide

neural networks (in the kernel-regime) to misclassify, thus allowing kernels to faithfully predict the

lack of robustness of those trained neural networks. In other words, adversarial (non-) robustness

transfers from kernels to networks; and adversarial perturbations generated via kernels resemble

those generated by the corresponding trained networks. One implication of this transferability is that

we can analytically devise adversarial examples that do not require access to the trained model and in

particular its weights; instead these “blind spots” may be calculated a-priori, before training starts.

A perhaps even more crucial implication of the NTK approach to robustness relates to the understand-

ing of adversarial examples. Indeed, we show how the spectrum of the NTK provides an alternative

way to deﬁne features of the model, to classify them according to their robustness and usefulness for

Figure 2:

Left

: Top 5 features for 7 different kernel architectures for a car image extracted from

the CIFAR10 dataset when trained on car and plane images.

Right

: Features according to their

robustness (x-axis) and usefulness (y-axis). Larger/darker bullets correspond to larger eigenvalues.

Useful features have

>0.5

-usefulness; shaded boxes are meant to help visualize useful-robust regions.

correct predictions and visually inspect them via their contribution to the adversarial perturbation

(see Fig. 1). This in turn allows us to verify previously conjectured properties of standard classiﬁers;

dependence on both robust and non-robust features in the data (Tsipras et al., 2019), and tradeoff

of accuracy and robustness during training. In particular we observe that features tend to be rather

invariable across architectures, and that robust features tend to correspond to the top of the eigenspec-

trum (see Fig. 2), and as such are learned ﬁrst by the corresponding wide nets (Arora et al., 2019a,

Jacot et al., 2018). Moreover, we are able to visualize useful non-robust features of standard models

(Fig. 4). While this conceptual feature distinction has been highly inﬂuential in recent works that

study the robustness of deep neural networks (see for example (Allen-Zhu and Li, 2022, Kim et al.,

2021, Springer et al., 2021)), to the best of our knowledge, none of them has explicitly demonstrated

the dependence of networks on such feature functions (except for simple linear models (Goh, 2019)).

Rather, these works either reveal such features in some indirect fashion, or accept their existence as

an assumption. Here, we show that Neural Tangent Kernel theory endows us with a natural deﬁnition

of features through its eigen-decomposition and provides a way to visualise and inspect robust and

non-robust features directly on the function space of trained neural networks.

Interestingly, this connection also enables us to empirically demonstrate that robust features of

standard models alone are not enough for robust classiﬁcation. Aiming to understand, then, what

makes robust models robust, we track the evolution of the data-dependent empirical NTK during

adversarial training of neural networks used in practice. Prior experimental work has found that

networks with non-trivial width to depth ratio which are trained with large learning rates, depart from

the NTK regime and fall in the so-called “rich feature” regime, where the NTK changes substantially

during training (Geiger et al., 2020, Fort et al., 2020, Baratin et al., 2021, Ortiz-Jiménez et al., 2021).

In our work, which to the best of our knowledge is the ﬁrst to provide insights on how the kernel

behaves during adversarial training, we ﬁnd that the NTK evolves much faster compared to standard

training, simultaneously both changing its features and assigning more importance to the more robust

ones, giving direct insight into the mechanism at play during adversarial training (see Fig. 6). In

summary, the contributions of our work are the following:

•

We discuss how to generate adversarial examples for inﬁnitely-wide neural networks via the

NTK, and show that they transfer to fool their associated (ﬁnite width) nets in the appropriate

regime, yielding a "training-free" attack without need to access model weights (Sec. 3).

•

Using the spectrum of the NTK, we give an alternative deﬁnition of features, providing a

natural decomposition or perturbations into robust and non-robust parts (Tsipras et al., 2019,

Ilyas et al., 2019) (Fig. 1). We conﬁrm that robust features overwhelmingly correspond to

the top part of the eigenspectrum; hence they are learned early on in training. We bolster

previously conjectured hypotheses that prediction relies on both robust and non-robust

features and that robustness is traded for accuracy during standard training. Further, we

show that only utilizing the robust features of standard models is not sufﬁcient for robust

classiﬁcation (Sec. 4).

•

We turn to ﬁnite-width neural nets with standard parameters to study the dynamics of their

empirical NTK during adversarial training. We show that the kernel rotates in a way that

enables both new (robust) feature learning and that drastically increases of the importance

(relative weight) of the robust features over the non-robust ones. We further highlight

the structural differences of the kernel change during adversarial training versus standard

training and observe that the kernel seems to enter the “lazy” regime much faster (Sec. 5).

Collectively, our ﬁndings may help explain many phenomena present in the adversarial ML literature

and further elucidate both the vulnerability of standard models and the robustness of adversarially

trained ones. We provide code to visualize features induced by kernels, giving a unique and principled

way to inspect features induced by standardly trained nets (available at

https://github.com/

Tsili42/adv-ntk).

Related work:

To the best of our knowledge the only prior work that leverages NTK theory to derive

perturbations in some adversarial setting is due to Yuan and Wu (2021), yet with entirely different

focus. It deals with what is coined generalization attacks: the process of altering the training data

distribution to prevent models to generalise on clean data. Bai et al. (2021) study aspects of robust

models through their linearized sub-networks, but do not leverage NTKs.

2 Preliminaries

We introduce background material and deﬁnitions important to our analysis. Here, we restrict

ourselves to binary classiﬁcation, to keep notation light. We defer the multiclass case, complete

deﬁnitions and a more detailed discussion of prior work to the Appendix.

2.1 Adversarial Examples

Let

be a classiﬁer,

be an input (e.g. a natural image) and

its label (e.g. the image class). Then,

given that

is an accurate classiﬁer on

is an adversarial example (Szegedy et al., 2014) for

the distance

d(x,˜

is small. Common choices in computer vision are the

norms,

especially the `∞norm on which we focus henceforth, and

ii) f(˜

x)6=y. That is, the perturbed input is being misclassiﬁed.

Given a loss function

, such as cross-entropy, one can construct an adversarial example

x=x+η

by ﬁnding the perturbation ηthat produces the maximal increase of the loss, solving

η= arg max

kηk∞≤L(f(x+η), y),(1)

for some

 > 0

that quantiﬁes the dissimilarity between the two examples. In general, this is a

non-convex problem and one can resort to ﬁrst order methods (Goodfellow et al., 2015)

x=x+·sign (∇xL(f(x), y)) ,(2)

or iterative versions for solving it (Kurakin et al., 2017, Madry et al., 2018). The former method

is usually called Fast Gradient Sign Method (FGSM) and the latter Projected Gradient Descent

(PGD). These methods are able to produce examples that are being misclassiﬁed by common neural

networks with a probability that approaches 1 (Carlini and Wagner, 2017). Even more surprisingly,

it has been observed that adversarial examples crafted to “fool” one machine learning model are

consistently capable of “fooling” others (Papernot et al., 2016, 2017), a phenomenon that is known as

the transferability of adversarial examples. Finally, adversarial training refers to the alteration of the

training procedure to include adversarial samples for teaching the model to be robust (Goodfellow

et al., 2015, Madry et al., 2018) and empirically holds as the strongest defense against adversarial

examples (Madry et al., 2018, Zhang et al., 2019).

2.2 Robust and Non-Robust Features

Despite a vast amount of research, the reasons behind the existence of adversarial examples are

not perfectly clear. A line of work has argued that a central reason is the presence of robust and

non-robust features in the data that standard models learn to rely upon (Tsipras et al., 2019, Ilyas

et al., 2019). In particular it is conjectured that reliance on useful but non-robust features during

training is responsible for the brittleness of neural nets. Here, we slightly adapt the feature deﬁnitions

of (Ilyas et al., 2019)1, and extend them to multi-class problems (see Appendix A).

Let

be the data generating distribution with

x∈ X

and

y∈ {±1}

. We deﬁne a feature as a function

φ:X → Rand distinguish how they perform as classiﬁers. Fix ρ, γ ≥0:

1. ρ-Useful feature: A feature φis called ρ-useful if

Ex,y∼D1{sign[φ(x)]=y}=ρ(3)

2. γ-Robust

feature: A feature

is called

-robust if it remains useful under any perturbation

inside a bounded “ball” B, that is if

Ex,y∼D inf

δ∈B

1{sign[φ(x+δ)]=y}=γ(4)

In general, a feature adds predictive value if it gives an advantage above guessing the most likely

label, i.e.

ρ > maxy0∈{±1}Ex,y∼D[1{y0=y}]

, and we will speak of “useful” features in this case,

omitting the

. We will call such a feature

useful, non-robust

if it is useful, but

-robust only for

γ= 0 or very close to 0, depending on context.

The vast majority of works imagines features as being induced by the activations of neurons in the

net, most commonly those of the penultimate layer (representation-layer features), but the previous

formal deﬁnitions are in no way restricted to activations, and we will show how to exploit them using

the eigenspectrum of the NTK. In particular, in Sec. 4, we demonstrate that the above framework

agrees perfectly with features induced by the eigenspectrum of the NTK of a network, providing a

natural way to decompose the predictions of the NTK into such feature functions. In particular we

can identify robust, useful, and, indeed, useful non-robust features.

2.3 Neural Tangent Kernel

Let

f:Rd→R

be a (scalar) neural network with a linear ﬁnal layer parameterized by a set of

weights

w∈Rp

and

{X,Y}

be a dataset of size

, with

X ∈ Rn×d

and

Y ∈ {±1}n

. Linearized

training methods study the ﬁrst order approximation

f(x;wt+1) = f(x;wt) + ∇wf(x;wt)>(wt+1 −wt).(5)

The network gradient

∇wf(x;wt)

induces a kernel function

Θt:Rd×Rd→R

, usually referred as

the Neural Tangent Kernel (NTK) of the model

Θt(x,x0) = ∇wf(x;wt)>∇wf(x0;wt).(6)

This kernel describes the dynamics with inﬁnitesimal learning rate (gradient ﬂow). In general, the

tangent space spanned by the

∇wf(x;wt)

twists substantially during training, and learning with the

Gram matrix of Eq.

(6)

(empirical NTK) corresponds to training along an intermediate tangent plane.

Remarkably, however, in the inﬁnite width limit with appropriate initialization and low learning

rate, it has been shown that

becomes a linear function of the parameters (Jacot et al., 2018, Liu

et al., 2020), and the NTK remains constant (

Θt= Θ0=: Θ

). Then, for learning with

loss the

training dynamics of inﬁnitely wide networks admits a closed form solution corresponding to kernel

regression (Jacot et al., 2018, Lee et al., 2019, Arora et al., 2019b)

ft(x) = Θ(x,X)>Θ−1(X,X)(I−e−λΘ(X,X)t)Y,(7)

where

x∈Rd

is any input (training or testing),

denotes the time evolution of gradient descent,

is the (small) learning rate and, slightly abusing notation,

Θ(X,X)∈Rn×n

denotes the matrix

containing the pairwise training values of the NTK,

Θ(X,X)ij = Θ(xi,xj)

, and similarly for

Θ(x,X)∈Rn

. To be precise, Eq.

(7)

gives the mean output of the network using a weight-

independent kernel with variance depending on the initialization2.

1We distinguish useful and robust features based on their accuracy as classiﬁers, not in terms of correlation

with the labels as in Ilyas et al. (2019), allowing a natural extension to the multi-class setting. For robustness, we

consider any accuracy bounded away from zero as robust, quantifying that an adversary cannot drive accuracy to

zero entirely.

For that reason, in the experiments, we often compare this with the centered prediction of the actual neural

network, f−f0, as is commonly done in similar studies (Chizat et al., 2019).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhatCantheNeuralTangentKernelTellUsAboutAdversarialRobustness?NikolaosTsilivisCenterforDataScienceNewYorkUniversitynt2231@nyu.eduJuliaKempeCenterforDataScienceandCourantInstituteofMathematicalSciencesNewYorkUniversitykempe@nyu.eduAbstractTheadversarialvulnerabilityofneuralnets,andsubsequenttechnique...

展开>> 收起<<

What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness Nikolaos Tsilivis.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness Nikolaos Tsilivis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: