NOISE INJECTION AS A PROBE OF DEEPLEARNING DYNAMICS Noam Levi Raymond and Beverly Sackler School of Physics and Astronomy

2025-05-02 0 0 1.58MB 11 页 10玖币

侵权投诉

NOISE INJECTION AS A PROBE OF DEEP LEARNING DYNAMICS

Noam Levi∗

Raymond and Beverly Sackler School of Physics and Astronomy

Tel-Aviv University

Tel-Aviv 69978, Israel

noam@mail.tau.ac.il

Itay M. Bloch∗

Berkeley Center for Theoretical Physics

University of California, Berkeley, CA 94720

itay.bloch.m@gmail.com

Marat Freytsis

NHETC, Department of Physics and Astronomy

Rutgers University

Piscataway, NJ 08854, USA

marat.freytsis@rutgers.edu

Tomer Volansky

Raymond and Beverly Sackler School of Physics and Astronomy

Tel-Aviv University

Tel-Aviv 69978, Israel

tomerv@post.tau.ac.il

ABSTRACT

We propose a new method to probe the learning mechanism of Deep Neural Networks (DNN) by

perturbing the system using Noise Injection Nodes (NINs). These nodes inject uncorrelated noise via

additional optimizable weights to existing feed-forward network architectures, without changing the

optimization algorithm. We ﬁnd that the system displays distinct phases during training, dictated by

the scale of injected noise. We ﬁrst derive expressions for the dynamics of the network and utilize a

simple linear model as a test case. We ﬁnd that in some cases, the evolution of the noise nodes is

similar to that of the unperturbed loss, thus indicating the possibility of using NINs to learn more

about the full system in the future.

1 Introduction

Deep learning has proven exceedingly successful, leading to dramatic improvements in multiple domains. Nevertheless,

our current theoretical understanding of deep learning methods has remained unsatisfactory. Speciﬁcally, the training of

DNNs is a highly opaque procedure, with few metrics, beyond curvature evolution [

–

], available to describe how a

network evolves as it trains.

An interesting attempt at parameterizing the interplay between training dynamics and generalization was explored in

the seminal work of Ref. [

], which demonstrated that when input data was corrupted by adding random noise, the

generalization error deteriorated in correlation with its strength. Noise injection has gained further traction in recent

years, both as a means of effective regularization [

–

], as well as a route towards understanding DNN dynamics

and generalization. For instance, label noise has been shown to affect the implicit bias of Stochastic Gradient Descent

(SGD) [

–

], as sparse solutions appear to be preferred over those which reduce the Euclidean norm, in certain cases.

In this work, we take another step along this direction, by allowing the network to actively regulate the effects of the

injected noise during training. Concretely, we deﬁne Noise Injection Nodes (NINs), whose output is a random variable,

chosen sample-wise from a given distribution. These NINs are connected to existing feed-forward DNNs via trainable

∗Both authors contributed equally to this work.

arXiv:2210.13599v1 [cs.LG] 24 Oct 2022

Figure 1: Schematic of a generic DNN, with the addition of a single NIN connected via NIWs.

Center:

Example evolution displaying

a decay behavior for both the NIWs and the losses within the decay phase of the system discussed below and with a ﬁxed noise

strength,

σ

. Two cases are shown: re-initialized NIN at every epoch, and ﬁxed value NIN. The similar behavior of the systems

in both cases (

blue

and

light blue

points respectively) hint at a potent relation between the NIWs evolution and generalization.

The

blue

green

and

violet

stars indicate the NIW, test loss and training loss decay time-scales. The three solid curves are ﬁts to

exponential decays, while the data is represented with points.

Right:

The different decay times as a function of the noise injection

magnitude. The four shaded regions indicate the four phases of the system discussed in Sec. 3. The results are shown for a 3-hidden

layer ReLU MLP with CE loss, trained on FMNIST to 100 % training accuracy.

Noise Injection Weights (NIWs). The network is subsequently trained to perform a classiﬁcation/regression task using

vanilla SGD.

We study such systems both numerically and analytically, providing a detailed analytic understanding for a simple

linear network. Our main results, partly summarized in Fig. 1, are as follows:

(i) The system exhibits 4 NIN-related phases, depending mostly on the strength of the injected noise.

(ii)

In two of the phases, the NIWs evolve to small values, implying that a well-trained network is able to recognize

that the noise contains no useful information within it.

(iii) For those two phases, the NIW dynamics is dictated by the local curvature of the training loss function2.

Item Item (ii) may be expected if the NIN is re-randomized at each training epoch, yet we ﬁnd essentially the same

behavior repeated even when the NIN values are generated only once and ﬁxed before training, putting them on equal

footing with actual data inputs, as shown in Fig. 1(center,right). It appears that while the system might in principle be

able to memorize the speciﬁc noise samples, optimization dynamics still prefer to suppress them. This implies a relation

between the NIN reduction mechanism and the network’s ability to generalize, to be explored further in future works.

2 Noise Injection Weights Evolution

Consider a DNN with parameters

θ={W(`), b(`)∈Rd`×d`+1 ,Rd`+1 |`= 0, . . . , NL−1}

, corresponding to the

weights and biases, and

layers, deﬁned by its single sample loss function

L:Rdin →R

and optimized under SGD

to perform a supervised learning task. Here, at each SGD iteration, a mini-batch

consists of a set of labeled examples,

{(xi,yi)}|B|

i=1 ⊆Rdin ×Rdlabel .

We study the simple case of connecting a given NIN to a speciﬁc layer, denoted as

`NI

, via a NIW vector

WNI ∈

R1×d`NI+1

(see Fig. 1, left). In this setup, the injected noise is taken as a random scalar variable,



, sampled repeatedly

at each SGD training epoch from a chosen distribution. The NIWs’ evolution is best studied via their effect on

preactivations, deﬁned as

z(`)=W(`)·x(`)+b(`)

. When a NIN is added, the preactivations at layer

`NI

are

subsequently shifted to z(`NI)→z(`NI )+WNI.

Recently, the relationship between the inherent noise present in SGD optimization and generalization has been of interest,

see [24,25] and references therein. Our work may be considered as a modeling of said SGD noise.

For a single NIN connected at layer

`NI

, the batch-averaged loss function can be written as a series expansion in the

noise translation parameter3

L(θ, WNI) = 1

|B| X

{x,,y}∈B L(˜

θ;z(`NI)+WNI, y)(1)

=L(θ) + 1

|B| X

{x,,y}∈B

∞

k=1

(W T

NI · ∇z(`NI))k

k!L(θ;x, , y),

where

θ=θ\ {W(`NI)}

and

L(θ)

is the loss function in the absence of a NIN. Focusing on a distribution with zero

mean for the NIN (e.g.,

∼ N(0, σ2

)

) and performing the batch averaging on each term, we arrive at the update rule

for the NIWs from the noisy loss expansion4

W(t+1)

NI =W(t)

NI −ησΦ

p|B|qh(g(t)

`NI )2i − ησ2



2DH(t)

`NI EW(t)

NI +. . . . (2)

Here, batch averaging is denoted by

h···i

is a random variable with zero mean and unit variance, and

σ2



is the

variance of the injected noise. We denote the network-dependent local gradient and Hessian at the NIN layer as

g`NI =∇z(`NI)L(θ;x,y),H`NI =∇z(`NI )∇T

z(`NI)L(θ;x,y).(3)

A more complete derivation, along with a proof for the

1/√B

scaling of odd terms in the expansion is given in App. C.

Terminating the expansion in Eq. (2) at 2nd order need not be valid for large

σ

. We thus proceed by studying a linear

test case for which the 2nd order expansion is precise. The persistence of analogous network behavior, and in particular

its phases, for a more realistic setup is conﬁrmed empirically.

3 Linear Toy Model

Consider a two-layer DNN with linear activations and layer widths (

d0,1= 1

) and no biases (

b= 0

), tasked with

univariate linear regression

, and with a single NIN connected to the ﬁrst layer (

`NI = 0

). The data consists of a

set of training samples

{(xi, yi)∈R×R}m

i=1

, and we sample

and the noise

i

from the normal distributions,

xi, i∼ N(0, σ2

x,)

. The corresponding data labels are given by a linear transformation of the inputs

yi=M·xi

with

a ﬁxed

M∈R

. This regression problem is solved by minimizing the empirical loss, taken as the Mean Squared Error

(MSE),

LMSE =1

2|B| X

i∈B

(w(1)(w(0) ·xi+wNIi)−yi)2,(4)

with optimal solution w(1)

∗w(0)

∗=M, wNI,∗= 0.

The evolution of the system can be studied by focusing on the coupled SGD equations for the hidden layer weight and

the NIW, parameterized as

w(1)

t+1 =Atσ+w(1)

t(1 −Btσ2

)−Ct,(5)

wNI,t+1 =˜

Atσ+wNI,t(1 −˜

Btσ2

).

Here, the various terms are given explicitly by6

At=ηΦtσx

p|B|wNI,t(2w(1)

tw(0)

t−M),˜

At=ηΦtσx

p|B|w(1)

t(w(1)

tw(0)

t−M),

Bt=η w2

NI,t,˜

Bt=η(w(1)

t)2, Ct=η(w(1)

tw(0)

t−M)w(0)

tσ2

(6)

In practice, it is often the case that one uses piece-wise analytic activation functions such as ReLU, and so if the noise causes the

crossing of a non-analytic point, the above formal expansion is invalid. This subtlety does not change any of our conclusions and

empirically we recognize the same phases when using ReLU activations.

Additional

σ/p|B|

corrections coming from the variances of the even terms emerge from batch-averaging, and are assumed to

be negligible throughout this work.

We use this toy model as a proxy for a diagonal linear network [

]. A more general treatment including width and depth effects

will be presented in future work.

6We match the local gradient qhg2

0i=σxw(1)

t(w(1)

tw(0)

t−M)and Hessian hH(t)

0i= 2(w(1)

t)2to Eq. (2).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NOISEINJECTIONASAPROBEOFDEEPLEARNINGDYNAMICSNoamLeviRaymondandBeverlySacklerSchoolofPhysicsandAstronomyTel-AvivUniversityTel-Aviv69978,Israelnoam@mail.tau.ac.ilItayM.BlochBerkeleyCenterforTheoreticalPhysicsUniversityofCalifornia,Berkeley,CA94720itay.bloch.m@gmail.comMaratFreytsisNHETC,Departmentof...

展开>> 收起<<

NOISE INJECTION AS A PROBE OF DEEPLEARNING DYNAMICS Noam Levi Raymond and Beverly Sackler School of Physics and Astronomy.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

NOISE INJECTION AS A PROBE OF DEEPLEARNING DYNAMICS Noam Levi Raymond and Beverly Sackler School of Physics and Astronomy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: