NOISE INJECTION AS A PROBE OF DEEPLEARNING DYNAMICS Noam Levi Raymond and Beverly Sackler School of Physics and Astronomy

2025-05-02 0 0 1.58MB 11 页 10玖币
侵权投诉
NOISE INJECTION AS A PROBE OF DEEP LEARNING DYNAMICS
Noam Levi
Raymond and Beverly Sackler School of Physics and Astronomy
Tel-Aviv University
Tel-Aviv 69978, Israel
noam@mail.tau.ac.il
Itay M. Bloch
Berkeley Center for Theoretical Physics
University of California, Berkeley, CA 94720
itay.bloch.m@gmail.com
Marat Freytsis
NHETC, Department of Physics and Astronomy
Rutgers University
Piscataway, NJ 08854, USA
marat.freytsis@rutgers.edu
Tomer Volansky
Raymond and Beverly Sackler School of Physics and Astronomy
Tel-Aviv University
Tel-Aviv 69978, Israel
tomerv@post.tau.ac.il
ABSTRACT
We propose a new method to probe the learning mechanism of Deep Neural Networks (DNN) by
perturbing the system using Noise Injection Nodes (NINs). These nodes inject uncorrelated noise via
additional optimizable weights to existing feed-forward network architectures, without changing the
optimization algorithm. We find that the system displays distinct phases during training, dictated by
the scale of injected noise. We first derive expressions for the dynamics of the network and utilize a
simple linear model as a test case. We find that in some cases, the evolution of the noise nodes is
similar to that of the unperturbed loss, thus indicating the possibility of using NINs to learn more
about the full system in the future.
1 Introduction
Deep learning has proven exceedingly successful, leading to dramatic improvements in multiple domains. Nevertheless,
our current theoretical understanding of deep learning methods has remained unsatisfactory. Specifically, the training of
DNNs is a highly opaque procedure, with few metrics, beyond curvature evolution [
1
7
], available to describe how a
network evolves as it trains.
An interesting attempt at parameterizing the interplay between training dynamics and generalization was explored in
the seminal work of Ref. [
8
], which demonstrated that when input data was corrupted by adding random noise, the
generalization error deteriorated in correlation with its strength. Noise injection has gained further traction in recent
years, both as a means of effective regularization [
9
18
], as well as a route towards understanding DNN dynamics
and generalization. For instance, label noise has been shown to affect the implicit bias of Stochastic Gradient Descent
(SGD) [
19
23
], as sparse solutions appear to be preferred over those which reduce the Euclidean norm, in certain cases.
In this work, we take another step along this direction, by allowing the network to actively regulate the effects of the
injected noise during training. Concretely, we define Noise Injection Nodes (NINs), whose output is a random variable,
chosen sample-wise from a given distribution. These NINs are connected to existing feed-forward DNNs via trainable
Both authors contributed equally to this work.
arXiv:2210.13599v1 [cs.LG] 24 Oct 2022
Figure 1: Schematic of a generic DNN, with the addition of a single NIN connected via NIWs.
Center:
Example evolution displaying
a decay behavior for both the NIWs and the losses within the decay phase of the system discussed below and with a fixed noise
strength,
σ
. Two cases are shown: re-initialized NIN at every epoch, and fixed value NIN. The similar behavior of the systems
in both cases (
blue
and
light blue
points respectively) hint at a potent relation between the NIWs evolution and generalization.
The
blue
,
green
and
violet
stars indicate the NIW, test loss and training loss decay time-scales. The three solid curves are fits to
exponential decays, while the data is represented with points.
Right:
The different decay times as a function of the noise injection
magnitude. The four shaded regions indicate the four phases of the system discussed in Sec. 3. The results are shown for a 3-hidden
layer ReLU MLP with CE loss, trained on FMNIST to 100 % training accuracy.
Noise Injection Weights (NIWs). The network is subsequently trained to perform a classification/regression task using
vanilla SGD.
We study such systems both numerically and analytically, providing a detailed analytic understanding for a simple
linear network. Our main results, partly summarized in Fig. 1, are as follows:
(i) The system exhibits 4 NIN-related phases, depending mostly on the strength of the injected noise.
(ii)
In two of the phases, the NIWs evolve to small values, implying that a well-trained network is able to recognize
that the noise contains no useful information within it.
(iii) For those two phases, the NIW dynamics is dictated by the local curvature of the training loss function2.
Item Item (ii) may be expected if the NIN is re-randomized at each training epoch, yet we find essentially the same
behavior repeated even when the NIN values are generated only once and fixed before training, putting them on equal
footing with actual data inputs, as shown in Fig. 1(center,right). It appears that while the system might in principle be
able to memorize the specific noise samples, optimization dynamics still prefer to suppress them. This implies a relation
between the NIN reduction mechanism and the network’s ability to generalize, to be explored further in future works.
2 Noise Injection Weights Evolution
Consider a DNN with parameters
θ={W(`), b(`)Rd`×d`+1 ,Rd`+1 |`= 0, . . . , NL1}
, corresponding to the
weights and biases, and
NL
layers, defined by its single sample loss function
L:Rdin R
and optimized under SGD
to perform a supervised learning task. Here, at each SGD iteration, a mini-batch
B
consists of a set of labeled examples,
{(xi,yi)}|B|
i=1 Rdin ×Rdlabel .
We study the simple case of connecting a given NIN to a specific layer, denoted as
`NI
, via a NIW vector
WNI
R1×d`NI+1
(see Fig. 1, left). In this setup, the injected noise is taken as a random scalar variable,
, sampled repeatedly
at each SGD training epoch from a chosen distribution. The NIWs’ evolution is best studied via their effect on
preactivations, defined as
z(`)=W(`)·x(`)+b(`)
. When a NIN is added, the preactivations at layer
`NI
are
subsequently shifted to z(`NI)z(`NI )+WNI.
2
Recently, the relationship between the inherent noise present in SGD optimization and generalization has been of interest,
see [24,25] and references therein. Our work may be considered as a modeling of said SGD noise.
2
For a single NIN connected at layer
`NI
, the batch-averaged loss function can be written as a series expansion in the
noise translation parameter3
L(θ, WNI) = 1
|B| X
{x,,y}∈B L(˜
θ;z(`NI)+WNI, y)(1)
=L(θ) + 1
|B| X
{x,,y}∈B
X
k=1
(W T
NI · ∇z(`NI))k
k!L(θ;x, , y),
where
˜
θ=θ\ {W(`NI)}
and
L(θ)
is the loss function in the absence of a NIN. Focusing on a distribution with zero
mean for the NIN (e.g.,
∼ N(0, σ2
)
) and performing the batch averaging on each term, we arrive at the update rule
for the NIWs from the noisy loss expansion4
W(t+1)
NI =W(t)
NI ησΦ
p|B|qh(g(t)
`NI )2i − ησ2
2DH(t)
`NI EW(t)
NI +. . . . (2)
Here, batch averaging is denoted by
··i
,
Φ
is a random variable with zero mean and unit variance, and
σ2
is the
variance of the injected noise. We denote the network-dependent local gradient and Hessian at the NIN layer as
g`NI =z(`NI)L(θ;x,y),H`NI =z(`NI )T
z(`NI)L(θ;x,y).(3)
A more complete derivation, along with a proof for the
1/B
scaling of odd terms in the expansion is given in App. C.
Terminating the expansion in Eq. (2) at 2nd order need not be valid for large
σ
. We thus proceed by studying a linear
test case for which the 2nd order expansion is precise. The persistence of analogous network behavior, and in particular
its phases, for a more realistic setup is confirmed empirically.
3 Linear Toy Model
Consider a two-layer DNN with linear activations and layer widths (
d0,1= 1
) and no biases (
b= 0
), tasked with
univariate linear regression
5
, and with a single NIN connected to the first layer (
`NI = 0
). The data consists of a
set of training samples
{(xi, yi)R×R}m
i=1
, and we sample
xi
and the noise
i
from the normal distributions,
xi, i∼ N(0, σ2
x,)
. The corresponding data labels are given by a linear transformation of the inputs
yi=M·xi
with
a fixed
MR
. This regression problem is solved by minimizing the empirical loss, taken as the Mean Squared Error
(MSE),
LMSE =1
2|B| X
i∈B
(w(1)(w(0) ·xi+wNIi)yi)2,(4)
with optimal solution w(1)
w(0)
=M, wNI,= 0.
The evolution of the system can be studied by focusing on the coupled SGD equations for the hidden layer weight and
the NIW, parameterized as
w(1)
t+1 =Atσ+w(1)
t(1 Btσ2
)Ct,(5)
wNI,t+1 =˜
Atσ+wNI,t(1 ˜
Btσ2
).
Here, the various terms are given explicitly by6
At=ηΦtσx
p|B|wNI,t(2w(1)
tw(0)
tM),˜
At=ηΦtσx
p|B|w(1)
t(w(1)
tw(0)
tM),
Bt=η w2
NI,t,˜
Bt=η(w(1)
t)2, Ct=η(w(1)
tw(0)
tM)w(0)
tσ2
x,
(6)
3
In practice, it is often the case that one uses piece-wise analytic activation functions such as ReLU, and so if the noise causes the
crossing of a non-analytic point, the above formal expansion is invalid. This subtlety does not change any of our conclusions and
empirically we recognize the same phases when using ReLU activations.
4
Additional
σ/p|B|
corrections coming from the variances of the even terms emerge from batch-averaging, and are assumed to
be negligible throughout this work.
5
We use this toy model as a proxy for a diagonal linear network [
26
]. A more general treatment including width and depth effects
will be presented in future work.
6We match the local gradient qhg2
0i=σxw(1)
t(w(1)
tw(0)
tM)and Hessian hH(t)
0i= 2(w(1)
t)2to Eq. (2).
3
摘要:

NOISEINJECTIONASAPROBEOFDEEPLEARNINGDYNAMICSNoamLeviRaymondandBeverlySacklerSchoolofPhysicsandAstronomyTel-AvivUniversityTel-Aviv69978,Israelnoam@mail.tau.ac.ilItayM.BlochBerkeleyCenterforTheoreticalPhysicsUniversityofCalifornia,Berkeley,CA94720itay.bloch.m@gmail.comMaratFreytsisNHETC,Departmentof...

展开>> 收起<<
NOISE INJECTION AS A PROBE OF DEEPLEARNING DYNAMICS Noam Levi Raymond and Beverly Sackler School of Physics and Astronomy.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.58MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注