
Figure 1: Schematic of a generic DNN, with the addition of a single NIN connected via NIWs.
Center:
Example evolution displaying
a decay behavior for both the NIWs and the losses within the decay phase of the system discussed below and with a fixed noise
strength,
σ
. Two cases are shown: re-initialized NIN at every epoch, and fixed value NIN. The similar behavior of the systems
in both cases (
blue
and
light blue
points respectively) hint at a potent relation between the NIWs evolution and generalization.
The
blue
,
green
and
violet
stars indicate the NIW, test loss and training loss decay time-scales. The three solid curves are fits to
exponential decays, while the data is represented with points.
Right:
The different decay times as a function of the noise injection
magnitude. The four shaded regions indicate the four phases of the system discussed in Sec. 3. The results are shown for a 3-hidden
layer ReLU MLP with CE loss, trained on FMNIST to 100 % training accuracy.
Noise Injection Weights (NIWs). The network is subsequently trained to perform a classification/regression task using
vanilla SGD.
We study such systems both numerically and analytically, providing a detailed analytic understanding for a simple
linear network. Our main results, partly summarized in Fig. 1, are as follows:
(i) The system exhibits 4 NIN-related phases, depending mostly on the strength of the injected noise.
(ii)
In two of the phases, the NIWs evolve to small values, implying that a well-trained network is able to recognize
that the noise contains no useful information within it.
(iii) For those two phases, the NIW dynamics is dictated by the local curvature of the training loss function2.
Item Item (ii) may be expected if the NIN is re-randomized at each training epoch, yet we find essentially the same
behavior repeated even when the NIN values are generated only once and fixed before training, putting them on equal
footing with actual data inputs, as shown in Fig. 1(center,right). It appears that while the system might in principle be
able to memorize the specific noise samples, optimization dynamics still prefer to suppress them. This implies a relation
between the NIN reduction mechanism and the network’s ability to generalize, to be explored further in future works.
2 Noise Injection Weights Evolution
Consider a DNN with parameters
θ={W(`), b(`)∈Rd`×d`+1 ,Rd`+1 |`= 0, . . . , NL−1}
, corresponding to the
weights and biases, and
NL
layers, defined by its single sample loss function
L:Rdin →R
and optimized under SGD
to perform a supervised learning task. Here, at each SGD iteration, a mini-batch
B
consists of a set of labeled examples,
{(xi,yi)}|B|
i=1 ⊆Rdin ×Rdlabel .
We study the simple case of connecting a given NIN to a specific layer, denoted as
`NI
, via a NIW vector
WNI ∈
R1×d`NI+1
(see Fig. 1, left). In this setup, the injected noise is taken as a random scalar variable,
, sampled repeatedly
at each SGD training epoch from a chosen distribution. The NIWs’ evolution is best studied via their effect on
preactivations, defined as
z(`)=W(`)·x(`)+b(`)
. When a NIN is added, the preactivations at layer
`NI
are
subsequently shifted to z(`NI)→z(`NI )+WNI.
2
Recently, the relationship between the inherent noise present in SGD optimization and generalization has been of interest,
see [24,25] and references therein. Our work may be considered as a modeling of said SGD noise.
2