1 Random orthogonal additive filters a solution to the vanishingexploding gradient of deep neural

2025-04-28 0 0 1.2MB 14 页 10玖币
侵权投诉
1
Random orthogonal additive filters: a solution to the
vanishing/exploding gradient of deep neural
networks
Andrea Ceni
Abstract—Since the recognition in the early nineties of the
vanishing/exploding (V/E) gradient issue plaguing the training of
neural networks (NNs), significant efforts have been exerted to
overcome this obstacle. However, a clear solution to the V/E issue
remained elusive so far. In this manuscript a new architecture
of NN is proposed, designed to mathematically prevent the V/E
issue to occur. The pursuit of approximate dynamical isometry,
i.e. parameter configurations where the singular values of the
input-output Jacobian are tightly distributed around 1, leads to
the derivation of a NN’s architecture that shares common traits
with the popular Residual Network model. Instead of skipping
connections between layers, the idea is to filter the previous
activations orthogonally and add them to the nonlinear activa-
tions of the next layer, realising a convex combination between
them. Remarkably, the impossibility for the gradient updates to
either vanish or explode is demonstrated with analytical bounds
that hold even in the infinite depth case. The effectiveness of
this method is empirically proved by means of training via
backpropagation an extremely deep multilayer perceptron of 50k
layers, and an Elman NN to learn long-term dependencies in
the input of 10k time steps in the past. Compared with other
architectures specifically devised to deal with the V/E problem,
e.g. LSTMs for recurrent NNs, the proposed model is way simpler
yet more effective. Surprisingly, a single layer vanilla RNN can be
enhanced to reach state of the art performance, while converging
super fast; for instance on the psMNIST task, it is possible to
get test accuracy of over 94% in the first epoch, and over 98%
after just 10 epochs.
Index Terms—Vanishing gradient, exploding gradient, deep
learning, recurrent neural networks, machine learning.
I. INTRODUCTION
ARTIFICIAL deep neural networks (NNs) are computa-
tional models that have gathered massive interest in the
last decade [1]. Usually NNs are composed of many simple
units (neurons) organised in multiple layers which nonlinearly
interact to each other via trainable parameters, also called
weights. Roughly speaking, training a NN to solve a task
means to adapt its parameters in order to fit an unknown
function mapping between raw input and the output labels
forming the data points from which to generalise.
Although several algorithms exist for training NN models,
the backpropagation algorithm [2], [3] with stochastic gradient
descent (or variants thereof [4]) has been established as a
standard for supervised learning. The backpropagation is a
The author is with the Department of Computer Science, University of Pisa,
Largo Bruno Pontecorvo, 3 - 56127, IT (e-mail: andrea.ceni@di.unipi.it), and
the Department of Mathematics, University of Exeter, Exeter EX4 4QF, UK
(e-mail: ac860@exeter.ac.uk).
Manuscript received September ??, 2022; revised September ??, 2022.
clever procedure to assess how sensitive the output error is
w.r.t. a given parameter of the model; thus, each parameter
is modified proportionally to this measure of sensitivity. The
computation involves the sequential product of many matrices
whose resulting norm, similarly to the product of many real
numbers, can shrink to vanish or expand to infinity expo-
nentially with depth. This problem has been known in the
literature as the vanishing/exploding (V/E) gradient issue. The
V/E issue has been firstly recognised in the context of recurrent
neural networks (RNNs) in the nineties [5], [6], and then,
after the advent of increasingly deep architectures, it became
evident also in feedforward neural networks (FNNs) [7].
A. Related works
Extensive works have been made to combat the V/E gradient
issue. Some of the main strategies are reported in this section.
Based on the particular choice of activation function, it has
been proved that some weights initialisation, known as xavier-
init [7] and kaimin-init [8], generate favourable statistics in the
neuronal activations that alleviate the issue.
The V/E dynamics have been recognised to be stronger
in “squashing” activation functions, e.g. tanh, while been
gentler on hard-nonlinear activation functions, e.g. ReLUs.
Moreover, true zero neuronal activations induced by ReLUs
promotes sparsity into the network which in turn has been
recognised to be beneficial for backpropagating errors, as long
as some neural paths remain actives [9]. Nevertheless the issue,
although mitigated, is still present with ReLUs.
Moreover, a mean-field theory analysis on linear FNNs
revealed the effectiveness of orthogonal weights initialisation
schemes [10], and, provided that an NN operates close to
the so-called edge of chaos [11], [12], on nonlinear FNNs
as well. In fact, the V/E gradient issue can be solved by
promoting the NN to operate in an approximate dynamical
isometry regime, i.e. a configuration where the distribution of
the singular values of the NN’s input-output Jacobian remains
circumscribed around 1 without spreading too far from it. This
condition would ensure that signals can flow back and forth
along the layered architecture. However, containing the learn-
ing process into an approximate dynamical isometry regime
appears to be hard in general FNNs, due to the nonlinearities
between layers. In [13] a theoretical analysis about orthogonal
initialisations led to the conclusion that ReLU networks are
incapable of dynamical isometry, while sigmoid and tanh
networks can. Orthogonal initialisation schemes [14], [15],
arXiv:2210.01245v1 [cs.LG] 3 Oct 2022
2
although not directly solving the V/E problem, have been
successfully implemented to train extremely deep NNs.
Additionally, gradient clipping procedures [16] have been
extensively used to maintain the gradients at a controlled mag-
nitude, while not addressing the vanishing problem. Besides,
normalising the incoming input from a layer to the consecutive
one, in order to avoid the saturating tails, has been shown
to accelerate training while improving the performance [17].
However, these are passive strategies that do not aim to prevent
the V/E gradient issue to occur.
Another breakthrough is represented by Highway Networks
[18], a feedforward architecture, inspired by the popular
LSTM recurrent NN model, provided with trainable gating
mechanisms to regulate information flow. Finally, Residual
Networks (ResNets) [19], a gateless version of Highway
Networks, have revolutionised the field of computer vision.
These architectures provide shortcut paths, directly connecting
non-consecutive layers, de facto realising ensembles of rela-
tively shallow networks [20]. Despite the important empirical
achievements of ResNets, theoretical results establishing the
effective solution of the V/E gradient problem for those archi-
tectures are still missing. In fact, although skip connections
have been observed to drive ResNets to the edge of chaos
[21], ResNets still heavily rely on batch normalisation for good
results with very deep architectures [22].
In practice, it is not rare in the literature to encounter
several of the aforementioned techniques alchemically
combined together to train deep learning models.
RNNs have been especially known for the difficulty to
train them to learn long-term dependencies. Probably the
most popular solution proposed to combat the V/E issue of
RNNs is represented by gated recurrent architectures, e.g. as
LSTMs [23], and GRUs [24]. However, despite the increased
computational complexity of these models, they still suffer
from instabilities when trained on very long sequences.
Similarly to FNNs, orthogonal initialisation of the recurrent
weights have been investigated and proved to be effective
for RNNs [25], [26]. Interestingly, it has been showed that
initialising the recurrent weights to the identity matrix in a
vanilla RNN with ReLU activations may reach performance
comparable to LSTMs [27].
In the seminal work of [28] the authors go beyond the
mere orthogonal initialisation and propose to parametrise the
recurrent weights in the space of unitary matrices, thereby
ensuring the orthogonality condition during the learning pro-
cess. Then, in [29] they further improve such model allowing
the learning dynamics to access the whole submanifold of the
unitary matrices. However, these methods come at the expense
of considerable computational effort. After these achievements
a burst of works have followed in the footsteps of constraining
the recurrent weights to be orthogonal or approximately such
[30]–[34] trying to restrain the run time. Although it is not
understood how, and to what extent, these restrictions imposed
on recurrent weights impact the learning dynamics, it is widely
recognised in the literature that imposing unitary constraints
on the recurrent matrix affects its expressive power [35], [36].
Alternatively to orthogonal constraints, IndRNN [37] is a re-
cently proposed RNN model where hidden-to-hidden weights
are constraint to diagonal matrices nonlinearly stacked in two
or more layers as a single recurrent block. V/E gradient is thus
prevented bounding the coefficients of those diagonal matrices
in a suitable range which depends on the length of the temporal
dependencies the IndRNN is aimed to learn.
Another option is to prevent the V/E issue at a fundamental
level by constraining the recurrent matrix to be constant,
i.e. untrained. An emblematic example is given by Reservoir
Computing machines [38]. Typically, the recurrent weights
are randomly generated and left untouched, tuning them just
at a hyperparameter level, e.g. rescaling the spectral radius
[39], and only an output readout is trained. A further case is
studied in [40], where the hidden-to-hidden weights are fixed
as a specific orthogonal matrix (namely a shifting permutation
matrix), while all the other weights are trained.
Other important developments come from considering
continuous-time RNNs as ODEs. For instance, Antisymmetri-
cRNN [41] uses skew-symmetric recurrent matrices to ensure
stable dynamics. Based on control theory results that ensures
global exponential stability, LipschitzRNN [42] extends the
pool of recurrent matrices where the learning occurs outside
the solely space of skew-symmetric matrices. While another
promising architecture is coRNN [43], derived from a second-
order ODE modeling a network of oscillators.
Other interesting works include models provided with mem-
ory cells, e.g. LMUs [44] and NRUs [45], and recurrent models
based on skip connections [46]–[48].
B. Original contributions
In this paper, a new idea that permits to solve the V/E
gradient issue of deep learning models trained via stochastic
gradient descent (SGD) methods is proposed. Empirically, it
is demonstrated how to:
enhance the vanilla FNN model to be trainable for
extremely deep architectures (e.g. 50k layers), without
relying on auxiliary techniques as gradient clipping, batch
normalisation, nor skipping connections (see Section
II-D);
enhance the vanilla RNN model to learn very long-term
dependencies, outperforming the great majority of recur-
rent NN models (among which LSTMs) in benchmark
tasks, as the Copying Memory, the Adding Problem, and
the Permuted Sequential MNIST (see Section IV).
Based on the same intuition, two NN models are proposed:
(i) roaFNN, which stands for random orthogonal additive
Feedforward NN, and (ii) roaRNN, which stands for random
orthogonal additive Recurrent NN. Those models result slight
variations of, respectively, a multilayer perceptron, and an El-
man network. The simplicity of these models permits a formal
mathematical analysis of their gradient update dynamics which
even holds for the case of infinite depth. In particular, both
lower and upper bounds are provided for
the maximum singular value of the input-output Jacobian
of the roaFNN model (Theorem 2.1), demonstrating the
impossibility for it to either explode or converge to zero;
3
the entire set of singular values of the input-output
Jacobian of the roaRNN model (Theorem 3.1), proving
that a roaRNN evolves in an approximate dynamical
isometry regime by design.
Remarkably, these theoretical results are achieved without any
constraint on the parameters of the model.
II. APPROACHING DYNAMICAL ISOMETRY IN NONLINEAR
DEEP FNNS
A. Background
A multilayer perceptron (MLP) is described via the follow-
ing equations:
yl=Wlxl+bl(1)
xl+1 =φ(yl), l = 0,1, . . . , L 1.(2)
When dealing with vectors, it is assumed they are column
vectors unless otherwise specified. That said, xlRNlis the
vector containing the neuronal activations of the l-th layer,
and xLRNLis the output vector corresponding to the
input vector x0RN0. Matrices WlRNl+1×Nlrepresent
the connection weights from neurons of the l-th layer to the
l+ 1-th layer, blRNl+1 are biases. The function φis called
activation function, and it is applied component-wise to each
component of the vector yl. This function must be nonlinear,
and usually is required to be monotonically nondecreasing,
continuously differentiable (almost everywhere) and, often,
bounded. Common choices of φare sigmoid, hyperbolic
tangent, and ReLU.
Given a training dataset D={(X(i), Y (i))}of input-
output sample pairs, training of model (1)-(2) is achieved by
tuning its parameters, i.e. all the matrices of weights Wland
vector of biases bl, such that the “distance” between the com-
puted output xL(i)corresponding to the input X(i), and the
desired output Y(i)is as low as possible on the overall training
set. This translates into minimising the sum of the distances on
each input-output sample, L(D) = 1
#samples PiL(Y(i), xL(i));
such distance is commonly called the loss function, in this
manuscript it will be denoted as L. Common choices are given
by the Mean Squared Error when dealing with regression tasks
or the Cross Entropy with classification tasks. The former, for
a given input-output sample, reads
L(Y(i), xL(i)) = kY(i)xL(i)k2,
where kakdenotes the Euclidean norm of the vector a.
Note that the computed output xL(i)is a function of all the
parameters of the model and the input X(i), with a nested de-
pendence on the parameters which increases as the depth of the
model. Therefore, considered that the data points (X(i), Y (i))
are fixed, we can actively optimise the loss as a function of the
parameters of the model, L=L(W0, b0, . . . , WL1, bL1). At
a first order approximation, each parameter contributes to the
loss in proportion to the derivative of the loss function w.r.t. it.
Thus, in order to minimise the loss, the core idea is to modify
each parameter moving towards the opposite direction pointed
by the gradients.
The derivative of the mean squared error loss w.r.t. the
output is the following vector
E(i) = 2(Y(i)xL(i)).
This will be called the error vector of the specific input-
output sample (X(i), Y (i)). However, in the following it will
be often denoted shortly as E, i.e. dropping the relation with
the specific input-output sample.
The derivative of the loss function w.r.t. the weights con-
necting the s-th layer to the (s+ 1)-th layer can be compactly
written as the following outer product1
L
Ws
(i) = hdiag(ds)BsE(i)ixs, s =L1,...,0,
(3)
where diag(ds)is the diagonal matrix whose diagonal is the
following vector
ds=φ0(ys),(4)
and the linear operator Bsbackpropagates the error vector
from the output layer up to the (s+ 1)-th layer, as follows
BsE(i)=xL
xs+1 T
E(i),(5)
where the subscript Tdenotes the transpose. The matrix
xL
xs+1
can be expressed as the product of Ls1matrices,
xL
xs+1
=xL
xL1
xL1
xL2
. . . xs+2
xs+1
(6)
xl+1
xl
= diag(dl)Wl.(7)
Derivatives of the loss function w.r.t. biases bscan be
obtained simply by omitting the outer product with the vector
xsin (3). In gradient-based learning methods the weights
Wsare updated by subtracting the matrix (3) rescaled by the
learning rate µ, that is
Ws(i) = Ws(i1) + ∆Ws(i); (8)
Ws(i) = µL
Ws
(i).(9)
In order to accelerate the learning process, often training is
accomplished dividing the dataset in batches of mtraining
samples Dj=X(j), Y (j),...,X(j+m1), Y (j+
m1). In that case the correction matrix (9) is averaged
over a batch as follows
Ws(j) = µ
m
m1
X
i=0
L
Ws
(j+i).(10)
Whether through mini-batches (10) or one-by-one samples
(9), the update rule (8) realises an SGD algorithm leading
to a journey in the parameter space that will converge to a
(hopefully low enough) minimum point of the loss function.
1Outer product of vectors aRNand bRMis a N×Mmatrix
of 1-rank defined as (ab)ij =aibj. Equivalently, considering a, b,
respectively matrices of dimensions (N, 1),(M, 1),then using the standard
matrix multiplication operation, ab=abT.
摘要:

1Randomorthogonaladditivelters:asolutiontothevanishing/explodinggradientofdeepneuralnetworksAndreaCeniAbstract—Sincetherecognitionintheearlyninetiesofthevanishing/exploding(V/E)gradientissueplaguingthetrainingofneuralnetworks(NNs),signicanteffortshavebeenexertedtoovercomethisobstacle.However,aclea...

展开>> 收起<<
1 Random orthogonal additive filters a solution to the vanishingexploding gradient of deep neural.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.2MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注