1 Random orthogonal additive ﬁlters a solution to the vanishingexploding gradient of deep neural

2025-04-28 0 0 1.2MB 14 页 10玖币

侵权投诉

Random orthogonal additive ﬁlters: a solution to the

vanishing/exploding gradient of deep neural

networks

Andrea Ceni

Abstract—Since the recognition in the early nineties of the

vanishing/exploding (V/E) gradient issue plaguing the training of

neural networks (NNs), signiﬁcant efforts have been exerted to

overcome this obstacle. However, a clear solution to the V/E issue

remained elusive so far. In this manuscript a new architecture

of NN is proposed, designed to mathematically prevent the V/E

issue to occur. The pursuit of approximate dynamical isometry,

i.e. parameter conﬁgurations where the singular values of the

input-output Jacobian are tightly distributed around 1, leads to

the derivation of a NN’s architecture that shares common traits

with the popular Residual Network model. Instead of skipping

connections between layers, the idea is to ﬁlter the previous

activations orthogonally and add them to the nonlinear activa-

tions of the next layer, realising a convex combination between

them. Remarkably, the impossibility for the gradient updates to

either vanish or explode is demonstrated with analytical bounds

that hold even in the inﬁnite depth case. The effectiveness of

this method is empirically proved by means of training via

backpropagation an extremely deep multilayer perceptron of 50k

layers, and an Elman NN to learn long-term dependencies in

the input of 10k time steps in the past. Compared with other

architectures speciﬁcally devised to deal with the V/E problem,

e.g. LSTMs for recurrent NNs, the proposed model is way simpler

yet more effective. Surprisingly, a single layer vanilla RNN can be

enhanced to reach state of the art performance, while converging

super fast; for instance on the psMNIST task, it is possible to

get test accuracy of over 94% in the ﬁrst epoch, and over 98%

after just 10 epochs.

Index Terms—Vanishing gradient, exploding gradient, deep

learning, recurrent neural networks, machine learning.

I. INTRODUCTION

ARTIFICIAL deep neural networks (NNs) are computa-

tional models that have gathered massive interest in the

last decade [1]. Usually NNs are composed of many simple

units (neurons) organised in multiple layers which nonlinearly

interact to each other via trainable parameters, also called

weights. Roughly speaking, training a NN to solve a task

means to adapt its parameters in order to ﬁt an unknown

function mapping between raw input and the output labels

forming the data points from which to generalise.

Although several algorithms exist for training NN models,

the backpropagation algorithm [2], [3] with stochastic gradient

descent (or variants thereof [4]) has been established as a

standard for supervised learning. The backpropagation is a

The author is with the Department of Computer Science, University of Pisa,

Largo Bruno Pontecorvo, 3 - 56127, IT (e-mail: andrea.ceni@di.unipi.it), and

the Department of Mathematics, University of Exeter, Exeter EX4 4QF, UK

(e-mail: ac860@exeter.ac.uk).

Manuscript received September ??, 2022; revised September ??, 2022.

clever procedure to assess how sensitive the output error is

w.r.t. a given parameter of the model; thus, each parameter

is modiﬁed proportionally to this measure of sensitivity. The

computation involves the sequential product of many matrices

whose resulting norm, similarly to the product of many real

numbers, can shrink to vanish or expand to inﬁnity expo-

nentially with depth. This problem has been known in the

literature as the vanishing/exploding (V/E) gradient issue. The

V/E issue has been ﬁrstly recognised in the context of recurrent

neural networks (RNNs) in the nineties [5], [6], and then,

after the advent of increasingly deep architectures, it became

evident also in feedforward neural networks (FNNs) [7].

A. Related works

Extensive works have been made to combat the V/E gradient

issue. Some of the main strategies are reported in this section.

Based on the particular choice of activation function, it has

been proved that some weights initialisation, known as xavier-

init [7] and kaimin-init [8], generate favourable statistics in the

neuronal activations that alleviate the issue.

The V/E dynamics have been recognised to be stronger

in “squashing” activation functions, e.g. tanh, while been

gentler on hard-nonlinear activation functions, e.g. ReLUs.

Moreover, true zero neuronal activations induced by ReLUs

promotes sparsity into the network which in turn has been

recognised to be beneﬁcial for backpropagating errors, as long

as some neural paths remain actives [9]. Nevertheless the issue,

although mitigated, is still present with ReLUs.

Moreover, a mean-ﬁeld theory analysis on linear FNNs

revealed the effectiveness of orthogonal weights initialisation

schemes [10], and, provided that an NN operates close to

the so-called edge of chaos [11], [12], on nonlinear FNNs

as well. In fact, the V/E gradient issue can be solved by

promoting the NN to operate in an approximate dynamical

isometry regime, i.e. a conﬁguration where the distribution of

the singular values of the NN’s input-output Jacobian remains

circumscribed around 1 without spreading too far from it. This

condition would ensure that signals can ﬂow back and forth

along the layered architecture. However, containing the learn-

ing process into an approximate dynamical isometry regime

appears to be hard in general FNNs, due to the nonlinearities

between layers. In [13] a theoretical analysis about orthogonal

initialisations led to the conclusion that ReLU networks are

incapable of dynamical isometry, while sigmoid and tanh

networks can. Orthogonal initialisation schemes [14], [15],

arXiv:2210.01245v1 [cs.LG] 3 Oct 2022

although not directly solving the V/E problem, have been

successfully implemented to train extremely deep NNs.

Additionally, gradient clipping procedures [16] have been

extensively used to maintain the gradients at a controlled mag-

nitude, while not addressing the vanishing problem. Besides,

normalising the incoming input from a layer to the consecutive

one, in order to avoid the saturating tails, has been shown

to accelerate training while improving the performance [17].

However, these are passive strategies that do not aim to prevent

the V/E gradient issue to occur.

Another breakthrough is represented by Highway Networks

[18], a feedforward architecture, inspired by the popular

LSTM recurrent NN model, provided with trainable gating

mechanisms to regulate information ﬂow. Finally, Residual

Networks (ResNets) [19], a gateless version of Highway

Networks, have revolutionised the ﬁeld of computer vision.

These architectures provide shortcut paths, directly connecting

non-consecutive layers, de facto realising ensembles of rela-

tively shallow networks [20]. Despite the important empirical

achievements of ResNets, theoretical results establishing the

effective solution of the V/E gradient problem for those archi-

tectures are still missing. In fact, although skip connections

have been observed to drive ResNets to the edge of chaos

[21], ResNets still heavily rely on batch normalisation for good

results with very deep architectures [22].

In practice, it is not rare in the literature to encounter

several of the aforementioned techniques alchemically

combined together to train deep learning models.

RNNs have been especially known for the difﬁculty to

train them to learn long-term dependencies. Probably the

most popular solution proposed to combat the V/E issue of

RNNs is represented by gated recurrent architectures, e.g. as

LSTMs [23], and GRUs [24]. However, despite the increased

computational complexity of these models, they still suffer

from instabilities when trained on very long sequences.

Similarly to FNNs, orthogonal initialisation of the recurrent

weights have been investigated and proved to be effective

for RNNs [25], [26]. Interestingly, it has been showed that

initialising the recurrent weights to the identity matrix in a

vanilla RNN with ReLU activations may reach performance

comparable to LSTMs [27].

In the seminal work of [28] the authors go beyond the

mere orthogonal initialisation and propose to parametrise the

recurrent weights in the space of unitary matrices, thereby

ensuring the orthogonality condition during the learning pro-

cess. Then, in [29] they further improve such model allowing

the learning dynamics to access the whole submanifold of the

unitary matrices. However, these methods come at the expense

of considerable computational effort. After these achievements

a burst of works have followed in the footsteps of constraining

the recurrent weights to be orthogonal or approximately such

[30]–[34] trying to restrain the run time. Although it is not

understood how, and to what extent, these restrictions imposed

on recurrent weights impact the learning dynamics, it is widely

recognised in the literature that imposing unitary constraints

on the recurrent matrix affects its expressive power [35], [36].

Alternatively to orthogonal constraints, IndRNN [37] is a re-

cently proposed RNN model where hidden-to-hidden weights

are constraint to diagonal matrices nonlinearly stacked in two

or more layers as a single recurrent block. V/E gradient is thus

prevented bounding the coefﬁcients of those diagonal matrices

in a suitable range which depends on the length of the temporal

dependencies the IndRNN is aimed to learn.

Another option is to prevent the V/E issue at a fundamental

level by constraining the recurrent matrix to be constant,

i.e. untrained. An emblematic example is given by Reservoir

Computing machines [38]. Typically, the recurrent weights

are randomly generated and left untouched, tuning them just

at a hyperparameter level, e.g. rescaling the spectral radius

[39], and only an output readout is trained. A further case is

studied in [40], where the hidden-to-hidden weights are ﬁxed

as a speciﬁc orthogonal matrix (namely a shifting permutation

matrix), while all the other weights are trained.

Other important developments come from considering

continuous-time RNNs as ODEs. For instance, Antisymmetri-

cRNN [41] uses skew-symmetric recurrent matrices to ensure

stable dynamics. Based on control theory results that ensures

global exponential stability, LipschitzRNN [42] extends the

pool of recurrent matrices where the learning occurs outside

the solely space of skew-symmetric matrices. While another

promising architecture is coRNN [43], derived from a second-

order ODE modeling a network of oscillators.

Other interesting works include models provided with mem-

ory cells, e.g. LMUs [44] and NRUs [45], and recurrent models

based on skip connections [46]–[48].

B. Original contributions

In this paper, a new idea that permits to solve the V/E

gradient issue of deep learning models trained via stochastic

gradient descent (SGD) methods is proposed. Empirically, it

is demonstrated how to:

•enhance the vanilla FNN model to be trainable for

extremely deep architectures (e.g. 50k layers), without

relying on auxiliary techniques as gradient clipping, batch

normalisation, nor skipping connections (see Section

II-D);

•enhance the vanilla RNN model to learn very long-term

dependencies, outperforming the great majority of recur-

rent NN models (among which LSTMs) in benchmark

tasks, as the Copying Memory, the Adding Problem, and

the Permuted Sequential MNIST (see Section IV).

Based on the same intuition, two NN models are proposed:

(i) roaFNN, which stands for random orthogonal additive

Feedforward NN, and (ii) roaRNN, which stands for random

orthogonal additive Recurrent NN. Those models result slight

variations of, respectively, a multilayer perceptron, and an El-

man network. The simplicity of these models permits a formal

mathematical analysis of their gradient update dynamics which

even holds for the case of inﬁnite depth. In particular, both

lower and upper bounds are provided for

•the maximum singular value of the input-output Jacobian

of the roaFNN model (Theorem 2.1), demonstrating the

impossibility for it to either explode or converge to zero;

•the entire set of singular values of the input-output

Jacobian of the roaRNN model (Theorem 3.1), proving

that a roaRNN evolves in an approximate dynamical

isometry regime by design.

Remarkably, these theoretical results are achieved without any

constraint on the parameters of the model.

II. APPROACHING DYNAMICAL ISOMETRY IN NONLINEAR

DEEP FNNS

A. Background

A multilayer perceptron (MLP) is described via the follow-

ing equations:

yl=Wlxl+bl(1)

xl+1 =φ(yl), l = 0,1, . . . , L −1.(2)

When dealing with vectors, it is assumed they are column

vectors unless otherwise speciﬁed. That said, xl∈RNlis the

vector containing the neuronal activations of the l-th layer,

and xL∈RNLis the output vector corresponding to the

input vector x0∈RN0. Matrices Wl∈RNl+1×Nlrepresent

the connection weights from neurons of the l-th layer to the

l+ 1-th layer, bl∈RNl+1 are biases. The function φis called

activation function, and it is applied component-wise to each

component of the vector yl. This function must be nonlinear,

and usually is required to be monotonically nondecreasing,

continuously differentiable (almost everywhere) and, often,

bounded. Common choices of φare sigmoid, hyperbolic

tangent, and ReLU.

Given a training dataset D={(X(i), Y (i))}of input-

output sample pairs, training of model (1)-(2) is achieved by

tuning its parameters, i.e. all the matrices of weights Wland

vector of biases bl, such that the “distance” between the com-

puted output xL(i)corresponding to the input X(i), and the

desired output Y(i)is as low as possible on the overall training

set. This translates into minimising the sum of the distances on

each input-output sample, L(D) = 1

#samples PiL(Y(i), xL(i));

such distance is commonly called the loss function, in this

manuscript it will be denoted as L. Common choices are given

by the Mean Squared Error when dealing with regression tasks

or the Cross Entropy with classiﬁcation tasks. The former, for

a given input-output sample, reads

L(Y(i), xL(i)) = kY(i)−xL(i)k2,

where kakdenotes the Euclidean norm of the vector a.

Note that the computed output xL(i)is a function of all the

parameters of the model and the input X(i), with a nested de-

pendence on the parameters which increases as the depth of the

model. Therefore, considered that the data points (X(i), Y (i))

are ﬁxed, we can actively optimise the loss as a function of the

parameters of the model, L=L(W0, b0, . . . , WL−1, bL−1). At

a ﬁrst order approximation, each parameter contributes to the

loss in proportion to the derivative of the loss function w.r.t. it.

Thus, in order to minimise the loss, the core idea is to modify

each parameter moving towards the opposite direction pointed

by the gradients.

The derivative of the mean squared error loss w.r.t. the

output is the following vector

E(i) = −2(Y(i)−xL(i)).

This will be called the error vector of the speciﬁc input-

output sample (X(i), Y (i)). However, in the following it will

be often denoted shortly as E, i.e. dropping the relation with

the speciﬁc input-output sample.

The derivative of the loss function w.r.t. the weights con-

necting the s-th layer to the (s+ 1)-th layer can be compactly

written as the following outer product1

∂L

∂Ws

(i) = hdiag(ds)BsE(i)i⊗xs, s =L−1,...,0,

(3)

where diag(ds)is the diagonal matrix whose diagonal is the

following vector

ds=φ0(ys),(4)

and the linear operator Bsbackpropagates the error vector

from the output layer up to the (s+ 1)-th layer, as follows

BsE(i)=∂xL

∂xs+1 T

E(i),(5)

where the subscript Tdenotes the transpose. The matrix

∂xL

∂xs+1

can be expressed as the product of L−s−1matrices,

∂xL

∂xs+1

=∂xL

∂xL−1

∂xL−2

. . . ∂xs+2

∂xs+1

(6)

∂xl+1

∂xl

= diag(dl)Wl.(7)

Derivatives of the loss function w.r.t. biases bscan be

obtained simply by omitting the outer product with the vector

xsin (3). In gradient-based learning methods the weights

Wsare updated by subtracting the matrix (3) rescaled by the

learning rate µ, that is

Ws(i) = Ws(i−1) + ∆Ws(i); (8)

∆Ws(i) = −µ∂L

∂Ws

(i).(9)

In order to accelerate the learning process, often training is

accomplished dividing the dataset in batches of mtraining

samples Dj=X(j), Y (j),...,X(j+m−1), Y (j+

m−1). In that case the correction matrix (9) is averaged

over a batch as follows

∆Ws(j) = −µ

m−1

i=0

∂L

∂Ws

(j+i).(10)

Whether through mini-batches (10) or one-by-one samples

(9), the update rule (8) realises an SGD algorithm leading

to a journey in the parameter space that will converge to a

(hopefully low enough) minimum point of the loss function.

1Outer product of vectors a∈RNand b∈RMis a N×Mmatrix

of 1-rank deﬁned as (a⊗b)ij =aibj. Equivalently, considering a, b,

respectively matrices of dimensions (N, 1),(M, 1),then using the standard

matrix multiplication operation, a⊗b=abT.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1Randomorthogonaladditivelters:asolutiontothevanishing/explodinggradientofdeepneuralnetworksAndreaCeniAbstractSincetherecognitionintheearlyninetiesofthevanishing/exploding(V/E)gradientissueplaguingthetrainingofneuralnetworks(NNs),signicanteffortshavebeenexertedtoovercomethisobstacle.However,aclea...

展开>> 收起<<

1 Random orthogonal additive ﬁlters a solution to the vanishingexploding gradient of deep neural.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Random orthogonal additive ﬁlters a solution to the vanishingexploding gradient of deep neural

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: