2
although not directly solving the V/E problem, have been
successfully implemented to train extremely deep NNs.
Additionally, gradient clipping procedures [16] have been
extensively used to maintain the gradients at a controlled mag-
nitude, while not addressing the vanishing problem. Besides,
normalising the incoming input from a layer to the consecutive
one, in order to avoid the saturating tails, has been shown
to accelerate training while improving the performance [17].
However, these are passive strategies that do not aim to prevent
the V/E gradient issue to occur.
Another breakthrough is represented by Highway Networks
[18], a feedforward architecture, inspired by the popular
LSTM recurrent NN model, provided with trainable gating
mechanisms to regulate information flow. Finally, Residual
Networks (ResNets) [19], a gateless version of Highway
Networks, have revolutionised the field of computer vision.
These architectures provide shortcut paths, directly connecting
non-consecutive layers, de facto realising ensembles of rela-
tively shallow networks [20]. Despite the important empirical
achievements of ResNets, theoretical results establishing the
effective solution of the V/E gradient problem for those archi-
tectures are still missing. In fact, although skip connections
have been observed to drive ResNets to the edge of chaos
[21], ResNets still heavily rely on batch normalisation for good
results with very deep architectures [22].
In practice, it is not rare in the literature to encounter
several of the aforementioned techniques alchemically
combined together to train deep learning models.
RNNs have been especially known for the difficulty to
train them to learn long-term dependencies. Probably the
most popular solution proposed to combat the V/E issue of
RNNs is represented by gated recurrent architectures, e.g. as
LSTMs [23], and GRUs [24]. However, despite the increased
computational complexity of these models, they still suffer
from instabilities when trained on very long sequences.
Similarly to FNNs, orthogonal initialisation of the recurrent
weights have been investigated and proved to be effective
for RNNs [25], [26]. Interestingly, it has been showed that
initialising the recurrent weights to the identity matrix in a
vanilla RNN with ReLU activations may reach performance
comparable to LSTMs [27].
In the seminal work of [28] the authors go beyond the
mere orthogonal initialisation and propose to parametrise the
recurrent weights in the space of unitary matrices, thereby
ensuring the orthogonality condition during the learning pro-
cess. Then, in [29] they further improve such model allowing
the learning dynamics to access the whole submanifold of the
unitary matrices. However, these methods come at the expense
of considerable computational effort. After these achievements
a burst of works have followed in the footsteps of constraining
the recurrent weights to be orthogonal or approximately such
[30]–[34] trying to restrain the run time. Although it is not
understood how, and to what extent, these restrictions imposed
on recurrent weights impact the learning dynamics, it is widely
recognised in the literature that imposing unitary constraints
on the recurrent matrix affects its expressive power [35], [36].
Alternatively to orthogonal constraints, IndRNN [37] is a re-
cently proposed RNN model where hidden-to-hidden weights
are constraint to diagonal matrices nonlinearly stacked in two
or more layers as a single recurrent block. V/E gradient is thus
prevented bounding the coefficients of those diagonal matrices
in a suitable range which depends on the length of the temporal
dependencies the IndRNN is aimed to learn.
Another option is to prevent the V/E issue at a fundamental
level by constraining the recurrent matrix to be constant,
i.e. untrained. An emblematic example is given by Reservoir
Computing machines [38]. Typically, the recurrent weights
are randomly generated and left untouched, tuning them just
at a hyperparameter level, e.g. rescaling the spectral radius
[39], and only an output readout is trained. A further case is
studied in [40], where the hidden-to-hidden weights are fixed
as a specific orthogonal matrix (namely a shifting permutation
matrix), while all the other weights are trained.
Other important developments come from considering
continuous-time RNNs as ODEs. For instance, Antisymmetri-
cRNN [41] uses skew-symmetric recurrent matrices to ensure
stable dynamics. Based on control theory results that ensures
global exponential stability, LipschitzRNN [42] extends the
pool of recurrent matrices where the learning occurs outside
the solely space of skew-symmetric matrices. While another
promising architecture is coRNN [43], derived from a second-
order ODE modeling a network of oscillators.
Other interesting works include models provided with mem-
ory cells, e.g. LMUs [44] and NRUs [45], and recurrent models
based on skip connections [46]–[48].
B. Original contributions
In this paper, a new idea that permits to solve the V/E
gradient issue of deep learning models trained via stochastic
gradient descent (SGD) methods is proposed. Empirically, it
is demonstrated how to:
•enhance the vanilla FNN model to be trainable for
extremely deep architectures (e.g. 50k layers), without
relying on auxiliary techniques as gradient clipping, batch
normalisation, nor skipping connections (see Section
II-D);
•enhance the vanilla RNN model to learn very long-term
dependencies, outperforming the great majority of recur-
rent NN models (among which LSTMs) in benchmark
tasks, as the Copying Memory, the Adding Problem, and
the Permuted Sequential MNIST (see Section IV).
Based on the same intuition, two NN models are proposed:
(i) roaFNN, which stands for random orthogonal additive
Feedforward NN, and (ii) roaRNN, which stands for random
orthogonal additive Recurrent NN. Those models result slight
variations of, respectively, a multilayer perceptron, and an El-
man network. The simplicity of these models permits a formal
mathematical analysis of their gradient update dynamics which
even holds for the case of infinite depth. In particular, both
lower and upper bounds are provided for
•the maximum singular value of the input-output Jacobian
of the roaFNN model (Theorem 2.1), demonstrating the
impossibility for it to either explode or converge to zero;