3 Related works
Convergence theorems for deep linear networks with the square cost are considered in [
4
,
2
]. In
[
22
], it is proved that the tangent kernel of a multi-layer perceptron (MLP) becomes approximately
constant over all of parameter space as width goes to infinity, and is positive-definite for certain
data distributions, which by Theorem 2.4 implies that all critical points are global minima. Strictly
speaking, however, [
22
] does not prove convergence of gradient descent: the authors consider only
gradient flow, and leave Lipschitz concerns untouched. The papers [
14
,
13
,
1
,
52
,
51
,
36
] prove that
overparameterized neural nets of varying architectures can be optimised to global minima close to
initialisation by assuming sufficient width of several layers. While [
1
] does consider the cross-entropy
cost, convergence to a global optimum is not proved: it is instead shown that perfect classification
accuracy can be achieved close to initialisation during training. Improvements on these works have
been made in [33, 32, 7], wherein large width is required of only a single layer.
It is identified in [
28
] that linearity of the final layer is key in establishing the approximate constancy of
the tangent kernel for wide networks that was used in [
14
,
13
]. By making explicit the implicit use of
the PŁ condition present in previous works [
14
,
13
,
1
,
52
,
51
,
36
], [
29
] proves a convergence theorem
even with nonlinear output layers. The theory explicated in [
29
] is formalised and generalised in
[
45
]. A key weakness of all of the works mentioned thus far (bar the purely formal [
45
]) is that their
hypotheses imply that optimisation trajectories are always close to initialisation. Without this, there
is no obvious way to guarantee the PŁ-inequality along the optimisation trajectory, and hence no
way to guarantee one does not converge to a suboptimal critical point. However such training is
not possible with the cross-entropy cost, whose global minima only exist at infinity. There is also
evidence to suggest that such training must be avoided for state-of-the-art test performance [
8
,
15
,
26
].
In contrast, our theory gives convergence guarantees even for trajectories that travel arbitrarily far
from initialisation, and is the only work of which we are aware that can make this claim.
Among the tools that make our theory work are skip connections [
20
] and weight normalisation
[
41
]. The smoothness properties of normalisation schemes have previously been studied [
42
,
40
],
however they only give pointwise estimates comparing normalised to non-normalised layers, and
do not provide a global analysis of Lipschitz properties as we do. The regularising effect of skip
connections on the loss landscape has previously been studied in [
35
], however this study is not tightly
linked to optimisation theory. Skip connections have also been shown to enable the interpretation
of a neural network as coordinate transformations of data manifolds [
19
]. Mean field analyses of
skip connections have been conducted in [
49
,
30
] which necessitate large width; our own analysis
does not. A similarly general framework to that which we supply is given in [
47
,
48
]; while both
encapsulate all presently used architectures, that of [
47
,
48
] is designed for the study of infinite-width
tangent kernels, while ours is designed specifically for optimisation theory. Our empirical singular
value analysis of skip connections complements existing theoretical work using random matrix theory
[
18
,
37
,
39
,
34
,
16
]. These works have not yet considered the shifting effect of skip connections on
layer Jacobians that we observe empirically.
Our theory also links nicely to the intuitive notions of gradient propagation [
20
] and dynamical
isometry already present in the literature. In tying Jacobian singular values rigorously to loss
regularity in the sense of the Polyak-Lojasiewicz inequality, our theory provides a new link between
dynamical isometry and optimisation theory [
43
,
38
,
46
]: specifically dynamical isometry ensures
better PŁ conditioning and therefore faster and more reliable convergence to global minima. In
linking this productive section of the literature to optimisation theory, our work may open up new
possibilities for convergence proofs in the optimisation theory of deep networks. We leave further
exploration of this topic to future work.
Due to this relationship with the notion of dynamical isometry, our work also provides optimisation-
theoretic support for the empirical analyses of [
24
,
6
] that study the importance of layerwise dynamical
isometry for trainability and neural architecture search [
27
,
44
,
25
]. Recent work on deep kernel
shaping shows via careful tuning of initialisation and activation functions that while skip connections
and normalisation layers may be sufficient for good trainability, they are not necessary [
31
,
50
]. Other
recent work has also shown benefits to inference performance by removing skip connections from the
trained model using a parameter transformation [
11
] or by removing them from the model altogether
and incorporating them only into the optimiser [12].
4