On skip connections and normalisation layers in deep optimisation Lachlan E. MacDonald

2025-05-02 0 0 635.61KB 20 页 10玖币
侵权投诉
On skip connections and normalisation layers in deep
optimisation
Lachlan E. MacDonald
Mathematical Institute for Data Science
Johns Hopkins University
lemacdonald@protonmail.com
Jack Valmadre
Australian Institute for Machine Learning
University of Adelaide
Hemanth Saratchandran
Australian Institute for Machine Learning
University of Adelaide
Simon Lucey
Australian Institute for Machine Learning
University of Adelaide
Abstract
We introduce a general theoretical framework, designed for the study of gradient
optimisation of deep neural networks, that encompasses ubiquitous architecture
choices including batch normalisation, weight normalisation and skip connections.
Our framework determines the curvature and regularity properties of multilayer
loss landscapes in terms of their constituent layers, thereby elucidating the roles
played by normalisation layers and skip connections in globalising these properties.
We then demonstrate the utility of this framework in two respects. First, we give
the only proof of which we are aware that a class of deep neural networks can be
trained using gradient descent to global optima even when such optima only exist
at infinity, as is the case for the cross-entropy cost. Second, we identify a novel
causal mechanism by which skip connections accelerate training, which we verify
predictively with ResNets on MNIST, CIFAR10, CIFAR100 and ImageNet.
1 Introduction
Deep, overparameterised neural networks are efficiently trainable to global optima using simple first
order methods. That this is true is immensely surprising from a theoretical perspective: modern
datasets and deep neural network architectures are so complex and irregular that they are essentially
opaque from the perspective of classical (convex) optimisation theory. A recent surge in inspired
theoretical works [
22
,
14
,
13
,
1
,
52
,
51
,
36
,
28
,
29
,
33
,
32
] has elucidated this phenomenon, showing
linear convergence of gradient descent on certain classes of neural networks, with certain cost
functions, to global optima. The formal principles underlying these works are identical. By taking
width sufficiently large, one guarantees uniform bounds on curvature (via a Lipschitz gradients-
type property) and regularity (via a Polyak-Łojasiewicz-type inequality) in a neighbourhood of
initialisation. Convergence to a global optimum in that neighbourhood then follows from a well-
known chain of estimates [23].
Despite significant progress, the theory of deep learning optimisation extant in the literature presents
at least three significant shortcomings:
Most done while at the Australian Institute for Machine Learning, University of Adelaide
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.05371v4 [cs.LG] 4 Dec 2023
1.
It lacks a formal framework in which to compare common practical architecture choices. Indeed,
none of the aforementioned works consider the impact of ubiquitous (weight/batch) normalisation
layers. Moreover, where common architectural modifications such as skip connections are studied, it
is unclear exactly what impact they have on optimisation. For instance, while in [
13
] it is shown that
skip connections enable convergence with width polynomial in the number of layers, as compared
with exponential width for chain networks, in [
1
] polynomial width is shown to be sufficient for
convergence of both chain and residual networks.
2.
It lacks theoretical flexibility. The consistent use of uniform curvature and regularity bounds are
insufficiently flexible to enable optimisation guarantees too far away from initialisation, where the
local, uniform bounds used in previous theory no longer hold. In particular, proving globally optimal
convergence for deep neural nets with the cross-entropy cost was (until now) an open problem [5].
3.
It lacks practical utility. Although it is presently unreasonable to demand quantitatively predictive
bounds on practical performance, existing optimisation theory has been largely unable to inform
architecture design even qualitatively. This is in part due to the first item, since practical architectures
typically differ substantially from those considered for theoretical purposes.
Our purpose in this article is to take a step in addressing these shortcomings. Specifically:
1.
We provide a formal framework, inspired by [
45
], for the study of multilayer optimisation. Our
framework is sufficiently general to include all commonly used neural network layers, and contains
formal results relating the curvature and regularity properties of multilayer loss landscapes to those
of their constituent layers. As instances, we prove novel results on the global curvature and regularity
properties enabled by normalisation layers and skip connections respectively, in contrast to the local
bounds provided in previous work.
2.
Using these novel, global bounds, we identify a class of weight-normalised residual networks
for which, given a linear independence assumption on the data, gradient descent can be provably
trained to a global optimum arbitrarily far away from initialisation. From a regularity perspective,
our analysis is strictly more flexible than the uniform analysis considered in previous works, and in
particular solves the open problem of proving global optimality for the training of deep nets with the
cross-entropy cost.
3.
Using our theoretical insight that skip connections aid loss regularity, we conduct a systematic
empirical analysis of singular value distributions of layer Jacobians for practical layers. We are
thereby able to predict that simple modifications to the classic ResNet architecture [
20
] will improve
training speed. We verify our predictions on MNIST, CIFAR10, CIFAR100 and ImageNet.
2 Background
In this section we give a summary of the principles underlying recent theoretical advances in neural
network optimisation. We discuss related works after this summary for greater clarity.
2.1 Smoothness and the PŁ-inequality
Gradient descent on a possibly non-convex function
:RpR0
can be guaranteed to converge
to a global optimum by insisting that
have Lipschitz gradients and satisfy the Polyak-Łojasiewicz
inequality. We recall these well-known properties here for convenience.
Definition 2.1. Let
β > 0
. A continuously differentiable function
:RpR
is said to have
β
-Lipschitz gradients, or is said to be
β
-smooth over a set
SRp
if the vector field
:RpRp
is β-Lipschitz. If Sis convex, having β-Lipschitz gradients implies that the inequality
(θ2)(θ1)≤ ∇(θ1)T(θ2θ1) + β
2θ2θ12(1)
holds for all θ1, θ2S.
The
β
-smoothness of
over
S
can be thought of as a uniform bound on the curvature of
over
S
: if
is twice continuously differentiable, then it has Lipschitz gradients over any compact set
K
with
(possibly loose) Lipschitz constant given by
β:= sup
θKD2(θ),(2)
where D2is the Hessian and ∥·∥denotes any matrix norm.
2
Definition 2.2. Let
µ > 0
. A differentiable function
:RpR0
is said to satisfy the
µ
-Polyak-
Łojasiewicz inequality, or is said to be µ-PŁ over a set SRpif
∥∇(θ)2µ(θ)inf
θS(θ)(3)
for all θS.
The PŁ condition on
over
S
is a uniform guarantee of regularity, which implies that all critical
points of
over
S
are
S
-global minima; however, such a function need not be convex. Synthesising
these definitions leads easily to the following result (cf. Theorem 1 of [23]).
Theorem 2.3. Let
:RpR0
be a continuously differentiable function that is
β
-smooth and
µ
-PŁ over a convex set
S
. Suppose that
θ0S
and let
{θt}
t=0
be the trajectory taken by gradient
descent, with step size
η < 2β1
, starting at
θ0
. If
{θt}
t=0 S
, then
(θt)
converges to an
S
-global
minimum at a linear rate:
(θt)1µη1βη
2t
(θ0)(4)
for all tN.
Essentially, while the Lipschitz constant of the gradients controls whether or not gradient descent
with a given step size can be guaranteed to decrease the loss at each step, the PŁ constant determines
by how much the loss will decrease. These ideas can be applied to the optimisation of deep neural
nets as follows.
2.2 Application to model optimisation
The above theory can be applied to parameterised models in the following fashion. Let
f:Rp×
Rd0RdL
be a differentiable,
Rp
-parameterised family of functions
Rd0RdL
(in later sections,
L
will denote the number of layers of a deep neural network). Given
N
training data
{(xi, yi)}N
i=1
Rd0×RdL, let F:RpRdL×Nbe the corresponding parameter-function map defined by
F(θ)i:= f(θ, xi).(5)
Any differentiable cost function
c:RdL×RdLR0
, convex in the first variable, extends to a
differentiable, convex function γ:RdL×NR0defined by
γ(zi)N
i=1:= 1
N
N
X
i=1
c(zi, yi),(6)
and one is then concerned with the optimisation of the composite
:= γF:RpR0
via
gradient descent.
To apply Theorem 2.3, one needs to determine the smoothness and regularity properties of
. By the
chain rule, the former can be determined given sufficient conditions on the derivatives
Dγ F
and
DF
(cf. Lemma 2 of [
45
]). The latter can be bounded by Lemma 3 of [
45
], which we recall below
and prove in the appendix for the reader’s convenience.
Theorem 2.4. Let
SRp
be a set. Suppose that
γ:RdL×NR0
is
µ
-PŁ over
F(S)
with
minimum γ
S. Let λ(DF (θ)) denote the smallest eigenvalue of DF (θ)DF (θ)T. Then
∥∇(θ)2µ λ(DF (θ)) (θ)γ
S(7)
for all θS.
Note that Theorem 2.4 is vacuous (
λ(DF (θ)) = 0
for all
θ
) unless in the overparameterised regime
(
pdLN
). Even in this regime, however, Theorem 2.4 does not imply that
is PŁ unless
λ(θ)
can be uniformly lower bounded by a positive constant over
S
. Although universally utilised in
previous literature, such a uniform lower bound will not be possible in our global analysis, and our
convergence theorem does not follow from Theorem 2.3, in contrast to previous work. Our theorem
requires additional argumentation, which we believe may be of independent utility.
3
3 Related works
Convergence theorems for deep linear networks with the square cost are considered in [
4
,
2
]. In
[
22
], it is proved that the tangent kernel of a multi-layer perceptron (MLP) becomes approximately
constant over all of parameter space as width goes to infinity, and is positive-definite for certain
data distributions, which by Theorem 2.4 implies that all critical points are global minima. Strictly
speaking, however, [
22
] does not prove convergence of gradient descent: the authors consider only
gradient flow, and leave Lipschitz concerns untouched. The papers [
14
,
13
,
1
,
52
,
51
,
36
] prove that
overparameterized neural nets of varying architectures can be optimised to global minima close to
initialisation by assuming sufficient width of several layers. While [
1
] does consider the cross-entropy
cost, convergence to a global optimum is not proved: it is instead shown that perfect classification
accuracy can be achieved close to initialisation during training. Improvements on these works have
been made in [33, 32, 7], wherein large width is required of only a single layer.
It is identified in [
28
] that linearity of the final layer is key in establishing the approximate constancy of
the tangent kernel for wide networks that was used in [
14
,
13
]. By making explicit the implicit use of
the PŁ condition present in previous works [
14
,
13
,
1
,
52
,
51
,
36
], [
29
] proves a convergence theorem
even with nonlinear output layers. The theory explicated in [
29
] is formalised and generalised in
[
45
]. A key weakness of all of the works mentioned thus far (bar the purely formal [
45
]) is that their
hypotheses imply that optimisation trajectories are always close to initialisation. Without this, there
is no obvious way to guarantee the PŁ-inequality along the optimisation trajectory, and hence no
way to guarantee one does not converge to a suboptimal critical point. However such training is
not possible with the cross-entropy cost, whose global minima only exist at infinity. There is also
evidence to suggest that such training must be avoided for state-of-the-art test performance [
8
,
15
,
26
].
In contrast, our theory gives convergence guarantees even for trajectories that travel arbitrarily far
from initialisation, and is the only work of which we are aware that can make this claim.
Among the tools that make our theory work are skip connections [
20
] and weight normalisation
[
41
]. The smoothness properties of normalisation schemes have previously been studied [
42
,
40
],
however they only give pointwise estimates comparing normalised to non-normalised layers, and
do not provide a global analysis of Lipschitz properties as we do. The regularising effect of skip
connections on the loss landscape has previously been studied in [
35
], however this study is not tightly
linked to optimisation theory. Skip connections have also been shown to enable the interpretation
of a neural network as coordinate transformations of data manifolds [
19
]. Mean field analyses of
skip connections have been conducted in [
49
,
30
] which necessitate large width; our own analysis
does not. A similarly general framework to that which we supply is given in [
47
,
48
]; while both
encapsulate all presently used architectures, that of [
47
,
48
] is designed for the study of infinite-width
tangent kernels, while ours is designed specifically for optimisation theory. Our empirical singular
value analysis of skip connections complements existing theoretical work using random matrix theory
[
18
,
37
,
39
,
34
,
16
]. These works have not yet considered the shifting effect of skip connections on
layer Jacobians that we observe empirically.
Our theory also links nicely to the intuitive notions of gradient propagation [
20
] and dynamical
isometry already present in the literature. In tying Jacobian singular values rigorously to loss
regularity in the sense of the Polyak-Lojasiewicz inequality, our theory provides a new link between
dynamical isometry and optimisation theory [
43
,
38
,
46
]: specifically dynamical isometry ensures
better PŁ conditioning and therefore faster and more reliable convergence to global minima. In
linking this productive section of the literature to optimisation theory, our work may open up new
possibilities for convergence proofs in the optimisation theory of deep networks. We leave further
exploration of this topic to future work.
Due to this relationship with the notion of dynamical isometry, our work also provides optimisation-
theoretic support for the empirical analyses of [
24
,
6
] that study the importance of layerwise dynamical
isometry for trainability and neural architecture search [
27
,
44
,
25
]. Recent work on deep kernel
shaping shows via careful tuning of initialisation and activation functions that while skip connections
and normalisation layers may be sufficient for good trainability, they are not necessary [
31
,
50
]. Other
recent work has also shown benefits to inference performance by removing skip connections from the
trained model using a parameter transformation [
11
] or by removing them from the model altogether
and incorporating them only into the optimiser [12].
4
摘要:

OnskipconnectionsandnormalisationlayersindeepoptimisationLachlanE.MacDonald∗MathematicalInstituteforDataScienceJohnsHopkinsUniversitylemacdonald@protonmail.comJackValmadreAustralianInstituteforMachineLearningUniversityofAdelaideHemanthSaratchandranAustralianInstituteforMachineLearningUniversityofAde...

展开>> 收起<<
On skip connections and normalisation layers in deep optimisation Lachlan E. MacDonald.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:635.61KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注