On skip connections and normalisation layers in deep optimisation Lachlan E. MacDonald

2025-05-02 0 0 635.61KB 20 页 10玖币

侵权投诉

On skip connections and normalisation layers in deep

optimisation

Lachlan E. MacDonald∗

Mathematical Institute for Data Science

Johns Hopkins University

lemacdonald@protonmail.com

Jack Valmadre

Australian Institute for Machine Learning

University of Adelaide

Hemanth Saratchandran

Australian Institute for Machine Learning

University of Adelaide

Simon Lucey

Australian Institute for Machine Learning

University of Adelaide

Abstract

We introduce a general theoretical framework, designed for the study of gradient

optimisation of deep neural networks, that encompasses ubiquitous architecture

choices including batch normalisation, weight normalisation and skip connections.

Our framework determines the curvature and regularity properties of multilayer

loss landscapes in terms of their constituent layers, thereby elucidating the roles

played by normalisation layers and skip connections in globalising these properties.

We then demonstrate the utility of this framework in two respects. First, we give

the only proof of which we are aware that a class of deep neural networks can be

trained using gradient descent to global optima even when such optima only exist

at inﬁnity, as is the case for the cross-entropy cost. Second, we identify a novel

causal mechanism by which skip connections accelerate training, which we verify

predictively with ResNets on MNIST, CIFAR10, CIFAR100 and ImageNet.

1 Introduction

Deep, overparameterised neural networks are efﬁciently trainable to global optima using simple ﬁrst

order methods. That this is true is immensely surprising from a theoretical perspective: modern

datasets and deep neural network architectures are so complex and irregular that they are essentially

opaque from the perspective of classical (convex) optimisation theory. A recent surge in inspired

theoretical works [

] has elucidated this phenomenon, showing

linear convergence of gradient descent on certain classes of neural networks, with certain cost

functions, to global optima. The formal principles underlying these works are identical. By taking

width sufﬁciently large, one guarantees uniform bounds on curvature (via a Lipschitz gradients-

type property) and regularity (via a Polyak-Łojasiewicz-type inequality) in a neighbourhood of

initialisation. Convergence to a global optimum in that neighbourhood then follows from a well-

known chain of estimates [23].

Despite signiﬁcant progress, the theory of deep learning optimisation extant in the literature presents

at least three signiﬁcant shortcomings:

∗Most done while at the Australian Institute for Machine Learning, University of Adelaide

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.05371v4 [cs.LG] 4 Dec 2023

It lacks a formal framework in which to compare common practical architecture choices. Indeed,

none of the aforementioned works consider the impact of ubiquitous (weight/batch) normalisation

layers. Moreover, where common architectural modiﬁcations such as skip connections are studied, it

is unclear exactly what impact they have on optimisation. For instance, while in [

] it is shown that

skip connections enable convergence with width polynomial in the number of layers, as compared

with exponential width for chain networks, in [

] polynomial width is shown to be sufﬁcient for

convergence of both chain and residual networks.

It lacks theoretical ﬂexibility. The consistent use of uniform curvature and regularity bounds are

insufﬁciently ﬂexible to enable optimisation guarantees too far away from initialisation, where the

local, uniform bounds used in previous theory no longer hold. In particular, proving globally optimal

convergence for deep neural nets with the cross-entropy cost was (until now) an open problem [5].

It lacks practical utility. Although it is presently unreasonable to demand quantitatively predictive

bounds on practical performance, existing optimisation theory has been largely unable to inform

architecture design even qualitatively. This is in part due to the ﬁrst item, since practical architectures

typically differ substantially from those considered for theoretical purposes.

Our purpose in this article is to take a step in addressing these shortcomings. Speciﬁcally:

We provide a formal framework, inspired by [

], for the study of multilayer optimisation. Our

framework is sufﬁciently general to include all commonly used neural network layers, and contains

formal results relating the curvature and regularity properties of multilayer loss landscapes to those

of their constituent layers. As instances, we prove novel results on the global curvature and regularity

properties enabled by normalisation layers and skip connections respectively, in contrast to the local

bounds provided in previous work.

Using these novel, global bounds, we identify a class of weight-normalised residual networks

for which, given a linear independence assumption on the data, gradient descent can be provably

trained to a global optimum arbitrarily far away from initialisation. From a regularity perspective,

our analysis is strictly more ﬂexible than the uniform analysis considered in previous works, and in

particular solves the open problem of proving global optimality for the training of deep nets with the

cross-entropy cost.

Using our theoretical insight that skip connections aid loss regularity, we conduct a systematic

empirical analysis of singular value distributions of layer Jacobians for practical layers. We are

thereby able to predict that simple modiﬁcations to the classic ResNet architecture [

] will improve

training speed. We verify our predictions on MNIST, CIFAR10, CIFAR100 and ImageNet.

2 Background

In this section we give a summary of the principles underlying recent theoretical advances in neural

network optimisation. We discuss related works after this summary for greater clarity.

2.1 Smoothness and the PŁ-inequality

Gradient descent on a possibly non-convex function

ℓ:Rp→R≥0

can be guaranteed to converge

to a global optimum by insisting that

ℓ

have Lipschitz gradients and satisfy the Polyak-Łojasiewicz

inequality. We recall these well-known properties here for convenience.

Deﬁnition 2.1. Let

β > 0

. A continuously differentiable function

ℓ:Rp→R

is said to have

-Lipschitz gradients, or is said to be

-smooth over a set

S⊂Rp

if the vector ﬁeld

∇ℓ:Rp→Rp

is β-Lipschitz. If Sis convex, ℓhaving β-Lipschitz gradients implies that the inequality

ℓ(θ2)−ℓ(θ1)≤ ∇ℓ(θ1)T(θ2−θ1) + β

2∥θ2−θ1∥2(1)

holds for all θ1, θ2∈S.

The

-smoothness of

ℓ

over

can be thought of as a uniform bound on the curvature of

ℓ

over

: if

ℓ

is twice continuously differentiable, then it has Lipschitz gradients over any compact set

with

(possibly loose) Lipschitz constant given by

β:= sup

θ∈K∥D2ℓ(θ)∥,(2)

where D2ℓis the Hessian and ∥·∥denotes any matrix norm.

Deﬁnition 2.2. Let

µ > 0

. A differentiable function

ℓ:Rp→R≥0

is said to satisfy the

-Polyak-

Łojasiewicz inequality, or is said to be µ-PŁ over a set S⊂Rpif

∥∇ℓ(θ)∥2≥µℓ(θ)−inf

θ′∈Sℓ(θ′)(3)

for all θ∈S.

The PŁ condition on

ℓ

over

is a uniform guarantee of regularity, which implies that all critical

points of

ℓ

over

are

-global minima; however, such a function need not be convex. Synthesising

these deﬁnitions leads easily to the following result (cf. Theorem 1 of [23]).

Theorem 2.3. Let

ℓ:Rp→R≥0

be a continuously differentiable function that is

-smooth and

-PŁ over a convex set

. Suppose that

θ0∈S

and let

{θt}∞

t=0

be the trajectory taken by gradient

descent, with step size

η < 2β−1

, starting at

θ0

. If

{θt}∞

t=0 ⊂S

, then

ℓ(θt)

converges to an

-global

minimum ℓ∗at a linear rate:

ℓ(θt)−ℓ∗≤1−µη1−βη

2t

ℓ(θ0)−ℓ∗(4)

for all t∈N.

Essentially, while the Lipschitz constant of the gradients controls whether or not gradient descent

with a given step size can be guaranteed to decrease the loss at each step, the PŁ constant determines

by how much the loss will decrease. These ideas can be applied to the optimisation of deep neural

nets as follows.

2.2 Application to model optimisation

The above theory can be applied to parameterised models in the following fashion. Let

f:Rp×

Rd0→RdL

be a differentiable,

-parameterised family of functions

Rd0→RdL

(in later sections,

will denote the number of layers of a deep neural network). Given

training data

{(xi, yi)}N

i=1 ⊂

Rd0×RdL, let F:Rp→RdL×Nbe the corresponding parameter-function map deﬁned by

F(θ)i:= f(θ, xi).(5)

Any differentiable cost function

c:RdL×RdL→R≥0

, convex in the ﬁrst variable, extends to a

differentiable, convex function γ:RdL×N→R≥0deﬁned by

γ(zi)N

i=1:= 1

i=1

c(zi, yi),(6)

and one is then concerned with the optimisation of the composite

ℓ:= γ◦F:Rp→R≥0

via

gradient descent.

To apply Theorem 2.3, one needs to determine the smoothness and regularity properties of

ℓ

. By the

chain rule, the former can be determined given sufﬁcient conditions on the derivatives

Dγ ◦F

and

(cf. Lemma 2 of [

]). The latter can be bounded by Lemma 3 of [

], which we recall below

and prove in the appendix for the reader’s convenience.

Theorem 2.4. Let

S⊂Rp

be a set. Suppose that

γ:RdL×N→R≥0

-PŁ over

F(S)

with

minimum γ∗

S. Let λ(DF (θ)) denote the smallest eigenvalue of DF (θ)DF (θ)T. Then

∥∇ℓ(θ)∥2≥µ λ(DF (θ)) ℓ(θ)−γ∗

S(7)

for all θ∈S.

Note that Theorem 2.4 is vacuous (

λ(DF (θ)) = 0

for all

) unless in the overparameterised regime

(

p≥dLN

). Even in this regime, however, Theorem 2.4 does not imply that

ℓ

is PŁ unless

λ(θ)

can be uniformly lower bounded by a positive constant over

. Although universally utilised in

previous literature, such a uniform lower bound will not be possible in our global analysis, and our

convergence theorem does not follow from Theorem 2.3, in contrast to previous work. Our theorem

requires additional argumentation, which we believe may be of independent utility.

3 Related works

Convergence theorems for deep linear networks with the square cost are considered in [

]. In

[

], it is proved that the tangent kernel of a multi-layer perceptron (MLP) becomes approximately

constant over all of parameter space as width goes to inﬁnity, and is positive-deﬁnite for certain

data distributions, which by Theorem 2.4 implies that all critical points are global minima. Strictly

speaking, however, [

] does not prove convergence of gradient descent: the authors consider only

gradient ﬂow, and leave Lipschitz concerns untouched. The papers [

] prove that

overparameterized neural nets of varying architectures can be optimised to global minima close to

initialisation by assuming sufﬁcient width of several layers. While [

] does consider the cross-entropy

cost, convergence to a global optimum is not proved: it is instead shown that perfect classiﬁcation

accuracy can be achieved close to initialisation during training. Improvements on these works have

been made in [33, 32, 7], wherein large width is required of only a single layer.

It is identiﬁed in [

] that linearity of the ﬁnal layer is key in establishing the approximate constancy of

the tangent kernel for wide networks that was used in [

]. By making explicit the implicit use of

the PŁ condition present in previous works [

], [

] proves a convergence theorem

even with nonlinear output layers. The theory explicated in [

] is formalised and generalised in

[

]. A key weakness of all of the works mentioned thus far (bar the purely formal [

]) is that their

hypotheses imply that optimisation trajectories are always close to initialisation. Without this, there

is no obvious way to guarantee the PŁ-inequality along the optimisation trajectory, and hence no

way to guarantee one does not converge to a suboptimal critical point. However such training is

not possible with the cross-entropy cost, whose global minima only exist at inﬁnity. There is also

evidence to suggest that such training must be avoided for state-of-the-art test performance [

In contrast, our theory gives convergence guarantees even for trajectories that travel arbitrarily far

from initialisation, and is the only work of which we are aware that can make this claim.

Among the tools that make our theory work are skip connections [

] and weight normalisation

[

]. The smoothness properties of normalisation schemes have previously been studied [

however they only give pointwise estimates comparing normalised to non-normalised layers, and

do not provide a global analysis of Lipschitz properties as we do. The regularising effect of skip

connections on the loss landscape has previously been studied in [

], however this study is not tightly

linked to optimisation theory. Skip connections have also been shown to enable the interpretation

of a neural network as coordinate transformations of data manifolds [

]. Mean ﬁeld analyses of

skip connections have been conducted in [

] which necessitate large width; our own analysis

does not. A similarly general framework to that which we supply is given in [

]; while both

encapsulate all presently used architectures, that of [

] is designed for the study of inﬁnite-width

tangent kernels, while ours is designed speciﬁcally for optimisation theory. Our empirical singular

value analysis of skip connections complements existing theoretical work using random matrix theory

[

]. These works have not yet considered the shifting effect of skip connections on

layer Jacobians that we observe empirically.

Our theory also links nicely to the intuitive notions of gradient propagation [

] and dynamical

isometry already present in the literature. In tying Jacobian singular values rigorously to loss

regularity in the sense of the Polyak-Lojasiewicz inequality, our theory provides a new link between

dynamical isometry and optimisation theory [

]: speciﬁcally dynamical isometry ensures

better PŁ conditioning and therefore faster and more reliable convergence to global minima. In

linking this productive section of the literature to optimisation theory, our work may open up new

possibilities for convergence proofs in the optimisation theory of deep networks. We leave further

exploration of this topic to future work.

Due to this relationship with the notion of dynamical isometry, our work also provides optimisation-

theoretic support for the empirical analyses of [

] that study the importance of layerwise dynamical

isometry for trainability and neural architecture search [

]. Recent work on deep kernel

shaping shows via careful tuning of initialisation and activation functions that while skip connections

and normalisation layers may be sufﬁcient for good trainability, they are not necessary [

]. Other

recent work has also shown beneﬁts to inference performance by removing skip connections from the

trained model using a parameter transformation [

] or by removing them from the model altogether

and incorporating them only into the optimiser [12].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnskipconnectionsandnormalisationlayersindeepoptimisationLachlanE.MacDonald∗MathematicalInstituteforDataScienceJohnsHopkinsUniversitylemacdonald@protonmail.comJackValmadreAustralianInstituteforMachineLearningUniversityofAdelaideHemanthSaratchandranAustralianInstituteforMachineLearningUniversityofAde...

收起<<

On skip connections and normalisation layers in deep optimisation Lachlan E. MacDonald.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On skip connections and normalisation layers in deep optimisation Lachlan E. MacDonald

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: