On the innite-depth limit of nite-width neural networks Souane Hayou hayounus.edu.sg Department of Mathematics

2025-05-02 1 0 1.94MB 71 页 10玖币

On the inﬁnite-depth limit of ﬁnite-width neural networks

Souﬁane Hayou hayou@nus.edu.sg

Department of Mathematics

National University of Singapore

Abstract

In this paper, we study the inﬁnite-depth limit of ﬁnite-width residual neural networks

with random Gaussian weights. With proper scaling, we show that by ﬁxing the width

and taking the depth to inﬁnity, the pre-activations converge in distribution to a zero-drift

diﬀusion process. Unlike the inﬁnite-width limit where the pre-activation converge weakly

to a Gaussian random variable, we show that the inﬁnite-depth limit yields diﬀerent distri-

butions depending on the choice of the activation function. We document two cases where

these distributions have closed-form (diﬀerent) expressions. We further show an intriguing

change of regime phenomenon of the post-activation norms when the width increases from

3 to 4. Lastly, we study the sequential limit inﬁnite-depth-then-inﬁnite-width and compare

it with the more commonly studied inﬁnite-width-then-inﬁnite-depth limit.

Peer Reviewed version: The ﬁrst version of this paper was published at Transactions

of Machine Learning Research (TMLR, https://openreview.net/forum?id=RbLsYz1Az9).

This version contains some updates and improvements in the proofs.

1. Introduction

The empirical success of over-parameterized neural networks has sparked a growing inter-

est in the theoretical understanding of these models. The large number of parameters –

millions if not billions – and the complex (non-linear) nature of the neural computations

(presence of non-linearities) make this hypothesis space highly non-trivial. However, in cer-

tain situations, increasing the number of parameters has the eﬀect of ‘placing’ the network

in some ‘average’ regime that simpliﬁes the theoretical analysis. This is the case with the

inﬁnite-width asymptotics of random neural networks. The inﬁnite-width limit of neural

network architectures has been extensively studied in the literature, and has led to many

interesting theoretical and algorithmic innovations. We summarize these results below.

•Initialization schemes: the inﬁnite-width limit of diﬀerent neural architectures has been

extensively studied in the literature. In particular, for multi-layer perceptrons (MLP),

a new initialization scheme that stabilizes forward and backward propagation (in the

inﬁnite-width limit) was derived in [1,2]. This initialization scheme is known as the Edge

of Chaos, and empirical results show that it signiﬁcantly improves performance. In [3,

4], the authors derived similar results for the ResNet architecture, and showed that this

architecture is placed by-default on the Edge of Chaos for any choice of the variances of

the initialization weights (Gaussian weights). In [5], the authors showed that an MLP

©Souﬁane Hayou.

arXiv:2210.00688v3 [stat.ML] 13 Jan 2023

that is initialized on the Edge of Chaos exhibits similar properties to ResNets, which

might partially explain the beneﬁts of the Edge of Chaos initialization.

•Gaussian process behaviour: Multiple papers (e.g. [6–10]) studied the weak limit of

neural networks when the width goes to inﬁnity. The results show that a randomly

initialized neural network (with Gaussian weights) has a similar behaviour to that of a

Gaussian process, for a wide range of neural architectures, and under mild conditions

on the activation function. In [7], the authors leveraged this result and introduced the

neural network Gaussian process (NNGP), which is a Gaussian process model with a

neural kernel that depends on the architecture and the activation function. Bayesian

regression with the NNGP showed that NNGP surprisingly achieves performance close

to the one achieved by an SGD-trained ﬁnite-width neural network.

The large depth limit of this Gaussian process was studied in [4], where the authors

showed that with proper scaling, the inﬁnite-depth (weak) limit is a Gaussian process

with a universal kernel1.

•Neural Tangent Kernel (NTK): the inﬁnite-width limit of the NTK is the so-called NTK

regime or Lazy-training regime. This topic has been extensively studied in the literature.

The optimization and generalization properties (and some other aspects) of the NTK

have been studied in [11–14]. The large depth asymptotics of the NTK have been studied

in [15–18]. We refer the reader to [19] for a comprehensive discussion on the NTK.

•Others: the theory of inﬁnite-width neural networks has also been utilized for network

pruning [20,21], regularization [22], feature learning [23], and ensembling methods [24]

(this is by no means an exhaustive list).

The theoretical analysis of inﬁnite-width neural networks has certainly led to many in-

teresting (theoretical and practical) discoveries. However, most works on this limit consider

a ﬁxed depth network. What about inﬁnite-depth? Existing works on the inﬁnite-depth

limit can generally be divided into three categories:

•Inﬁnite-width-then-inﬁnite-depth limit: in this case, the width is taken to inﬁnity ﬁrst,

then the depth is take to inﬁnity. This is the inﬁnite-depth limit of inﬁnite-width neural

networks. This limit was particularly used to derive the Edge of Chaos initialization

scheme [1,2], study the impact of the activation function [5], the behaviour of the NTK

[15,18], kernel shaping [25,26] etc.

•The joint inﬁnite-width-and-depth limit: in this case, the depth-to-width ratio is ﬁxed,

and therefore, the width and depth are jointly taken to inﬁnity at the same time. There

are few works that study the joint width-depth limit. For instance, in [27], the authors

showed that for a special form of residual neural networks (ResNet), the network output

exhibits a (scaled) log-normal behaviour in this joint limit. This is diﬀerent from the

sequential limit where width is taken to inﬁnity ﬁrst, followed by the depth, in which

case the distribution of the network output is asymptotically normal ([2,5]). In [28], the

1. A kernel is called universal when any continuous function on some compact set can be approximated

arbitrarily well with kernel features.

2

authors studied the covariance kernel of an MLP in the joint limit, and showed that it

converges weakly to the solution of Stochastic Diﬀerential Equation (SDE). In [29], the

authors showed that in the joint limit case, the NTK of an MLP remains random when

the width and depth jointly go to inﬁnity. This is diﬀerent from the deterministic limit

of the NTK where the width is taken to inﬁnity before depth [15]. More recently, in [30],

the author explored the impact of the depth-to-width ratio on the correlation kernel and

the gradient norms in the case of an MLP architecture, and showed that this ratio can

be interpreted as an eﬀective network depth.

•Inﬁnite-depth limit of ﬁnite-width neural networks: in both previous limits (inﬁnite-width-

then-inﬁnite-depth limit, and the joint inﬁnite-width-depth limit), the width goes to

inﬁnity. Naturally, one might ask what happens if width is ﬁxed and depth goes to

inﬁnity? What is the limiting distribution of the network output at initialization? In

[31], the author showed that neural networks with bounded width are still universal

approximators, which motivates the study of ﬁnite-width large depth neural networks.

In [32], the authors showed that the pre-activations of a particular ResNet architecture

converge weakly to a diﬀusion process in the inﬁnite-depth limit. This is the result of the

fact that ResNet can be seen as discretizations of SDEs (see Section 2).

In the present paper, we study the inﬁnite-depth limit of ﬁnite-width ResNet with random

Gaussian weights (an architecture that is diﬀerent from the one studied in [32]). We are

particularly interested in the asymptotic behaviour of the pre/post-activation values. Our

contributions are four-fold:

1. Unlike the inﬁnite-width limit, we show that the resulting distribution of the pre-

activations in the inﬁnite-depth limit is not necessarily Gaussian. In the simple case

of networks of width 1, we study two cases where we obtain known but completely

diﬀerent distributions by carefully choosing the activation function.

2. For ReLU activation function, we introduce and discuss the phenomenon of network

collapse. This phenomenon occurs when the pre-activations in some hidden layer

have all non-positive values which results in zero post-activations. This leads to a

stagnant network where increasing the depth beyond a certain level has no eﬀect on

the network output. For any ﬁxed width, we show that in the inﬁnite-depth limit,

network collapse is a zero-probability event, meaning that almost surely, all post-

activations in the network are non-zero.

3. For networks with general width, where the distribution of the pre-activations is gen-

erally intractable, we focus on the norm of the post-activations with ReLU activation

function, and show that this norm has approximately a Geometric Bronwian Motion

(GBM) dynamics. We call this Quasi-GBM. We also shed light on a regime change

phenomenon that occurs when the width nincreases from 3 to 4. For width n≤3,

resp. n≥4, the logarithmic growth factor of the post-activations is , resp. positive.

4. We study the sequential limit inﬁnite-depth-then-inﬁnite-width, which is the converse

of the more commonly studied inﬁnite-width-then-inﬁnite-depth limit, and show some

key diﬀerences between these limits. We particularly show that the pre-activations

3

converge to the solution of a Mckean-Vlasov process, which has marginal Gaussian

distributions, and thus we recover the Gaussian behaviour in this limit. We compare

the two sequential limits and discuss some diﬀerences.

The proofs of the theoretical results are provided in the appendix and referenced after

each result. Empirical evaluations of these theoretical ﬁndings are also provided.

2. The inﬁnite-depth limit

Hereafter, we denote the width, resp. depth, of the network by n, resp. L. We also denote

the input dimension by d. Let d, n, L ≥1, and consider the following ResNet architecture

of width nand depth L

Y0=Winx, x ∈Rd

Yl=Yl−1+1

√LWlφ(Yl−1), l = 1, . . . , L, (1)

where φ:R→Ris the activation function, L≥1 is the network depth, Win ∈Rn×d, and

Wl∈Rn×nis the weight matrix in the lth layer. We assume that the weights are randomly

initialized with iid Gaussian variables Wij

l∼ N(0,1

n), Wij

in ∼ N(0,1

d). For the sake of

simpliﬁcation, we only consider networks with no bias, and we omit the dependence of Yl

on nin the notation. While the activation function is only deﬁned for real numbers, we

will abuse the notation and write φ(z) = (φ(z1), . . . , φ(zk)) for any k-dimensional vector

z= (z1, . . . , zk)∈Rkfor any k≥1. We refer to the vectors {Yl, l = 0, . . . , L}by the pre-

activations and the vectors {φ(Yl), l = 0, . . . , L}by the post-activations. Hereafter, x∈Rd

is ﬁxed, and we assume that x6= 0.

The 1/√Lscaling in Eq. (1) is not arbitrary. This speciﬁc scaling was shown to stabilize

the norm of Ylas well as gradient norms in the large depth limit (e.g. [4,20,33]). In the

next result, we show that the inﬁnite depth limit of Eq. (1) (in the sens of the distribution)

exists and has the same distribution of the solution of a stochastic diﬀerential equation. In

the case of a single input, this has already been shown in [32]. The details are provided in

Appendix A. We also generalize this result in the case of multiple inputs and obtain similar

SDE dynamics (see Proposition 5in the Appendix).

Proposition 1 Assume that the activation function φis Lipschitz on Rn. Then, in the

limit L→ ∞, the process XL

t=YbtLc,t∈[0,1], converges in distribution to the solution of

the following SDE

dXt=1

√nkφ(Xt)kdBt, X0=Winx, (2)

where (Bt)t≥0is a Brownian motion (Wiener process), independent from Win. Moreover,

we have that for any t∈[0,1] and any Lipschitz function Ψ : Rn→R,

EΨ(YbtLc) = EΨ(Xt) + O(L−1/2),

where the constant in Odoes not depend on t.

Moreover, if the activation function φis only locally Lipschitz, then XL

tconverges locally

to Xt. More precisely, for any ﬁxed r > 0, we consider the stopping times

τL= inf{t≥0 : kXL

tk ≥ r}, τ = inf{t≥0 : kXtk ≥ r},

4

then the stopped process XL

t∧τLconverges in distribution to the stopped solution Xt∧τof the

above SDE.

The proof of Proposition 1is provided in Appendix A.6. We use classical results on

the numerical approximations of SDEs. Proposition 1shows that the inﬁnite-depth limit

of ﬁnite-width ResNet (Eq. (1)) has a similar behaviour to the solution of the SDE given

in Eq. (7). In this limit, YbtLcconverges in distribution to Xt. Hence, properties of the

solutions of Eq. (7) should theoretically be ‘shared’ by the pre-activations YbtLcwhen the

depth is large. For the rest of the paper, we study some properties of the solutions of

Eq. (7). This requires the deﬁnition of ﬁltered probability spaces which we omit here. All

the technical details are provided in Appendix A. We compare the theoretical ﬁndings with

empirical results obtained by simulating the pre/post-activations of the original network

Eq. (1). We refer to Xt, the solution of Eq. (7), by the inﬁnite-depth network.

The distribution of X1(the last layer in the inﬁnite-depth limit) is generally intractable,

unlike in the inﬁnite-width-then-inﬁnite-depth limit (Gaussian, [4]) or joint inﬁnite-depth-

and-width limit (involves a log-normal distribution in the case of an MLP architecture,

[27]). Intuitively, one should not expect a universal behaviour (e.g. the Gaussian behaviour

in the inﬁnite-width case) of the solution of Eq. (7) as this latter is highly sensitive to the

choice of the activation function, and diﬀerent activation functions might yield completely

diﬀerent distributions of X1. We demonstrate this in the next section by showing that we

can recover closed-form distributions by carefully choosing the activation function. The

main ingredient is the use of Itˆo’s lemma. See Appendix Afor more details.

3. Diﬀerent behaviours depending on the activation function

In this section, we restrict our analysis to a width-1 ResNet with one-dimensional inputs,

where each layer consists of a single neuron, i.e. d=n= 1. In this case, the process

(Xt)0≤t≤1is one-dimensional and is solution of the following SDE

dXt=|φ(Xt)|dBt, X0=Winx.

We can get rid of the absolute value in the equation above since the process Xthas the

same distribution as ˜

Xt, the solution of the SDE d˜

Xt=φ(˜

Xt)dBt. The intuition behind

this is that the inﬁnitesimal random variable ‘dBt’ is Gaussian distributed with zero mean

and variance dt. Hence, it is a symmetric random variable and can absorb the sign of φ(Xt).

The rigorous justiﬁcation of this fact is provided in Theorem 7in the Appendix. Hereafter

in this section, we consider the process X, solution of the SDE

dXt=φ(Xt)dBt, X0=Winx.

Given a function g∈ C2(R)2, we use Itˆo’s lemma (Lemma 4in the appendix) to derive the

dynamics of the process g(Xt). We obtain,

dg(Xt) = φ(Xt)g0(Xt)

| {z }

σ(Xt)

dBt+1

2φ(Xt)2g00(Xt)

| {z }

µ(Xt)

dt. (3)

2. Here C2(R) refers to the vector space of functions g:R→Rthat are twice diﬀerentiable and their second

derivatives are continuous.

5

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Ontheinnite-depthlimitofnite-widthneuralnetworksSouaneHayouhayou@nus.edu.sgDepartmentofMathematicsNationalUniversityofSingaporeAbstractInthispaper,westudytheinnite-depthlimitofnite-widthresidualneuralnetworkswithrandomGaussianweights.Withproperscaling,weshowthatbyxingthewidthandtakingthedeptht...

展开>> 收起<<

On the innite-depth limit of nite-width neural networks Souane Hayou hayounus.edu.sg Department of Mathematics.pdf

共71页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

相关推荐

更多

立即下载

分类：图书资源 价格：10玖币 属性：71 页 大小：1.94MB 格式：PDF 时间：2025-05-02

开通VIP享超值会员特权

多端同步记录
高速下载文档
免费文档工具
分享文档赚钱
每日登录抽奖
优质衍生服务

作者详情

MAOOA..
高级编辑

文档 14218 粉丝 0

相关内容

更多

热门标签

人际关系配电装置动力学连接体力的合成高考理综全宋诗作者索引公务员考试

/ 71

评分收藏

立即下载

关于我们联系我们隐私政策用户协议免责申明会员服务协议
本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！ Copyright ©Jiubeiyunall rights reserved SITEMAP| 备案号：渝ICP备2024044455号| 渝公网安备50010702506394 | 违法与不良信息举报方式：微信:jiubeiyun2024,QQ:264159069,电话:15523442343,邮箱:jiubeiyun@126.com

客服

关注

二维码已失效
刷新

打开微信，点击“扫一扫”

安全高效便捷

免密登录