On the innite-depth limit of nite-width neural networks Souane Hayou hayounus.edu.sg Department of Mathematics

2025-05-02 1 0 1.94MB 71 页 10玖币
侵权投诉
On the infinite-depth limit of finite-width neural networks
Soufiane Hayou hayou@nus.edu.sg
Department of Mathematics
National University of Singapore
Abstract
In this paper, we study the infinite-depth limit of finite-width residual neural networks
with random Gaussian weights. With proper scaling, we show that by fixing the width
and taking the depth to infinity, the pre-activations converge in distribution to a zero-drift
diffusion process. Unlike the infinite-width limit where the pre-activation converge weakly
to a Gaussian random variable, we show that the infinite-depth limit yields different distri-
butions depending on the choice of the activation function. We document two cases where
these distributions have closed-form (different) expressions. We further show an intriguing
change of regime phenomenon of the post-activation norms when the width increases from
3 to 4. Lastly, we study the sequential limit infinite-depth-then-infinite-width and compare
it with the more commonly studied infinite-width-then-infinite-depth limit.
Peer Reviewed version: The first version of this paper was published at Transactions
of Machine Learning Research (TMLR, https://openreview.net/forum?id=RbLsYz1Az9).
This version contains some updates and improvements in the proofs.
1. Introduction
The empirical success of over-parameterized neural networks has sparked a growing inter-
est in the theoretical understanding of these models. The large number of parameters –
millions if not billions – and the complex (non-linear) nature of the neural computations
(presence of non-linearities) make this hypothesis space highly non-trivial. However, in cer-
tain situations, increasing the number of parameters has the effect of ‘placing’ the network
in some ‘average’ regime that simplifies the theoretical analysis. This is the case with the
infinite-width asymptotics of random neural networks. The infinite-width limit of neural
network architectures has been extensively studied in the literature, and has led to many
interesting theoretical and algorithmic innovations. We summarize these results below.
Initialization schemes: the infinite-width limit of different neural architectures has been
extensively studied in the literature. In particular, for multi-layer perceptrons (MLP),
a new initialization scheme that stabilizes forward and backward propagation (in the
infinite-width limit) was derived in [1,2]. This initialization scheme is known as the Edge
of Chaos, and empirical results show that it significantly improves performance. In [3,
4], the authors derived similar results for the ResNet architecture, and showed that this
architecture is placed by-default on the Edge of Chaos for any choice of the variances of
the initialization weights (Gaussian weights). In [5], the authors showed that an MLP
©Soufiane Hayou.
arXiv:2210.00688v3 [stat.ML] 13 Jan 2023
that is initialized on the Edge of Chaos exhibits similar properties to ResNets, which
might partially explain the benefits of the Edge of Chaos initialization.
Gaussian process behaviour: Multiple papers (e.g. [610]) studied the weak limit of
neural networks when the width goes to infinity. The results show that a randomly
initialized neural network (with Gaussian weights) has a similar behaviour to that of a
Gaussian process, for a wide range of neural architectures, and under mild conditions
on the activation function. In [7], the authors leveraged this result and introduced the
neural network Gaussian process (NNGP), which is a Gaussian process model with a
neural kernel that depends on the architecture and the activation function. Bayesian
regression with the NNGP showed that NNGP surprisingly achieves performance close
to the one achieved by an SGD-trained finite-width neural network.
The large depth limit of this Gaussian process was studied in [4], where the authors
showed that with proper scaling, the infinite-depth (weak) limit is a Gaussian process
with a universal kernel1.
Neural Tangent Kernel (NTK): the infinite-width limit of the NTK is the so-called NTK
regime or Lazy-training regime. This topic has been extensively studied in the literature.
The optimization and generalization properties (and some other aspects) of the NTK
have been studied in [1114]. The large depth asymptotics of the NTK have been studied
in [1518]. We refer the reader to [19] for a comprehensive discussion on the NTK.
Others: the theory of infinite-width neural networks has also been utilized for network
pruning [20,21], regularization [22], feature learning [23], and ensembling methods [24]
(this is by no means an exhaustive list).
The theoretical analysis of infinite-width neural networks has certainly led to many in-
teresting (theoretical and practical) discoveries. However, most works on this limit consider
a fixed depth network. What about infinite-depth? Existing works on the infinite-depth
limit can generally be divided into three categories:
Infinite-width-then-infinite-depth limit: in this case, the width is taken to infinity first,
then the depth is take to infinity. This is the infinite-depth limit of infinite-width neural
networks. This limit was particularly used to derive the Edge of Chaos initialization
scheme [1,2], study the impact of the activation function [5], the behaviour of the NTK
[15,18], kernel shaping [25,26] etc.
The joint infinite-width-and-depth limit: in this case, the depth-to-width ratio is fixed,
and therefore, the width and depth are jointly taken to infinity at the same time. There
are few works that study the joint width-depth limit. For instance, in [27], the authors
showed that for a special form of residual neural networks (ResNet), the network output
exhibits a (scaled) log-normal behaviour in this joint limit. This is different from the
sequential limit where width is taken to infinity first, followed by the depth, in which
case the distribution of the network output is asymptotically normal ([2,5]). In [28], the
1. A kernel is called universal when any continuous function on some compact set can be approximated
arbitrarily well with kernel features.
2
authors studied the covariance kernel of an MLP in the joint limit, and showed that it
converges weakly to the solution of Stochastic Differential Equation (SDE). In [29], the
authors showed that in the joint limit case, the NTK of an MLP remains random when
the width and depth jointly go to infinity. This is different from the deterministic limit
of the NTK where the width is taken to infinity before depth [15]. More recently, in [30],
the author explored the impact of the depth-to-width ratio on the correlation kernel and
the gradient norms in the case of an MLP architecture, and showed that this ratio can
be interpreted as an effective network depth.
Infinite-depth limit of finite-width neural networks: in both previous limits (infinite-width-
then-infinite-depth limit, and the joint infinite-width-depth limit), the width goes to
infinity. Naturally, one might ask what happens if width is fixed and depth goes to
infinity? What is the limiting distribution of the network output at initialization? In
[31], the author showed that neural networks with bounded width are still universal
approximators, which motivates the study of finite-width large depth neural networks.
In [32], the authors showed that the pre-activations of a particular ResNet architecture
converge weakly to a diffusion process in the infinite-depth limit. This is the result of the
fact that ResNet can be seen as discretizations of SDEs (see Section 2).
In the present paper, we study the infinite-depth limit of finite-width ResNet with random
Gaussian weights (an architecture that is different from the one studied in [32]). We are
particularly interested in the asymptotic behaviour of the pre/post-activation values. Our
contributions are four-fold:
1. Unlike the infinite-width limit, we show that the resulting distribution of the pre-
activations in the infinite-depth limit is not necessarily Gaussian. In the simple case
of networks of width 1, we study two cases where we obtain known but completely
different distributions by carefully choosing the activation function.
2. For ReLU activation function, we introduce and discuss the phenomenon of network
collapse. This phenomenon occurs when the pre-activations in some hidden layer
have all non-positive values which results in zero post-activations. This leads to a
stagnant network where increasing the depth beyond a certain level has no effect on
the network output. For any fixed width, we show that in the infinite-depth limit,
network collapse is a zero-probability event, meaning that almost surely, all post-
activations in the network are non-zero.
3. For networks with general width, where the distribution of the pre-activations is gen-
erally intractable, we focus on the norm of the post-activations with ReLU activation
function, and show that this norm has approximately a Geometric Bronwian Motion
(GBM) dynamics. We call this Quasi-GBM. We also shed light on a regime change
phenomenon that occurs when the width nincreases from 3 to 4. For width n3,
resp. n4, the logarithmic growth factor of the post-activations is , resp. positive.
4. We study the sequential limit infinite-depth-then-infinite-width, which is the converse
of the more commonly studied infinite-width-then-infinite-depth limit, and show some
key differences between these limits. We particularly show that the pre-activations
3
converge to the solution of a Mckean-Vlasov process, which has marginal Gaussian
distributions, and thus we recover the Gaussian behaviour in this limit. We compare
the two sequential limits and discuss some differences.
The proofs of the theoretical results are provided in the appendix and referenced after
each result. Empirical evaluations of these theoretical findings are also provided.
2. The infinite-depth limit
Hereafter, we denote the width, resp. depth, of the network by n, resp. L. We also denote
the input dimension by d. Let d, n, L 1, and consider the following ResNet architecture
of width nand depth L
Y0=Winx, x Rd
Yl=Yl1+1
LWlφ(Yl1), l = 1, . . . , L, (1)
where φ:RRis the activation function, L1 is the network depth, Win Rn×d, and
WlRn×nis the weight matrix in the lth layer. We assume that the weights are randomly
initialized with iid Gaussian variables Wij
l∼ N(0,1
n), Wij
in ∼ N(0,1
d). For the sake of
simplification, we only consider networks with no bias, and we omit the dependence of Yl
on nin the notation. While the activation function is only defined for real numbers, we
will abuse the notation and write φ(z) = (φ(z1), . . . , φ(zk)) for any k-dimensional vector
z= (z1, . . . , zk)Rkfor any k1. We refer to the vectors {Yl, l = 0, . . . , L}by the pre-
activations and the vectors {φ(Yl), l = 0, . . . , L}by the post-activations. Hereafter, xRd
is fixed, and we assume that x6= 0.
The 1/Lscaling in Eq. (1) is not arbitrary. This specific scaling was shown to stabilize
the norm of Ylas well as gradient norms in the large depth limit (e.g. [4,20,33]). In the
next result, we show that the infinite depth limit of Eq. (1) (in the sens of the distribution)
exists and has the same distribution of the solution of a stochastic differential equation. In
the case of a single input, this has already been shown in [32]. The details are provided in
Appendix A. We also generalize this result in the case of multiple inputs and obtain similar
SDE dynamics (see Proposition 5in the Appendix).
Proposition 1 Assume that the activation function φis Lipschitz on Rn. Then, in the
limit L→ ∞, the process XL
t=YbtLc,t[0,1], converges in distribution to the solution of
the following SDE
dXt=1
nkφ(Xt)kdBt, X0=Winx, (2)
where (Bt)t0is a Brownian motion (Wiener process), independent from Win. Moreover,
we have that for any t[0,1] and any Lipschitz function Ψ : RnR,
EΨ(YbtLc) = EΨ(Xt) + O(L1/2),
where the constant in Odoes not depend on t.
Moreover, if the activation function φis only locally Lipschitz, then XL
tconverges locally
to Xt. More precisely, for any fixed r > 0, we consider the stopping times
τL= inf{t0 : kXL
tk ≥ r}, τ = inf{t0 : kXtk ≥ r},
4
then the stopped process XL
tτLconverges in distribution to the stopped solution Xtτof the
above SDE.
The proof of Proposition 1is provided in Appendix A.6. We use classical results on
the numerical approximations of SDEs. Proposition 1shows that the infinite-depth limit
of finite-width ResNet (Eq. (1)) has a similar behaviour to the solution of the SDE given
in Eq. (7). In this limit, YbtLcconverges in distribution to Xt. Hence, properties of the
solutions of Eq. (7) should theoretically be ‘shared’ by the pre-activations YbtLcwhen the
depth is large. For the rest of the paper, we study some properties of the solutions of
Eq. (7). This requires the definition of filtered probability spaces which we omit here. All
the technical details are provided in Appendix A. We compare the theoretical findings with
empirical results obtained by simulating the pre/post-activations of the original network
Eq. (1). We refer to Xt, the solution of Eq. (7), by the infinite-depth network.
The distribution of X1(the last layer in the infinite-depth limit) is generally intractable,
unlike in the infinite-width-then-infinite-depth limit (Gaussian, [4]) or joint infinite-depth-
and-width limit (involves a log-normal distribution in the case of an MLP architecture,
[27]). Intuitively, one should not expect a universal behaviour (e.g. the Gaussian behaviour
in the infinite-width case) of the solution of Eq. (7) as this latter is highly sensitive to the
choice of the activation function, and different activation functions might yield completely
different distributions of X1. We demonstrate this in the next section by showing that we
can recover closed-form distributions by carefully choosing the activation function. The
main ingredient is the use of Itˆo’s lemma. See Appendix Afor more details.
3. Different behaviours depending on the activation function
In this section, we restrict our analysis to a width-1 ResNet with one-dimensional inputs,
where each layer consists of a single neuron, i.e. d=n= 1. In this case, the process
(Xt)0t1is one-dimensional and is solution of the following SDE
dXt=|φ(Xt)|dBt, X0=Winx.
We can get rid of the absolute value in the equation above since the process Xthas the
same distribution as ˜
Xt, the solution of the SDE d˜
Xt=φ(˜
Xt)dBt. The intuition behind
this is that the infinitesimal random variable ‘dBt’ is Gaussian distributed with zero mean
and variance dt. Hence, it is a symmetric random variable and can absorb the sign of φ(Xt).
The rigorous justification of this fact is provided in Theorem 7in the Appendix. Hereafter
in this section, we consider the process X, solution of the SDE
dXt=φ(Xt)dBt, X0=Winx.
Given a function g∈ C2(R)2, we use Itˆo’s lemma (Lemma 4in the appendix) to derive the
dynamics of the process g(Xt). We obtain,
dg(Xt) = φ(Xt)g0(Xt)
| {z }
σ(Xt)
dBt+1
2φ(Xt)2g00(Xt)
| {z }
µ(Xt)
dt. (3)
2. Here C2(R) refers to the vector space of functions g:RRthat are twice differentiable and their second
derivatives are continuous.
5
摘要:

Onthein nite-depthlimitof nite-widthneuralnetworksSou aneHayouhayou@nus.edu.sgDepartmentofMathematicsNationalUniversityofSingaporeAbstractInthispaper,westudythein nite-depthlimitof nite-widthresidualneuralnetworkswithrandomGaussianweights.Withproperscaling,weshowthatby xingthewidthandtakingthedeptht...

展开>> 收起<<
On the innite-depth limit of nite-width neural networks Souane Hayou hayounus.edu.sg Department of Mathematics.pdf

共71页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:71 页 大小:1.94MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 71
客服
关注