DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS Advait Gadhikar CISPA Helmholtz Center for Information Security

2025-05-03 0 0 705.98KB 22 页 10玖币

侵权投诉

DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS

Advait Gadhikar

CISPA Helmholtz Center for Information Security

Saarbrücken 66123, Germany

advait.gadhikar@cispa.de

Rebekka Burkholz

CISPA Helmholtz Center for Information Security

Saarbrücken 66123, Germany

burkholz@cispa.de

ABSTRACT

The training success, training speed and generalization ability of neural networks

rely crucially on the choice of random parameter initialization. It has been shown

for multiple architectures that initial dynamical isometry is particularly advanta-

geous. Known initialization schemes for residual blocks, however, miss this prop-

erty and suffer from degrading separability of different inputs for increasing depth

and instability without Batch Normalization or lack feature diversity. We propose

a random initialization scheme, RISOTTO, that achieves perfect dynamical isome-

try for residual networks with ReLU activation functions even for ﬁnite depth and

width. It balances the contributions of the residual and skip branches unlike other

schemes, which initially bias towards the skip connections. In experiments, we

demonstrate that in most cases our approach outperforms initialization schemes

proposed to make Batch Normalization obsolete, including Fixup and SkipInit,

and facilitates stable training. Also in combination with Batch Normalization, we

ﬁnd that RISOTTO often achieves the overall best result.

1 INTRODUCTION

Random initialization of weights in a neural network play a crucial role in determining the ﬁnal

performance of the network. This effect becomes even more pronounced for very deep models that

seem to be able to solve many complex tasks more effectively. An important building block of many

models are residual blocks He et al. (2016), in which skip connections between non-consecutive

layers are added to ease signal propagation (Balduzzi et al., 2017) and allow for faster training.

ResNets, which consist of multiple residual blocks, have since become a popular center piece of

many deep learning applications (Bello et al., 2021).

Batch Normalization (BN) (Ioffe & Szegedy, 2015) is a key ingredient to train ResNets on large

datasets. It allows training with larger learning rates, often improves generalization, and makes

the training success robust to different choices of parameter initializations. It has furthermore been

shown to smoothen the loss landscape (Santurkar et al., 2018) and to improve signal propagation

(De & Smith, 2020). However, BN has also several drawbacks: It breaks the independence of

samples in a minibatch and adds considerable computational costs. Sufﬁciently large batch sizes

to compute robust statistics can be infeasible if the input data requires a lot of memory. Moreover,

BN also prevents adversarial training (Wang et al., 2022). For that reason, it is still an active area

of research to ﬁnd alternatives to BN Zhang et al. (2018); Brock et al. (2021b). A combinations of

Scaled Weight Standardization and gradient clipping has recently outperformed BN (Brock et al.,

2021b). However, a random parameter initialization scheme that can achieve all the beneﬁts of BN is

still an open problem. An initialization scheme allows deep learning systems the ﬂexibility to drop

in to existing setups without modifying pipelines. For that reason, it is still necessary to develop

initialization schemes that enable learning very deep neural network models without normalization

or standardization methods.

A direction of research pioneered by Saxe et al. (2013); Pennington et al. (2017) has analyzed the

signal propagation through randomly parameterized neural networks in the inﬁnite width limit using

random matrix theory. They have argued that parameter initialization approaches that have the

dynamical isometry (DI) property avoid exploding or vanishing gradients, as the singular values of

the input-output Jacobian are close to unity. DI is key to stable and fast training (Du et al., 2019;

Hu et al., 2020). While Pennington et al. (2017) showed that it is not possible to achieve DI in

arXiv:2210.02411v1 [cs.LG] 5 Oct 2022

networks with ReLU activations with independent weights or orthogonal weight matrices, Burkholz

& Dubatovka (2019); Balduzzi et al. (2017) derived a way to attain perfect DI even in ﬁnite ReLU

networks by parameter sharing. This approach can also be combined (Blumenfeld et al., 2020;

Balduzzi et al., 2017) with orthogonal initialization schemes for convolutional layers (Xiao et al.,

2018). The main idea is to design a random initial network that represents a linear isometric map.

We transfer a similar idea to ResNets but have to overcome the additional challenge of integrating

residual connections and, in particular, potentially non-trainable identity mappings while balancing

skip and residual connections and creating initial feature diversity. We propose an initialization

scheme, RISOTTO (Residual dynamical isometry by initial orthogonality), that achieves dynamical

isometry (DI) for ResNets (He et al., 2016) with convolutional or fully-connected layers and ReLU

activation functions exactly. RISOTTO achieves this for networks of ﬁnite width and ﬁnite depth and

not only in expectation but exactly. We provide theoretical and empirical evidence that highlight the

advantages of our approach. In contrast to other initialization schemes that aim to improve signal

propagation in ResNets, RISOTTO can achieve performance gains even in combination with BN. We

further demonstrate that RISOTTO can successfully train ResNets without BN and achieve the same

or better performance than Zhang et al. (2018); Brock et al. (2021b).

1.1 CONTRIBUTIONS

• To explain the drawbacks of most initialization schemes for residual blocks, we derive

signal propagation results for ﬁnite networks without requiring mean ﬁeld approximations

and highlight input separability issues for large depths.

• We propose a solution, RISOTTO, which is an initialization scheme for residual blocks that

provably achieves dynamical isometry (exactly for ﬁnite networks and not only approxi-

mately). A residual block is initialized so that it acts as an orthogonal, norm and distance

preserving transform.

• In experiments on multiple standard benchmark datasets, we demonstrate that our approach

achieves competitive results in comparison with alternatives:

–We show that RISOTTO facilitates training ResNets without BN or any other nor-

malization method and often outperforms existing BN free methods including Fixup,

SkipInit, and NF ResNets.

–It outperforms standard initialization schemes for ResNets with BN on Tiny Imagenet

and CIFAR100.

1.2 RELATED WORK

Preserving Signal Propagation Random initialization schemes have been designed for a multitude

of neural network architectures and activation functions. Early work has focused on the layerwise

preservation of average squared signal norms (Glorot & Bengio, 2010; He et al., 2015; Hanin, 2018)

and their variance (Hanin & Rolnick, 2018). The mean ﬁeld theory of inﬁnitely wide networks

has also integrated signal covariances into the analysis and further generated practical insights into

good choices that avoid exploding or vanishing gradients and enable feature learning (Yang & Hu,

2021) if the parameters are drawn independently (Poole et al., 2016; Raghu et al., 2017; Schoenholz

et al., 2017; Yang & Schoenholz, 2017; Xiao et al., 2018). Indirectly, these works demand that

the average eigenvalue of the signal input-output Jacobian is steered towards 1. Yet, in this set-up,

ReLU activation functions fail to support parameter choices that lead to good trainability of very

deep networks, as outputs corresponding to different inputs become more similar for increasing

depth (Poole et al., 2016; Burkholz & Dubatovka, 2019). Yang & Schoenholz (2017) could show

that ResNets can mitigate this effect and enable training deeper networks, but also cannot distinguish

different inputs eventually.

However, there are exceptions. Balanced networks can improve (Li et al., 2021) interlayer correla-

tions and reduce the variance of the output. A more effective option is to remove the contribution of

the residual part entirely as proposed in successful ResNet initialization schemes like Fixup (Zhang

et al., 2018) and SkipInit (De & Smith, 2020). This, however, limits signiﬁcantly the initial feature

diversity that is usually crucial for the training success (Blumenfeld et al., 2020). A way to address

the issue for other architectures with ReLUs like fully-connected (Burkholz & Dubatovka, 2019)

and convolutional (Balduzzi et al., 2017) layers is a looks-linear weight matrix structure (Shang

et al., 2016). This idea has not been transfered to residual blocks yet but has the advantage that it

can be combined with orthogonal submatrices. These matrices induce perfect dynamical isometry

(Saxe et al., 2013; Mishkin & Matas, 2015; Poole et al., 2016; Pennington et al., 2017), meaning that

the eigenvalues of the initial input-output Jacobian are identical to 1or −1and not just close to unity

on average. This property has been shown to enable the training of very deep neural networks (Xiao

et al., 2018) and can improve their generalization ability (Hayase & Karakida, 2021) and training

speed Pennington et al. (2017; 2018). ResNets equipped with ReLUs can currently only achieve

this property approximately and without a practical initialization scheme (Tarnowski et al., 2019) or

with reduced feature diversity (Blumenfeld et al., 2020) and potential training instabilities (Zhang

et al., 2018; De & Smith, 2020).

ResNet Initialization Approaches Fixup (Zhang et al., 2018), SkipInit (De & Smith, 2020), and

ReZero (Bachlechner et al., 2021) have been designed to enable training without requiring BN,

yet, can usually not achieve equal performance. Training data informed approaches have also been

successful (Zhu et al., 2021; Dauphin & Schoenholz, 2019) but they require computing the gradient

of the input minibatches. Yet, most methods only work well in combination with BN (Ioffe &

Szegedy, 2015), as it seems to improve ill conditioned initializations (Glorot & Bengio, 2010; He

et al., 2016) according to Bjorck et al. (2018), allows training with larger learning rates (Santurkar

et al., 2018), and might initially bias the residual block towards the identity enabling signal to ﬂow

through De & Smith (2020). The additional computational and memory costs of BN, however,

have motivated research on alternatives including different normalization methods (Wu & He, 2018;

Salimans & Kingma, 2016; Ulyanov et al., 2016). Only recently has it been possible to outperform

BN in generalization performance using scaled weight standardization and gradient clipping (Brock

et al., 2021b;a), but this requires careful hyperparameter tuning. In experiments, we compare our

initialization proposal RISOTTO with all three approaches: normalization free methods, BN and

normalization alternatives (e.g NF ResNet).

2 RESNET INITIALIZATION

2.1 BACKGROUND AND NOTATION

The object of our study is a general residual network that is deﬁned by

z0:= W0∗x,xl=φ(zl−1),zl:= αlfl(xl) + βlhl(xl); zout := WoutP(xL)(1)

for 1≤l≤L.P(.)denotes an optional pooling operation like maxpool or average pool, f(.)

residual connections, and h(.)the skip connections, which usually represent an identity mapping or

a projection. For simplicity, we assume in our derivations and arguments that these functions are

parameterized as fl(xl) = Wl

2∗φ(Wl

1∗xl+bl

1)+bl

2and hl(xl) = Wl

skip∗xl+bl

skip (∗denotes con-

volution), but our arguments also transfer to residual blocks in which more than one layer is skipped.

Optionally, batch normalization (BN) layers are placed before or after the nonlinear activation func-

tion φ(·). We focus on ReLUs φ(x) = max{0, x}(Krizhevsky et al., 2012), which are among the

most commonly used activation functions in practice. All biases bl

2∈RNl+1 ,bl

1∈RNml, and

skip ∈RNlare assumed to be trainable and set initially to zero. We ignore them in the following,

since we are primarily interested in the neuron states and signal propagation at initialization. The

parameters αand βbalance the contribution of the skip and the residual branch, respectively. Note

that αis a trainable parameter, while βis just mentioned for convenience to simplify the compar-

ison with standard He initialization approaches (He et al., 2015). Both parameters could also be

integrated into the weight parameters Wl

2∈RNl+1×Nml×kl

2,1×kl

2,2,Wl

1∈RNml×Nl×kl

1,2×kl

1,2,

and Wl

skip ∈RNl+1×Nl×1×1, but they make the discussion of different initialization schemes more

convenient and simplify the comparison with standard He initialization approaches (He et al., 2015).

Residual Blocks Following the deﬁnition by He et al. (2015), we distinguish two types of residual

blocks, Type B and Type C (see Figure 1a), which differ in the choice of Wl

skip. The Type C

residual block is deﬁned as zl=αfl(xl) + hl(xl)so that shortcuts h(.)are projections with a

1×1kernel with trainable parameters. The type B residual block has identity skip connections

zl=αfl(xl) + xl. Thus, Wl

skip represents the identity and is not trainable.

2.2 SIGNAL PROPAGATION FOR NORMAL RESNET INITIALIZATION

Most initialization methods for ResNets draw weight entries independently at random, including

FixUp and SkipInit. To simplify the theoretical analysis of the induced random networks and to

highlight the shortcomings of the independence assumption, we assume:

Deﬁnition 2.1 (Normally Distributed ResNet Parameters).All biases are initialized as zero and all

weight matrix entries are independently normally distributed with

ij,2∼ N 0, σ2

l,2,wl

ij,1∼ N 0, σ2

l,1, and wl

ij,skip ∼ N 0, σ2

l,skip.

Most studies further focus on special cases of the following set of parameter choices.

Deﬁnition 2.2 (Normal ResNet Initialization).The choice σl,1=q2

Nmlkl

1,1kl

1,2

,σl,2=

Nl+1kl

2,1kl

2,2

,σl,skip =q2

Nl+1 as used in Deﬁnition 2.1 and αl, βl≥0that fulﬁll α2

l+β2

l= 1.

Another common choice is Wskip =Iinstead of random entries. If βl= 1, sometimes also αl6= 0

is still common if it accounts for the depth Lof the network. In case αland βlare the same for each

layer we drop the subscript l. For instance, Fixup (Zhang et al., 2018) and SkipInit (De & Smith,

2020) satisfy the above condition with α= 0 and β= 1. De & Smith (2020) argue that BN also

suppresses the residual branch effectively. However, in combination with He initialization (He et al.,

2015) it becomes more similar to α=β=√0.5. Li et al. (2021) study the case of free αlbut focus

their analysis on identity mappings Wl

1=Iand Wl

skip =I.

As other theoretical work, we focus our following investigations on fully-connected layers to sim-

plify the exposition. Similar insights would transfer to convolutional layers but would require extra

effort (Yang & Schoenholz, 2017). The motivation for the general choice in Deﬁnition 2.2 is that it

ensures that the average squared l2-norm of the neuron states is identical in every layer. This has

been shown by Li et al. (2021) for the special choice Wl

1=Iand Wl

skip =I,β= 1 and by (Yang

& Schoenholz, 2017) in the mean ﬁeld limit with a missing ReLU so that xl=zl−1. (Hanin &

Rolnick, 2018) has also observed for Wl

skip =Iand β= 1 that the squared signal norm increases

in Plαl. For completeness, we present the most general case next and prove it in the appendix.

+ +

(a) (b) (c)

Figure 1: (a)The two types of considered residual blocks. In Type C the skip connection is a

projection with a 1×1kernel while in Type B the input is directly added to the residual block via

the skip connection. Both these blocks have been described by He et al. (2016). (b)The correlation

between two inputs for different initializations as they pass through a residual network consisting

of a convolution ﬁlter followed by 5residual blocks (Type C), an average pool, and a linear layer

on CIFAR10. Only RISOTTO maintains constant correlations after each residual block while it

increases for the other initializations with depth. (c)Performance of RISOTTO for different values

of alpha (α) for ResNet 18 (C) on CIFAR10. Note that α= 0 is equivalent to SkipInit and achieves

the lowest accuracy. Initializing α= 1 clearly improves performance.

Theorem 2.3 (Norm preservation).Let a neural network consist of fully-connected residual blocks

as deﬁned by Equ. (1) that start with a fully-connected layer at the beginning W0, which contains

N1output channels. Assume that all biases are initialized as 0and that all weight matrix entries are

independently normally distributed with wl

ij,2∼ N 0, σ2

l,2,wl

ij,1∼ N 0, σ2

l,1, and wl

ij,skip ∼

N0, σ2

l,skip. Then the expected squared norm of the output after one fully-connected layer and L

residual blocks applied to input xis given by

E

xL



2=N1

2σ2

L−1

l=1

Nl+1

2α2

lσ2

l,2σ2

l,1

Nml

2+β2

lσ2

l,skipkxk2.

Note that this result does not rely on any (mean ﬁeld) approximations and applies also to other

parameter distributions that have zero mean and are symmetric around zero. Inserting the parameters

of Deﬁnition 2.1 for fully-connected networks with k= 1 leads to the following insight that explains

why this is the preferred initialization choice.

Insight 2.4 (Norm preserving initialization).Acccording to Theorem 2.3, the normal ResNet initial-

ization (Deﬁnition 2.2) preserves the average squared signal norm for arbitrary depth L.

Even though this initialization setting is able to avoid exploding or vanishing signals, it still induces

considerable issues, as the analysis of the joint signal corresponding to different inputs reveals.

According to the next theorem, the signal covariance fulﬁlls a layerwise recurrence relationship that

leads to the observation that signals become more similar with increasing depth.

Theorem 2.5 (Layerwise signal covariance).Let a fully-connected residual block be given as de-

ﬁned by Eq. (1) with random parameters according to Deﬁnition 2.2. Let xl+1 denote the neuron

states of Layer l+ 1 for input xand ˜

xl+1 the same neurons but for input ˜

x. Then their covariance

given all parameters of the previous layers is given as Elhxl+1,˜

xl+1i

≥1

Nl+1

2α2σ2

l,2σ2

l,1

Nml

2+ 2β2σ2

l,skiphxl,˜

xli+c

4α2Nl+1σ2

l,2σ2

l,1Nml

xl



˜

xl

(2)

+EWl

1 rα2σ2

l,2

φ(Wl

1xl)



2+β2σ2

l,skip kxlk2α2σ2

l,2

φ(Wl

1˜

xl)



2+β2σ2

l,skip k˜

xlk2!,

where the expectation Elis taken with respect to the initial parameters Wl

2,Wl

1, and Wl

skip and the

constant cfulﬁlls 0.24 ≤c≤0.25.

Note that this statement holds even for ﬁnite networks. To clarify what that means for the separability

of inputs, we have to compute the expectation with respect to the parameters of W1. To gain an

intuition, we employ an approximation that holds for a wide intermediary network.

Insight 2.6 (Covariance of signal for different inputs increases with depth).Let a fully-connected

ResNet with random parameters as in Deﬁnition 2.2 be given. It follows from Theorem 2.5 that the

outputs corresponding to different inputs become more difﬁcult to distinguish for increasing depth

L. For simplicity, let us assume that kxk=k˜

xk= 1. Then, in the mean ﬁeld limit Nml→ ∞, the

covariance of the signals is lower bounded by

EhxL,˜

xLi≥γL

1hx,˜

xi+γ2

L−1

k=0

γk

1=γL

1hx,˜

xi+γ2

1−γ11−γL

1(3)

for γ1=1+β2

4≤1

2and γ2=c(α2+ 2) ≈α2

4+1

2using El−1kxlkk˜

xlk ≈ 1.

Since γ1<1, the contribution of the original input correlations hx,˜

xivanishes for increasing depth

L. Meanwhile, by adding constant contribution in every layer, irrespective of the input correlations,

EhxL,˜

xLiincreases with Land converges to the maximum value 1(or a slightly smaller value

in case of smaller width Nml). Thus, deep models essentially map every input to almost the same

output vector, which makes it impossible for the initial network to distinguish different inputs and

provide information for meaningful gradients. Fig. 1b demonstrates this trend and compares it with

our initialization proposal RISOTTO, which does not suffer from this problem.

While the general trend holds for residual as well as standard fully-connected feed forward networks

(β= 0), interestingly, we still note a mitigation for a strong residual branch (β= 1). The contri-

bution by the input correlations decreases more slowly and the constant contribution is reduced for

larger β. Thus, residual networks make the training of deeper models feasible, as they were de-

signed to do (He et al., 2016). This observation is in line with the ﬁndings of Yang & Schoenholz

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DYNAMICALISOMETRYFORRESIDUALNETWORKSAdvaitGadhikarCISPAHelmholtzCenterforInformationSecuritySaarbrücken66123,Germanyadvait.gadhikar@cispa.deRebekkaBurkholzCISPAHelmholtzCenterforInformationSecuritySaarbrücken66123,Germanyburkholz@cispa.deABSTRACTThetrainingsuccess,trainingspeedandgeneralizationabili...

展开>> 收起<<

DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS Advait Gadhikar CISPA Helmholtz Center for Information Security.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS Advait Gadhikar CISPA Helmholtz Center for Information Security

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: