DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS Advait Gadhikar CISPA Helmholtz Center for Information Security

2025-05-03 0 0 705.98KB 22 页 10玖币
侵权投诉
DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS
Advait Gadhikar
CISPA Helmholtz Center for Information Security
Saarbrücken 66123, Germany
advait.gadhikar@cispa.de
Rebekka Burkholz
CISPA Helmholtz Center for Information Security
Saarbrücken 66123, Germany
burkholz@cispa.de
ABSTRACT
The training success, training speed and generalization ability of neural networks
rely crucially on the choice of random parameter initialization. It has been shown
for multiple architectures that initial dynamical isometry is particularly advanta-
geous. Known initialization schemes for residual blocks, however, miss this prop-
erty and suffer from degrading separability of different inputs for increasing depth
and instability without Batch Normalization or lack feature diversity. We propose
a random initialization scheme, RISOTTO, that achieves perfect dynamical isome-
try for residual networks with ReLU activation functions even for finite depth and
width. It balances the contributions of the residual and skip branches unlike other
schemes, which initially bias towards the skip connections. In experiments, we
demonstrate that in most cases our approach outperforms initialization schemes
proposed to make Batch Normalization obsolete, including Fixup and SkipInit,
and facilitates stable training. Also in combination with Batch Normalization, we
find that RISOTTO often achieves the overall best result.
1 INTRODUCTION
Random initialization of weights in a neural network play a crucial role in determining the final
performance of the network. This effect becomes even more pronounced for very deep models that
seem to be able to solve many complex tasks more effectively. An important building block of many
models are residual blocks He et al. (2016), in which skip connections between non-consecutive
layers are added to ease signal propagation (Balduzzi et al., 2017) and allow for faster training.
ResNets, which consist of multiple residual blocks, have since become a popular center piece of
many deep learning applications (Bello et al., 2021).
Batch Normalization (BN) (Ioffe & Szegedy, 2015) is a key ingredient to train ResNets on large
datasets. It allows training with larger learning rates, often improves generalization, and makes
the training success robust to different choices of parameter initializations. It has furthermore been
shown to smoothen the loss landscape (Santurkar et al., 2018) and to improve signal propagation
(De & Smith, 2020). However, BN has also several drawbacks: It breaks the independence of
samples in a minibatch and adds considerable computational costs. Sufficiently large batch sizes
to compute robust statistics can be infeasible if the input data requires a lot of memory. Moreover,
BN also prevents adversarial training (Wang et al., 2022). For that reason, it is still an active area
of research to find alternatives to BN Zhang et al. (2018); Brock et al. (2021b). A combinations of
Scaled Weight Standardization and gradient clipping has recently outperformed BN (Brock et al.,
2021b). However, a random parameter initialization scheme that can achieve all the benefits of BN is
still an open problem. An initialization scheme allows deep learning systems the flexibility to drop
in to existing setups without modifying pipelines. For that reason, it is still necessary to develop
initialization schemes that enable learning very deep neural network models without normalization
or standardization methods.
A direction of research pioneered by Saxe et al. (2013); Pennington et al. (2017) has analyzed the
signal propagation through randomly parameterized neural networks in the infinite width limit using
random matrix theory. They have argued that parameter initialization approaches that have the
dynamical isometry (DI) property avoid exploding or vanishing gradients, as the singular values of
the input-output Jacobian are close to unity. DI is key to stable and fast training (Du et al., 2019;
Hu et al., 2020). While Pennington et al. (2017) showed that it is not possible to achieve DI in
1
arXiv:2210.02411v1 [cs.LG] 5 Oct 2022
networks with ReLU activations with independent weights or orthogonal weight matrices, Burkholz
& Dubatovka (2019); Balduzzi et al. (2017) derived a way to attain perfect DI even in finite ReLU
networks by parameter sharing. This approach can also be combined (Blumenfeld et al., 2020;
Balduzzi et al., 2017) with orthogonal initialization schemes for convolutional layers (Xiao et al.,
2018). The main idea is to design a random initial network that represents a linear isometric map.
We transfer a similar idea to ResNets but have to overcome the additional challenge of integrating
residual connections and, in particular, potentially non-trainable identity mappings while balancing
skip and residual connections and creating initial feature diversity. We propose an initialization
scheme, RISOTTO (Residual dynamical isometry by initial orthogonality), that achieves dynamical
isometry (DI) for ResNets (He et al., 2016) with convolutional or fully-connected layers and ReLU
activation functions exactly. RISOTTO achieves this for networks of finite width and finite depth and
not only in expectation but exactly. We provide theoretical and empirical evidence that highlight the
advantages of our approach. In contrast to other initialization schemes that aim to improve signal
propagation in ResNets, RISOTTO can achieve performance gains even in combination with BN. We
further demonstrate that RISOTTO can successfully train ResNets without BN and achieve the same
or better performance than Zhang et al. (2018); Brock et al. (2021b).
1.1 CONTRIBUTIONS
To explain the drawbacks of most initialization schemes for residual blocks, we derive
signal propagation results for finite networks without requiring mean field approximations
and highlight input separability issues for large depths.
We propose a solution, RISOTTO, which is an initialization scheme for residual blocks that
provably achieves dynamical isometry (exactly for finite networks and not only approxi-
mately). A residual block is initialized so that it acts as an orthogonal, norm and distance
preserving transform.
In experiments on multiple standard benchmark datasets, we demonstrate that our approach
achieves competitive results in comparison with alternatives:
We show that RISOTTO facilitates training ResNets without BN or any other nor-
malization method and often outperforms existing BN free methods including Fixup,
SkipInit, and NF ResNets.
It outperforms standard initialization schemes for ResNets with BN on Tiny Imagenet
and CIFAR100.
1.2 RELATED WORK
Preserving Signal Propagation Random initialization schemes have been designed for a multitude
of neural network architectures and activation functions. Early work has focused on the layerwise
preservation of average squared signal norms (Glorot & Bengio, 2010; He et al., 2015; Hanin, 2018)
and their variance (Hanin & Rolnick, 2018). The mean field theory of infinitely wide networks
has also integrated signal covariances into the analysis and further generated practical insights into
good choices that avoid exploding or vanishing gradients and enable feature learning (Yang & Hu,
2021) if the parameters are drawn independently (Poole et al., 2016; Raghu et al., 2017; Schoenholz
et al., 2017; Yang & Schoenholz, 2017; Xiao et al., 2018). Indirectly, these works demand that
the average eigenvalue of the signal input-output Jacobian is steered towards 1. Yet, in this set-up,
ReLU activation functions fail to support parameter choices that lead to good trainability of very
deep networks, as outputs corresponding to different inputs become more similar for increasing
depth (Poole et al., 2016; Burkholz & Dubatovka, 2019). Yang & Schoenholz (2017) could show
that ResNets can mitigate this effect and enable training deeper networks, but also cannot distinguish
different inputs eventually.
However, there are exceptions. Balanced networks can improve (Li et al., 2021) interlayer correla-
tions and reduce the variance of the output. A more effective option is to remove the contribution of
the residual part entirely as proposed in successful ResNet initialization schemes like Fixup (Zhang
et al., 2018) and SkipInit (De & Smith, 2020). This, however, limits significantly the initial feature
diversity that is usually crucial for the training success (Blumenfeld et al., 2020). A way to address
the issue for other architectures with ReLUs like fully-connected (Burkholz & Dubatovka, 2019)
and convolutional (Balduzzi et al., 2017) layers is a looks-linear weight matrix structure (Shang
2
et al., 2016). This idea has not been transfered to residual blocks yet but has the advantage that it
can be combined with orthogonal submatrices. These matrices induce perfect dynamical isometry
(Saxe et al., 2013; Mishkin & Matas, 2015; Poole et al., 2016; Pennington et al., 2017), meaning that
the eigenvalues of the initial input-output Jacobian are identical to 1or 1and not just close to unity
on average. This property has been shown to enable the training of very deep neural networks (Xiao
et al., 2018) and can improve their generalization ability (Hayase & Karakida, 2021) and training
speed Pennington et al. (2017; 2018). ResNets equipped with ReLUs can currently only achieve
this property approximately and without a practical initialization scheme (Tarnowski et al., 2019) or
with reduced feature diversity (Blumenfeld et al., 2020) and potential training instabilities (Zhang
et al., 2018; De & Smith, 2020).
ResNet Initialization Approaches Fixup (Zhang et al., 2018), SkipInit (De & Smith, 2020), and
ReZero (Bachlechner et al., 2021) have been designed to enable training without requiring BN,
yet, can usually not achieve equal performance. Training data informed approaches have also been
successful (Zhu et al., 2021; Dauphin & Schoenholz, 2019) but they require computing the gradient
of the input minibatches. Yet, most methods only work well in combination with BN (Ioffe &
Szegedy, 2015), as it seems to improve ill conditioned initializations (Glorot & Bengio, 2010; He
et al., 2016) according to Bjorck et al. (2018), allows training with larger learning rates (Santurkar
et al., 2018), and might initially bias the residual block towards the identity enabling signal to flow
through De & Smith (2020). The additional computational and memory costs of BN, however,
have motivated research on alternatives including different normalization methods (Wu & He, 2018;
Salimans & Kingma, 2016; Ulyanov et al., 2016). Only recently has it been possible to outperform
BN in generalization performance using scaled weight standardization and gradient clipping (Brock
et al., 2021b;a), but this requires careful hyperparameter tuning. In experiments, we compare our
initialization proposal RISOTTO with all three approaches: normalization free methods, BN and
normalization alternatives (e.g NF ResNet).
2 RESNET INITIALIZATION
2.1 BACKGROUND AND NOTATION
The object of our study is a general residual network that is defined by
z0:= W0x,xl=φ(zl1),zl:= αlfl(xl) + βlhl(xl); zout := WoutP(xL)(1)
for 1lL.P(.)denotes an optional pooling operation like maxpool or average pool, f(.)
residual connections, and h(.)the skip connections, which usually represent an identity mapping or
a projection. For simplicity, we assume in our derivations and arguments that these functions are
parameterized as fl(xl) = Wl
2φ(Wl
1xl+bl
1)+bl
2and hl(xl) = Wl
skipxl+bl
skip (denotes con-
volution), but our arguments also transfer to residual blocks in which more than one layer is skipped.
Optionally, batch normalization (BN) layers are placed before or after the nonlinear activation func-
tion φ(·). We focus on ReLUs φ(x) = max{0, x}(Krizhevsky et al., 2012), which are among the
most commonly used activation functions in practice. All biases bl
2RNl+1 ,bl
1RNml, and
bl
skip RNlare assumed to be trainable and set initially to zero. We ignore them in the following,
since we are primarily interested in the neuron states and signal propagation at initialization. The
parameters αand βbalance the contribution of the skip and the residual branch, respectively. Note
that αis a trainable parameter, while βis just mentioned for convenience to simplify the compar-
ison with standard He initialization approaches (He et al., 2015). Both parameters could also be
integrated into the weight parameters Wl
2RNl+1×Nml×kl
2,1×kl
2,2,Wl
1RNml×Nl×kl
1,2×kl
1,2,
and Wl
skip RNl+1×Nl×1×1, but they make the discussion of different initialization schemes more
convenient and simplify the comparison with standard He initialization approaches (He et al., 2015).
Residual Blocks Following the definition by He et al. (2015), we distinguish two types of residual
blocks, Type B and Type C (see Figure 1a), which differ in the choice of Wl
skip. The Type C
residual block is defined as zl=αfl(xl) + hl(xl)so that shortcuts h(.)are projections with a
1×1kernel with trainable parameters. The type B residual block has identity skip connections
zl=αfl(xl) + xl. Thus, Wl
skip represents the identity and is not trainable.
3
2.2 SIGNAL PROPAGATION FOR NORMAL RESNET INITIALIZATION
Most initialization methods for ResNets draw weight entries independently at random, including
FixUp and SkipInit. To simplify the theoretical analysis of the induced random networks and to
highlight the shortcomings of the independence assumption, we assume:
Definition 2.1 (Normally Distributed ResNet Parameters).All biases are initialized as zero and all
weight matrix entries are independently normally distributed with
wl
ij,2∼ N 0, σ2
l,2,wl
ij,1∼ N 0, σ2
l,1, and wl
ij,skip ∼ N 0, σ2
l,skip.
Most studies further focus on special cases of the following set of parameter choices.
Definition 2.2 (Normal ResNet Initialization).The choice σl,1=q2
Nmlkl
1,1kl
1,2
,σl,2=
q2
Nl+1kl
2,1kl
2,2
,σl,skip =q2
Nl+1 as used in Definition 2.1 and αl, βl0that fulfill α2
l+β2
l= 1.
Another common choice is Wskip =Iinstead of random entries. If βl= 1, sometimes also αl6= 0
is still common if it accounts for the depth Lof the network. In case αland βlare the same for each
layer we drop the subscript l. For instance, Fixup (Zhang et al., 2018) and SkipInit (De & Smith,
2020) satisfy the above condition with α= 0 and β= 1. De & Smith (2020) argue that BN also
suppresses the residual branch effectively. However, in combination with He initialization (He et al.,
2015) it becomes more similar to α=β=0.5. Li et al. (2021) study the case of free αlbut focus
their analysis on identity mappings Wl
1=Iand Wl
skip =I.
As other theoretical work, we focus our following investigations on fully-connected layers to sim-
plify the exposition. Similar insights would transfer to convolutional layers but would require extra
effort (Yang & Schoenholz, 2017). The motivation for the general choice in Definition 2.2 is that it
ensures that the average squared l2-norm of the neuron states is identical in every layer. This has
been shown by Li et al. (2021) for the special choice Wl
1=Iand Wl
skip =I,β= 1 and by (Yang
& Schoenholz, 2017) in the mean field limit with a missing ReLU so that xl=zl1. (Hanin &
Rolnick, 2018) has also observed for Wl
skip =Iand β= 1 that the squared signal norm increases
in Plαl. For completeness, we present the most general case next and prove it in the appendix.
+ +
(a) (b) (c)
Figure 1: (a)The two types of considered residual blocks. In Type C the skip connection is a
projection with a 1×1kernel while in Type B the input is directly added to the residual block via
the skip connection. Both these blocks have been described by He et al. (2016). (b)The correlation
between two inputs for different initializations as they pass through a residual network consisting
of a convolution filter followed by 5residual blocks (Type C), an average pool, and a linear layer
on CIFAR10. Only RISOTTO maintains constant correlations after each residual block while it
increases for the other initializations with depth. (c)Performance of RISOTTO for different values
of alpha (α) for ResNet 18 (C) on CIFAR10. Note that α= 0 is equivalent to SkipInit and achieves
the lowest accuracy. Initializing α= 1 clearly improves performance.
Theorem 2.3 (Norm preservation).Let a neural network consist of fully-connected residual blocks
as defined by Equ. (1) that start with a fully-connected layer at the beginning W0, which contains
N1output channels. Assume that all biases are initialized as 0and that all weight matrix entries are
independently normally distributed with wl
ij,2 N 0, σ2
l,2,wl
ij,1 N 0, σ2
l,1, and wl
ij,skip
4
N0, σ2
l,skip. Then the expected squared norm of the output after one fully-connected layer and L
residual blocks applied to input xis given by
E
xL
2=N1
2σ2
0
L1
Y
l=1
Nl+1
2α2
lσ2
l,2σ2
l,1
Nml
2+β2
lσ2
l,skipkxk2.
Note that this result does not rely on any (mean field) approximations and applies also to other
parameter distributions that have zero mean and are symmetric around zero. Inserting the parameters
of Definition 2.1 for fully-connected networks with k= 1 leads to the following insight that explains
why this is the preferred initialization choice.
Insight 2.4 (Norm preserving initialization).Acccording to Theorem 2.3, the normal ResNet initial-
ization (Definition 2.2) preserves the average squared signal norm for arbitrary depth L.
Even though this initialization setting is able to avoid exploding or vanishing signals, it still induces
considerable issues, as the analysis of the joint signal corresponding to different inputs reveals.
According to the next theorem, the signal covariance fulfills a layerwise recurrence relationship that
leads to the observation that signals become more similar with increasing depth.
Theorem 2.5 (Layerwise signal covariance).Let a fully-connected residual block be given as de-
fined by Eq. (1) with random parameters according to Definition 2.2. Let xl+1 denote the neuron
states of Layer l+ 1 for input xand ˜
xl+1 the same neurons but for input ˜
x. Then their covariance
given all parameters of the previous layers is given as Elhxl+1,˜
xl+1i
1
4
Nl+1
2α2σ2
l,2σ2
l,1
Nml
2+ 2β2σ2
l,skiphxl,˜
xli+c
4α2Nl+1σ2
l,2σ2
l,1Nml
xl
˜
xl
(2)
+EWl
1 rα2σ2
l,2
φ(Wl
1xl)
2+β2σ2
l,skip kxlk2α2σ2
l,2
φ(Wl
1˜
xl)
2+β2σ2
l,skip k˜
xlk2!,
where the expectation Elis taken with respect to the initial parameters Wl
2,Wl
1, and Wl
skip and the
constant cfulfills 0.24 c0.25.
Note that this statement holds even for finite networks. To clarify what that means for the separability
of inputs, we have to compute the expectation with respect to the parameters of W1. To gain an
intuition, we employ an approximation that holds for a wide intermediary network.
Insight 2.6 (Covariance of signal for different inputs increases with depth).Let a fully-connected
ResNet with random parameters as in Definition 2.2 be given. It follows from Theorem 2.5 that the
outputs corresponding to different inputs become more difficult to distinguish for increasing depth
L. For simplicity, let us assume that kxk=k˜
xk= 1. Then, in the mean field limit Nml→ ∞, the
covariance of the signals is lower bounded by
EhxL,˜
xLiγL
1hx,˜
xi+γ2
L1
X
k=0
γk
1=γL
1hx,˜
xi+γ2
1γ11γL
1(3)
for γ1=1+β2
41
2and γ2=c(α2+ 2) α2
4+1
2using El1kxlkk˜
xlk ≈ 1.
Since γ1<1, the contribution of the original input correlations hx,˜
xivanishes for increasing depth
L. Meanwhile, by adding constant contribution in every layer, irrespective of the input correlations,
EhxL,˜
xLiincreases with Land converges to the maximum value 1(or a slightly smaller value
in case of smaller width Nml). Thus, deep models essentially map every input to almost the same
output vector, which makes it impossible for the initial network to distinguish different inputs and
provide information for meaningful gradients. Fig. 1b demonstrates this trend and compares it with
our initialization proposal RISOTTO, which does not suffer from this problem.
While the general trend holds for residual as well as standard fully-connected feed forward networks
(β= 0), interestingly, we still note a mitigation for a strong residual branch (β= 1). The contri-
bution by the input correlations decreases more slowly and the constant contribution is reduced for
larger β. Thus, residual networks make the training of deeper models feasible, as they were de-
signed to do (He et al., 2016). This observation is in line with the findings of Yang & Schoenholz
5
摘要:

DYNAMICALISOMETRYFORRESIDUALNETWORKSAdvaitGadhikarCISPAHelmholtzCenterforInformationSecuritySaarbrücken66123,Germanyadvait.gadhikar@cispa.deRebekkaBurkholzCISPAHelmholtzCenterforInformationSecuritySaarbrücken66123,Germanyburkholz@cispa.deABSTRACTThetrainingsuccess,trainingspeedandgeneralizationabili...

展开>> 收起<<
DYNAMICAL ISOMETRY FOR RESIDUAL NETWORKS Advait Gadhikar CISPA Helmholtz Center for Information Security.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:705.98KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注