
et al., 2016). This idea has not been transfered to residual blocks yet but has the advantage that it
can be combined with orthogonal submatrices. These matrices induce perfect dynamical isometry
(Saxe et al., 2013; Mishkin & Matas, 2015; Poole et al., 2016; Pennington et al., 2017), meaning that
the eigenvalues of the initial input-output Jacobian are identical to 1or −1and not just close to unity
on average. This property has been shown to enable the training of very deep neural networks (Xiao
et al., 2018) and can improve their generalization ability (Hayase & Karakida, 2021) and training
speed Pennington et al. (2017; 2018). ResNets equipped with ReLUs can currently only achieve
this property approximately and without a practical initialization scheme (Tarnowski et al., 2019) or
with reduced feature diversity (Blumenfeld et al., 2020) and potential training instabilities (Zhang
et al., 2018; De & Smith, 2020).
ResNet Initialization Approaches Fixup (Zhang et al., 2018), SkipInit (De & Smith, 2020), and
ReZero (Bachlechner et al., 2021) have been designed to enable training without requiring BN,
yet, can usually not achieve equal performance. Training data informed approaches have also been
successful (Zhu et al., 2021; Dauphin & Schoenholz, 2019) but they require computing the gradient
of the input minibatches. Yet, most methods only work well in combination with BN (Ioffe &
Szegedy, 2015), as it seems to improve ill conditioned initializations (Glorot & Bengio, 2010; He
et al., 2016) according to Bjorck et al. (2018), allows training with larger learning rates (Santurkar
et al., 2018), and might initially bias the residual block towards the identity enabling signal to flow
through De & Smith (2020). The additional computational and memory costs of BN, however,
have motivated research on alternatives including different normalization methods (Wu & He, 2018;
Salimans & Kingma, 2016; Ulyanov et al., 2016). Only recently has it been possible to outperform
BN in generalization performance using scaled weight standardization and gradient clipping (Brock
et al., 2021b;a), but this requires careful hyperparameter tuning. In experiments, we compare our
initialization proposal RISOTTO with all three approaches: normalization free methods, BN and
normalization alternatives (e.g NF ResNet).
2 RESNET INITIALIZATION
2.1 BACKGROUND AND NOTATION
The object of our study is a general residual network that is defined by
z0:= W0∗x,xl=φ(zl−1),zl:= αlfl(xl) + βlhl(xl); zout := WoutP(xL)(1)
for 1≤l≤L.P(.)denotes an optional pooling operation like maxpool or average pool, f(.)
residual connections, and h(.)the skip connections, which usually represent an identity mapping or
a projection. For simplicity, we assume in our derivations and arguments that these functions are
parameterized as fl(xl) = Wl
2∗φ(Wl
1∗xl+bl
1)+bl
2and hl(xl) = Wl
skip∗xl+bl
skip (∗denotes con-
volution), but our arguments also transfer to residual blocks in which more than one layer is skipped.
Optionally, batch normalization (BN) layers are placed before or after the nonlinear activation func-
tion φ(·). We focus on ReLUs φ(x) = max{0, x}(Krizhevsky et al., 2012), which are among the
most commonly used activation functions in practice. All biases bl
2∈RNl+1 ,bl
1∈RNml, and
bl
skip ∈RNlare assumed to be trainable and set initially to zero. We ignore them in the following,
since we are primarily interested in the neuron states and signal propagation at initialization. The
parameters αand βbalance the contribution of the skip and the residual branch, respectively. Note
that αis a trainable parameter, while βis just mentioned for convenience to simplify the compar-
ison with standard He initialization approaches (He et al., 2015). Both parameters could also be
integrated into the weight parameters Wl
2∈RNl+1×Nml×kl
2,1×kl
2,2,Wl
1∈RNml×Nl×kl
1,2×kl
1,2,
and Wl
skip ∈RNl+1×Nl×1×1, but they make the discussion of different initialization schemes more
convenient and simplify the comparison with standard He initialization approaches (He et al., 2015).
Residual Blocks Following the definition by He et al. (2015), we distinguish two types of residual
blocks, Type B and Type C (see Figure 1a), which differ in the choice of Wl
skip. The Type C
residual block is defined as zl=αfl(xl) + hl(xl)so that shortcuts h(.)are projections with a
1×1kernel with trainable parameters. The type B residual block has identity skip connections
zl=αfl(xl) + xl. Thus, Wl
skip represents the identity and is not trainable.
3