shows many challenges. First, the test performance is mostly decided by the converged parameters
after training, while the training dynamic is more related to the parameters at initialization or during
training. Second, to efficiently find a better starting point, the quantity is expected to be differentiable
to enable the optimization in the continuous parameter space. To this end, we leverage the recent
advances in optimization landscape analysis [
2
] to propose a novel differentiable quantity and develop
a corresponding algorithm for its optimization at initialization.
Specifically, our quantity is inspired by analyzing the optimization landscapes of individual training
samples [
59
]. Through generalizing prior theoretical results on batch-wise optimization [
2
] to sample-
wise optimization, we prove that both the network’s training and generalization error are upper
bounded by a theoretical quantity that correlates with the cosine similarity of the sample-wise local
optima. Moreover, this quantity also relates to the training dynamic since it reflects the optimization
path consistency [
35
,
37
] from the starting point. Unfortunately, the sample-wise local optima are
intractable. With the hypothesis that the sample-wise local optima can be reached by the first-order
approximation from the initial parameters, we can approximate the quantity via the sample-wise
gradients at initialization. Our final result shows that, under a limited gradient norm, both the
training and test performance of a network can be improved by maximizing the cosine similarity of
sample-wise gradients, named GradCosine, which is differentiable and easy to implement.
We then propose the Neural Initialization Optimization (NIO) algorithm based on GradCosine to find
a better initialization agnostic of architecture. We generalize the algorithm from the sample-wise
analysis into the batch-wise setting by dividing a batch into sub-batches for friendly implementation.
We follow [
8
,
60
] using gradient descent to learn a set of scalar coefficients of the initialized
parameters. These coefficients are optimized to maximize GradCosine for better training dynamic
and expected performance while constraining the gradient norm from explosion.
Experiments show that for a variety of deep architectures including ResNet [
19
], DenseNet [
21
], and
WideResNet [
56
], our method achieves better classification results on CIFAR-10/100 [
27
] than prior
heuristic [
18
] and learning-based [
8
,
60
] initialization methods. We can also initialize ResNet-50 [
19
]
on ImageNet [
9
] for better performance. Moreover, our method is able to help the recently proposed
Swin-Transformer [
32
] achieve stable training and competitive results on ImageNet even without
warmup [17], which is crucial for the successful training of Transformer architectures [31,52].
2 Related Work
2.1 Network Initialization
Existing initialization methods are designed to control the norms of network parameters via Gaussian
initialization [
16
,
18
] or orthonormal matrix initialization [
40
,
36
] with different variance patterns.
These analyses are most effective for simple feed-forward networks without skip connections or
normalization layers. Recently, initialization techniques specified for some complex architectures
are proposed. For example, [
58
] studied how to initialize networks with skip connections and [
23
]
generalized the results into Transformer architecture [
49
]. However, these heuristic methods are
restricted to specific architectures. Automated machine learning has achieved success in looking for
hyperparameters [
3
,
13
] and architectures [
62
,
38
,
5
,
54
,
55
,
22
], while similar techniques for neural
network initialization deserve more exploration. Current learning based initialization methods [
8
,
60
]
optimize the curvature [
8
] or the loss reduction of the first stochastic step [
60
] using gradient descent
to tune the norms of the initial parameters. However, these methods lack theoretical foundations to be
related to the model performance. Different from these methods, our proposed GradCosine is derived
from a theoretical quantity that is the upper bound of both training and generalization error.
2.2 Evaluating Model Performance at Initialization
Evaluating the performance of a network at initialization has been an important challenge and widely
applied in zero-shot neural architecture search [
1
,
34
,
6
,
42
,
59
,
41
,
29
] and pruning [
50
,
28
,
44
].
The evaluation quantities in these studies are mainly based on the initial gradient norm [44,41], the
eigenvalues of neural tangent kernel [
6
,
41
], and the Fisher information matrix [
47
,
48
,
45
]. However,
these quantities cannot reflect optimization landscape, which is crucial for training dynamic and
generalization [
4
,
12
,
14
,
43
,
30
,
2
]. [
2
] provided theoretical evidence that for a sufficiently large
neighborhood of a random initialization, the optimization landscape is nearly convex and semi-
2