Towards Theoretically Inspired Neural Initialization Optimization Yibo Yang1 Hong Wang2 Haobo Yuan3 Zhouchen Lin245

2025-04-26 0 0 3.01MB 18 页 10玖币
侵权投诉
Towards Theoretically Inspired Neural Initialization
Optimization
Yibo Yang1, Hong Wang2, Haobo Yuan3, Zhouchen Lin2,4,5
1JD Explore Academy, Beijing, China
2Key Lab. of Machine Perception (MoE), School of Intelligence Science and Technology, Peking University
3Institute of Artificial Intelligence and School of Computer Science, Wuhan University
4Institute for Artificial Intelligence, Peking University
5Pazhou Laboratory, Guangzhou, China
Abstract
Automated machine learning has been widely explored to reduce human efforts
in designing neural architectures and looking for proper hyperparameters. In the
domain of neural initialization, however, similar automated techniques have rarely
been studied. Most existing initialization methods are handcrafted and highly
dependent on specific architectures. In this paper, we propose a differentiable
quantity, named GradCosine, with theoretical insights to evaluate the initial state of
a neural network. Specifically, GradCosine is the cosine similarity of sample-wise
gradients with respect to the initialized parameters. By analyzing the sample-
wise optimization landscape, we show that both the training and test performance
of a network can be improved by maximizing GradCosine under gradient norm
constraint. Based on this observation, we further propose the neural initialization
optimization (NIO) algorithm. Generalized from the sample-wise analysis into
the real batch setting, NIO is able to automatically look for a better initialization
with negligible cost compared with the training time. With NIO, we improve
the classification performance of a variety of neural architectures on CIFAR-10,
CIFAR-100, and ImageNet. Moreover, we find that our method can even help to
train large vision Transformer architecture without warmup.
1 Introduction
For a deep neural network, architecture [
19
,
20
,
21
] and parameter initialization [
18
,
16
] are two
initial elements that largely account for the final model performance. Lots of human efforts have been
devoted to finding better answers with respect to the two aspects. To automatically produce better
architectures, neural architecture search [
62
,
38
,
5
] has been a research focus. However, on the other
hand, using automated techniques for parameter initialization has rarely been studied.
Previous initialization methods are mostly handcrafted. They focus on finding proper variance
patterns of randomly initialized weights [
18
,
16
,
40
,
36
] or rely on empirical evidence derived from
certain architectures [
58
,
23
,
15
,
7
]. Recently, [
60
,
8
] propose learning-based initialization that learns
to tune the norms of the initial weights so as to minimize a quantity that is intimately related to
favorable training dynamics. Despite being architecture-agnostic and gradient-based, these methods
do not consider sample-wise landscape. Their derived quantities lack theoretical supports for being
related to model performance. It is unclear whether optimizing these quantities can indeed lead to
better training or generalization performance.
In order to find a better initialization, a theoretically sound quantity that intimately evaluates both the
training and test performance should be designed. Finding such a quantity is a non-trivial problem and
: corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05956v1 [cs.LG] 12 Oct 2022
shows many challenges. First, the test performance is mostly decided by the converged parameters
after training, while the training dynamic is more related to the parameters at initialization or during
training. Second, to efficiently find a better starting point, the quantity is expected to be differentiable
to enable the optimization in the continuous parameter space. To this end, we leverage the recent
advances in optimization landscape analysis [
2
] to propose a novel differentiable quantity and develop
a corresponding algorithm for its optimization at initialization.
Specifically, our quantity is inspired by analyzing the optimization landscapes of individual training
samples [
59
]. Through generalizing prior theoretical results on batch-wise optimization [
2
] to sample-
wise optimization, we prove that both the network’s training and generalization error are upper
bounded by a theoretical quantity that correlates with the cosine similarity of the sample-wise local
optima. Moreover, this quantity also relates to the training dynamic since it reflects the optimization
path consistency [
35
,
37
] from the starting point. Unfortunately, the sample-wise local optima are
intractable. With the hypothesis that the sample-wise local optima can be reached by the first-order
approximation from the initial parameters, we can approximate the quantity via the sample-wise
gradients at initialization. Our final result shows that, under a limited gradient norm, both the
training and test performance of a network can be improved by maximizing the cosine similarity of
sample-wise gradients, named GradCosine, which is differentiable and easy to implement.
We then propose the Neural Initialization Optimization (NIO) algorithm based on GradCosine to find
a better initialization agnostic of architecture. We generalize the algorithm from the sample-wise
analysis into the batch-wise setting by dividing a batch into sub-batches for friendly implementation.
We follow [
8
,
60
] using gradient descent to learn a set of scalar coefficients of the initialized
parameters. These coefficients are optimized to maximize GradCosine for better training dynamic
and expected performance while constraining the gradient norm from explosion.
Experiments show that for a variety of deep architectures including ResNet [
19
], DenseNet [
21
], and
WideResNet [
56
], our method achieves better classification results on CIFAR-10/100 [
27
] than prior
heuristic [
18
] and learning-based [
8
,
60
] initialization methods. We can also initialize ResNet-50 [
19
]
on ImageNet [
9
] for better performance. Moreover, our method is able to help the recently proposed
Swin-Transformer [
32
] achieve stable training and competitive results on ImageNet even without
warmup [17], which is crucial for the successful training of Transformer architectures [31,52].
2 Related Work
2.1 Network Initialization
Existing initialization methods are designed to control the norms of network parameters via Gaussian
initialization [
16
,
18
] or orthonormal matrix initialization [
40
,
36
] with different variance patterns.
These analyses are most effective for simple feed-forward networks without skip connections or
normalization layers. Recently, initialization techniques specified for some complex architectures
are proposed. For example, [
58
] studied how to initialize networks with skip connections and [
23
]
generalized the results into Transformer architecture [
49
]. However, these heuristic methods are
restricted to specific architectures. Automated machine learning has achieved success in looking for
hyperparameters [
3
,
13
] and architectures [
62
,
38
,
5
,
54
,
55
,
22
], while similar techniques for neural
network initialization deserve more exploration. Current learning based initialization methods [
8
,
60
]
optimize the curvature [
8
] or the loss reduction of the first stochastic step [
60
] using gradient descent
to tune the norms of the initial parameters. However, these methods lack theoretical foundations to be
related to the model performance. Different from these methods, our proposed GradCosine is derived
from a theoretical quantity that is the upper bound of both training and generalization error.
2.2 Evaluating Model Performance at Initialization
Evaluating the performance of a network at initialization has been an important challenge and widely
applied in zero-shot neural architecture search [
1
,
34
,
6
,
42
,
59
,
41
,
29
] and pruning [
50
,
28
,
44
].
The evaluation quantities in these studies are mainly based on the initial gradient norm [44,41], the
eigenvalues of neural tangent kernel [
6
,
41
], and the Fisher information matrix [
47
,
48
,
45
]. However,
these quantities cannot reflect optimization landscape, which is crucial for training dynamic and
generalization [
4
,
12
,
14
,
43
,
30
,
2
]. [
2
] provided theoretical evidence that for a sufficiently large
neighborhood of a random initialization, the optimization landscape is nearly convex and semi-
2
𝜽𝟏
𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏)𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)
𝜽𝟐
𝜽𝟐
𝜽𝟏
𝒍𝒍𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)
𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏)
𝜽𝜽
𝜽
𝒍
𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏)𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)
𝜽𝟏
𝜽𝟐
𝜽
|𝜽𝟏
− 𝜽𝟐
|
𝜽
𝒍
𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏)𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)
𝜽𝟏
𝜽𝟐
𝜽
|𝜽𝟏
− 𝜽𝟐
|
(𝒂) (𝒃) (𝒄) (𝒅)
𝜽𝟎𝜽𝟎
𝐜𝐨𝐬(𝜽𝟏
− 𝜽𝟎, 𝜽𝟐
− 𝜽𝟎)𝐜𝐨𝐬(𝜽𝟏
− 𝜽𝟎, 𝜽𝟐
− 𝜽𝟎)
Figure 1: (a) Optimization landscape with sparser sample-wise local optima corresponds to a worse
θ
(larger loss
l
). (b) Optimization landscape with denser sample-wise local optima corresponds
to a better
θ
(smaller loss
l
). However, the density of sample-wise local optima cannot reflect
training path. We further leverage the cosine similarity of sample-wise local optima. Under the same
local optima density, (d) corresponds to a better training dynamic than (c), since (d) enjoys a better
optimization path consistency (smaller cosine distance between
θ
1θ0
and
θ
2θ0
). We give more
detailed explanations and discussions in Appendix C.
smooth. Based on the result, [
59
] proposed to use the density of sample-wise local optima to evaluate
and rank neural architectures. Our study also performs sample-wise landscape analysis, but differs
from [
59
] in that our proposed quantity is differentiable and reflects optimization path consistency,
while the quantity in [
59
] is non-differentiable so cannot serve for initialization optimization. We
make a comparison in more details in Appendix C.
3 Theoretical Foundations
3.1 Sample-wise Optimization Landscape Analysis
Conventional optimization landscape analyses [
30
,
2
] mainly focus on the objective across a mini-
batch of samples and miss potential evidence hidden in the optimization landscapes of individual
samples. As recently pointed out in [
59
], by decomposing a mini-batch objective into the summation
of sample-wise objectives across individual samples in the mini-batch, they find that the network with
denser sample-wise local optima tends to reach a better local optima for the mini-batch objective, as
shown in the Figure (1) (a)-(b). Based on this insight, they propose to use sample-wise local optima
density to judge model performance at initialization.
Density of sample-wise local optima.
For a batch of training samples
S={(xi, yi)}i[n]
, a loss
function
l(ˆyi, yi)
, and a network
fθ(·)
parameterized by
θRm
, the sample-wise local optima
density
ΨS,l(fθ0(·))
is measured by the averaged Manhattan distance between the pair-wise local
optima {θ
i}i[n]of all nsamples near the random initialization θ0[59], i.e.,
ΨS,l(fθ0(·)) = H
n2X
i,j ||θ
iθ
j||1, i, j [1, n],(1)
where
H
is the smoothness upper bound:
k[m], i [n],[2l(fθ(xi), yi)]k,k ≤ H
and
H
always
exists with the smoothness assumption that for a neighborhood
Γθ0
of a random initialization
θ0
,
the sample-wise optimization landscapes are nearly convex and semi-smooth [
2
]. Based on this
assumption, the training error of the network can be upper bounded by
n3
2Ψ2
S,l(fθ0(·))
. Moreover,
with probability
1δ
, the generalization error measured by population loss
E(xu,yu)∼D[l(f
θ(xu), yu)]
is upper bounded by
n3
2Ψ2
S,l(fθ0(·)) + σ
, where
D={(xu, yu)}u[U]
is the underlying data
distribution of test samples and σ2is the upper bound of V ar(xu,yu)∼D[||θθ
u||2
1][59].
Although the sample-wise local optima density is theoretically related to both the network training
and generalization error, we point out that it may not be a suitable quantity to evaluate network
initialization due to the following reasons. First, it ignores the optimization path consistency from
initialization
θ0
to each sample-wise local optimum
θ
i
. As shown in Figure 1(c)-(d), while both (c)
and (d) have the same local optima density, in Figure 1(d) the optimization path from the initialization
3
to the local optima of the two samples are more consistent. Training networks on samples with more
consistent optimization paths using batch gradient descent naturally corresponds to a better training
dynamic that enjoys faster and more stable convergence [
39
,
37
]. Second, the sample-wise local
optima in Eq. (1) are intractable. It can be approximated by measuring the consistency of sample-wise
gradient signs [59]. But it is non-differentiable and cannot serve for initialization optimization.
Based on this observation, we aim to derive a new quantity that directly reflects optimization path
consistency and is a differentiable function of the initialization
θ0
. Our proposed quantity is based on
the cosine similarity of the paths from initialization to sample-wise local optima.
Cosine similarity of sample-wise local optima. Concretely, our quantity can be formulated as:
ΘS,l(fθ0(·)) = Hα2
nX
i,j α
βcos (θ
iθ0, θ
jθ0), i, j [1, n],(2)
where
α
and
β
are the maximal and minimal
`2
-norms of the sample-wise optimization paths, i.e.,
α= max(||θ
iθ0||2), β = min(||θ
iθ0||2),i[1, n]
, and
cos (θ
iθ0, θ
jθ0)
refers to
the cosine similarity of the paths from initialization
θ0
to sample-wise local optima,
θ
i
and
θ
j
. The
cosine term in Eq. (2) reflects the optimization path consistency. Together with the distance term
α
β
, it is also able to measure the density of sample-wise local optima. In the ideal case, when all the
local optima are located at the same point,
ΘS,l(fθ0(·)) = ΨS,l(fθ0(·)) = 0
. Hence, compared with
Ψ, our Θis more suitable for evaluating the initialization quality.
3.2 Main Results
In this subsection, we theoretically illustrate how minimizing the quantity
Θ
in Eq. (2) corresponds
to better training and generalization performance. Similar to [
59
], our derivations are also based on
the evidence that there exists a neighborhood for a random initialization such that the sample-wise
optimization landscapes are nearly convex and semi-smooth [2].
Lemma 1.
There exists no saddle point in a sample-wise optimization landscape and every local
optimum is a global optimum [59].
Based on Lemma 1, we can draw a relation between the training error and
ΘS,l(fθ0(·))
. Moreover,
we show that the proposed quantity is also related to generalization performance as the upper bound
of population error. We present the following two theoretical results.
Theorem 2.
The training loss
L=1
nPil(fθ(xi), yi)
of a trained network
fθ
on a dataset
S={(xi, yi)}i[n]
is upper bounded by
ΘS,l(fθ0(·))
, and the bound is tight when
ΘS,l(fθ0(·)) = 0
.
Theorem 3.
Suppose that
σ2
is the upper bound of
V ar(xu,yu)∼D[||θθ
u||2
2]
, where
θ
u
is the
local optimum in the convex neighborhood of
θ0
for test sample
(xu, yu)
. With probability
1δ
, the
population loss E(xu,yu)∼D[l(fθ(xu), yu)] is upper bounded by ΘS,l(fθ0(·)) + σ
.
We provide proofs of these two theorems in Appendix A. Combining both theorems, we can conclude
that
ΘS,l(fθ0(·))
is the upper bound of both training and generalization errors of the network
fθ
.
Therefore, minimizing Θtheoretically helps to improve the model performance.
Albeit theoretically sound,
ΘS,l(fθ0(·))
requires sample-wise optima
θ
i
which are intractable at
initialization. To this end, we will show how to develop a differentiable and tractable objective based
on Eq. (2) in Section 4, and introduce the initialization optimization algorithm in Section 5.
4 GradCosine
4.1 First-Order Approximation of Sample-Wise Optimization
Since we are dealing with sample-wise optimization, it is reasonable to calculate each sample-wise
optimum by the first-order approximation. We hypothesize that each sample-wise optimum can be
reached via only one-step gradient descent from the initialized parameters. Its rationale lies in that
it is very easy for a deep neural network to learn the optimum for only one training sample with
gradient descent. Based on this hypothesis, we can approximate each local optimum as:
θ
iθ0ηgi, i [1, n],(3)
4
摘要:

TowardsTheoreticallyInspiredNeuralInitializationOptimizationYiboYang1,HongWang2,HaoboYuan3,ZhouchenLin2;4;51JDExploreAcademy,Beijing,China2KeyLab.ofMachinePerception(MoE),SchoolofIntelligenceScienceandTechnology,PekingUniversity3InstituteofArticialIntelligenceandSchoolofComputerScience,WuhanUniver...

展开>> 收起<<
Towards Theoretically Inspired Neural Initialization Optimization Yibo Yang1 Hong Wang2 Haobo Yuan3 Zhouchen Lin245.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:3.01MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注