Towards Theoretically Inspired Neural Initialization Optimization Yibo Yang1 Hong Wang2 Haobo Yuan3 Zhouchen Lin245

2025-04-26 0 0 3.01MB 18 页 10玖币

侵权投诉

Towards Theoretically Inspired Neural Initialization

Optimization

Yibo Yang1, Hong Wang2, Haobo Yuan3, Zhouchen Lin2,4,5∗

1JD Explore Academy, Beijing, China

2Key Lab. of Machine Perception (MoE), School of Intelligence Science and Technology, Peking University

3Institute of Artiﬁcial Intelligence and School of Computer Science, Wuhan University

4Institute for Artiﬁcial Intelligence, Peking University

5Pazhou Laboratory, Guangzhou, China

Abstract

Automated machine learning has been widely explored to reduce human efforts

in designing neural architectures and looking for proper hyperparameters. In the

domain of neural initialization, however, similar automated techniques have rarely

been studied. Most existing initialization methods are handcrafted and highly

dependent on speciﬁc architectures. In this paper, we propose a differentiable

quantity, named GradCosine, with theoretical insights to evaluate the initial state of

a neural network. Speciﬁcally, GradCosine is the cosine similarity of sample-wise

gradients with respect to the initialized parameters. By analyzing the sample-

wise optimization landscape, we show that both the training and test performance

of a network can be improved by maximizing GradCosine under gradient norm

constraint. Based on this observation, we further propose the neural initialization

optimization (NIO) algorithm. Generalized from the sample-wise analysis into

the real batch setting, NIO is able to automatically look for a better initialization

with negligible cost compared with the training time. With NIO, we improve

the classiﬁcation performance of a variety of neural architectures on CIFAR-10,

CIFAR-100, and ImageNet. Moreover, we ﬁnd that our method can even help to

train large vision Transformer architecture without warmup.

1 Introduction

For a deep neural network, architecture [

] and parameter initialization [

] are two

initial elements that largely account for the ﬁnal model performance. Lots of human efforts have been

devoted to ﬁnding better answers with respect to the two aspects. To automatically produce better

architectures, neural architecture search [

] has been a research focus. However, on the other

hand, using automated techniques for parameter initialization has rarely been studied.

Previous initialization methods are mostly handcrafted. They focus on ﬁnding proper variance

patterns of randomly initialized weights [

] or rely on empirical evidence derived from

certain architectures [

]. Recently, [

] propose learning-based initialization that learns

to tune the norms of the initial weights so as to minimize a quantity that is intimately related to

favorable training dynamics. Despite being architecture-agnostic and gradient-based, these methods

do not consider sample-wise landscape. Their derived quantities lack theoretical supports for being

related to model performance. It is unclear whether optimizing these quantities can indeed lead to

better training or generalization performance.

In order to ﬁnd a better initialization, a theoretically sound quantity that intimately evaluates both the

training and test performance should be designed. Finding such a quantity is a non-trivial problem and

∗: corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05956v1 [cs.LG] 12 Oct 2022

shows many challenges. First, the test performance is mostly decided by the converged parameters

after training, while the training dynamic is more related to the parameters at initialization or during

training. Second, to efﬁciently ﬁnd a better starting point, the quantity is expected to be differentiable

to enable the optimization in the continuous parameter space. To this end, we leverage the recent

advances in optimization landscape analysis [

] to propose a novel differentiable quantity and develop

a corresponding algorithm for its optimization at initialization.

Speciﬁcally, our quantity is inspired by analyzing the optimization landscapes of individual training

samples [

]. Through generalizing prior theoretical results on batch-wise optimization [

] to sample-

wise optimization, we prove that both the network’s training and generalization error are upper

bounded by a theoretical quantity that correlates with the cosine similarity of the sample-wise local

optima. Moreover, this quantity also relates to the training dynamic since it reﬂects the optimization

path consistency [

] from the starting point. Unfortunately, the sample-wise local optima are

intractable. With the hypothesis that the sample-wise local optima can be reached by the ﬁrst-order

approximation from the initial parameters, we can approximate the quantity via the sample-wise

gradients at initialization. Our ﬁnal result shows that, under a limited gradient norm, both the

training and test performance of a network can be improved by maximizing the cosine similarity of

sample-wise gradients, named GradCosine, which is differentiable and easy to implement.

We then propose the Neural Initialization Optimization (NIO) algorithm based on GradCosine to ﬁnd

a better initialization agnostic of architecture. We generalize the algorithm from the sample-wise

analysis into the batch-wise setting by dividing a batch into sub-batches for friendly implementation.

We follow [

] using gradient descent to learn a set of scalar coefﬁcients of the initialized

parameters. These coefﬁcients are optimized to maximize GradCosine for better training dynamic

and expected performance while constraining the gradient norm from explosion.

Experiments show that for a variety of deep architectures including ResNet [

], DenseNet [

], and

WideResNet [

], our method achieves better classiﬁcation results on CIFAR-10/100 [

] than prior

heuristic [

] and learning-based [

] initialization methods. We can also initialize ResNet-50 [

]

on ImageNet [

] for better performance. Moreover, our method is able to help the recently proposed

Swin-Transformer [

] achieve stable training and competitive results on ImageNet even without

warmup [17], which is crucial for the successful training of Transformer architectures [31,52].

2 Related Work

2.1 Network Initialization

Existing initialization methods are designed to control the norms of network parameters via Gaussian

initialization [

] or orthonormal matrix initialization [

] with different variance patterns.

These analyses are most effective for simple feed-forward networks without skip connections or

normalization layers. Recently, initialization techniques speciﬁed for some complex architectures

are proposed. For example, [

] studied how to initialize networks with skip connections and [

]

generalized the results into Transformer architecture [

]. However, these heuristic methods are

restricted to speciﬁc architectures. Automated machine learning has achieved success in looking for

hyperparameters [

] and architectures [

], while similar techniques for neural

network initialization deserve more exploration. Current learning based initialization methods [

]

optimize the curvature [

] or the loss reduction of the ﬁrst stochastic step [

] using gradient descent

to tune the norms of the initial parameters. However, these methods lack theoretical foundations to be

related to the model performance. Different from these methods, our proposed GradCosine is derived

from a theoretical quantity that is the upper bound of both training and generalization error.

2.2 Evaluating Model Performance at Initialization

Evaluating the performance of a network at initialization has been an important challenge and widely

applied in zero-shot neural architecture search [

] and pruning [

The evaluation quantities in these studies are mainly based on the initial gradient norm [44,41], the

eigenvalues of neural tangent kernel [

], and the Fisher information matrix [

]. However,

these quantities cannot reﬂect optimization landscape, which is crucial for training dynamic and

generalization [

]. [

] provided theoretical evidence that for a sufﬁciently large

neighborhood of a random initialization, the optimization landscape is nearly convex and semi-

𝜽𝟏

∗

𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏)𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)

𝜽𝟐

∗𝜽𝟐

∗

𝜽𝟏

∗

𝒍𝒍𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)

𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏)

𝜽𝜽

𝜽

𝒍

𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏)𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)

𝜽𝟏

∗𝜽𝟐

∗

𝜽∗

|𝜽𝟏

∗− 𝜽𝟐

∗|

𝜽

𝒍

𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏)𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)

𝜽𝟏

∗𝜽𝟐

∗

𝜽∗

|𝜽𝟏

∗− 𝜽𝟐

∗|

(𝒂) (𝒃) (𝒄) (𝒅)

𝜽𝟎𝜽𝟎

𝐜𝐨𝐬(𝜽𝟏

∗− 𝜽𝟎, 𝜽𝟐

∗− 𝜽𝟎)𝐜𝐨𝐬(𝜽𝟏

∗− 𝜽𝟎, 𝜽𝟐

∗− 𝜽𝟎)

Figure 1: (a) Optimization landscape with sparser sample-wise local optima corresponds to a worse

θ∗

(larger loss

). (b) Optimization landscape with denser sample-wise local optima corresponds

to a better

θ∗

(smaller loss

). However, the density of sample-wise local optima cannot reﬂect

training path. We further leverage the cosine similarity of sample-wise local optima. Under the same

local optima density, (d) corresponds to a better training dynamic than (c), since (d) enjoys a better

optimization path consistency (smaller cosine distance between

θ∗

1−θ0

and

θ∗

2−θ0

). We give more

detailed explanations and discussions in Appendix C.

smooth. Based on the result, [

] proposed to use the density of sample-wise local optima to evaluate

and rank neural architectures. Our study also performs sample-wise landscape analysis, but differs

from [

] in that our proposed quantity is differentiable and reﬂects optimization path consistency,

while the quantity in [

] is non-differentiable so cannot serve for initialization optimization. We

make a comparison in more details in Appendix C.

3 Theoretical Foundations

3.1 Sample-wise Optimization Landscape Analysis

Conventional optimization landscape analyses [

] mainly focus on the objective across a mini-

batch of samples and miss potential evidence hidden in the optimization landscapes of individual

samples. As recently pointed out in [

], by decomposing a mini-batch objective into the summation

of sample-wise objectives across individual samples in the mini-batch, they ﬁnd that the network with

denser sample-wise local optima tends to reach a better local optima for the mini-batch objective, as

shown in the Figure (1) (a)-(b). Based on this insight, they propose to use sample-wise local optima

density to judge model performance at initialization.

Density of sample-wise local optima.

For a batch of training samples

S={(xi, yi)}i∈[n]

, a loss

function

l(ˆyi, yi)

, and a network

fθ(·)

parameterized by

θ∈Rm

, the sample-wise local optima

density

ΨS,l(fθ0(·))

is measured by the averaged Manhattan distance between the pair-wise local

optima {θ∗

i}i∈[n]of all nsamples near the random initialization θ0[59], i.e.,

ΨS,l(fθ0(·)) = √H

n2X

i,j ||θ∗

i−θ∗

j||1, i, j ∈[1, n],(1)

where

is the smoothness upper bound:

∀k∈[m], i ∈[n],[∇2l(fθ(xi), yi)]k,k ≤ H

and

always

exists with the smoothness assumption that for a neighborhood

Γθ0

of a random initialization

θ0

the sample-wise optimization landscapes are nearly convex and semi-smooth [

]. Based on this

assumption, the training error of the network can be upper bounded by

2Ψ2

S,l(fθ0(·))

. Moreover,

with probability

1−δ

, the generalization error measured by population loss

E(xu,yu)∼D[l(f∗

θ(xu), yu)]

is upper bounded by

2Ψ2

S,l(fθ0(·)) + σ

√nδ

, where

D={(xu, yu)}u∈[U]

is the underlying data

distribution of test samples and σ2is the upper bound of V ar(xu,yu)∼D[||θ∗−θ∗

u||2

1][59].

Although the sample-wise local optima density is theoretically related to both the network training

and generalization error, we point out that it may not be a suitable quantity to evaluate network

initialization due to the following reasons. First, it ignores the optimization path consistency from

initialization

θ0

to each sample-wise local optimum

θ∗

. As shown in Figure 1(c)-(d), while both (c)

and (d) have the same local optima density, in Figure 1(d) the optimization path from the initialization

to the local optima of the two samples are more consistent. Training networks on samples with more

consistent optimization paths using batch gradient descent naturally corresponds to a better training

dynamic that enjoys faster and more stable convergence [

]. Second, the sample-wise local

optima in Eq. (1) are intractable. It can be approximated by measuring the consistency of sample-wise

gradient signs [59]. But it is non-differentiable and cannot serve for initialization optimization.

Based on this observation, we aim to derive a new quantity that directly reﬂects optimization path

consistency and is a differentiable function of the initialization

θ0

. Our proposed quantity is based on

the cosine similarity of the paths from initialization to sample-wise local optima.

Cosine similarity of sample-wise local optima. Concretely, our quantity can be formulated as:

ΘS,l(fθ0(·)) = Hα2

i,j α

β−cos ∠(θ∗

i−θ0, θ∗

j−θ0), i, j ∈[1, n],(2)

where

and

are the maximal and minimal

-norms of the sample-wise optimization paths, i.e.,

α= max(||θ∗

i−θ0||2), β = min(||θ∗

i−θ0||2),∀i∈[1, n]

, and

cos ∠(θ∗

i−θ0, θ∗

j−θ0)

refers to

the cosine similarity of the paths from initialization

θ0

to sample-wise local optima,

θ∗

and

θ∗

. The

cosine term in Eq. (2) reﬂects the optimization path consistency. Together with the distance term

, it is also able to measure the density of sample-wise local optima. In the ideal case, when all the

local optima are located at the same point,

ΘS,l(fθ0(·)) = ΨS,l(fθ0(·)) = 0

. Hence, compared with

Ψ, our Θis more suitable for evaluating the initialization quality.

3.2 Main Results

In this subsection, we theoretically illustrate how minimizing the quantity

in Eq. (2) corresponds

to better training and generalization performance. Similar to [

], our derivations are also based on

the evidence that there exists a neighborhood for a random initialization such that the sample-wise

optimization landscapes are nearly convex and semi-smooth [2].

Lemma 1.

There exists no saddle point in a sample-wise optimization landscape and every local

optimum is a global optimum [59].

Based on Lemma 1, we can draw a relation between the training error and

ΘS,l(fθ0(·))

. Moreover,

we show that the proposed quantity is also related to generalization performance as the upper bound

of population error. We present the following two theoretical results.

Theorem 2.

The training loss

L=1

nPil(fθ∗(xi), yi)

of a trained network

fθ∗

on a dataset

S={(xi, yi)}i∈[n]

is upper bounded by

ΘS,l(fθ0(·))

, and the bound is tight when

ΘS,l(fθ0(·)) = 0

Theorem 3.

Suppose that

σ2

is the upper bound of

V ar(xu,yu)∼D[||θ∗−θ∗

u||2

, where

θ∗

is the

local optimum in the convex neighborhood of

θ0

for test sample

(xu, yu)

. With probability

1−δ

, the

population loss E(xu,yu)∼D[l(fθ∗(xu), yu)] is upper bounded by ΘS,l(fθ0(·)) + σ

√nδ .

We provide proofs of these two theorems in Appendix A. Combining both theorems, we can conclude

that

ΘS,l(fθ0(·))

is the upper bound of both training and generalization errors of the network

fθ∗

Therefore, minimizing Θtheoretically helps to improve the model performance.

Albeit theoretically sound,

ΘS,l(fθ0(·))

requires sample-wise optima

θ∗

which are intractable at

initialization. To this end, we will show how to develop a differentiable and tractable objective based

on Eq. (2) in Section 4, and introduce the initialization optimization algorithm in Section 5.

4 GradCosine

4.1 First-Order Approximation of Sample-Wise Optimization

Since we are dealing with sample-wise optimization, it is reasonable to calculate each sample-wise

optimum by the ﬁrst-order approximation. We hypothesize that each sample-wise optimum can be

reached via only one-step gradient descent from the initialized parameters. Its rationale lies in that

it is very easy for a deep neural network to learn the optimum for only one training sample with

gradient descent. Based on this hypothesis, we can approximate each local optimum as:

θ∗

i≈θ0−ηgi, i ∈[1, n],(3)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsTheoreticallyInspiredNeuralInitializationOptimizationYiboYang1,HongWang2,HaoboYuan3,ZhouchenLin2;4;51JDExploreAcademy,Beijing,China2KeyLab.ofMachinePerception(MoE),SchoolofIntelligenceScienceandTechnology,PekingUniversity3InstituteofArticialIntelligenceandSchoolofComputerScience,WuhanUniver...

展开>> 收起<<

Towards Theoretically Inspired Neural Initialization Optimization Yibo Yang1 Hong Wang2 Haobo Yuan3 Zhouchen Lin245.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Theoretically Inspired Neural Initialization Optimization Yibo Yang1 Hong Wang2 Haobo Yuan3 Zhouchen Lin245

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: