Component-Wise Natural Gradient Descent - An Eﬃcient Neural Network Optimization 1stTran Van Sang

2025-04-27 0 0 2.91MB 10 页 10玖币

侵权投诉

Component-Wise Natural Gradient Descent - An

Eﬃcient Neural Network Optimization

1st Tran Van Sang

The University of Tokyo

Japan

0000-0002-4211-049X

2nd Mhd Irvan

The University of Tokyo

Japan

irvan@yamagula.ic.i.u-tokyo.ac.jp

3rd Rie Shigetomi Yamaguchi

The University of Tokyo

Japan

0000-0002-6359-2221

4th Toshiyuki Nakata

The University of Tokyo

Japan

0000-0001-6383-7105

Abstract—Natural Gradient Descent (NGD) is a

second-order neural network training that precondi-

tions the gradient descent with the inverse of the Fisher

Information Matrix (FIM). Although NGD provides

an eﬃcient preconditioner, it is not practicable due

to the expensive computation required when inverting

the FIM. This paper proposes a new NGD variant

algorithm named Component-Wise Natural Gradient

Descent (CW-NGD). CW-NGD is composed of 2 steps.

Similar to several existing works, the ﬁrst step is to con-

sider the FIM matrix as a block-diagonal matrix whose

diagonal blocks correspond to the FIM of each layer’s

weights. In the second step, unique to CW-NGD, we

analyze the layer’s structure and further decompose

the layer’s FIM into smaller segments whose derivatives

are approximately independent. As a result, individual

layers’ FIMs are approximated in a block-diagonal

form that trivially supports the inversion. The segment

decomposition strategy is varied by layer structure.

Speciﬁcally, we analyze the dense and convolutional

layers and design their decomposition strategies ap-

propriately. In an experiment of training a network

containing these 2 types of layers, we empirically prove

that CW-NGD requires fewer iterations to converge

compared to the state-of-the-art ﬁrst-order and second-

order methods.

Index Terms—neural network, network optimiza-

tion, hessian matrix, ﬁsher information matrix, block-

diagonal matrix, quadratic curvature, convolutional

layer, natural gradient descent

I. Introduction

Recently, Machine Learning has been rapidly growing

and poses an inevitable role in modern society. Among

various machine learning architectures, Neural Network is

the most powerful and versatile architecture that commits

to the success of machine learning nowadays. Especially

since 2013, when ResNet [1], the ﬁrst successful deep

neural network, was introduced, Neural Network has been

intensively investigated with regard to the number of

layers, architectures, network and complexity, becoming

one of the best-performing architectures in Machine Learn-

ing. Neural Network has been used in a wide range of

applications, including image recognition, text recognition,

speech recognition, natural language processing, machine

translation, computer vision, and many others. Along with

the involvement of a new network model, a crucial com-

ponent in neural networks that attracts much researcher’s

attention is the network training (optimization) algorithm,

the research ﬁeld investigating algorithms that ﬁnds the

optimal weights of the network.

Gradient Descent (GD) is a popular network optimiza-

tion family that optimizes the weight by following the

steepest descent direction of the loss function. First-order

GD methods precondition the direction with a scalar

learning rate, usually deﬁned via an adaptive strategy [2]–

[7], while second-order GD methods precondition the di-

rection with the inverse of the Hessian matrix. Natural

Gradient Descent (NGD) [8] is a second-order GD variant

that uses the Fisher Information Matrix (FIM) [9] in the

place of the Hessian. A great number of works [10]–[17]

have shown that NGD and its variants produce faster

convergence than ﬁrst-order GD. Besides, NGD requires

fewer hyperparameters, thus simplifying the tuning pro-

cess. However, the application of NGD is limited due to the

burdensome calculation necessary for inverting the FIM.

Reducing the amount of calculation in NGD and making

it applicable in practice is an open active area of research.

Tackling the challenge, we investigate a novel method

named Component-Wise Natural Gradient Descent (CW-

NGD), an improved variant of NGD that produces high

accuracy results and requires a reasonable running time.

CW-NGD is composed of 2 steps. The ﬁrst step, similar

to several existing works [13]–[18], is to consider the FIM

matrix as a block-diagonal matrix whose diagonal blocks

correspond to the FIM of each layer’s weights. In the

second step, which is unique to CW-NGD, for dense and

convolutional layers, we analyze the layer’s structure and

further decompose the layer’s FIM into smaller segments

whose derivatives are approximately independent. As a

result, individual layers’ FIMs are approximated in a

block-diagonal form that trivially supports the inversion.

In an experiment with a network containing convo-

lutional and dense layers on the MNIST dataset, CW-

NGD converges within fewer iterations than the best-

known ﬁrst-order and second-order optimization algo-

rithms, Adam [2] and KFAC [14]–[17], respectively.

In the next section, Section II, we discuss several exist-

ing works in the literature before describing our method

in Section III. The experiment detail is provided in Sec-

tion IV. Next, we discuss the results of our method in

arXiv:2210.05268v1 [cs.LG] 11 Oct 2022

Section V. Finally, we summarize and discuss the future

of our method in Section VI.

II. Related Works

First-order optimization involves the use of a learn-

ing rate adaptive strategy: Adam, AdaMax, RMSProps,

AdaDelta, AdaGrad, SGD with Nesterov’s accelerated

gradient, Nadam [2]–[7], etc. Some of them, for instance,

Adam, AdaGrad [2], use the moving average of squared

gradients as a heuristic of the FIM, the matrix con-

tains the quadratic curvature of the loss function. Their

prominent usage in practice and well-performed results

highly promote the usefulness of the use of quadratic curve

information in network optimization.

The initial idea of optimizing networks using quadratic

curvature and dividing the network parameters into

derivative-independent components was ﬁrst proposed by

Kurita et al. in 1993 [19]. The motivation of the authors

was to apply the iterative weighted least squares algo-

rithm, which is usually used in Generalized Linear Model

training, to network optimization. However, this approach

supports only dense networks with a single hidden layer,

and the experiment was on a simple XOR problem that has

not been commonly used recently. Nonetheless, it inspired

the recently introduced unit-wise FIM approximation [20].

NGD [8], proposed in 1998, later became the most

prominent variant of the second-order method in Neural

Network optimization. NGD uses the FIM as the pre-

conditioner of the gradient descent update in the place

of the Hessian matrix. While the Hessian preconditioner

optimizes the Euclidean distance, the FIM preconditioner

in NGD is created to adapt to the information geometry.

It aims to minimize the distance in the probabilistic

space, whose distance is deﬁned by the Kullback-Leibler

divergence. Although the initial motivation of NGD is

not to approximate the Hessian, it was later proven that

the FIM coincides with a generalized Gauss-Newton [21]

approximation of the Hessian under overwhelmingly prac-

tical cases [22]. This coincidence explicitly implies the clas-

siﬁcation of NGD as a second-order optimization method.

NGD can produce a more eﬃcient learning step com-

pared to almost any ﬁrst-order optimization method. How-

ever, the FIM inverting operation in NGD requires a sig-

niﬁcant computation time, which restricts the application

of NGD to simple neural network models. To remedy

this problem, several existing works [14]–[18], [20], [23],

[24] have proposed various approximation schemes of the

FIM for a cheap inversion. For instance, in 2015, inspired

by [19] and theoretically evaluated by Amari et al. [24],

Ollivier et al. [20] extended NGD with the unit-wise NGD

method. The authors introduce a theoretical framework

to build invariant algorithms, in which they treat NGD

as an invariant method that optimizes the distance in

Riemannian geometry. Based on the suggested framework,

the authors grouped the update by the output nodes and

calculated individual group’s FIM for each update, thus,

ignoring the interaction among the node groups. However,

the approach does not apply to the convolutional layer

and is limited to simple neural network models due to the

ineﬃcient gradient calculation.

The more modern and practicable extension of NGD is

KFAC, initially introduced in 2015 by James et al. [14]–

[17]. This is considered the state-of-the-art second-order

method in neural network optimization. KFAC works

by approximating the oﬀ-diagonal blocks of the FIM as

zero, then factorizing the diagonal blocks as a Kronecker

product of two smaller matrices. The second step reduces

the computational cost of inverting the FIM. In contrast,

it greatly reduces the accuracy of the approximation.

Besides NGD, which uses the FIM as the preconditioner,

another research direction of second-order optimization is

to approximate the Hessian preconditioner. The most well-

known methods are Broyden Fletcher Goldfarb Shanno

(BFGS) and L-BFGS methods [25]. These methods work

on any diﬀerentiable function and, in reality, are used

to optimize the neural network loss function. Instead of

calculating the Hessian in every step, the former works

by updating the Hessian matrix calculated in previous

iterations. Although it is more computationally eﬃcient

than directly deriving the second-order derivative of the

loss function, it requires storing the update vector in

each step, hence is very memory-consuming. The latter

was introduced to remedy that problem by only storing

the update vector of k last iterations. Nevertheless, these

methods are not used in practice for large-scale problems

(large networks or large datasets), because they do not

perform well in mini-batch settings and produce low-

accuracy results [26].

In the case of dense layers, CW-NGD is identical to unit-

wise NGD introduced in [13], [20], [24]. We explicitly de-

clare our additional contributions as follows. Firstly, CW-

NGD analyzes the layer’s structure to provide an eﬃcient

FIM-approximation scheme, while unit-wise NGD works

by grouping the network weights into groups by output

nodes. They accidentally turn out to be the same ap-

proximation scheme in the case of dense layers. Secondly,

CW-NGD is applicable to convolutional layers while unit-

wise is not. Thirdly, studies on unit-wise NGD [13], [20],

[24] focus on theoretical analysis of the method and lack

practical application analysis. In contrast, CW-NGD is

veriﬁed with the MNIST dataset [27], which demonstrates

the more potential practicability of CW-NGD over unit-

wise NGD.

III. Methodology

In this section, we ﬁrst introduce several basic notations

and the neural network deﬁnition in the ﬁrst subsection.

In the next subsection, we brieﬂy introduce the NGD

method, FIM deﬁnition, and layer-wise NGD. After that,

we describe our proposed method Component-Wise Nat-

ural Gradient Descent (CW-NGD) in the following two

subsections for the cases of dense and convolutional layers,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Component-WiseNaturalGradientDescent-AnEcientNeuralNetworkOptimization1stTranVanSangTheUniversityofTokyoJapan0000-0002-4211-049X2ndMhdIrvanTheUniversityofTokyoJapanirvan@yamagula.ic.i.u-tokyo.ac.jp3rdRieShigetomiYamaguchiTheUniversityofTokyoJapan0000-0002-6359-22214thToshiyukiNakataTheUniversityofT...

展开>> 收起<<

Component-Wise Natural Gradient Descent - An Eﬃcient Neural Network Optimization 1stTran Van Sang.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Component-Wise Natural Gradient Descent - An Eﬃcient Neural Network Optimization 1stTran Van Sang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: