Component-Wise Natural Gradient Descent - An Efficient Neural Network Optimization 1stTran Van Sang

2025-04-27 0 0 2.91MB 10 页 10玖币
侵权投诉
Component-Wise Natural Gradient Descent - An
Efficient Neural Network Optimization
1st Tran Van Sang
The University of Tokyo
Japan
0000-0002-4211-049X
2nd Mhd Irvan
The University of Tokyo
Japan
irvan@yamagula.ic.i.u-tokyo.ac.jp
3rd Rie Shigetomi Yamaguchi
The University of Tokyo
Japan
0000-0002-6359-2221
4th Toshiyuki Nakata
The University of Tokyo
Japan
0000-0001-6383-7105
Abstract—Natural Gradient Descent (NGD) is a
second-order neural network training that precondi-
tions the gradient descent with the inverse of the Fisher
Information Matrix (FIM). Although NGD provides
an efficient preconditioner, it is not practicable due
to the expensive computation required when inverting
the FIM. This paper proposes a new NGD variant
algorithm named Component-Wise Natural Gradient
Descent (CW-NGD). CW-NGD is composed of 2 steps.
Similar to several existing works, the first step is to con-
sider the FIM matrix as a block-diagonal matrix whose
diagonal blocks correspond to the FIM of each layer’s
weights. In the second step, unique to CW-NGD, we
analyze the layer’s structure and further decompose
the layer’s FIM into smaller segments whose derivatives
are approximately independent. As a result, individual
layers’ FIMs are approximated in a block-diagonal
form that trivially supports the inversion. The segment
decomposition strategy is varied by layer structure.
Specifically, we analyze the dense and convolutional
layers and design their decomposition strategies ap-
propriately. In an experiment of training a network
containing these 2 types of layers, we empirically prove
that CW-NGD requires fewer iterations to converge
compared to the state-of-the-art first-order and second-
order methods.
Index Terms—neural network, network optimiza-
tion, hessian matrix, fisher information matrix, block-
diagonal matrix, quadratic curvature, convolutional
layer, natural gradient descent
I. Introduction
Recently, Machine Learning has been rapidly growing
and poses an inevitable role in modern society. Among
various machine learning architectures, Neural Network is
the most powerful and versatile architecture that commits
to the success of machine learning nowadays. Especially
since 2013, when ResNet [1], the first successful deep
neural network, was introduced, Neural Network has been
intensively investigated with regard to the number of
layers, architectures, network and complexity, becoming
one of the best-performing architectures in Machine Learn-
ing. Neural Network has been used in a wide range of
applications, including image recognition, text recognition,
speech recognition, natural language processing, machine
translation, computer vision, and many others. Along with
the involvement of a new network model, a crucial com-
ponent in neural networks that attracts much researcher’s
attention is the network training (optimization) algorithm,
the research field investigating algorithms that finds the
optimal weights of the network.
Gradient Descent (GD) is a popular network optimiza-
tion family that optimizes the weight by following the
steepest descent direction of the loss function. First-order
GD methods precondition the direction with a scalar
learning rate, usually defined via an adaptive strategy [2]–
[7], while second-order GD methods precondition the di-
rection with the inverse of the Hessian matrix. Natural
Gradient Descent (NGD) [8] is a second-order GD variant
that uses the Fisher Information Matrix (FIM) [9] in the
place of the Hessian. A great number of works [10]–[17]
have shown that NGD and its variants produce faster
convergence than first-order GD. Besides, NGD requires
fewer hyperparameters, thus simplifying the tuning pro-
cess. However, the application of NGD is limited due to the
burdensome calculation necessary for inverting the FIM.
Reducing the amount of calculation in NGD and making
it applicable in practice is an open active area of research.
Tackling the challenge, we investigate a novel method
named Component-Wise Natural Gradient Descent (CW-
NGD), an improved variant of NGD that produces high
accuracy results and requires a reasonable running time.
CW-NGD is composed of 2 steps. The first step, similar
to several existing works [13]–[18], is to consider the FIM
matrix as a block-diagonal matrix whose diagonal blocks
correspond to the FIM of each layer’s weights. In the
second step, which is unique to CW-NGD, for dense and
convolutional layers, we analyze the layer’s structure and
further decompose the layer’s FIM into smaller segments
whose derivatives are approximately independent. As a
result, individual layers’ FIMs are approximated in a
block-diagonal form that trivially supports the inversion.
In an experiment with a network containing convo-
lutional and dense layers on the MNIST dataset, CW-
NGD converges within fewer iterations than the best-
known first-order and second-order optimization algo-
rithms, Adam [2] and KFAC [14]–[17], respectively.
In the next section, Section II, we discuss several exist-
ing works in the literature before describing our method
in Section III. The experiment detail is provided in Sec-
tion IV. Next, we discuss the results of our method in
arXiv:2210.05268v1 [cs.LG] 11 Oct 2022
Section V. Finally, we summarize and discuss the future
of our method in Section VI.
II. Related Works
First-order optimization involves the use of a learn-
ing rate adaptive strategy: Adam, AdaMax, RMSProps,
AdaDelta, AdaGrad, SGD with Nesterov’s accelerated
gradient, Nadam [2]–[7], etc. Some of them, for instance,
Adam, AdaGrad [2], use the moving average of squared
gradients as a heuristic of the FIM, the matrix con-
tains the quadratic curvature of the loss function. Their
prominent usage in practice and well-performed results
highly promote the usefulness of the use of quadratic curve
information in network optimization.
The initial idea of optimizing networks using quadratic
curvature and dividing the network parameters into
derivative-independent components was first proposed by
Kurita et al. in 1993 [19]. The motivation of the authors
was to apply the iterative weighted least squares algo-
rithm, which is usually used in Generalized Linear Model
training, to network optimization. However, this approach
supports only dense networks with a single hidden layer,
and the experiment was on a simple XOR problem that has
not been commonly used recently. Nonetheless, it inspired
the recently introduced unit-wise FIM approximation [20].
NGD [8], proposed in 1998, later became the most
prominent variant of the second-order method in Neural
Network optimization. NGD uses the FIM as the pre-
conditioner of the gradient descent update in the place
of the Hessian matrix. While the Hessian preconditioner
optimizes the Euclidean distance, the FIM preconditioner
in NGD is created to adapt to the information geometry.
It aims to minimize the distance in the probabilistic
space, whose distance is defined by the Kullback-Leibler
divergence. Although the initial motivation of NGD is
not to approximate the Hessian, it was later proven that
the FIM coincides with a generalized Gauss-Newton [21]
approximation of the Hessian under overwhelmingly prac-
tical cases [22]. This coincidence explicitly implies the clas-
sification of NGD as a second-order optimization method.
NGD can produce a more efficient learning step com-
pared to almost any first-order optimization method. How-
ever, the FIM inverting operation in NGD requires a sig-
nificant computation time, which restricts the application
of NGD to simple neural network models. To remedy
this problem, several existing works [14]–[18], [20], [23],
[24] have proposed various approximation schemes of the
FIM for a cheap inversion. For instance, in 2015, inspired
by [19] and theoretically evaluated by Amari et al. [24],
Ollivier et al. [20] extended NGD with the unit-wise NGD
method. The authors introduce a theoretical framework
to build invariant algorithms, in which they treat NGD
as an invariant method that optimizes the distance in
Riemannian geometry. Based on the suggested framework,
the authors grouped the update by the output nodes and
calculated individual group’s FIM for each update, thus,
ignoring the interaction among the node groups. However,
the approach does not apply to the convolutional layer
and is limited to simple neural network models due to the
inefficient gradient calculation.
The more modern and practicable extension of NGD is
KFAC, initially introduced in 2015 by James et al. [14]–
[17]. This is considered the state-of-the-art second-order
method in neural network optimization. KFAC works
by approximating the off-diagonal blocks of the FIM as
zero, then factorizing the diagonal blocks as a Kronecker
product of two smaller matrices. The second step reduces
the computational cost of inverting the FIM. In contrast,
it greatly reduces the accuracy of the approximation.
Besides NGD, which uses the FIM as the preconditioner,
another research direction of second-order optimization is
to approximate the Hessian preconditioner. The most well-
known methods are Broyden Fletcher Goldfarb Shanno
(BFGS) and L-BFGS methods [25]. These methods work
on any differentiable function and, in reality, are used
to optimize the neural network loss function. Instead of
calculating the Hessian in every step, the former works
by updating the Hessian matrix calculated in previous
iterations. Although it is more computationally efficient
than directly deriving the second-order derivative of the
loss function, it requires storing the update vector in
each step, hence is very memory-consuming. The latter
was introduced to remedy that problem by only storing
the update vector of k last iterations. Nevertheless, these
methods are not used in practice for large-scale problems
(large networks or large datasets), because they do not
perform well in mini-batch settings and produce low-
accuracy results [26].
In the case of dense layers, CW-NGD is identical to unit-
wise NGD introduced in [13], [20], [24]. We explicitly de-
clare our additional contributions as follows. Firstly, CW-
NGD analyzes the layer’s structure to provide an efficient
FIM-approximation scheme, while unit-wise NGD works
by grouping the network weights into groups by output
nodes. They accidentally turn out to be the same ap-
proximation scheme in the case of dense layers. Secondly,
CW-NGD is applicable to convolutional layers while unit-
wise is not. Thirdly, studies on unit-wise NGD [13], [20],
[24] focus on theoretical analysis of the method and lack
practical application analysis. In contrast, CW-NGD is
verified with the MNIST dataset [27], which demonstrates
the more potential practicability of CW-NGD over unit-
wise NGD.
III. Methodology
In this section, we first introduce several basic notations
and the neural network definition in the first subsection.
In the next subsection, we briefly introduce the NGD
method, FIM definition, and layer-wise NGD. After that,
we describe our proposed method Component-Wise Nat-
ural Gradient Descent (CW-NGD) in the following two
subsections for the cases of dense and convolutional layers,
摘要:

Component-WiseNaturalGradientDescent-AnEcientNeuralNetworkOptimization1stTranVanSangTheUniversityofTokyoJapan0000-0002-4211-049X2ndMhdIrvanTheUniversityofTokyoJapanirvan@yamagula.ic.i.u-tokyo.ac.jp3rdRieShigetomiYamaguchiTheUniversityofTokyoJapan0000-0002-6359-22214thToshiyukiNakataTheUniversityofT...

展开>> 收起<<
Component-Wise Natural Gradient Descent - An Efficient Neural Network Optimization 1stTran Van Sang.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:10 页 大小:2.91MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注