Section V. Finally, we summarize and discuss the future
of our method in Section VI.
II. Related Works
First-order optimization involves the use of a learn-
ing rate adaptive strategy: Adam, AdaMax, RMSProps,
AdaDelta, AdaGrad, SGD with Nesterov’s accelerated
gradient, Nadam [2]–[7], etc. Some of them, for instance,
Adam, AdaGrad [2], use the moving average of squared
gradients as a heuristic of the FIM, the matrix con-
tains the quadratic curvature of the loss function. Their
prominent usage in practice and well-performed results
highly promote the usefulness of the use of quadratic curve
information in network optimization.
The initial idea of optimizing networks using quadratic
curvature and dividing the network parameters into
derivative-independent components was first proposed by
Kurita et al. in 1993 [19]. The motivation of the authors
was to apply the iterative weighted least squares algo-
rithm, which is usually used in Generalized Linear Model
training, to network optimization. However, this approach
supports only dense networks with a single hidden layer,
and the experiment was on a simple XOR problem that has
not been commonly used recently. Nonetheless, it inspired
the recently introduced unit-wise FIM approximation [20].
NGD [8], proposed in 1998, later became the most
prominent variant of the second-order method in Neural
Network optimization. NGD uses the FIM as the pre-
conditioner of the gradient descent update in the place
of the Hessian matrix. While the Hessian preconditioner
optimizes the Euclidean distance, the FIM preconditioner
in NGD is created to adapt to the information geometry.
It aims to minimize the distance in the probabilistic
space, whose distance is defined by the Kullback-Leibler
divergence. Although the initial motivation of NGD is
not to approximate the Hessian, it was later proven that
the FIM coincides with a generalized Gauss-Newton [21]
approximation of the Hessian under overwhelmingly prac-
tical cases [22]. This coincidence explicitly implies the clas-
sification of NGD as a second-order optimization method.
NGD can produce a more efficient learning step com-
pared to almost any first-order optimization method. How-
ever, the FIM inverting operation in NGD requires a sig-
nificant computation time, which restricts the application
of NGD to simple neural network models. To remedy
this problem, several existing works [14]–[18], [20], [23],
[24] have proposed various approximation schemes of the
FIM for a cheap inversion. For instance, in 2015, inspired
by [19] and theoretically evaluated by Amari et al. [24],
Ollivier et al. [20] extended NGD with the unit-wise NGD
method. The authors introduce a theoretical framework
to build invariant algorithms, in which they treat NGD
as an invariant method that optimizes the distance in
Riemannian geometry. Based on the suggested framework,
the authors grouped the update by the output nodes and
calculated individual group’s FIM for each update, thus,
ignoring the interaction among the node groups. However,
the approach does not apply to the convolutional layer
and is limited to simple neural network models due to the
inefficient gradient calculation.
The more modern and practicable extension of NGD is
KFAC, initially introduced in 2015 by James et al. [14]–
[17]. This is considered the state-of-the-art second-order
method in neural network optimization. KFAC works
by approximating the off-diagonal blocks of the FIM as
zero, then factorizing the diagonal blocks as a Kronecker
product of two smaller matrices. The second step reduces
the computational cost of inverting the FIM. In contrast,
it greatly reduces the accuracy of the approximation.
Besides NGD, which uses the FIM as the preconditioner,
another research direction of second-order optimization is
to approximate the Hessian preconditioner. The most well-
known methods are Broyden Fletcher Goldfarb Shanno
(BFGS) and L-BFGS methods [25]. These methods work
on any differentiable function and, in reality, are used
to optimize the neural network loss function. Instead of
calculating the Hessian in every step, the former works
by updating the Hessian matrix calculated in previous
iterations. Although it is more computationally efficient
than directly deriving the second-order derivative of the
loss function, it requires storing the update vector in
each step, hence is very memory-consuming. The latter
was introduced to remedy that problem by only storing
the update vector of k last iterations. Nevertheless, these
methods are not used in practice for large-scale problems
(large networks or large datasets), because they do not
perform well in mini-batch settings and produce low-
accuracy results [26].
In the case of dense layers, CW-NGD is identical to unit-
wise NGD introduced in [13], [20], [24]. We explicitly de-
clare our additional contributions as follows. Firstly, CW-
NGD analyzes the layer’s structure to provide an efficient
FIM-approximation scheme, while unit-wise NGD works
by grouping the network weights into groups by output
nodes. They accidentally turn out to be the same ap-
proximation scheme in the case of dense layers. Secondly,
CW-NGD is applicable to convolutional layers while unit-
wise is not. Thirdly, studies on unit-wise NGD [13], [20],
[24] focus on theoretical analysis of the method and lack
practical application analysis. In contrast, CW-NGD is
verified with the MNIST dataset [27], which demonstrates
the more potential practicability of CW-NGD over unit-
wise NGD.
III. Methodology
In this section, we first introduce several basic notations
and the neural network definition in the first subsection.
In the next subsection, we briefly introduce the NGD
method, FIM definition, and layer-wise NGD. After that,
we describe our proposed method Component-Wise Nat-
ural Gradient Descent (CW-NGD) in the following two
subsections for the cases of dense and convolutional layers,