
For instance, RMSProp and Adagrad use
mk
=
∇JBk
(Θ
k
)whereas Adam maintains in
mk
an
exponential moving averaging from the past evaluations of the gradient. The RMSProp, Adam and
Adagrad optimizers build
Pk
such that
P2
k
is a diagonal matrix whoses elements are exponential
moving average of the square of the past gradients (see [
35
] for example). It is an estimator of the
diagonal part of the Fisher-Information matrix.
All these methods can be incorporated in our framework as we consider the choice of
Pk
as a
preconditioning technique whose step is yet to be found. In a nutshell, if
Pk
approximates the Hessian
up to an unknown multiplicative factor, our method is able to find this multiplicative factor.
Barzilai-Borwein:
The Barzilai-Borwein (BB) class of methods [
4
,
34
,
11
,
47
,
6
,
23
] may be
interpreted as methods which aim at estimating the curvature in
(6)
by numerical differences using
past gradient computations. In the stochastic convex setting, the BB method is introduced in [
42
] for
the choice
˙
Θk
=
∇J
(Θ
k
)and also for variance-reducing methods [
18
]. It has been extended in [
27
]
to non-convex problems and in [
24
] to DNNs. Due to the variance of the gradient and possibly to a
poor estimation of the curvature by numerical differences, these methods allow prescribing a new
step at each epoch only. In [
48
,
8
], the step is prescribed at each iteration at the cost of computing
two mini-batch gradients per iteration. Moreover, in [
48
] the gradient over all the data needs to be
computed at the beginning of each epoch whereas [
8
] maintains an exponential moving average to
avoid this extra computation. The downside of [
8
] is that they still need to tune the learning rate
and its decay factor and that their method has not been tried on other choices than Pk= Id.
Our belief is that approximating by numerical differences in a stochastic setting suffers too much
from variance from the data and from the approximation error. Hence we advocate in this study for
exact computations of the curvature (6).
Automatic differentiation:
The theory that allows to compute the matrix-vector product of
the Hessian with a certain direction is well-studied [
45
,
9
,
15
,
31
] and costs 4passes (2forward and
backward passes) and 3memory footprint, when the computation of the gradient costs 2passes (1
forward and backward pass) and 1memory footprint. We study the cost of computing the curvature
defined in
(6)
, which to the best of our knowledge, has never been studied. Our method has a
numerical cost that is always lower than the best BB method [8].
1.3 Our contributions
We propose a change of point of view. While most of the methods presented above use first order
information to develop second order algorithms, we use second order information to tune a first order
method. The curvature
(6)
is computed using automatic differentiation in order to estimate the local
Lipschitz constant of the gradient and to choose a step accordingly. Our contribution is threefold:
•
We propose a method that automatically rescales the learning rate using curvature information
in Section 2.1 and we discuss the heuristics of this method in Section 2.2. The rescaling allows
the practitioner to choose between three different physical regimes coined as : hyperexploration,
exploration/convergence trade-off and hyperconvergence.
•
We study the automatic differentiation of the curvature in Section 3 and its computational
costs.
•
Numerical tests are provided in Section 4 with a discussion on the three different physical
regimes introduced in Section 2.1.
2 Rescaling the learning rate
2.1 Presentation and guideline for rescaling
The second order analysis of Section 1 relies on the Taylor expansion
J(Θk−τk˙
Θk)' J(Θk)−τkh˙
Θk,∇J(Θk)i+τ2
k
2c(Θk,˙
Θk)k˙
Θkk2,
with
c
(Θ
k,˙
Θk
)given by
(6)
. This Taylor expansion yields the following algorithm: given Θ
k
and an
update direction
˙
Θk
, compute
c
(Θ
k,˙
Θk
)by
(6)
, the step
τk
by
(5)
and finally update the parameters
3