Adaptive scaling of the learning rate by second order automatic differentiation Frédéric de Gournay12

2025-04-30 0 0 4.46MB 26 页 10玖币
侵权投诉
Adaptive scaling of the learning rate by second order
automatic differentiation
Frédéric de Gournay1,2
degourna@insa-toulouse.fr
Alban Gossard1,3
alban.paul.gossard@gmail.com
1Institut de Mathématiques de Toulouse; UMR5219; Université de Toulouse; CNRS
2INSA, F-31077 Toulouse, France 3UPS, F-31062 Toulouse Cedex 9, France
October 27, 2022
Abstract
In the context of the optimization of Deep Neural Networks, we propose to rescale the learning
rate using a new technique of automatic differentiation. This technique relies on the computation
of the curvature, a second order information whose computational complexity is in between the
computation of the gradient and the one of the Hessian-vector product. If (1
C,
1
M
)represents
respectively the computational time and memory footprint of the gradient method, the new
technique increase the overall cost to either (1
.
5
C,
2
M
)or (2
C,
1
M
). This rescaling has the
appealing characteristic of having a natural interpretation, it allows the practitioner to choose
between exploration of the parameters set and convergence of the algorithm. The rescaling is
adaptive, it depends on the data and on the direction of descent. The numerical experiments
highlight the different exploration/convergence regimes.
1 Introduction
The optimization of Deep Neural Networks (DNNs) has received tremendous attention over the past
years. Training DNNs amounts to minimize the expectation of non-convex random functions in a
high dimensional space Rd. If J:RdRdenotes this expectation, the problem reads
min
ΘRdJ(Θ),(1)
with Θthe parameters. Optimization algorithms compute iteratively Θ
k
, an approximation of a
minimizer of (1) at iteration k, by the update rule
Θk+1 = Θkτk˙
Θk,(2)
where
τk
is the learning-rate and
˙
Θk
is the update direction. The choice of
˙
Θk
encodes the type of
algorithm used. This work focuses on the choice of the learning rate τk.
There is a trade-off in the choice of this learning rate. Indeed high values of
τk
allows exploration
of the parameters space and slowly decaying step size ensures convergence in accordance to the
famous Robbins-Monro algorithm [
36
]. This decaying condition may be met by defining the step as
τk
=
τ0kα
with
τ0
being the initial step size and
1
2< α <
1a constant. The choice of the initial
learning rate and its decay are left to practitioners and these hyperparameters have to be tuned
manually in order to obtain the best rate of convergence. For instance, they can be optimized using
a grid-search or by using more intricated strategies [
40
], but in all generality tuning the learning rate
and its decay factor is difficult and time consuming. The main issue is that the learning rate has no
natural scaling. The goal of this work is to propose an algorithm that, given a direction
˙
Θk
finds
automatically a scaling of the learning rate. This rescaling has the following advantages:
The scaling is adaptive, it depends on the data and of the choice of direction ˙
Θk.
The scaling expresses the convergence vs. exploration trade-off. Multiplying the rescaled
learning rate by 1
/
2enforces convergence whereas multiplying it by 1allows for exploration of
the space of parameters.
1
arXiv:2210.14520v1 [cs.NE] 26 Oct 2022
This rescaling comes at a cost and it has the following disadvantages:
The computational costs and memory footprint of the algorithm goes from (1
C,
1
M
)to
(1.5C, 2M)or (2C, 1M).
The rescaling method is only available to algorithms that yield directions of descent, it excludes
momentum method and notably Adam-flavored algorithm.
Rescaling is theoritically limited to functions whose second order derivative exists and does not
vanish. This non-vanishing condition can be compensated by L2-regularization.
1.1 Foreword
First recall that second order methods for the minimization of a deterministic
C2
function Θ
7→ J
(Θ),
with a Hessian that we denote
2J
, are based on the second order Taylor expansion at iteration
k
:
Jkτk˙
Θk)' Jk)τkh˙
Θk,∇Jk)i+τ2
k
2h∇2Jk)˙
Θk,˙
Θki.(3)
If the Hessian of Jis positive definite, the minimization of the right-hand side leads to the choice
˙
Θk=P1
k∇Jk)with Pk' ∇2Jk).(4)
Once a direction ˙
Θkis chosen, another minimization in τkgives
τk=h˙
Θk,∇Jk)i
k˙
Θkk2ck,˙
Θk),(5)
where cis the curvature of the function, and is defined as
ck,˙
Θk)def
=h∇2Jk)˙
Θk,˙
Θki
k˙
Θkk2.(6)
A second-order driven algorithm can be decomposed in two steps: i) the choice of
Pk
in
(4)
, and if
this choice leads to an update which is a direction of ascent, that is
h˙
Θk,∇J
k
)
i>
0, ii) a choice
of τkby an heuristic inspired from (6) and (5).
In the stochastic setting, we denote as
s7→ Js
the mapping of the random function. At iteration
k
, only information on (
Js
)
s∈Bk
can be computed where (
Bk
)
k
is a sequence of mini-batches which are
indepently drawn. If
Es∈Bk
is the empirical average over the mini-batch, we define
JBk
=
Es∈Bk
[
Js
].
Given Θ, the quantity J(Θ) is deterministic, and Jis the expectation of Jsw.r.t. s.
1.2 Related works
Choice of Pk:
The choice
Pk
=
2JBk
k
)in
(4)
, leads to a choice
τk
= 1 and to the so-called
Newton method. It is possible in theory to compute the Hessian by automatic differentiation if it
is sparse [
45
], but to our knowledge it has not been implemented yet. In [
28
], the authors solve
˙
Θk
=
2JBkk)1∇JBk
k
)by a conjugate gradient method which requires only matrix-vector
product which is affordable by automatic differentiation [
9
,
31
]. This point of view, as well as
some variants [44, 20], suffer from high computational cost per batch and go through less data in a
comparable amount of time, leading to slower convergence at the beginning of the optimization.
Another choice is to set
Pk' ∇2JBk
k
)in
(4)
which is coined as the “Quasi-Newton” approach.
These methods directly invert a diagonal, block-diagonal or low rank approximation of the Hessian
[
5
,
38
,
37
,
30
,
29
,
49
]. In most of these works, the Hessian is approximated by
E
[
∇Js
(
θk
)
∇Js
(
θk
)
T
],
the so-called Fisher-Information matrix, which leads to the natural gradient method [
3
]. Note also
the use of a low-rank approximation of the true Hessian for variance reduction in [14].
Finally, there is an interpretation of adaptive methods as Quasi-Newton methods. Amongst the
adaptive method, let us cite RMSProp [
43
], Adam [
19
], Adagrad [
13
] and Adadelta [
50
]. For all these
methods,
Pk
is as a diagonal preconditioner that reduces the variability of the step size across the
different layers. This class of methods can be written
˙
Θk=P1
kmk, mk' ∇Jk)and Pk' ∇2Jk).(7)
2
For instance, RMSProp and Adagrad use
mk
=
∇JBk
k
)whereas Adam maintains in
mk
an
exponential moving averaging from the past evaluations of the gradient. The RMSProp, Adam and
Adagrad optimizers build
Pk
such that
P2
k
is a diagonal matrix whoses elements are exponential
moving average of the square of the past gradients (see [
35
] for example). It is an estimator of the
diagonal part of the Fisher-Information matrix.
All these methods can be incorporated in our framework as we consider the choice of
Pk
as a
preconditioning technique whose step is yet to be found. In a nutshell, if
Pk
approximates the Hessian
up to an unknown multiplicative factor, our method is able to find this multiplicative factor.
Barzilai-Borwein:
The Barzilai-Borwein (BB) class of methods [
4
,
34
,
11
,
47
,
6
,
23
] may be
interpreted as methods which aim at estimating the curvature in
(6)
by numerical differences using
past gradient computations. In the stochastic convex setting, the BB method is introduced in [
42
] for
the choice
˙
Θk
=
∇J
k
)and also for variance-reducing methods [
18
]. It has been extended in [
27
]
to non-convex problems and in [
24
] to DNNs. Due to the variance of the gradient and possibly to a
poor estimation of the curvature by numerical differences, these methods allow prescribing a new
step at each epoch only. In [
48
,
8
], the step is prescribed at each iteration at the cost of computing
two mini-batch gradients per iteration. Moreover, in [
48
] the gradient over all the data needs to be
computed at the beginning of each epoch whereas [
8
] maintains an exponential moving average to
avoid this extra computation. The downside of [
8
] is that they still need to tune the learning rate
and its decay factor and that their method has not been tried on other choices than Pk= Id.
Our belief is that approximating by numerical differences in a stochastic setting suffers too much
from variance from the data and from the approximation error. Hence we advocate in this study for
exact computations of the curvature (6).
Automatic differentiation:
The theory that allows to compute the matrix-vector product of
the Hessian with a certain direction is well-studied [
45
,
9
,
15
,
31
] and costs 4passes (2forward and
backward passes) and 3memory footprint, when the computation of the gradient costs 2passes (1
forward and backward pass) and 1memory footprint. We study the cost of computing the curvature
defined in
(6)
, which to the best of our knowledge, has never been studied. Our method has a
numerical cost that is always lower than the best BB method [8].
1.3 Our contributions
We propose a change of point of view. While most of the methods presented above use first order
information to develop second order algorithms, we use second order information to tune a first order
method. The curvature
(6)
is computed using automatic differentiation in order to estimate the local
Lipschitz constant of the gradient and to choose a step accordingly. Our contribution is threefold:
We propose a method that automatically rescales the learning rate using curvature information
in Section 2.1 and we discuss the heuristics of this method in Section 2.2. The rescaling allows
the practitioner to choose between three different physical regimes coined as : hyperexploration,
exploration/convergence trade-off and hyperconvergence.
We study the automatic differentiation of the curvature in Section 3 and its computational
costs.
Numerical tests are provided in Section 4 with a discussion on the three different physical
regimes introduced in Section 2.1.
2 Rescaling the learning rate
2.1 Presentation and guideline for rescaling
The second order analysis of Section 1 relies on the Taylor expansion
Jkτk˙
Θk)' Jk)τkh˙
Θk,∇Jk)i+τ2
k
2ck,˙
Θk)k˙
Θkk2,
with
c
k,˙
Θk
)given by
(6)
. This Taylor expansion yields the following algorithm: given Θ
k
and an
update direction
˙
Θk
, compute
c
k,˙
Θk
)by
(6)
, the step
τk
by
(5)
and finally update the parameters
3
Θ
k
by
(2)
. The first order analysis is slightly different. Starting with the second order exact Taylor
expansion in integral form:
Jkτk˙
Θk) = Jk) + τkh˙
Θk,∇Jk)i+Zτk
t=0
(τkt)ckt˙
Θk,˙
Θk)k˙
Θkk2dt,
we introduce the local directional Lipschitz constant of the gradient
Lk= max
t[0k]|ckt˙
Θk,˙
Θk)|,(8)
in order to bound the right-hand side of the Taylor expansion. One obtains
Jkτk˙
Θk)≤ Jk)τkh˙
Θk,∇Jk)i+τ2
k
2Lkk˙
Θkk2.(9)
Introducing the rescaling rk
rk=h˙
Θk,∇Jk)i
k˙
Θkk2Lk
(10)
and writing τk=rk, Equation (9) turns into
Jkrk˙
Θk)≤ Jk) + 2Lk
2krk˙
Θkk2. (11)
Any choice of
in ]0
,
1[ leads to a decrease of
J
in
(11)
. The choice
=
1
2
allows faster decrease of
the right-hand side of
(11)
. We coin the choice
= 1 in
(11)
as the exploration choice and the choice
=
1
2
as the convergence choice. The only difficulty in computing
(10)
lies in the computation of
Lk
.
Indeed,
Lk
is a maximum over an unknown interval and, in the stochastic setting, we only estimate
the function Jand its derivative on a batch Bk.
We propose to build ˜
Lkan estimator of Lkby the following rules.
Replace the maximum over the unknown interval [0
, τk
]in
(8)
by the value at
t
= 0. This is
reminiscent of the Newton’s method.
Perform an exponential moving average on the past computations of
rk
in order to average
over the data previously seen.
Use the maximum of this latter exponential moving average and the current estimate in order
to stablize ˜
Lk.
The algorithm reads as follows:
Algorithm 1 Rescaling of the learning rate
1: Hyperparameters β3= 0.9(exponential moving average).
2: Initialization ˆc0= 0
3: Input (at each iteration k)
: a batch
Bk
,
gk
=
Es∈Bk[∇Jsk)]
and
˙
Θk
a direction that
verifies hgk,˙
Θki>0.
4: ck=Es∈Bkhh∇2Jsk)˙
Θk,˙
Θkii/k˙
Θkk2local curvature
5: ˆck=β3ˆck1+ (1 β3)ckand ˜ck= ˆck/(1 βk
3)moving average
6: ˜
Lk= max(˜ck, ck)stabilization
7: rk=h˙
Θk, gki/2k˙
Θkk2˜
Lkrescaling factor
8: Output (at each iteration k):rka rescaling of the direction ˙
Θk.
9: Usage of rescaling
: The practitioner should use the update rule Θ
k+1
= Θ
krk˙
Θk
, where
follows the Rescaling guidelines (see below).
Note that the curvature
ck
is computed with the same batch that the one used to compute
gk
and ˙
Θk.
4
Rescaling guidelines Given a descent direction ˙
Θk, the update rule is given by
Θk+1 = Θkrk˙
Θk,
where
is the learning-rate that has to be chosen by the practitioner and
rk
is the rescaling computed
by Algorithm 1. In the deterministic case, has a physical interpretation:
1
1
2
(Convergence/exploration trade-off). The choice
= 1 (exploration) is the largest
step that keeps the loss function non increasing. The choice
= 1
/
2(convergence) ensures the
fastest convergence to the closest local minimum. It is advised to start from
= 1 and decrease
to = 1/2(see Section 4.1).
 >
1(Hyperexploration). The expected behavior is a loss function increase and large variations
of the parameters. This mode can be used to escape local basin of attraction in annealing
methods (see Section 4.2).
0
<<
1
/
2(Hyperconvergence). This mode slows down the convergence. In the stochastic
setting, if the practitioner has to resort to setting
 <
1
/
2in order to obtain convergence, then
some stochastic effects are of importance in the optimization procedure (see Section 4.3).
2.2 Analysis of the rescaling
Several remarks are necessary to understand the limitations and applications of rescaling.
The algorithm does not converge in the deterministic setting.
Note that in the determin-
istic one dimensional case, when
J
is convex (i.e. the curvature is positive),
β3
= 0 and
= 1
/
2,
the algorithm boils down to the Newton method. It is known that the Newton method may fail to
converge, even for strictly convex smooth functions. For example if we choose
J(Θ) = p1+Θ2,
the iterates of the Newton method are given by Θ
k+1
=
Θ
3
k
, which diverges as soon as
|
Θ
0|>
1.
This problem comes from the fact that the curvature
c
kt˙
Θk,˙
Θk
)has to be computed for each
t
[0
, τk
]in order to estimate
Lk
in
(8)
but this maximum is estimated by its value at
t
= 0. In this
example, ck,˙
Θk)is a bad estimator for Lkas it is too small and the resulting step is too large.
Another issue in DNN is the massive use of piecewise linear activation functions which can make
the Hessian vanish and in this case, the rescaled algorithm may diverge. For example if the loss
function is locally linear, then ck= 0 in line 4 and if we choose β3= 0 then rk= +in line 7.
The algorithm does not converge in the stochastic setting.
Let
X
be a vector-valued
random variable and Jis the function
J(Θ) = 1
2EkΘXk2,
then for any value of
β3
and for the choice
=
1
2
, the rescaled algorithm yields the update Θ
k
=
EBk
[
X
],
when the optimal value is Θ
?
=
E
[
X
]. The algorithm oscillates around Θ
?
, with oscillations depending
on the variance of the gradient. This oscillating stochastic effect is well known and is the basic
analysis of SGD. Since the proposed rescaling analysis is performed in a deterministic setting, it is
not designed to offer any solution to this problem.
Enforcing convergence by Robbins-Monro conditions
In order to enforce convergence, we
can use the results of [36]. It is then sufficient to sow instructions like
αrkkδβ, (12)
with fixed
α, β >
0and
δ
]1
/
2
,
1[. This is the choice followed by [
8
] for instance. Note that
convergence analysis for curvature-dependent step is, to our knowledge, studied only in [
2
], for the
non-stochastic time-continuous setting.
5
摘要:

AdaptivescalingofthelearningratebysecondorderautomaticdierentiationFrédéricdeGournay1,2degourna@insa-toulouse.frAlbanGossard1,3alban.paul.gossard@gmail.com1InstitutdeMathématiquesdeToulouse;UMR5219;UniversitédeToulouse;CNRS2INSA,F-31077Toulouse,France3UPS,F-31062ToulouseCedex9,FranceOctober27,2022A...

展开>> 收起<<
Adaptive scaling of the learning rate by second order automatic differentiation Frédéric de Gournay12.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:26 页 大小:4.46MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注