Adaptive scaling of the learning rate by second order automatic diﬀerentiation Frédéric de Gournay12

2025-04-30 1 0 4.46MB 26 页 10玖币

侵权投诉

Adaptive scaling of the learning rate by second order

automatic diﬀerentiation

Frédéric de Gournay1,2

degourna@insa-toulouse.fr

Alban Gossard1,3

alban.paul.gossard@gmail.com

1Institut de Mathématiques de Toulouse; UMR5219; Université de Toulouse; CNRS

2INSA, F-31077 Toulouse, France 3UPS, F-31062 Toulouse Cedex 9, France

October 27, 2022

Abstract

In the context of the optimization of Deep Neural Networks, we propose to rescale the learning

rate using a new technique of automatic diﬀerentiation. This technique relies on the computation

of the curvature, a second order information whose computational complexity is in between the

computation of the gradient and the one of the Hessian-vector product. If (1

)represents

respectively the computational time and memory footprint of the gradient method, the new

technique increase the overall cost to either (1

)or (2

). This rescaling has the

appealing characteristic of having a natural interpretation, it allows the practitioner to choose

between exploration of the parameters set and convergence of the algorithm. The rescaling is

adaptive, it depends on the data and on the direction of descent. The numerical experiments

highlight the diﬀerent exploration/convergence regimes.

1 Introduction

The optimization of Deep Neural Networks (DNNs) has received tremendous attention over the past

years. Training DNNs amounts to minimize the expectation of non-convex random functions in a

high dimensional space Rd. If J:Rd→Rdenotes this expectation, the problem reads

min

Θ∈RdJ(Θ),(1)

with Θthe parameters. Optimization algorithms compute iteratively Θ

, an approximation of a

minimizer of (1) at iteration k, by the update rule

Θk+1 = Θk−τk˙

Θk,(2)

where

τk

is the learning-rate and

Θk

is the update direction. The choice of

Θk

encodes the type of

algorithm used. This work focuses on the choice of the learning rate τk.

There is a trade-oﬀ in the choice of this learning rate. Indeed high values of

τk

allows exploration

of the parameters space and slowly decaying step size ensures convergence in accordance to the

famous Robbins-Monro algorithm [

]. This decaying condition may be met by deﬁning the step as

τk

τ0k−α

with

τ0

being the initial step size and

2< α <

1a constant. The choice of the initial

learning rate and its decay are left to practitioners and these hyperparameters have to be tuned

manually in order to obtain the best rate of convergence. For instance, they can be optimized using

a grid-search or by using more intricated strategies [

], but in all generality tuning the learning rate

and its decay factor is diﬃcult and time consuming. The main issue is that the learning rate has no

natural scaling. The goal of this work is to propose an algorithm that, given a direction

Θk

ﬁnds

automatically a scaling of the learning rate. This rescaling has the following advantages:

•The scaling is adaptive, it depends on the data and of the choice of direction ˙

Θk.

•

The scaling expresses the convergence vs. exploration trade-oﬀ. Multiplying the rescaled

learning rate by 1

2enforces convergence whereas multiplying it by 1allows for exploration of

the space of parameters.

arXiv:2210.14520v1 [cs.NE] 26 Oct 2022

This rescaling comes at a cost and it has the following disadvantages:

•

The computational costs and memory footprint of the algorithm goes from (1

)to

(1.5C, 2M)or (2C, 1M).

•

The rescaling method is only available to algorithms that yield directions of descent, it excludes

momentum method and notably Adam-ﬂavored algorithm.

•

Rescaling is theoritically limited to functions whose second order derivative exists and does not

vanish. This non-vanishing condition can be compensated by L2-regularization.

1.1 Foreword

First recall that second order methods for the minimization of a deterministic

function Θ

7→ J

(Θ),

with a Hessian that we denote

∇2J

, are based on the second order Taylor expansion at iteration

J(Θk−τk˙

Θk)' J(Θk)−τkh˙

Θk,∇J(Θk)i+τ2

2h∇2J(Θk)˙

Θk,˙

Θki.(3)

If the Hessian of Jis positive deﬁnite, the minimization of the right-hand side leads to the choice

Θk=P−1

k∇J(Θk)with Pk' ∇2J(Θk).(4)

Once a direction ˙

Θkis chosen, another minimization in τkgives

τk=h˙

Θk,∇J(Θk)i

k˙

Θkk2c(Θk,˙

Θk),(5)

where cis the curvature of the function, and is deﬁned as

c(Θk,˙

Θk)def

=h∇2J(Θk)˙

Θk,˙

Θki

k˙

Θkk2.(6)

A second-order driven algorithm can be decomposed in two steps: i) the choice of

(4)

, and if

this choice leads to an update which is a direction of ascent, that is

h˙

Θk,∇J

(Θ

)

0, ii) a choice

of τkby an heuristic inspired from (6) and (5).

In the stochastic setting, we denote as

s7→ Js

the mapping of the random function. At iteration

, only information on (

)

s∈Bk

can be computed where (

)

is a sequence of mini-batches which are

indepently drawn. If

Es∈Bk

is the empirical average over the mini-batch, we deﬁne

JBk

Es∈Bk

[

Given Θ, the quantity J(Θ) is deterministic, and Jis the expectation of Jsw.r.t. s.

1.2 Related works

Choice of Pk:

The choice

∇2JBk

(Θ

)in

(4)

, leads to a choice

τk

= 1 and to the so-called

Newton method. It is possible in theory to compute the Hessian by automatic diﬀerentiation if it

is sparse [

], but to our knowledge it has not been implemented yet. In [

], the authors solve

Θk

∇2JBk(Θk)−1∇JBk

(Θ

)by a conjugate gradient method which requires only matrix-vector

product which is aﬀordable by automatic diﬀerentiation [

]. This point of view, as well as

some variants [44, 20], suﬀer from high computational cost per batch and go through less data in a

comparable amount of time, leading to slower convergence at the beginning of the optimization.

Another choice is to set

Pk' ∇2JBk

(Θ

)in

(4)

which is coined as the “Quasi-Newton” approach.

These methods directly invert a diagonal, block-diagonal or low rank approximation of the Hessian

[

]. In most of these works, the Hessian is approximated by

[

∇Js

(

θk

)

∇Js

(

θk

)

the so-called Fisher-Information matrix, which leads to the natural gradient method [

]. Note also

the use of a low-rank approximation of the true Hessian for variance reduction in [14].

Finally, there is an interpretation of adaptive methods as Quasi-Newton methods. Amongst the

adaptive method, let us cite RMSProp [

], Adam [

], Adagrad [

] and Adadelta [

]. For all these

methods,

is as a diagonal preconditioner that reduces the variability of the step size across the

diﬀerent layers. This class of methods can be written

Θk=P−1

kmk, mk' ∇J(Θk)and Pk' ∇2J(Θk).(7)

For instance, RMSProp and Adagrad use

∇JBk

(Θ

)whereas Adam maintains in

exponential moving averaging from the past evaluations of the gradient. The RMSProp, Adam and

Adagrad optimizers build

such that

is a diagonal matrix whoses elements are exponential

moving average of the square of the past gradients (see [

] for example). It is an estimator of the

diagonal part of the Fisher-Information matrix.

All these methods can be incorporated in our framework as we consider the choice of

as a

preconditioning technique whose step is yet to be found. In a nutshell, if

approximates the Hessian

up to an unknown multiplicative factor, our method is able to ﬁnd this multiplicative factor.

Barzilai-Borwein:

The Barzilai-Borwein (BB) class of methods [

] may be

interpreted as methods which aim at estimating the curvature in

(6)

by numerical diﬀerences using

past gradient computations. In the stochastic convex setting, the BB method is introduced in [

] for

the choice

Θk

∇J

(Θ

)and also for variance-reducing methods [

]. It has been extended in [

]

to non-convex problems and in [

] to DNNs. Due to the variance of the gradient and possibly to a

poor estimation of the curvature by numerical diﬀerences, these methods allow prescribing a new

step at each epoch only. In [

], the step is prescribed at each iteration at the cost of computing

two mini-batch gradients per iteration. Moreover, in [

] the gradient over all the data needs to be

computed at the beginning of each epoch whereas [

] maintains an exponential moving average to

avoid this extra computation. The downside of [

] is that they still need to tune the learning rate

and its decay factor and that their method has not been tried on other choices than Pk= Id.

Our belief is that approximating by numerical diﬀerences in a stochastic setting suﬀers too much

from variance from the data and from the approximation error. Hence we advocate in this study for

exact computations of the curvature (6).

Automatic diﬀerentiation:

The theory that allows to compute the matrix-vector product of

the Hessian with a certain direction is well-studied [

] and costs 4passes (2forward and

backward passes) and 3memory footprint, when the computation of the gradient costs 2passes (1

forward and backward pass) and 1memory footprint. We study the cost of computing the curvature

deﬁned in

(6)

, which to the best of our knowledge, has never been studied. Our method has a

numerical cost that is always lower than the best BB method [8].

1.3 Our contributions

We propose a change of point of view. While most of the methods presented above use ﬁrst order

information to develop second order algorithms, we use second order information to tune a ﬁrst order

method. The curvature

(6)

is computed using automatic diﬀerentiation in order to estimate the local

Lipschitz constant of the gradient and to choose a step accordingly. Our contribution is threefold:

•

We propose a method that automatically rescales the learning rate using curvature information

in Section 2.1 and we discuss the heuristics of this method in Section 2.2. The rescaling allows

the practitioner to choose between three diﬀerent physical regimes coined as : hyperexploration,

exploration/convergence trade-oﬀ and hyperconvergence.

•

We study the automatic diﬀerentiation of the curvature in Section 3 and its computational

costs.

•

Numerical tests are provided in Section 4 with a discussion on the three diﬀerent physical

regimes introduced in Section 2.1.

2 Rescaling the learning rate

2.1 Presentation and guideline for rescaling

The second order analysis of Section 1 relies on the Taylor expansion

J(Θk−τk˙

Θk)' J(Θk)−τkh˙

Θk,∇J(Θk)i+τ2

2c(Θk,˙

Θk)k˙

Θkk2,

with

(Θ

k,˙

Θk

)given by

(6)

. This Taylor expansion yields the following algorithm: given Θ

and an

update direction

Θk

, compute

(Θ

k,˙

Θk

)by

(6)

, the step

τk

(5)

and ﬁnally update the parameters

(2)

. The ﬁrst order analysis is slightly diﬀerent. Starting with the second order exact Taylor

expansion in integral form:

J(Θk−τk˙

Θk) = J(Θk) + τkh˙

Θk,∇J(Θk)i+Zτk

t=0

(τk−t)c(Θk−t˙

Θk,˙

Θk)k˙

Θkk2dt,

we introduce the local directional Lipschitz constant of the gradient

Lk= max

t∈[0,τk]|c(Θk−t˙

Θk,˙

Θk)|,(8)

in order to bound the right-hand side of the Taylor expansion. One obtains

J(Θk−τk˙

Θk)≤ J(Θk)−τkh˙

Θk,∇J(Θk)i+τ2

2Lkk˙

Θkk2.(9)

Introducing the rescaling rk

rk=h˙

Θk,∇J(Θk)i

k˙

Θkk2Lk

(10)

and writing τk=rk, Equation (9) turns into

J(Θk−rk˙

Θk)≤ J(Θk) + 2−Lk

2krk˙

Θkk2∀. (11)

Any choice of



in ]0

1[ leads to a decrease of

(11)

. The choice



allows faster decrease of

the right-hand side of

(11)

. We coin the choice



= 1 in

(11)

as the exploration choice and the choice



as the convergence choice. The only diﬃculty in computing

(10)

lies in the computation of

Indeed,

is a maximum over an unknown interval and, in the stochastic setting, we only estimate

the function Jand its derivative on a batch Bk.

We propose to build ˜

Lkan estimator of Lkby the following rules.

•

Replace the maximum over the unknown interval [0

, τk

]in

(8)

by the value at

= 0. This is

reminiscent of the Newton’s method.

•

Perform an exponential moving average on the past computations of

in order to average

over the data previously seen.

•

Use the maximum of this latter exponential moving average and the current estimate in order

to stablize ˜

Lk.

The algorithm reads as follows:

Algorithm 1 Rescaling of the learning rate

1: Hyperparameters β3= 0.9(exponential moving average).

2: Initialization ˆc0= 0

3: Input (at each iteration k)

: a batch

Es∈Bk[∇Js(Θk)]

and

Θk

a direction that

veriﬁes hgk,˙

Θki>0.

4: ck=Es∈Bkhh∇2Js(Θk)˙

Θk,˙

Θkii/k˙

Θkk2local curvature

5: ˆck=β3ˆck−1+ (1 −β3)ckand ˜ck= ˆck/(1 −βk

3)moving average

6: ˜

Lk= max(˜ck, ck)stabilization

7: rk=h˙

Θk, gki/2k˙

Θkk2˜

Lkrescaling factor

8: Output (at each iteration k):rka rescaling of the direction ˙

Θk.

9: Usage of rescaling

: The practitioner should use the update rule Θ

k+1

= Θ

k−rk˙

Θk

, where



follows the Rescaling guidelines (see below).

Note that the curvature

is computed with the same batch that the one used to compute

and ˙

Θk.

Rescaling guidelines Given a descent direction ˙

Θk, the update rule is given by

Θk+1 = Θk−rk˙

Θk,

where



is the learning-rate that has to be chosen by the practitioner and

is the rescaling computed

by Algorithm 1. In the deterministic case, has a physical interpretation:

•

≥≥1

(Convergence/exploration trade-oﬀ). The choice



= 1 (exploration) is the largest

step that keeps the loss function non increasing. The choice



= 1

2(convergence) ensures the

fastest convergence to the closest local minimum. It is advised to start from



= 1 and decrease

to = 1/2(see Section 4.1).

• >

1(Hyperexploration). The expected behavior is a loss function increase and large variations

of the parameters. This mode can be used to escape local basin of attraction in annealing

methods (see Section 4.2).

•

<<

2(Hyperconvergence). This mode slows down the convergence. In the stochastic

setting, if the practitioner has to resort to setting

 <

2in order to obtain convergence, then

some stochastic eﬀects are of importance in the optimization procedure (see Section 4.3).

2.2 Analysis of the rescaling

Several remarks are necessary to understand the limitations and applications of rescaling.

The algorithm does not converge in the deterministic setting.

Note that in the determin-

istic one dimensional case, when

is convex (i.e. the curvature is positive),

β3

= 0 and



= 1

the algorithm boils down to the Newton method. It is known that the Newton method may fail to

converge, even for strictly convex smooth functions. For example if we choose

J(Θ) = p1+Θ2,

the iterates of the Newton method are given by Θ

k+1

−

, which diverges as soon as

0|>

This problem comes from the fact that the curvature

(Θ

k−t˙

Θk,˙

Θk

)has to be computed for each

t∈

, τk

]in order to estimate

(8)

but this maximum is estimated by its value at

= 0. In this

example, c(Θk,˙

Θk)is a bad estimator for Lkas it is too small and the resulting step is too large.

Another issue in DNN is the massive use of piecewise linear activation functions which can make

the Hessian vanish and in this case, the rescaled algorithm may diverge. For example if the loss

function is locally linear, then ck= 0 in line 4 and if we choose β3= 0 then rk= +∞in line 7.

The algorithm does not converge in the stochastic setting.

Let

be a vector-valued

random variable and Jis the function

J(Θ) = 1

2EkΘ−Xk2,

then for any value of

β3

and for the choice



, the rescaled algorithm yields the update Θ

EBk

[

when the optimal value is Θ

[

]. The algorithm oscillates around Θ

, with oscillations depending

on the variance of the gradient. This oscillating stochastic eﬀect is well known and is the basic

analysis of SGD. Since the proposed rescaling analysis is performed in a deterministic setting, it is

not designed to oﬀer any solution to this problem.

Enforcing convergence by Robbins-Monro conditions

In order to enforce convergence, we

can use the results of [36]. It is then suﬃcient to sow instructions like

α≤rkkδ≤β, (12)

with ﬁxed

α, β >

0and

δ∈

1[. This is the choice followed by [

] for instance. Note that

convergence analysis for curvature-dependent step is, to our knowledge, studied only in [

], for the

non-stochastic time-continuous setting.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdaptivescalingofthelearningratebysecondorderautomaticdierentiationFrédéricdeGournay1,2degourna@insa-toulouse.frAlbanGossard1,3alban.paul.gossard@gmail.com1InstitutdeMathématiquesdeToulouse;UMR5219;UniversitédeToulouse;CNRS2INSA,F-31077Toulouse,France3UPS,F-31062ToulouseCedex9,FranceOctober27,2022A...

展开>> 收起<<

Adaptive scaling of the learning rate by second order automatic diﬀerentiation Frédéric de Gournay12.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive scaling of the learning rate by second order automatic diﬀerentiation Frédéric de Gournay12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: