Non-Convergence and Limit Cycles in the Adam optimizer Sebastian Bock1and Martin Wei1

2025-05-02 0 0 1.47MB 12 页 10玖币

侵权投诉

Non-Convergence and Limit Cycles in the Adam

optimizer?

Sebastian Bock1and Martin Weiß1

OTH Regensburg, Pr¨ufeninger Str. 58, 93049 Regensburg, Germany

{sebastian2.bock,martin.weiss}@oth-regensburg.de

Abstract. One of the most popular training algorithms for deep neu-

ral networks is the Adaptive Moment Estimation (Adam) introduced by

Kingma and Ba. Despite its success in many applications there is no sat-

isfactory convergence analysis: only local convergence can be shown for

batch mode under some restrictions on the hyperparameters, counterex-

amples exist for incremental mode. Recent results show that for simple

quadratic objective functions limit cycles of period 2 exist in batch mode,

but only for atypical hyperparameters, and only for the algorithm with-

out bias correction. We extend the convergence analysis for Adam in the

batch mode with bias correction and show that even for quadratic ob-

jective functions as the simplest case of convex functions 2-limit-cycles

exist, for all choices of the hyperparameters. We analyze the stability

of these limit cycles and relate our analysis to other results where ap-

proximate convergence was shown, but under the additional assumption

of bounded gradients which does not apply to quadratic functions. The

investigation heavily relies on the use of computer algebra due to the

complexity of the equations.

Keywords: Adam optimizer ·convergence ·computer algebra ·dynam-

ical system ·limit cycle

1 Introduction

Adaptive Moment Estimation (Adam), originally presented by Kingma and Ba

[7] is probably the most widely used training algorithm for neural networks,

especially convolutional neural networks. Implementations exist in all popular

machine learning frameworks like Tensorﬂow or PyTorch. Despite its apparent

success the theoretical basis is weak: The original proof in [7] is wrong as has

been noted by several authors, see [1, 10]. Of course a faulty proof does not imply

that the Adam optimizer does not converge, and indeed local convergence can

be shown for batch mode under reasonable restrictions on the hyperparameters,

see [2]. Furthermore [9] give an example in incremental mode where the regret

does not converge, neither do the arguments of the objective function.

?This paper presents results of the project ”LeaP – Learning Poses” supported by

the Bavarian Ministry of Science and Art under Kap. 15 49 TG 78.

arXiv:2210.02070v1 [cs.LG] 5 Oct 2022

2 S. Bock and M. Weiß

Several results exist which show ε-bounds on the gradients, k∇f(wt)k< ε

for ε > 0 arbitrarily small for all tis suﬃciently large, see [5, Theorem 3.4.], [3],

[11, Theorem 3.3]. Other results show asymptotic bounds on the regret or that

the function values come close to the minimum, f(wt)−f(w?)< ε, see [3] for

example. To the best of the authors’ knowledge, [2] is the only (partial) result

on weight convergence in the standard mathematical deﬁnition limt→∞ wt=w?,

however only in a local sense.

Contrary to these results [4] shows that 2-cycles exist for the Adam optimizer

without bias correction for the simple case of a scalar quadratic objective func-

tion f(w) = 1

2w2. Quadratic objective functions are a natural benchmark for any

optimization algorithm in convex analysis: The standard gradient descent algo-

rithm converges for learning rate small enough, see [8], so this behaviour should

be replicated by more sophisticated gradient motivated adaptive algorithms like

Adam. However [4, Proposition 3.3] only deals with the case β1= 0 which means

that the ﬁrst moments are not adapted at all – this case hardly can be called

Adam any more.

We extend the results of [4] to the general case of hyperparameters α > 0,

0< β1<1, 0 < β2<1, and show existence of 2-limit-cycles for scalar objective

functions f(w) = 1

2cw2,c > 0, which easily generalizes by diagonalization to

strictly convex quadratic functions f(w) = 1

2w⊥Cw,Cpositive deﬁnite. This is

done for the Adam algorithm in batch mode only, but also for bias correction.

We give numerical evidence that for typical values of β1, β2near 1 these 2-

cycles are unstable, and stable for β1, β2near 0. The analysis of the limit cycles

is not exhaustive: more 2-cycles may exist, and cycles of larger period. However

our results suﬃce to clarify the global non-convergence of Adam even for strictly

convex functions under the fairly standard assumptions of bounds on the Hessian

like 0 < lIn≤ ∇2f(w)≤LInfor all w∈Rn.

The outline of the paper is as follows: In Section 2 we deﬁne our variant of

the Adam algorithm and explain the steps of our proof. In Section 3 these steps

are carried out. The Maple code used can be obtained from the publishers web

site1. Section 4 shows numerical simulations suggesting that a Hopf bifurcation

occurs, before we state some conclusions and relate our results to other research.

Notation: With Matnwe denote the set of all real n-by-nmatrices. The

symbol ⊥denotes the transpose of a vector or matrix. The class of k-times con-

tinuously diﬀerentiable functions from Rnto Rmis denoted by Ck(Rn,Rm), with

∇fthe gradient and ∇2fthe Hessian for scalar valued functions. Throughout

this paper we assume f:Rn→Rnat least C1,C2for some results. In the

numerical tests we denote the machine accuracy with eps = 2.2204 ∗10−16, i.e.

standard IEEE ﬂoating point numbers with double precision.

1- URL missing - We can provide the code for the referees of course

Non-Convergence and Limit Cycles in the Adam optimizer 3

2 Outline of the Proof

2.1 Deﬁnition of the Algorithm

In the course of time, variations of the Adam optimizer were developed. We use

the version of Adam shown in Algorithm 1 (The symbols ⊗,⊕and denote the

component-wise multiplication and division of vectors, as well as component-wise

addition of vectors and scalar.). The main points worth noting are:

Originally Kingma and Ba [7] use √v⊕εand the bias correction in ˆmand ˆv.

Other publications like [9, 3] do not use an εto avoid division by zero as well as

[12], but the latter initialize v0=εwith essentially the same eﬀect. We use the

variant with εin the denominator; otherwise the initial value v0= 0 would have

to be excluded in all results, and one could not talk about stability of a ﬁxed

point w?of the iteration corresponding to a minimum of the objective function.

Also we use √v⊕εas in [4, 2], instead of √v⊕εas in the original publication

[7]. Our variant has the advantage that the iteration is continuously diﬀerentiable

for all v≥0 whereas the εoutside of the square root leads to a non diﬀerentiable

exception set. The numerical diﬀerences of the two variants are marginal and

described in more detail in [2].

All cited publications with the exception of [4] apply a bias correction in the

learning rate αt; we use the same bias correction as described in [7, Section 2].

Algorithm 1 Adam Optimization

Require: α∈R+,ε∈Rβ1, β2∈(0,1), w0∈Rnand the function f(w)∈C2(Rn,R)

1: m0= 0

2: v0= 0

3: t= 0

4: while wnot converged do

5: mt+1 =β1mt+ (1 −β1)∇wf(wt)

6: vt+1 =β2vt+ (1 −β2)∇wf(wt)⊗ ∇wf(wt)

7: wt+1 =wt−αq1−βt+1

(1−βt+1

1)mt+1 √vt+1 ⊕ε

8: t=t+ 1

9: end while

We denote x= (m, v, w) the state of the Adam iteration, and interpret

the algorithm as discrete time dynamical system xt+1 =T(t, xt;p) with T=

T(t, x;p) = T(t, m, v, w;α, β1, β2, ε) to express the dependence of the iteration

on the state x= (m, v, w) and the hyperparameters p= (α, β1, β2, ε).

We write Adam without bias correction in the same way as ¯

T=¯

T(x;p) =

T(m, v, w;α, β1, β2, ε). This gives an autonomous dynamical system; the right

hand side does not explicitly depend on t. The diﬀerence between the two systems

is denoted by Θ(t, x;p), so we have analogous to [2]

xt+1 =T(t, xt;p) = ¯

T(xt;p) + Θ(t, xt, p) (1)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Non-ConvergenceandLimitCyclesintheAdamoptimizer?SebastianBock1andMartinWei1OTHRegensburg,PrufeningerStr.58,93049Regensburg,Germanyfsebastian2.bock,martin.weissg@oth-regensburg.deAbstract.Oneofthemostpopulartrainingalgorithmsfordeepneu-ralnetworksistheAdaptiveMomentEstimation(Adam)introducedbyKingm...

展开>> 收起<<

Non-Convergence and Limit Cycles in the Adam optimizer Sebastian Bock1and Martin Wei1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Non-Convergence and Limit Cycles in the Adam optimizer Sebastian Bock1and Martin Wei1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: