Non-Convergence and Limit Cycles in the Adam optimizer Sebastian Bock1and Martin Wei1

2025-05-02 0 0 1.47MB 12 页 10玖币
侵权投诉
Non-Convergence and Limit Cycles in the Adam
optimizer?
Sebastian Bock1and Martin Weiß1
OTH Regensburg, Pr¨ufeninger Str. 58, 93049 Regensburg, Germany
{sebastian2.bock,martin.weiss}@oth-regensburg.de
Abstract. One of the most popular training algorithms for deep neu-
ral networks is the Adaptive Moment Estimation (Adam) introduced by
Kingma and Ba. Despite its success in many applications there is no sat-
isfactory convergence analysis: only local convergence can be shown for
batch mode under some restrictions on the hyperparameters, counterex-
amples exist for incremental mode. Recent results show that for simple
quadratic objective functions limit cycles of period 2 exist in batch mode,
but only for atypical hyperparameters, and only for the algorithm with-
out bias correction. We extend the convergence analysis for Adam in the
batch mode with bias correction and show that even for quadratic ob-
jective functions as the simplest case of convex functions 2-limit-cycles
exist, for all choices of the hyperparameters. We analyze the stability
of these limit cycles and relate our analysis to other results where ap-
proximate convergence was shown, but under the additional assumption
of bounded gradients which does not apply to quadratic functions. The
investigation heavily relies on the use of computer algebra due to the
complexity of the equations.
Keywords: Adam optimizer ·convergence ·computer algebra ·dynam-
ical system ·limit cycle
1 Introduction
Adaptive Moment Estimation (Adam), originally presented by Kingma and Ba
[7] is probably the most widely used training algorithm for neural networks,
especially convolutional neural networks. Implementations exist in all popular
machine learning frameworks like Tensorflow or PyTorch. Despite its apparent
success the theoretical basis is weak: The original proof in [7] is wrong as has
been noted by several authors, see [1, 10]. Of course a faulty proof does not imply
that the Adam optimizer does not converge, and indeed local convergence can
be shown for batch mode under reasonable restrictions on the hyperparameters,
see [2]. Furthermore [9] give an example in incremental mode where the regret
does not converge, neither do the arguments of the objective function.
?This paper presents results of the project ”LeaP – Learning Poses” supported by
the Bavarian Ministry of Science and Art under Kap. 15 49 TG 78.
arXiv:2210.02070v1 [cs.LG] 5 Oct 2022
2 S. Bock and M. Weiß
Several results exist which show ε-bounds on the gradients, k∇f(wt)k< ε
for ε > 0 arbitrarily small for all tis sufficiently large, see [5, Theorem 3.4.], [3],
[11, Theorem 3.3]. Other results show asymptotic bounds on the regret or that
the function values come close to the minimum, f(wt)f(w?)< ε, see [3] for
example. To the best of the authors’ knowledge, [2] is the only (partial) result
on weight convergence in the standard mathematical definition limt→∞ wt=w?,
however only in a local sense.
Contrary to these results [4] shows that 2-cycles exist for the Adam optimizer
without bias correction for the simple case of a scalar quadratic objective func-
tion f(w) = 1
2w2. Quadratic objective functions are a natural benchmark for any
optimization algorithm in convex analysis: The standard gradient descent algo-
rithm converges for learning rate small enough, see [8], so this behaviour should
be replicated by more sophisticated gradient motivated adaptive algorithms like
Adam. However [4, Proposition 3.3] only deals with the case β1= 0 which means
that the first moments are not adapted at all – this case hardly can be called
Adam any more.
We extend the results of [4] to the general case of hyperparameters α > 0,
0< β1<1, 0 < β2<1, and show existence of 2-limit-cycles for scalar objective
functions f(w) = 1
2cw2,c > 0, which easily generalizes by diagonalization to
strictly convex quadratic functions f(w) = 1
2wCw,Cpositive definite. This is
done for the Adam algorithm in batch mode only, but also for bias correction.
We give numerical evidence that for typical values of β1, β2near 1 these 2-
cycles are unstable, and stable for β1, β2near 0. The analysis of the limit cycles
is not exhaustive: more 2-cycles may exist, and cycles of larger period. However
our results suffice to clarify the global non-convergence of Adam even for strictly
convex functions under the fairly standard assumptions of bounds on the Hessian
like 0 < lIn≤ ∇2f(w)LInfor all wRn.
The outline of the paper is as follows: In Section 2 we define our variant of
the Adam algorithm and explain the steps of our proof. In Section 3 these steps
are carried out. The Maple code used can be obtained from the publishers web
site1. Section 4 shows numerical simulations suggesting that a Hopf bifurcation
occurs, before we state some conclusions and relate our results to other research.
Notation: With Matnwe denote the set of all real n-by-nmatrices. The
symbol denotes the transpose of a vector or matrix. The class of k-times con-
tinuously differentiable functions from Rnto Rmis denoted by Ck(Rn,Rm), with
fthe gradient and 2fthe Hessian for scalar valued functions. Throughout
this paper we assume f:RnRnat least C1,C2for some results. In the
numerical tests we denote the machine accuracy with eps = 2.2204 1016, i.e.
standard IEEE floating point numbers with double precision.
1- URL missing - We can provide the code for the referees of course
Non-Convergence and Limit Cycles in the Adam optimizer 3
2 Outline of the Proof
2.1 Definition of the Algorithm
In the course of time, variations of the Adam optimizer were developed. We use
the version of Adam shown in Algorithm 1 (The symbols ,and denote the
component-wise multiplication and division of vectors, as well as component-wise
addition of vectors and scalar.). The main points worth noting are:
Originally Kingma and Ba [7] use vεand the bias correction in ˆmand ˆv.
Other publications like [9, 3] do not use an εto avoid division by zero as well as
[12], but the latter initialize v0=εwith essentially the same effect. We use the
variant with εin the denominator; otherwise the initial value v0= 0 would have
to be excluded in all results, and one could not talk about stability of a fixed
point w?of the iteration corresponding to a minimum of the objective function.
Also we use vεas in [4, 2], instead of vεas in the original publication
[7]. Our variant has the advantage that the iteration is continuously differentiable
for all v0 whereas the εoutside of the square root leads to a non differentiable
exception set. The numerical differences of the two variants are marginal and
described in more detail in [2].
All cited publications with the exception of [4] apply a bias correction in the
learning rate αt; we use the same bias correction as described in [7, Section 2].
Algorithm 1 Adam Optimization
Require: αR+,εRβ1, β2(0,1), w0Rnand the function f(w)C2(Rn,R)
1: m0= 0
2: v0= 0
3: t= 0
4: while wnot converged do
5: mt+1 =β1mt+ (1 β1)wf(wt)
6: vt+1 =β2vt+ (1 β2)wf(wt)⊗ ∇wf(wt)
7: wt+1 =wtαq1βt+1
2
(1βt+1
1)mt+1 vt+1 ε
8: t=t+ 1
9: end while
We denote x= (m, v, w) the state of the Adam iteration, and interpret
the algorithm as discrete time dynamical system xt+1 =T(t, xt;p) with T=
T(t, x;p) = T(t, m, v, w;α, β1, β2, ε) to express the dependence of the iteration
on the state x= (m, v, w) and the hyperparameters p= (α, β1, β2, ε).
We write Adam without bias correction in the same way as ¯
T=¯
T(x;p) =
¯
T(m, v, w;α, β1, β2, ε). This gives an autonomous dynamical system; the right
hand side does not explicitly depend on t. The difference between the two systems
is denoted by Θ(t, x;p), so we have analogous to [2]
xt+1 =T(t, xt;p) = ¯
T(xt;p) + Θ(t, xt, p) (1)
摘要:

Non-ConvergenceandLimitCyclesintheAdamoptimizer?SebastianBock1andMartinWei1OTHRegensburg,PrufeningerStr.58,93049Regensburg,Germanyfsebastian2.bock,martin.weissg@oth-regensburg.deAbstract.Oneofthemostpopulartrainingalgorithmsfordeepneu-ralnetworksistheAdaptiveMomentEstimation(Adam)introducedbyKingm...

展开>> 收起<<
Non-Convergence and Limit Cycles in the Adam optimizer Sebastian Bock1and Martin Wei1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.47MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注