A Momentum Accelerated Adaptive Cubic Regularization Method for Nonconvex Optimization Yihang GaoMichael K. Ng

2025-04-27 0 0 658.7KB 13 页 10玖币
侵权投诉
A Momentum Accelerated Adaptive Cubic Regularization Method for
Nonconvex Optimization
Yihang GaoMichael K. Ng
Abstract
The cubic regularization method (CR) and its adaptive ver-
sion (ARC) are popular Newton-type methods in solving
unconstrained non-convex optimization problems, due to its
global convergence to local minima under mild conditions.
The main aim of this paper is to develop a momentum accel-
erated adaptive cubic regularization method (ARCm) to im-
prove the convergent performance. With the proper choice
of momentum step size, we show the global convergence of
ARCm and the local convergence can also be guaranteed
under the K L property. Such global and local convergence
can also be established when inexact solvers with low com-
putational costs are employed in the iteration procedure.
Numerical results for non-convex logistic regression and ro-
bust linear regression models are reported to demonstrate
that the proposed ARCm significantly outperforms state-of-
the-art cubic regularization methods (e.g., CR, momentum-
based CR, ARC) and the trust region method. In particular,
the number of iterations required by ARCm is less than 10%
to 50% required by the most competitive method (ARC) in
the experiments.
Keywords. Adaptive cubic regularization, mo-
mentum, K L property, inexact solvers, global and local
convergence
1 Introduction
Most machine learning tasks involve solving challenging
(non-convex) optimization problems
(1.1) min
xRdf(x).
Second-order critical points are usually the preferred so-
lutions for (1.1) since many machine learning problems
have been shown to have only strict saddle points and
global minima without spurious local minima [8, 16].
Second-order methods that exploit Hessian information
have been proposed to escape saddle points [15, 5] and
enjoy faster local convergence [14, 19, 21]. Cubic regu-
larization method (CR), first proposed by Griewank [9],
and later independently by Nesterov and Polyak [15],
and Weiser et al. [18], one of the second-order methods,
Department of Mathematics, The University of Hong Kong,
Pokfulam, Hong Kong SAR.
is widely applied in solving inverse problems [4, 12], re-
gression models [19, 20] as well as minimax problems
[11].
The vanilla CR method is formulated as
skarg min
s
f(xk)>s+1
2s>2f(xk)s+Mk
6ksk3
2,
xk+1 =xk+sk,
where Mk:= Mis a fixed pre-defined constant for all
iterations. In recent decades, various techniques are
adopted to improve the performance of CR. Wang et
al. [17] accelerated the vanilla CR by a momentum
term (CRm), where the method works well mainly by
enlarging the step size of skwhen the cubic penalty
parameter Mis over-estimated.
Cartis et al. [4] proposed adaptive cubic regulariza-
tion method (ARC), which adaptively assigns Mkbased
on the quality of the step sk, as an analogy to the trust
region method (TR) [5]. Stochastic (subsampled) ARC
(e.g., Kohler et al. [13] and Zhou et al. [20] etc.) were
developed to reduce the overall computation, where the
gradient f(xk) and the Hessian 2f(xk) are inexactly
evaluated. To the best of our knowledge, ARC behaves
better in most of the applications compared with CR
methods that fix M(e.g., vanilla CR and CRm). A
natural question is how to further accelerate ARC with
little extra cost in each iteration.
In this paper, we investigate a momentum-
accelerated adaptive cubic regularization method
(ARCm). Here are our contributions and the outline
of the paper.
We develop a momentum-accelerated adaptive cu-
bic regularization method (ARCm). We adopt a
general scheme for momentum, which is more suit-
able for ARC than CR (see in Theorem 2.1). The
extra computation is cheap and ignorant as both
the momentum and its step size are solely based
on {sk}and are free of gradient or Hessian evalua-
tions.
With the proper setting of step size for momentum
(see in the Algorithm 1), the global convergence of
ARCm to second-order critical points is satisfied
arXiv:2210.05987v1 [math.OC] 12 Oct 2022
Algorithm 1 Adaptive cubic regularization with momentum (ARCm)
Input: x0,v1=0, γ1>1γ2> γ3>0,1> η2> η1>0,σ0> σmin >0,1> τ, β > 0and α1, α2>0 in
Output: {xk}T
k=0 out
1: for k= 0 to T1do
2: Solve the cubic subproblem:
skarg min
s
mk(s) := arg min
s
f(xk) + f(xk)>s+1
2s>2f(xk)s+σk
6ksk3
2.
3: Compute f(xk+sk) and ρk=f(xk)f(xk+sk)
f(xk)mk(sk).
4: if ρk> η1(successful update) then
5: yk+1 =xk+sk
6: Momentum step:
Select βk0,min τ, α1kskk2, α2kskk2
2 such that f(zk+1)f(yk+1) with vk=βk·vk1+
skand zk+1 =xk+vk.
7: xk+1 =zk+1
8: if ρk> η2(very successful update) then
9: σk+1 = max(σmin, γ3·σk)
10: else
11: σk+1 =γ2·σk
12: end if
13: else
14: (unsuccessful update)
xk+1 =xkand vk=vk1
15: σk+1 =γ1·σk
16: end if
17: end for
(see in Theorem 2.2). We further show that the
proposed ARCm enjoys local convergence under the
K L property (see in Theorem 2.3), which is one of
the advantages of second-order methods over first-
order methods.
We also study the global convergence of ARCm
with the inexact cubic regularized subproblems
(CRS) solutions as we usually approximately solve
CRS in practice (see in Theorem 3.1). The local
convergence is preserved if the error of CRS is
decreasing with kskk3
2(see in Corollary 3.1) but
may be destructed otherwise.
We conduct experiments in solving high-
dimensional and large-scale non-convex logistic
regression and robust linear regression problems.
Experimental results show that ARCm signif-
icantly outperforms CR, CRm, ARC and TR,
where it is 10%-50% faster than ARC (the most
competitive method among CR, CRm, ARC and
TR) in terms of iterations for convergence.
We really appreciate CRm, which first accelerates
CR by momentum. We would like to mention our main
difference with CRm [17]. Firstly, the scheme of the
momentum for ARCm is different from that in CRm due
to the adaptive selection strategy for Mk. Secondly, the
step size for momentum in ARCm is free of gradient
evaluation but CRm requires. Thirdly, besides the
global convergence, we also study the local convergence
of ARCm under the K L property, which is more general
than the local error-bound condition studied for CRm.
Furthermore, we analyzed the ARCm with inexact CRS
solutions, which is more applicable in real applications.
2 The Proposed ARCm Algorithm
The detailed pseudocode for ARCm is shown in Algo-
rithm 1. Here, we first assume that the cubic subprob-
lem in Step 2 of Algorithm 1 is exactly computed. In
Section 3, we will analyze the convergence property of
ARCm with inexact solutions in Step 2.
Firstly, the momentum vk1is an aggregation
of previously accepted descent steps. It is always
uniformly bounded if βkτ < 1 and skis bounded
(we will prove it in Lemma 2.3) for all kT.
At the beginning of the algorithm (i.e., kT),
kskk2is usually relative large. Therefore, α1kskk2and
α2kskk2
2dominate the term min τ, α1kskk2, α2kskk2
2,
i.e., τ= min τ, α1kskk2, α2kskk2
2is a popular choice
for step size of momentum. When xkapproaches the
local minima (then kskk20), we may not expect
momentum with large step size to work for second-order
methods. Therefore, we adopt α1kskk2and α2kskk2
2in
selecting βk(where βk0) in order to preserve the
local convergence property of the second-order method.
Secondly, we require that f(zk+1)f(yk+1) since we
do not hope to violate the sufficient descent of the
objective in CR and ARC. Here, such zk+1 must exist
(e.g., βk= 0) and we may use the bisection method
to search appropriate βk. The following Theorem 2.1
shows that zk+1 with nonzero βkmay exist under two
cases, where the momentum term helps the convergence
of ARCm (i.e., f(zk+1)< f (yk+1) holds).
Theorem 2.1. Assume that the Hessian 2f(x)of
f(x)on the line segment [xk,xk+sk]is Lk-
Lipschitz (i.e., for all x,y[xk,xk+sk]we have
2f(x)− ∇2f(y)
2Lkkxyk2), then in the fol-
lowing two cases the momentum may help the con-
vergence, i.e., there exist small enough βksuch that
f(zk+1)< f(yk+1):
(i) Lk< σkand s>
kvk1>0;
(ii) Lk> σkand s>
kvk1<0.
Proof. Suppose that the Hessian of f(x) is L-Lipschitz
on a ball centered at yk+1 with radius τkvk1k2, then
we have
f(zk+1)f(yk+1)β· ∇f(yk+1)>vk1
+1
2β2·v>
k12f(xk+1)vk1+L
6β3· kvk1k3
2,
by [15, Lemma 1] (we also provide the useful result in
supplementary material (A.2)). If f(yk+1)>vk1<0,
then there exists a small enough β > 0 such that
f(zk+1)< f(yk+1). In the remaining part, we show
that in the above two cases, f(yk+1)>vk1<0 may
hold. Using the properties of the cubic regularization
method [15, Lemma 1 & (2.5)], we have
f(yk+1)>vk1
= (f(yk+1)− ∇f(xk))>vk1+f(xk)>vk1
= (f(yk+1)− ∇f(xk))>vk1
2f(xk)sk+1
2σkkskk2·sk>
vk1
=1
2σkkskk2·s>
kvk
+f(yk+1)− ∇f(xk)− ∇2f(xk)sk>vk
≤ − 1
2σkkskk2·s>
kvk+1
2Lkkskk2·s>
kvk.
(2.2)
If Lk< σk(i.e., Lkσkfor some  > 0) and
s>
kvk1>0, then
f(yk+1)>vk1≤ − 1
2σkkskk2·s>
kvk+1
2Lkkskk2·s>
kvk
≤ −1
2kskk2·s>
kvk<0.
In the last inequality of (2.2), we use the inequality that
f(yk+1)− ∇f(xk)− ∇2f(xk)sk>vk
1
2Lkkskk2·s>
kvk.
If Lk> σk,s>
kvk1<0 and
f(yk+1)− ∇f(xk)− ∇2f(xk)sk>vk=1
2e
Lkkskk2·s>
kvk1
for σke
Lk<Lkand  > 0, then
f(yk+1)>vk1
=1
2σkkskk2·s>
kvk
+f(yk+1)− ∇f(xk)− ∇2f(xk)sk>vk
=1
2σkkskk2·s>
kvk+1
2e
Lk·s>
kvk1
1
2kskk2·s>
kvk1<0.
Remark 2.1. The two potential cases may happen
since ARCm adaptively select Mk:= σkby the creteria
ρkwhere σkmay be overestimated or underestimated.
When σk> Lk, the step skis too conservative and the
new step xk+1 xkcontributes to lower objective value
if vk1is also a descent direction (i.e., s>
kvk1>0).
Conversely, if σk< Lk, the step skmay be too aggres-
sive, then the momentum with opposite direction (i.e.,
s>
kvk1<0) may correct the imperfect step sk. In
CRm, the momentum vk1is of highly correlated to
skthat s>
kvk1>0and M > Lk, then only the first
case will happen. Therefore, the momentum scheme in
ARCm is more appropriate for ARC since ARC is more
ambitious than vanilla CR.
2.1 Global Convergence To prove global conver-
gence, the following mild assumptions are essential.
Assumption 2.1. We have the following assumptions
for the objective f(x):
1. f(x)is globally second-order differentiable with
respect to x.
2. f(x)is bounded below, i.e., f= infxf(x)>−∞.
摘要:

AMomentumAcceleratedAdaptiveCubicRegularizationMethodforNonconvexOptimizationYihangGao*MichaelK.Ng*AbstractThecubicregularizationmethod(CR)anditsadaptivever-sion(ARC)arepopularNewton-typemethodsinsolvingunconstrainednon-convexoptimizationproblems,duetoitsglobalconvergencetolocalminimaundermildcondit...

展开>> 收起<<
A Momentum Accelerated Adaptive Cubic Regularization Method for Nonconvex Optimization Yihang GaoMichael K. Ng.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:658.7KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注