A Momentum Accelerated Adaptive Cubic Regularization Method for Nonconvex Optimization Yihang GaoMichael K. Ng

2025-04-27 0 0 658.7KB 13 页 10玖币

侵权投诉

A Momentum Accelerated Adaptive Cubic Regularization Method for

Nonconvex Optimization

Yihang Gao∗Michael K. Ng∗

Abstract

The cubic regularization method (CR) and its adaptive ver-

sion (ARC) are popular Newton-type methods in solving

unconstrained non-convex optimization problems, due to its

global convergence to local minima under mild conditions.

The main aim of this paper is to develop a momentum accel-

erated adaptive cubic regularization method (ARCm) to im-

prove the convergent performance. With the proper choice

of momentum step size, we show the global convergence of

ARCm and the local convergence can also be guaranteed

under the K L property. Such global and local convergence

can also be established when inexact solvers with low com-

putational costs are employed in the iteration procedure.

Numerical results for non-convex logistic regression and ro-

bust linear regression models are reported to demonstrate

that the proposed ARCm signiﬁcantly outperforms state-of-

the-art cubic regularization methods (e.g., CR, momentum-

based CR, ARC) and the trust region method. In particular,

the number of iterations required by ARCm is less than 10%

to 50% required by the most competitive method (ARC) in

the experiments.

Keywords. Adaptive cubic regularization, mo-

mentum, K L property, inexact solvers, global and local

convergence

1 Introduction

Most machine learning tasks involve solving challenging

(non-convex) optimization problems

(1.1) min

x∈Rdf(x).

Second-order critical points are usually the preferred so-

lutions for (1.1) since many machine learning problems

have been shown to have only strict saddle points and

global minima without spurious local minima [8, 16].

Second-order methods that exploit Hessian information

have been proposed to escape saddle points [15, 5] and

enjoy faster local convergence [14, 19, 21]. Cubic regu-

larization method (CR), ﬁrst proposed by Griewank [9],

and later independently by Nesterov and Polyak [15],

and Weiser et al. [18], one of the second-order methods,

∗Department of Mathematics, The University of Hong Kong,

Pokfulam, Hong Kong SAR.

is widely applied in solving inverse problems [4, 12], re-

gression models [19, 20] as well as minimax problems

[11].

The vanilla CR method is formulated as

sk∈arg min

∇f(xk)>s+1

2s>∇2f(xk)s+Mk

6ksk3

xk+1 =xk+sk,

where Mk:= Mis a ﬁxed pre-deﬁned constant for all

iterations. In recent decades, various techniques are

adopted to improve the performance of CR. Wang et

al. [17] accelerated the vanilla CR by a momentum

term (CRm), where the method works well mainly by

enlarging the step size of skwhen the cubic penalty

parameter Mis over-estimated.

Cartis et al. [4] proposed adaptive cubic regulariza-

tion method (ARC), which adaptively assigns Mkbased

on the quality of the step sk, as an analogy to the trust

region method (TR) [5]. Stochastic (subsampled) ARC

(e.g., Kohler et al. [13] and Zhou et al. [20] etc.) were

developed to reduce the overall computation, where the

gradient ∇f(xk) and the Hessian ∇2f(xk) are inexactly

evaluated. To the best of our knowledge, ARC behaves

better in most of the applications compared with CR

methods that ﬁx M(e.g., vanilla CR and CRm). A

natural question is how to further accelerate ARC with

little extra cost in each iteration.

In this paper, we investigate a momentum-

accelerated adaptive cubic regularization method

(ARCm). Here are our contributions and the outline

of the paper.

•We develop a momentum-accelerated adaptive cu-

bic regularization method (ARCm). We adopt a

general scheme for momentum, which is more suit-

able for ARC than CR (see in Theorem 2.1). The

extra computation is cheap and ignorant as both

the momentum and its step size are solely based

on {sk}and are free of gradient or Hessian evalua-

tions.

•With the proper setting of step size for momentum

(see in the Algorithm 1), the global convergence of

ARCm to second-order critical points is satisﬁed

arXiv:2210.05987v1 [math.OC] 12 Oct 2022

Algorithm 1 Adaptive cubic regularization with momentum (ARCm)

Input: x0,v−1=0, γ1>1≥γ2> γ3>0,1> η2> η1>0,σ0> σmin >0,1> τ, β > 0and α1, α2>0 in

Output: {xk}T

k=0 out

1: for k= 0 to T−1do

2: Solve the cubic subproblem:

sk∈arg min

mk(s) := arg min

f(xk) + ∇f(xk)>s+1

2s>∇2f(xk)s+σk

6ksk3

3: Compute f(xk+sk) and ρk=f(xk)−f(xk+sk)

f(xk)−mk(sk).

4: if ρk> η1(successful update) then

5: yk+1 =xk+sk

6: Momentum step:

Select βk∈0,min τ, α1kskk2, α2kskk2

2 such that f(zk+1)≤f(yk+1) with vk=βk·vk−1+

skand zk+1 =xk+vk.

7: xk+1 =zk+1

8: if ρk> η2(very successful update) then

9: σk+1 = max(σmin, γ3·σk)

10: else

11: σk+1 =γ2·σk

12: end if

13: else

14: (unsuccessful update)

xk+1 =xkand vk=vk−1

15: σk+1 =γ1·σk

16: end if

17: end for

(see in Theorem 2.2). We further show that the

proposed ARCm enjoys local convergence under the

K L property (see in Theorem 2.3), which is one of

the advantages of second-order methods over ﬁrst-

order methods.

•We also study the global convergence of ARCm

with the inexact cubic regularized subproblems

(CRS) solutions as we usually approximately solve

CRS in practice (see in Theorem 3.1). The local

convergence is preserved if the error of CRS is

decreasing with kskk3

2(see in Corollary 3.1) but

may be destructed otherwise.

•We conduct experiments in solving high-

dimensional and large-scale non-convex logistic

regression and robust linear regression problems.

Experimental results show that ARCm signif-

icantly outperforms CR, CRm, ARC and TR,

where it is 10%-50% faster than ARC (the most

competitive method among CR, CRm, ARC and

TR) in terms of iterations for convergence.

We really appreciate CRm, which ﬁrst accelerates

CR by momentum. We would like to mention our main

diﬀerence with CRm [17]. Firstly, the scheme of the

momentum for ARCm is diﬀerent from that in CRm due

to the adaptive selection strategy for Mk. Secondly, the

step size for momentum in ARCm is free of gradient

evaluation but CRm requires. Thirdly, besides the

global convergence, we also study the local convergence

of ARCm under the K L property, which is more general

than the local error-bound condition studied for CRm.

Furthermore, we analyzed the ARCm with inexact CRS

solutions, which is more applicable in real applications.

2 The Proposed ARCm Algorithm

The detailed pseudocode for ARCm is shown in Algo-

rithm 1. Here, we ﬁrst assume that the cubic subprob-

lem in Step 2 of Algorithm 1 is exactly computed. In

Section 3, we will analyze the convergence property of

ARCm with inexact solutions in Step 2.

Firstly, the momentum vk−1is an aggregation

of previously accepted descent steps. It is always

uniformly bounded if βk≤τ < 1 and skis bounded

(we will prove it in Lemma 2.3) for all k≤T.

At the beginning of the algorithm (i.e., kT),

kskk2is usually relative large. Therefore, α1kskk2and

α2kskk2

2dominate the term min τ, α1kskk2, α2kskk2

2,

i.e., τ= min τ, α1kskk2, α2kskk2

2is a popular choice

for step size of momentum. When xkapproaches the

local minima (then kskk2≈0), we may not expect

momentum with large step size to work for second-order

methods. Therefore, we adopt α1kskk2and α2kskk2

2in

selecting βk(where βk≈0) in order to preserve the

local convergence property of the second-order method.

Secondly, we require that f(zk+1)≤f(yk+1) since we

do not hope to violate the suﬃcient descent of the

objective in CR and ARC. Here, such zk+1 must exist

(e.g., βk= 0) and we may use the bisection method

to search appropriate βk. The following Theorem 2.1

shows that zk+1 with nonzero βkmay exist under two

cases, where the momentum term helps the convergence

of ARCm (i.e., f(zk+1)< f (yk+1) holds).

Theorem 2.1. Assume that the Hessian ∇2f(x)of

f(x)on the line segment [xk,xk+sk]is Lk-

Lipschitz (i.e., for all x,y∈[xk,xk+sk]we have



∇2f(x)− ∇2f(y)

2≤Lkkx−yk2), then in the fol-

lowing two cases the momentum may help the con-

vergence, i.e., there exist small enough βksuch that

f(zk+1)< f(yk+1):

(i) Lk< σkand s>

kvk−1>0;

(ii) Lk> σkand s>

kvk−1<0.

Proof. Suppose that the Hessian of f(x) is L-Lipschitz

on a ball centered at yk+1 with radius τkvk−1k2, then

we have

f(zk+1)−f(yk+1)≤β· ∇f(yk+1)>vk−1

2β2·v>

k−1∇2f(xk+1)vk−1+L

6β3· kvk−1k3

by [15, Lemma 1] (we also provide the useful result in

supplementary material (A.2)). If ∇f(yk+1)>vk−1<0,

then there exists a small enough β > 0 such that

f(zk+1)< f(yk+1). In the remaining part, we show

that in the above two cases, ∇f(yk+1)>vk−1<0 may

hold. Using the properties of the cubic regularization

method [15, Lemma 1 & (2.5)], we have

∇f(yk+1)>vk−1

= (∇f(yk+1)− ∇f(xk))>vk−1+∇f(xk)>vk−1

= (∇f(yk+1)− ∇f(xk))>vk−1

−∇2f(xk)sk+1

2σkkskk2·sk>

vk−1

=−1

2σkkskk2·s>

kvk

+∇f(yk+1)− ∇f(xk)− ∇2f(xk)sk>vk

≤ − 1

2σkkskk2·s>

kvk+1

2Lkkskk2·s>

kvk.

(2.2)

If Lk< σk(i.e., Lk≤σk−for some  > 0) and

kvk−1>0, then

∇f(yk+1)>vk−1≤ − 1

2σkkskk2·s>

kvk+1

2Lkkskk2·s>

kvk

≤ −1

2kskk2·s>

kvk<0.

In the last inequality of (2.2), we use the inequality that

∇f(yk+1)− ∇f(xk)− ∇2f(xk)sk>vk

≤1

2Lkkskk2·s>

kvk.

If Lk> σk,s>

kvk−1<0 and

∇f(yk+1)− ∇f(xk)− ∇2f(xk)sk>vk=1

Lkkskk2·s>

kvk−1

for σk≤e

Lk−<Lkand  > 0, then

∇f(yk+1)>vk−1

=−1

2σkkskk2·s>

kvk

+∇f(yk+1)− ∇f(xk)− ∇2f(xk)sk>vk

=−1

2σkkskk2·s>

kvk+1

Lk·s>

kvk−1

≤1

2kskk2·s>

kvk−1<0.

Remark 2.1. The two potential cases may happen

since ARCm adaptively select Mk:= σkby the creteria

ρkwhere σkmay be overestimated or underestimated.

When σk> Lk, the step skis too conservative and the

new step xk+1 −xkcontributes to lower objective value

if vk−1is also a descent direction (i.e., s>

kvk−1>0).

Conversely, if σk< Lk, the step skmay be too aggres-

sive, then the momentum with opposite direction (i.e.,

kvk−1<0) may correct the imperfect step sk. In

CRm, the momentum vk−1is of highly correlated to

skthat s>

kvk−1>0and M > Lk, then only the ﬁrst

case will happen. Therefore, the momentum scheme in

ARCm is more appropriate for ARC since ARC is more

ambitious than vanilla CR.

2.1 Global Convergence To prove global conver-

gence, the following mild assumptions are essential.

Assumption 2.1. We have the following assumptions

for the objective f(x):

1. f(x)is globally second-order diﬀerentiable with

respect to x.

2. f(x)is bounded below, i.e., f∗= infxf(x)>−∞.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AMomentumAcceleratedAdaptiveCubicRegularizationMethodforNonconvexOptimizationYihangGao*MichaelK.Ng*AbstractThecubicregularizationmethod(CR)anditsadaptivever-sion(ARC)arepopularNewton-typemethodsinsolvingunconstrainednon-convexoptimizationproblems,duetoitsglobalconvergencetolocalminimaundermildcondit...

展开>> 收起<<

A Momentum Accelerated Adaptive Cubic Regularization Method for Nonconvex Optimization Yihang GaoMichael K. Ng.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Momentum Accelerated Adaptive Cubic Regularization Method for Nonconvex Optimization Yihang GaoMichael K. Ng

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: