Zeroth-Order Negative Curvature Finding Escaping Saddle Points without Gradients Hualin Zhang1Huan Xiong23Bin Gu13

2025-04-29 1 0 1.01MB 56 页 10玖币

侵权投诉

Zeroth-Order Negative Curvature Finding: Escaping

Saddle Points without Gradients

Hualin Zhang1Huan Xiong2,3Bin Gu1,3

1Nanjing University of Information Science & Technology

2Harbin Institute of Technology

3Mohamed bin Zayed University of Artiﬁcial Intelligence

{zhanghualin98,huan.xiong.math,jsgubin}@gmail.com

Abstract

We consider escaping saddle points of nonconvex problems where only the function

evaluations can be accessed. Although a variety of works have been proposed, the

majority of them require either second or ﬁrst-order information, and only a few of

them have exploited zeroth-order methods, particularly the technique of negative

curvature ﬁnding with zeroth-order methods which has been proven to be the most

efﬁcient method for escaping saddle points. To ﬁll this gap, in this paper, we

propose two zeroth-order negative curvature ﬁnding frameworks that can replace

Hessian-vector product computations without increasing the iteration complexity.

We apply the proposed frameworks to ZO-GD, ZO-SGD, ZO-SCSG, ZO-SPIDER

and prove that these ZO algorithms can converge to

(, δ)

-approximate second-

order stationary points with less query complexity compared with prior zeroth-order

works for ﬁnding local minima.

1 Introduction

Nonconvex optimization has received wide attention in recent years due to its popularity in modern

machine learning (ML) and deep learning (DL) tasks. Speciﬁcally, in this paper, we study the

following unconstrained optimization problem:

min

x∈Rdf(x) := 1

i=1

fi(x),(1)

where both

fi(·)

and

f(·)

can be nonconvex. In general, ﬁnding the global optima of nonconvex

functions is NP-hard. Fortunately, ﬁnding local optima is an alternative because it has been shown in

theory and practice that local optima have comparable performance capabilities to global optima in

many machine learning problems [

]. Gradient-based methods have been

shown to be able to ﬁnd an



-approximate ﬁrst-order stationary point (

k∇f(x)k ≤ 

) efﬁciently, both

in the deterministic setting (e.g., gradient descent [

]; accelerated gradient descent [

]) and

stochastic setting (e.g., stochastic gradient descent [

]; SCSG [

]; SPIDER [

]). However, in

nonconvex settings, ﬁrst-order stationary points can be local minima, global minima, or even saddle

points. Converging to saddle points will lead to highly suboptimal solutions [

] and destroy the

model’s performance. Thus, escaping saddle points has recently become an important research topic

in nonconvex optimization.

Several classical results have shown that, for

-Hessian Lipschitz functions (see Deﬁnition 1),

using the second-order information like computing the Hessian [

] or Hessian-vector products

[

], one can ﬁnd an



-approximate second-order stationary point (SOSP,

k∇f(x)k ≤ 

and

∇2f(x) −√ρI

). However, when the dimension of

is large, even once access to the Hessian

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01496v1 [math.OC] 4 Oct 2022

is computationally infeasible. A recent line of work shows that, by adding uniform random pertur-

bations, ﬁrst-order (FO) methods can efﬁciently escape saddle points and converge to SOSP. In the

deterministic setting, [

] proposed the perturbed gradient descent (PGD) algorithm with gradient

query complexity

O(log4d/2)

by adding uniform random perturbation into the standard gradient

descent algorithm. This complexity is later improved to

O(log6d/1.75)

by the perturbed accelerated

gradient descent [

] which replaces the gradient descent step in PGD by Nesterov’s accelerated

gradient descent.

Table 1: A summary of the results of ﬁnding

(, δ)

-approximate SOSPs (see Deﬁnition 2) by the

zeroth-order algorithms. (CoordGE, GaussGE, and RandGE are abbreviations of “coordinate-wise

gradient estimator”, “Gaussian random gradient estimator” and “uniform random gradient estimator”,

respectively. RP, RS, and CR are abbreviations of “random perturbation”, “random search” and

“cubic regularization”, respectively.)

Algorithm Setting ZO Oracle Main Techniques Function Queries

ZPSGD [27] Deterministic GaussGE + Noise RP ˜

Od2

5†

PAGD [47] Deterministic CoordGE RP Odlog4d

2†

RSPI [35] Deterministic CoordGE RS + NCF O(dlog d

8/3)‡

Theorem. 4 Deterministic CoordGE NCF Od

2+dlog d

δ3.5

ZO-SCRN [5] Stochastic GaussGE CR ˜

Od

3.5+d4

2.5†

Theorem. 3 Stochastic CoordGE NCF ˜

Od

4+d

2δ3+d

δ5

Theorem. 5 Stochastic CoordGE + (RandGE) NCF ˜

Od

10/3+d

2δ3+d

δ5

Theorem. 6 Stochastic CoordGE NCF ˜

Od

3+d

2δ2+d

δ5

†guarantees (, O(√))-approximate SOSP, and ‡guarantees (, 2/3)-approximate SOSP.

Another line of work for escaping saddle points is to utilize the negative curvature ﬁnding (NCF),

which can be combined with



-approximate ﬁrst-order stationary point (FOSP) ﬁnding algorithms

to ﬁnd an (

, δ

)-approximate SOSP. The main task of NCF is to calculate the approximate smallest

eigenvector of the Hessian for a given point. Classical methods for solving NCF like the power

method and Oja’s method need the computation of Hessian-vector products. Based on the fact the

Hessian-vector product can be approximated by the ﬁnite difference between two gradients, [

]

proposed the FO NCF frameworks Neon+ and Neon2, respectively. In general, adding perturbations

in the negative curvature direction can escape saddle points more efﬁciently than adding random

perturbations by a factor of

O(poly(log d))

in theory. Speciﬁcally, in the deterministic setting,

CDHS [

] combined with Neon2 can ﬁnd an (

, δ

)-approximate SOSP in gradient query complexity

O(log d/1.75)

. Recently, the same result was achieved by a simple single-loop algorithm [

], which

combined the techniques of perturbed accelerated gradient descent and accelerated negative curvature

ﬁnding. In the online stochastic setting, the best gradient query complexity result

O(1/3)

is achieved

by SPIDER-SFO

[

], which combined the near-optimal



-approximate FOSP ﬁnding algorithm

SPIDER and the NCF framework Neon2 to ﬁnd an (, δ)-approximate SOSP.

However, the gradient information is not always accessible. Many machine learning and deep learning

applications often encounter situations where the calculation of explicit gradients is expensive or

even infeasible, such as black-box adversarial attack on deep neural networks [

] and

policy search in reinforcement learning [

]. Thus, zeroth-order (ZO) optimization, which

uses function values to estimate the explicit gradients as an important gradient-based black-box

method, is one of the best options for solving this type of ML/DL problem. A considerable body

of work has shown that ZO algorithms based on gradient estimation have comparable convergence

rates to their gradient-based counterparts. Although many gradient estimation-based ZO algorithms

have been proposed in recent years, most of them focus on the performance of converging to FOSPs

[40, 22, 25, 16], and only a few of them on SOSPs [27, 47, 35, 5].

As mentioned above, although there have been several works of ﬁnding local minima via ZO

methods, they utilized the techniques of random perturbations [

], random search [

], and

cubic regularization [

], as shown in Table 1, which are not the most efﬁcient ones of escaping saddle

points as discussed before. Speciﬁcally, in the deterministic setting, [

] proposed the ZO perturbed

stochastic gradient (ZPSGD) method, which uses a batch of Gaussian smoothing based stochastic ZO

gradient estimators and adds a random perturbation in each iteration. As a result, ZPSGD can ﬁnd an



-approximate SOSP using

Od2/5

function queries. [

] proposed the perturbed approximate

gradient descent (PAGD) method which iteratively conducts the gradient descent steps by utilizing

the forward difference version of the coordinate-wise gradient estimators until it reaches a point with

a small gradient. Then, PAGD adds a uniform perturbation and continues the gradient descent steps.

The total function queries of PAGD to ﬁnd an



-approximate SOSP is

Odlog4d/2

. Recently,

[

] proposed the random search power iteration (RSPI) method, which alternately performs random

search steps and power iteration steps. The power iteration step contains an inexact power iteration

subroutine using only the ZO oracle to conduct the NCF, and the core idea is to use a ﬁnite difference

approach to approximate the Hessian-vector product. In the stochastic setting, [

] proposed a zeroth-

order stochastic cubic regularization newton (ZO-SCRN) method with function query complexity

Od/7/2

using Gaussian sampling-based gradient estimator and Hessian estimator. Unfortunately,

each iteration of ZO-SCRN needs to solve a cubic minimization subproblem, which does not have a

closed-form solution. Typically, inexact solvers for solving the cubic minimization subproblem need

additional computations of the Hessian-vector product [1] or the gradient [7].

Thus, it is then natural to explore faster ZO negative curvature ﬁnding based algorithms to make

escaping saddle points more efﬁcient. To the best of our knowledge, negative curvature ﬁnding

algorithms with access only to ZO oracle is still a vacancy in the stochastic setting. Inspired by

the fact that the gradient can be approximated by the ﬁnite difference of function queries with high

accuracy, a natural question is: Can we turn FO NCF methods (especially the state-of-the-art Neon2)

into ZO methods without increasing the iteration complexity and turn ZO algorithms of ﬁnding FOSPs

into the ones of ﬁnding SOSPs?

Contributions. We summarize our main contributions as follows:

•

We give an afﬁrmative answer to the above question. We propose two ZO negative curvature

ﬁnding frameworks, which use only function queries and can detect whether there is a negative

curvature direction at a given point

on a smooth, Hessian-Lipschitz function

f:Rd→R

ofﬂine deterministic and online stochastic settings, respectively.

•

We apply the proposed frameworks to four ZO algorithms and prove that these ZO algorithms can

converge to (

, δ

)-approximate SOSPs, which are ZO-GD, ZO-SGD, ZO-SCSG, and ZO-SPIDER.

•

In the deterministic setting, compared with the classical setting where

δ=O(√)

[

or the special case

δ=2/3

[

], our Theorem 4 is always not worse than other algorithms in

Table 1. In the online stochastic setting, all of our algorithms don’t need to solve the cubic

subproblem as in ZO-SCRN and our Theorem 6 improves the best function query complexity by

a factor of ˜

O(1/√).

2 Preliminaries

Throughout this paper, we use

k · k

to denote the Euclidean norm of a vector and the spectral

norm of a matrix. We use

O(·)

to hide the poly-logarithmic terms. For a given set

drawn from

[n] := {1,2, . . . , n}, deﬁne fS(·) := 1

|S| Pi∈S fi(·).

Deﬁnition 1. For a twice differentiable nonconvex function f:Rd→R,

•fis `-Lipschitz smooth if ∀x, y ∈Rd,k∇f(x)− ∇f(y)k ≤ `kx−yk.

•fis ρ-Hessian Lipschitz if ∀x, y ∈Rd,k∇2f(x)− ∇2f(y)k ≤ ρkx−yk.

Deﬁnition 2. For a twice differentiable nonconvex function f:Rd→R, we say

•x∈Rdis an -approximate ﬁrst-order stationary point if k∇f(x)k ≤ .

•x∈Rd

is an

(, δ)

-approximate second-order stationary point if

k∇f(x)k ≤ , ∇2f(x) −δI

We need the following assumptions which are standard in the literature of ﬁnding SOSPs [

Assumption 1. We assume that f(·)in (1) satisﬁes:

•∆f:= f(x0)−f(x∗)<∞where x∗:= argminxf(x).

•Each component function fi(x)is `-Lipschitz smooth and ρ-Hessian Lipschitz.

•

(For online case only) The variance of the stochastic gradient is bounded:

∀x∈Rd

Ek∇fi(x)−

∇f(x)k2≤σ2.

We’ll also need the following more stringent assumption to get high-probability convergence results

of ZO-SPIDER.

Assumption 2.

We assume that Assumption 1 holds, and in addition, the gradient of each component

function fi(x)satisﬁes ∀i, x ∈Rd,k∇fi(x)− ∇f(x)k2≤σ2.

2.1 ZO Gradient Estimators

Given a smooth, Hessian Lipschitz function

, a central difference version of the deterministic

coordinate-wise gradient estimator is deﬁned by

∇coordf(x) =

i=1

f(x+µei)−f(x−µei)

2µei,(CoordGradEst)

where

denotes a standard basis vector with

at its

-th coordinate and 0 otherwise;

is the

smoothing parameter, which is a sufﬁcient small positive constant. A central difference version of the

random gradient estimator is deﬁned by

∇randf(x) = df(x+µu)−f(x−µu)

2µu, (RandGradEst)

where

u∈Rd

is a random direction drawn from a uniform distribution over the unit sphere;

is the

smoothing parameter, which is a sufﬁcient small positive constant.

Remark 1. Deterministic vs. Random

: CoordGradEst needs

times more function queries than

RandGradEst. However, as will be discussed in section 4, it has a lower approximation error and

thus can reduce the iteration complexity.

Central Difference vs. Forward Difference

(please refer

to Appendix A.1): Under the assumption of Hessian Lipschitz, a smaller approximation error bound

can be obtained by the central difference version of both CoordGradEst and RandGradEst.

2.2 ZO Hessian-Vector Product Estimator

By the deﬁnition of derivative:

∇2f(x)·v= limµ→0∇f(x+µv)−∇f(x)

, we have

∇2f(x)·v

can be

approximated by the difference of two gradients

∇f(x+v)−∇f(x)

for some

with small magnitude.

On the other hand,

∇f(x+v),∇f(x)

can be approximated by

∇coordf(x+v),ˆ

∇coordf(x)

with

high accuracy, respectively. Then the coordinate-wise Hessian-vector product estimator is deﬁned by:

Hf(x)v,

i=1

f(x+v+µei)−f(x+v−µei) + f(x−µei)−f(x+µei)

2µei.(2)

Note that we do not need to know the explicit representation of

Hf(x)

. It is merely used as a notation

for a virtual matrix and can be viewed as the Hessian

∇2f(x0)

with minor perturbations. As stated

in the following lemma, the approximation error is efﬁciently upper bounded.

Lemma 1.

Assume that

-Hessian Lipschitz, then for any smoothing parameter

and

x∈Rd

we have

kHf(x)v− ∇2f(x)vk ≤ ρkvk2/2 + √dµ2/3.(3)

The ZO Hessian-vector product estimator was previously studied in [

], but we provide a tighter

bound than that in Lemma 6 in [

]. This is because we utilize properties of the central difference

version of the coordinate-wise gradient estimator under the Hessian Lipschitz assumption. It is then

directly concluded that, if f(·)is quadratic, we have ρ= 0 and kHf(x)v− ∇2f(x)vk= 0.

3 Zeroth-Order Negative Curvature Finding

In this section, we introduce how to ﬁnd the negative curvature direction near the saddle point using

zeroth-order methods. Recently, based on the fact that the Hessian-vector product

∇2f(x)·v

can

be approximated by

∇f(x+v)− ∇f(x)

with approximation error up to

O(kvk2)

, [

] proposed

a FO framework named Neon2 that can replace the Hessian-vector product computations in NCF

subroutine with gradient computations and thus can turn a FO algorithm for ﬁnding FOSPs into a FO

algorithm for ﬁnding SOSPs. Enlightened by Neon2, we propose two zeroth-order NCF frameworks

(i.e.,ZO-NCF-Online and ZO-NCF-Deterministic) using only function queries to solve nonconvex

problems in the online stochastic setting and ofﬂine deterministic setting, respectively.

3.1 Stochastic Setting

In this subsection, we focus on solving the NCF problem with zeroth-order methods under the

online stochastic setting and propose ZO-NCF-Online. Before introducing ZO-NCF-Online, we ﬁrst

introduce ZO-NCF-Online-Weak with weak conﬁdence of 2/3for solving the NCF problem.

We summarize ZO-NCF-Online-Weak in Algorithm 1. Speciﬁcally, ZO-NCF-Online-Weak consists

of at most

T=O(log2d

δ2)

iterations and works as follows: Given a detection point

, add a

random perturbation with small magnitude

as the starting point. At the

-th iteration where

t= 1, . . . , T

, set

µt=kxt−x0k

to be the smoothing parameter

(2)

. Then we keep updating

xt+1 =xt−ηHfi(x0)(xt−x0)

where

Hfi(x0)(xt−x0)

is the ZO Hessian-vector product estimator

and stops whenever

kxt+1 −x0k ≥ r

or the maximum iteration number

is reached. Thus as

long as Algorithm 1 does not terminate, we have that the approximation error

kHfi(x0)(xt−x0)−

∇2fi(x0)(xt−x0)k

can be bounded by

O(√dr2)

according to Lemma 1. Note that, although

the error bound is poorer by a factor of

O(√d)

as compared to

Neononline

weak

in [

] which used the

difference of two gradients to approximate the Hessian-vector product and achieve an approximation

error up to

O(r2)

, with our choice of

in Algorithm 1, the error term is still efﬁciently upper bounded.

Algorithm 1 ZO-NCF-Online-Weak (f,x0,δ)

1: η←δ

0`2log(100d),T←C2

0log(100d)

ηδ ,σ←η2δ3

(100d)3C0ρ,r←(100d)C0σ

2: ξ←σξ0

kξ0k, with ξ0∼ N(0,I)

3: x1←x0+ξ

4: for t= 1, . . . , T do

5: µt← kxt−x0k

6: xt+1 =xt−ηHfi(x0)(xt−x0)with µ=µtand i∈[n]

7: if kxt+1 −x0k ≥ rthen return v=xs−x0

kxs−x0kfor a uniformly random s∈[t]

Return v=⊥

Other than the additional error term caused by ZO approximation, the motivation of ZO-NCF-Online-

Weak is almost the same as

Neononline

weak

. That is, under reasonable control of the approximation

error of the Hessian-vector product, using the update rule of Oja’s method [

] to approximately

calculate the eigenvector corresponding to the minimum eigenvalue of

∇2f(x0) = 1

nPn

i=1 ∇2fi(x0)

Under similar analysis, we conclude that as long as the minimum eigenvalue of

∇2f(x0)

satisﬁes

λmin(∇2f(x0)) ≤ −δ

,ZO-NCF-Online-Weak will stop before

and ﬁnd a negative curvature

direction that aligns well with the eigenvector corresponding to the minimum eigenvalue of

∇f2(x0)

Then we have the following lemma:

Lemma 2

(ZO-NCF-Online-Weak)

The output

of Algorithm 1 satisﬁes: If

λmin(∇2f(x0)) ≤ −δ

then with probability at least 2/3,v6=⊥and vT∇2f(x0)v≤ −3

4δ.

We summarize ZO-NCF-Online in Algorithm 2. Speciﬁcally, ZO-NCF-Online repeatedly calls ZO-

NCF-Online-Weak for

Θ(log(1/p))

times to boost the conﬁdence of solving the NCF problem from

2/3to 1−p. We have the following results:

Lemma 3.

In the same setting as in Algorithm 2, deﬁne

z=1

mPm

j=1 vT(Hfij(x0))v

. Then, if

kvk ≤ δ

16dρ and m= Θ( `2

δ2), with probability at least 1−p, we have 

kvk2−vT∇2f(x)v

kvk2≤δ

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Zeroth-OrderNegativeCurvatureFinding:EscapingSaddlePointswithoutGradientsHualinZhang1HuanXiong2;3BinGu1;31NanjingUniversityofInformationScience&Technology2HarbinInstituteofTechnology3MohamedbinZayedUniversityofArticialIntelligence{zhanghualin98,huan.xiong.math,jsgubin}@gmail.comAbstractWeconsideres...

展开>> 收起<<

Zeroth-Order Negative Curvature Finding Escaping Saddle Points without Gradients Hualin Zhang1Huan Xiong23Bin Gu13.pdf

共56页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Zeroth-Order Negative Curvature Finding Escaping Saddle Points without Gradients Hualin Zhang1Huan Xiong23Bin Gu13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: