Zeroth-Order Negative Curvature Finding Escaping Saddle Points without Gradients Hualin Zhang1Huan Xiong23Bin Gu13

2025-04-29 1 0 1.01MB 56 页 10玖币
侵权投诉
Zeroth-Order Negative Curvature Finding: Escaping
Saddle Points without Gradients
Hualin Zhang1Huan Xiong2,3Bin Gu1,3
1Nanjing University of Information Science & Technology
2Harbin Institute of Technology
3Mohamed bin Zayed University of Artificial Intelligence
{zhanghualin98,huan.xiong.math,jsgubin}@gmail.com
Abstract
We consider escaping saddle points of nonconvex problems where only the function
evaluations can be accessed. Although a variety of works have been proposed, the
majority of them require either second or first-order information, and only a few of
them have exploited zeroth-order methods, particularly the technique of negative
curvature finding with zeroth-order methods which has been proven to be the most
efficient method for escaping saddle points. To fill this gap, in this paper, we
propose two zeroth-order negative curvature finding frameworks that can replace
Hessian-vector product computations without increasing the iteration complexity.
We apply the proposed frameworks to ZO-GD, ZO-SGD, ZO-SCSG, ZO-SPIDER
and prove that these ZO algorithms can converge to
(, δ)
-approximate second-
order stationary points with less query complexity compared with prior zeroth-order
works for finding local minima.
1 Introduction
Nonconvex optimization has received wide attention in recent years due to its popularity in modern
machine learning (ML) and deep learning (DL) tasks. Specifically, in this paper, we study the
following unconstrained optimization problem:
min
xRdf(x) := 1
n
n
X
i=1
fi(x),(1)
where both
fi(·)
and
f(·)
can be nonconvex. In general, finding the global optima of nonconvex
functions is NP-hard. Fortunately, finding local optima is an alternative because it has been shown in
theory and practice that local optima have comparable performance capabilities to global optima in
many machine learning problems [
18
,
19
,
30
,
21
,
20
,
23
,
31
]. Gradient-based methods have been
shown to be able to find an
-approximate first-order stationary point (
k∇f(x)k ≤
) efficiently, both
in the deterministic setting (e.g., gradient descent [
37
]; accelerated gradient descent [
8
,
33
]) and
stochastic setting (e.g., stochastic gradient descent [
37
,
43
]; SCSG [
32
]; SPIDER [
16
]). However, in
nonconvex settings, first-order stationary points can be local minima, global minima, or even saddle
points. Converging to saddle points will lead to highly suboptimal solutions [
24
,
45
] and destroy the
model’s performance. Thus, escaping saddle points has recently become an important research topic
in nonconvex optimization.
Several classical results have shown that, for
ρ
-Hessian Lipschitz functions (see Definition 1),
using the second-order information like computing the Hessian [
39
] or Hessian-vector products
[
1
,
9
,
2
], one can find an
-approximate second-order stationary point (SOSP,
k∇f(x)k ≤
and
2f(x) −ρI
). However, when the dimension of
x
is large, even once access to the Hessian
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01496v1 [math.OC] 4 Oct 2022
is computationally infeasible. A recent line of work shows that, by adding uniform random pertur-
bations, first-order (FO) methods can efficiently escape saddle points and converge to SOSP. In the
deterministic setting, [
26
] proposed the perturbed gradient descent (PGD) algorithm with gradient
query complexity
˜
O(log4d/2)
by adding uniform random perturbation into the standard gradient
descent algorithm. This complexity is later improved to
˜
O(log6d/1.75)
by the perturbed accelerated
gradient descent [
28
] which replaces the gradient descent step in PGD by Nesterov’s accelerated
gradient descent.
Table 1: A summary of the results of finding
(, δ)
-approximate SOSPs (see Definition 2) by the
zeroth-order algorithms. (CoordGE, GaussGE, and RandGE are abbreviations of “coordinate-wise
gradient estimator”, “Gaussian random gradient estimator” and “uniform random gradient estimator”,
respectively. RP, RS, and CR are abbreviations of “random perturbation”, “random search” and
“cubic regularization”, respectively.)
Algorithm Setting ZO Oracle Main Techniques Function Queries
ZPSGD [27] Deterministic GaussGE + Noise RP ˜
Od2
5
PAGD [47] Deterministic CoordGE RP Odlog4d
2
RSPI [35] Deterministic CoordGE RS + NCF O(dlog d
8/3)
Theorem. 4 Deterministic CoordGE NCF Od
2+dlog d
δ3.5
ZO-SCRN [5] Stochastic GaussGE CR ˜
Od
3.5+d4
2.5
Theorem. 3 Stochastic CoordGE NCF ˜
Od
4+d
2δ3+d
δ5
Theorem. 5 Stochastic CoordGE + (RandGE) NCF ˜
Od
10/3+d
2δ3+d
δ5
Theorem. 6 Stochastic CoordGE NCF ˜
Od
3+d
2δ2+d
δ5
guarantees (, O())-approximate SOSP, and guarantees (, 2/3)-approximate SOSP.
Another line of work for escaping saddle points is to utilize the negative curvature finding (NCF),
which can be combined with
-approximate first-order stationary point (FOSP) finding algorithms
to find an (
, δ
)-approximate SOSP. The main task of NCF is to calculate the approximate smallest
eigenvector of the Hessian for a given point. Classical methods for solving NCF like the power
method and Oja’s method need the computation of Hessian-vector products. Based on the fact the
Hessian-vector product can be approximated by the finite difference between two gradients, [
49
,
4
]
proposed the FO NCF frameworks Neon+ and Neon2, respectively. In general, adding perturbations
in the negative curvature direction can escape saddle points more efficiently than adding random
perturbations by a factor of
˜
O(poly(log d))
in theory. Specifically, in the deterministic setting,
CDHS [
9
] combined with Neon2 can find an (
, δ
)-approximate SOSP in gradient query complexity
˜
O(log d/1.75)
. Recently, the same result was achieved by a simple single-loop algorithm [
51
], which
combined the techniques of perturbed accelerated gradient descent and accelerated negative curvature
finding. In the online stochastic setting, the best gradient query complexity result
˜
O(1/3)
is achieved
by SPIDER-SFO
+
[
16
], which combined the near-optimal
-approximate FOSP finding algorithm
SPIDER and the NCF framework Neon2 to find an (, δ)-approximate SOSP.
However, the gradient information is not always accessible. Many machine learning and deep learning
applications often encounter situations where the calculation of explicit gradients is expensive or
even infeasible, such as black-box adversarial attack on deep neural networks [
42
,
36
,
13
,
6
,
46
] and
policy search in reinforcement learning [
44
,
14
,
29
]. Thus, zeroth-order (ZO) optimization, which
uses function values to estimate the explicit gradients as an important gradient-based black-box
method, is one of the best options for solving this type of ML/DL problem. A considerable body
of work has shown that ZO algorithms based on gradient estimation have comparable convergence
rates to their gradient-based counterparts. Although many gradient estimation-based ZO algorithms
have been proposed in recent years, most of them focus on the performance of converging to FOSPs
[40, 22, 25, 16], and only a few of them on SOSPs [27, 47, 35, 5].
As mentioned above, although there have been several works of finding local minima via ZO
methods, they utilized the techniques of random perturbations [
27
,
47
], random search [
35
], and
cubic regularization [
5
], as shown in Table 1, which are not the most efficient ones of escaping saddle
points as discussed before. Specifically, in the deterministic setting, [
27
] proposed the ZO perturbed
2
stochastic gradient (ZPSGD) method, which uses a batch of Gaussian smoothing based stochastic ZO
gradient estimators and adds a random perturbation in each iteration. As a result, ZPSGD can find an
-approximate SOSP using
˜
Od2/5
function queries. [
47
] proposed the perturbed approximate
gradient descent (PAGD) method which iteratively conducts the gradient descent steps by utilizing
the forward difference version of the coordinate-wise gradient estimators until it reaches a point with
a small gradient. Then, PAGD adds a uniform perturbation and continues the gradient descent steps.
The total function queries of PAGD to find an
-approximate SOSP is
˜
Odlog4d/2
. Recently,
[
35
] proposed the random search power iteration (RSPI) method, which alternately performs random
search steps and power iteration steps. The power iteration step contains an inexact power iteration
subroutine using only the ZO oracle to conduct the NCF, and the core idea is to use a finite difference
approach to approximate the Hessian-vector product. In the stochastic setting, [
5
] proposed a zeroth-
order stochastic cubic regularization newton (ZO-SCRN) method with function query complexity
˜
Od/7/2
using Gaussian sampling-based gradient estimator and Hessian estimator. Unfortunately,
each iteration of ZO-SCRN needs to solve a cubic minimization subproblem, which does not have a
closed-form solution. Typically, inexact solvers for solving the cubic minimization subproblem need
additional computations of the Hessian-vector product [1] or the gradient [7].
Thus, it is then natural to explore faster ZO negative curvature finding based algorithms to make
escaping saddle points more efficient. To the best of our knowledge, negative curvature finding
algorithms with access only to ZO oracle is still a vacancy in the stochastic setting. Inspired by
the fact that the gradient can be approximated by the finite difference of function queries with high
accuracy, a natural question is: Can we turn FO NCF methods (especially the state-of-the-art Neon2)
into ZO methods without increasing the iteration complexity and turn ZO algorithms of finding FOSPs
into the ones of finding SOSPs?
Contributions. We summarize our main contributions as follows:
We give an affirmative answer to the above question. We propose two ZO negative curvature
finding frameworks, which use only function queries and can detect whether there is a negative
curvature direction at a given point
x
on a smooth, Hessian-Lipschitz function
f:RdR
in
offline deterministic and online stochastic settings, respectively.
We apply the proposed frameworks to four ZO algorithms and prove that these ZO algorithms can
converge to (
, δ
)-approximate SOSPs, which are ZO-GD, ZO-SGD, ZO-SCSG, and ZO-SPIDER.
In the deterministic setting, compared with the classical setting where
δ=O()
[
26
,
28
,
27
,
47
],
or the special case
δ=2/3
[
35
], our Theorem 4 is always not worse than other algorithms in
Table 1. In the online stochastic setting, all of our algorithms don’t need to solve the cubic
subproblem as in ZO-SCRN and our Theorem 6 improves the best function query complexity by
a factor of ˜
O(1/).
2 Preliminaries
Throughout this paper, we use
k · k
to denote the Euclidean norm of a vector and the spectral
norm of a matrix. We use
˜
O(·)
to hide the poly-logarithmic terms. For a given set
S
drawn from
[n] := {1,2, . . . , n}, define fS(·) := 1
|S| Pi∈S fi(·).
Definition 1. For a twice differentiable nonconvex function f:RdR,
fis `-Lipschitz smooth if x, y Rd,k∇f(x)− ∇f(y)k ≤ `kxyk.
fis ρ-Hessian Lipschitz if x, y Rd,k∇2f(x)− ∇2f(y)k ≤ ρkxyk.
Definition 2. For a twice differentiable nonconvex function f:RdR, we say
xRdis an -approximate first-order stationary point if k∇f(x)k ≤ .
xRd
is an
(, δ)
-approximate second-order stationary point if
k∇f(x)k ≤ , 2f(x) −δI
.
We need the following assumptions which are standard in the literature of finding SOSPs [
4
,
16
,
51
].
Assumption 1. We assume that f(·)in (1) satisfies:
3
f:= f(x0)f(x)<where x:= argminxf(x).
Each component function fi(x)is `-Lipschitz smooth and ρ-Hessian Lipschitz.
(For online case only) The variance of the stochastic gradient is bounded:
xRd
,
Ek∇fi(x)
f(x)k2σ2.
We’ll also need the following more stringent assumption to get high-probability convergence results
of ZO-SPIDER.
Assumption 2.
We assume that Assumption 1 holds, and in addition, the gradient of each component
function fi(x)satisfies i, x Rd,k∇fi(x)− ∇f(x)k2σ2.
2.1 ZO Gradient Estimators
Given a smooth, Hessian Lipschitz function
f
, a central difference version of the deterministic
coordinate-wise gradient estimator is defined by
ˆ
coordf(x) =
d
X
i=1
f(x+µei)f(xµei)
2µei,(CoordGradEst)
where
ei
denotes a standard basis vector with
1
at its
i
-th coordinate and 0 otherwise;
µ
is the
smoothing parameter, which is a sufficient small positive constant. A central difference version of the
random gradient estimator is defined by
ˆ
randf(x) = df(x+µu)f(xµu)
2µu, (RandGradEst)
where
uRd
is a random direction drawn from a uniform distribution over the unit sphere;
µ
is the
smoothing parameter, which is a sufficient small positive constant.
Remark 1. Deterministic vs. Random
: CoordGradEst needs
d
times more function queries than
RandGradEst. However, as will be discussed in section 4, it has a lower approximation error and
thus can reduce the iteration complexity.
Central Difference vs. Forward Difference
(please refer
to Appendix A.1): Under the assumption of Hessian Lipschitz, a smaller approximation error bound
can be obtained by the central difference version of both CoordGradEst and RandGradEst.
2.2 ZO Hessian-Vector Product Estimator
By the definition of derivative:
2f(x)·v= limµ0f(x+µv)−∇f(x)
µ
, we have
2f(x)·v
can be
approximated by the difference of two gradients
f(x+v)f(x)
for some
v
with small magnitude.
On the other hand,
f(x+v),f(x)
can be approximated by
ˆ
coordf(x+v),ˆ
coordf(x)
with
high accuracy, respectively. Then the coordinate-wise Hessian-vector product estimator is defined by:
Hf(x)v,
d
X
i=1
f(x+v+µei)f(x+vµei) + f(xµei)f(x+µei)
2µei.(2)
Note that we do not need to know the explicit representation of
Hf(x)
. It is merely used as a notation
for a virtual matrix and can be viewed as the Hessian
2f(x0)
with minor perturbations. As stated
in the following lemma, the approximation error is efficiently upper bounded.
Lemma 1.
Assume that
f
is
ρ
-Hessian Lipschitz, then for any smoothing parameter
µ
and
xRd
,
we have
kHf(x)v− ∇2f(x)vk ≤ ρkvk2/2 + 2/3.(3)
The ZO Hessian-vector product estimator was previously studied in [
50
,
35
], but we provide a tighter
bound than that in Lemma 6 in [
35
]. This is because we utilize properties of the central difference
version of the coordinate-wise gradient estimator under the Hessian Lipschitz assumption. It is then
directly concluded that, if f(·)is quadratic, we have ρ= 0 and kHf(x)v− ∇2f(x)vk= 0.
4
3 Zeroth-Order Negative Curvature Finding
In this section, we introduce how to find the negative curvature direction near the saddle point using
zeroth-order methods. Recently, based on the fact that the Hessian-vector product
2f(x)·v
can
be approximated by
f(x+v)− ∇f(x)
with approximation error up to
O(kvk2)
, [
4
] proposed
a FO framework named Neon2 that can replace the Hessian-vector product computations in NCF
subroutine with gradient computations and thus can turn a FO algorithm for finding FOSPs into a FO
algorithm for finding SOSPs. Enlightened by Neon2, we propose two zeroth-order NCF frameworks
(i.e.,ZO-NCF-Online and ZO-NCF-Deterministic) using only function queries to solve nonconvex
problems in the online stochastic setting and offline deterministic setting, respectively.
3.1 Stochastic Setting
In this subsection, we focus on solving the NCF problem with zeroth-order methods under the
online stochastic setting and propose ZO-NCF-Online. Before introducing ZO-NCF-Online, we first
introduce ZO-NCF-Online-Weak with weak confidence of 2/3for solving the NCF problem.
We summarize ZO-NCF-Online-Weak in Algorithm 1. Specifically, ZO-NCF-Online-Weak consists
of at most
T=O(log2d
δ2)
iterations and works as follows: Given a detection point
x0
, add a
random perturbation with small magnitude
σ
as the starting point. At the
t
-th iteration where
t= 1, . . . , T
, set
µt=kxtx0k
to be the smoothing parameter
µ
in
(2)
. Then we keep updating
xt+1 =xtηHfi(x0)(xtx0)
where
Hfi(x0)(xtx0)
is the ZO Hessian-vector product estimator
and stops whenever
kxt+1 x0k ≥ r
or the maximum iteration number
T
is reached. Thus as
long as Algorithm 1 does not terminate, we have that the approximation error
kHfi(x0)(xtx0)
2fi(x0)(xtx0)k
can be bounded by
O(dr2)
according to Lemma 1. Note that, although
the error bound is poorer by a factor of
O(d)
as compared to
Neononline
weak
in [
4
] which used the
difference of two gradients to approximate the Hessian-vector product and achieve an approximation
error up to
O(r2)
, with our choice of
r
in Algorithm 1, the error term is still efficiently upper bounded.
Algorithm 1 ZO-NCF-Online-Weak (f,x0,δ)
1: ηδ
C2
0`2log(100d),TC2
0log(100d)
ηδ ,ση2δ3
(100d)3C0ρ,r(100d)C0σ
2: ξσξ0
kξ0k, with ξ0∼ N(0,I)
3: x1x0+ξ
4: for t= 1, . . . , T do
5: µt← kxtx0k
6: xt+1 =xtηHfi(x0)(xtx0)with µ=µtand i[n]
7: if kxt+1 x0k ≥ rthen return v=xsx0
kxsx0kfor a uniformly random s[t]
Return v=
Other than the additional error term caused by ZO approximation, the motivation of ZO-NCF-Online-
Weak is almost the same as
Neononline
weak
. That is, under reasonable control of the approximation
error of the Hessian-vector product, using the update rule of Oja’s method [
41
] to approximately
calculate the eigenvector corresponding to the minimum eigenvalue of
2f(x0) = 1
nPn
i=1 2fi(x0)
.
Under similar analysis, we conclude that as long as the minimum eigenvalue of
2f(x0)
satisfies
λmin(2f(x0)) ≤ −δ
,ZO-NCF-Online-Weak will stop before
T
and find a negative curvature
direction that aligns well with the eigenvector corresponding to the minimum eigenvalue of
f2(x0)
.
Then we have the following lemma:
Lemma 2
(ZO-NCF-Online-Weak)
.
The output
v
of Algorithm 1 satisfies: If
λmin(2f(x0)) ≤ −δ
,
then with probability at least 2/3,v6=and vT2f(x0)v≤ −3
4δ.
We summarize ZO-NCF-Online in Algorithm 2. Specifically, ZO-NCF-Online repeatedly calls ZO-
NCF-Online-Weak for
Θ(log(1/p))
times to boost the confidence of solving the NCF problem from
2/3to 1p. We have the following results:
Lemma 3.
In the same setting as in Algorithm 2, define
z=1
mPm
j=1 vT(Hfij(x0))v
. Then, if
kvk ≤ δ
16and m= Θ( `2
δ2), with probability at least 1p, we have
z
kvk2vT2f(x)v
kvk2δ
4.
5
摘要:

Zeroth-OrderNegativeCurvatureFinding:EscapingSaddlePointswithoutGradientsHualinZhang1HuanXiong2;3BinGu1;31NanjingUniversityofInformationScience&Technology2HarbinInstituteofTechnology3MohamedbinZayedUniversityofArticialIntelligence{zhanghualin98,huan.xiong.math,jsgubin}@gmail.comAbstractWeconsideres...

展开>> 收起<<
Zeroth-Order Negative Curvature Finding Escaping Saddle Points without Gradients Hualin Zhang1Huan Xiong23Bin Gu13.pdf

共56页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:56 页 大小:1.01MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 56
客服
关注