
is computationally infeasible. A recent line of work shows that, by adding uniform random pertur-
bations, first-order (FO) methods can efficiently escape saddle points and converge to SOSP. In the
deterministic setting, [
26
] proposed the perturbed gradient descent (PGD) algorithm with gradient
query complexity
˜
O(log4d/2)
by adding uniform random perturbation into the standard gradient
descent algorithm. This complexity is later improved to
˜
O(log6d/1.75)
by the perturbed accelerated
gradient descent [
28
] which replaces the gradient descent step in PGD by Nesterov’s accelerated
gradient descent.
Table 1: A summary of the results of finding
(, δ)
-approximate SOSPs (see Definition 2) by the
zeroth-order algorithms. (CoordGE, GaussGE, and RandGE are abbreviations of “coordinate-wise
gradient estimator”, “Gaussian random gradient estimator” and “uniform random gradient estimator”,
respectively. RP, RS, and CR are abbreviations of “random perturbation”, “random search” and
“cubic regularization”, respectively.)
Algorithm Setting ZO Oracle Main Techniques Function Queries
ZPSGD [27] Deterministic GaussGE + Noise RP ˜
Od2
5†
PAGD [47] Deterministic CoordGE RP Odlog4d
2†
RSPI [35] Deterministic CoordGE RS + NCF O(dlog d
8/3)‡
Theorem. 4 Deterministic CoordGE NCF Od
2+dlog d
δ3.5
ZO-SCRN [5] Stochastic GaussGE CR ˜
Od
3.5+d4
2.5†
Theorem. 3 Stochastic CoordGE NCF ˜
Od
4+d
2δ3+d
δ5
Theorem. 5 Stochastic CoordGE + (RandGE) NCF ˜
Od
10/3+d
2δ3+d
δ5
Theorem. 6 Stochastic CoordGE NCF ˜
Od
3+d
2δ2+d
δ5
†guarantees (, O(√))-approximate SOSP, and ‡guarantees (, 2/3)-approximate SOSP.
Another line of work for escaping saddle points is to utilize the negative curvature finding (NCF),
which can be combined with
-approximate first-order stationary point (FOSP) finding algorithms
to find an (
, δ
)-approximate SOSP. The main task of NCF is to calculate the approximate smallest
eigenvector of the Hessian for a given point. Classical methods for solving NCF like the power
method and Oja’s method need the computation of Hessian-vector products. Based on the fact the
Hessian-vector product can be approximated by the finite difference between two gradients, [
49
,
4
]
proposed the FO NCF frameworks Neon+ and Neon2, respectively. In general, adding perturbations
in the negative curvature direction can escape saddle points more efficiently than adding random
perturbations by a factor of
˜
O(poly(log d))
in theory. Specifically, in the deterministic setting,
CDHS [
9
] combined with Neon2 can find an (
, δ
)-approximate SOSP in gradient query complexity
˜
O(log d/1.75)
. Recently, the same result was achieved by a simple single-loop algorithm [
51
], which
combined the techniques of perturbed accelerated gradient descent and accelerated negative curvature
finding. In the online stochastic setting, the best gradient query complexity result
˜
O(1/3)
is achieved
by SPIDER-SFO
+
[
16
], which combined the near-optimal
-approximate FOSP finding algorithm
SPIDER and the NCF framework Neon2 to find an (, δ)-approximate SOSP.
However, the gradient information is not always accessible. Many machine learning and deep learning
applications often encounter situations where the calculation of explicit gradients is expensive or
even infeasible, such as black-box adversarial attack on deep neural networks [
42
,
36
,
13
,
6
,
46
] and
policy search in reinforcement learning [
44
,
14
,
29
]. Thus, zeroth-order (ZO) optimization, which
uses function values to estimate the explicit gradients as an important gradient-based black-box
method, is one of the best options for solving this type of ML/DL problem. A considerable body
of work has shown that ZO algorithms based on gradient estimation have comparable convergence
rates to their gradient-based counterparts. Although many gradient estimation-based ZO algorithms
have been proposed in recent years, most of them focus on the performance of converging to FOSPs
[40, 22, 25, 16], and only a few of them on SOSPs [27, 47, 35, 5].
As mentioned above, although there have been several works of finding local minima via ZO
methods, they utilized the techniques of random perturbations [
27
,
47
], random search [
35
], and
cubic regularization [
5
], as shown in Table 1, which are not the most efficient ones of escaping saddle
points as discussed before. Specifically, in the deterministic setting, [
27
] proposed the ZO perturbed
2