
Table 1: Complexity of sparsity-enforcing algorithms. We give the query complexity for a precision
ε
, up to
the system error (see section 4). For first-order algorithms (FO), we give it in terms of number of first order
oracle calls (#IFO), that is, calls to
∇f(x, ξ)
, and for ZO algorithms, in terms of calls of
f(ξ,·)
. Here
κ
denotes
the condition number
L
ν
, with
L
is the smoothness (or RSS) constant and
ν
is the strong-convexity (or RSC)
constant.
Type Name Assumptions #IZO/#IFO #HT ops.
FO/ℓ0StoIHT [31] RSS, RSC O(κlog( 1
ε)) O(κlog( 1
ε))
ZO/ℓ1RSPGF [14] smooth3O(d
ε2)—
ZO/ℓ1ZSCG2
[2] convex, smooth O(d
ε2)—
ZO/ℓ1ZORO [7]
s-sparse gradient,
weakly sparse hessian,
smooth3
, RSCbis1O(slog(d) log(1
ε)) —
ZO/ℓ0SZOHT RSS, RSC O((k+d
s2)κ2log( 1
ε)) O(κ2log( 1
ε))
ZO/ℓ0SZOHT smooth, RSC O(kκ2log(1
ε)) O(κ2log( 1
ε))
1
The definition of Restricted Strong Convexity from [
7
] is different from ours and that of [
31
], hence
the bis subscript.
2We refer to the modified version of ZSCG (Algorithm 3 in [2]).
3RSPGF and ZORO minimize f(x) + λ∥x∥1: only fneeds to be smooth.
Nguyen et al.
[31]
, which developed a stochastic gradient descent (SGD) version of hard thresholding
(StoIHT), and further more with Zhou et al.
[47]
, Shen and Li
[36]
and Li et al.
[20]
, which used
variance reduction technique to improve upon StoIHT.
However, the first-order gradients used in the above methods may be either unavailable or expensive
to calculate in a lot of real-world problems. For example, in certain graphical modeling tasks [
42
],
obtaining the gradient of the objective function is computationally hard. Even worse, in some settings,
the gradient is inaccessible by nature, for instance in bandit problems [
35
], black-box adversarial
attacks [
40
,
9
,
10
], or reinforcement learning [
34
,
27
,
11
]. To tackle those problems, ZO optimization
methods have been developed [
30
]. Those methods usually replace the inaccessible gradient by its
finite difference approximation which can be computed only from function evaluations, following
the idea that for a differentiable function f:R→R, we have: f′(x) = limh→0f(x+h)−f(x)
h. Later
on, ZO methods have been adapted to deal with a convex constraints set, and can therefore be used
to solve the
ℓ1
convex relaxation of problem
(1)
. To that end, Ghadimi et al.
[14]
, and Cai et al.
[7]
introduce proximal ZO algorithms, Liu et al.
[24]
introduce a ZO projected gradient algorithm and
Balasubramanian and Ghadimi
[2]
introduce a ZO conditional gradient [
19
] algorithm. We provide
a review of those results in Table 1. As can be seen from the table, their query complexity is high
(linear in
d
), except [
7
] that has a complexity of
O(slog(d) log(1
ε))
, but assumes that gradients are
sparse. In addition, those methods must introduce a hyperparameter
λ
(the strength of the
ℓ1
penalty)
or
R
(the radius of the
ℓ1
ball), which need to be tuned to find which value ensures the right sparsity
level. Therefore, it would be interesting to use the hard-thresholding techniques described in the
previous paragraph, instead of those convex relaxations.
Unfortunately, ZO hard-thresholding gradient algorithms have not been exploited formally. Even
more, whether ZO gradients can work with the hard-thresholding operator is still an unknown
problem. Although there was one related algorithm proposed recently by Balasubramanian and
Ghadimi
[2]
, they did not target
ℓ0
constrained optimization problem and importantly have strong
assumptions in their convergence analysis. Indeed, they assume that the gradients, as well as the
solution of the unconstrained problem, are
s
-sparse:
∥∇f(x)∥0≤s
and
∥x∗∥0≤s∗≈s
, where
x∗= arg minxf(x)
. In addition, it was recently shown by Cai et al.
[7]
that they must in fact
assume that the support of the gradient is fixed for all
x∈ X
, for their convergence result to hold,
which is a hard limitation, since that amounts to say that the function
f
depends on
s
fixed variables.
To fill this gap, in this paper, we focus on the
ℓ0
constrained black-box stochastic optimization
problems, and propose a novel stochastic zeroth-order gradient hard-thresholding (SZOHT) algorithm.
Specifically, we propose a dimension friendly ZO gradient estimator powered by a novel random
support sampling technique, and then embed it into the standard hard-thresholding operator.
2