Zeroth-Order Hard-Thresholding Gradient Error vs. Expansivity William de Vazelhes1 Hualin Zhang2 Huimin Wu2 Xiao-Tong Yuan2 Bin Gu1

2025-04-24 0 0 1.61MB 33 页 10玖币
侵权投诉
Zeroth-Order Hard-Thresholding:
Gradient Error vs. Expansivity
William de Vazelhes1, Hualin Zhang2, Huimin Wu2, Xiao-Tong Yuan2, Bin Gu1
1Mohamed bin Zayed University of Artificial Intelligence
2Nanjing University of Information Science & Technology
{wdevazelhes,zhanghualin98,xtyuan1980,jsgubin}@gmail.com,
wuhuimin@nuist.edu.cn
Abstract
0
constrained optimization is prevalent in machine learning, particularly for high-
dimensional problems, because it is a fundamental approach to achieve sparse
learning. Hard-thresholding gradient descent is a dominant technique to solve
this problem. However, first-order gradients of the objective function may be
either unavailable or expensive to calculate in a lot of real-world problems, where
zeroth-order (ZO) gradients could be a good surrogate. Unfortunately, whether ZO
gradients can work with the hard-thresholding operator is still an unsolved prob-
lem. To solve this puzzle, in this paper, we focus on the
0
constrained black-box
stochastic optimization problems, and propose a new stochastic zeroth-order gradi-
ent hard-thresholding (SZOHT) algorithm with a general ZO gradient estimator
powered by a novel random support sampling. We provide the convergence analysis
of SZOHT under standard assumptions. Importantly, we reveal a conflict between
the deviation of ZO estimators and the expansivity of the hard-thresholding opera-
tor, and provide a theoretical minimal value of the number of random directions
in ZO gradients. In addition, we find that the query complexity of SZOHT is
independent or weakly dependent on the dimensionality under different settings.
Finally, we illustrate the utility of our method on a portfolio optimization problem
as well as black-box adversarial attacks.
1 Introduction
0
constrained optimization is prevalent in machine learning, particularly for high-dimensional
problems, because it is a fundamental approach to achieve sparse learning. In addition to improving
the memory, computational and environmental footprint of the models, these sparse constraints
help reduce overfitting and obtain consistent statistical estimation [
46
,
5
,
33
,
29
]. We formulate the
problem as follows:
min
xRd{f(x) := Eξf(x,ξ)},s.t. x0k(1)
where
f(·,ξ) : RdR
is a differentiable function and
ξ
is a noise term, for instance related to an
underlying finite sum structure in
f
, of the form:
Eξf(x,ξ) = 1
nPn
i=1 fi(x)
. Hard-thresholding
gradient algorithm [
17
,
31
,
45
] is a dominant technique to solve this problem. It generally consists in
alternating between a gradient step, and a hard-thresholding operation which only keeps the
k
-largest
components (in absolute value) of the current iterate. The advantage of hard-thresholding over its
convex relaxations ([
39
,
41
]) is that it can often attain similar precision, but is more computationally
efficient, since it can directly ensure a desired sparsity level instead of tuning an
1
penalty or
constraint. The only expensive computation in hard-thresholding is the hard-thresholding step itself,
which requires finding the top
k
elements of the current iterate. Hard-thresholding was originally
developed in its full gradient form [
17
], but has been later on extended to the stochastic setting by
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05279v2 [cs.LG] 18 Mar 2024
Table 1: Complexity of sparsity-enforcing algorithms. We give the query complexity for a precision
ε
, up to
the system error (see section 4). For first-order algorithms (FO), we give it in terms of number of first order
oracle calls (#IFO), that is, calls to
f(x, ξ)
, and for ZO algorithms, in terms of calls of
f(ξ,·)
. Here
κ
denotes
the condition number
L
ν
, with
L
is the smoothness (or RSS) constant and
ν
is the strong-convexity (or RSC)
constant.
Type Name Assumptions #IZO/#IFO #HT ops.
FO/0StoIHT [31] RSS, RSC O(κlog( 1
ε)) O(κlog( 1
ε))
ZO/1RSPGF [14] smooth3O(d
ε2)
ZO/1ZSCG2
[2] convex, smooth O(d
ε2)
ZO/1ZORO [7]
s-sparse gradient,
weakly sparse hessian,
smooth3
, RSCbis1O(slog(d) log(1
ε))
ZO/0SZOHT RSS, RSC O((k+d
s2)κ2log( 1
ε)) O(κ2log( 1
ε))
ZO/0SZOHT smooth, RSC O(kκ2log(1
ε)) O(κ2log( 1
ε))
1
The definition of Restricted Strong Convexity from [
7
] is different from ours and that of [
31
], hence
the bis subscript.
2We refer to the modified version of ZSCG (Algorithm 3 in [2]).
3RSPGF and ZORO minimize f(x) + λx1: only fneeds to be smooth.
Nguyen et al.
[31]
, which developed a stochastic gradient descent (SGD) version of hard thresholding
(StoIHT), and further more with Zhou et al.
[47]
, Shen and Li
[36]
and Li et al.
[20]
, which used
variance reduction technique to improve upon StoIHT.
However, the first-order gradients used in the above methods may be either unavailable or expensive
to calculate in a lot of real-world problems. For example, in certain graphical modeling tasks [
42
],
obtaining the gradient of the objective function is computationally hard. Even worse, in some settings,
the gradient is inaccessible by nature, for instance in bandit problems [
35
], black-box adversarial
attacks [
40
,
9
,
10
], or reinforcement learning [
34
,
27
,
11
]. To tackle those problems, ZO optimization
methods have been developed [
30
]. Those methods usually replace the inaccessible gradient by its
finite difference approximation which can be computed only from function evaluations, following
the idea that for a differentiable function f:RR, we have: f(x) = limh0f(x+h)f(x)
h. Later
on, ZO methods have been adapted to deal with a convex constraints set, and can therefore be used
to solve the
1
convex relaxation of problem
(1)
. To that end, Ghadimi et al.
[14]
, and Cai et al.
[7]
introduce proximal ZO algorithms, Liu et al.
[24]
introduce a ZO projected gradient algorithm and
Balasubramanian and Ghadimi
[2]
introduce a ZO conditional gradient [
19
] algorithm. We provide
a review of those results in Table 1. As can be seen from the table, their query complexity is high
(linear in
d
), except [
7
] that has a complexity of
O(slog(d) log(1
ε))
, but assumes that gradients are
sparse. In addition, those methods must introduce a hyperparameter
λ
(the strength of the
1
penalty)
or
R
(the radius of the
1
ball), which need to be tuned to find which value ensures the right sparsity
level. Therefore, it would be interesting to use the hard-thresholding techniques described in the
previous paragraph, instead of those convex relaxations.
Unfortunately, ZO hard-thresholding gradient algorithms have not been exploited formally. Even
more, whether ZO gradients can work with the hard-thresholding operator is still an unknown
problem. Although there was one related algorithm proposed recently by Balasubramanian and
Ghadimi
[2]
, they did not target
0
constrained optimization problem and importantly have strong
assumptions in their convergence analysis. Indeed, they assume that the gradients, as well as the
solution of the unconstrained problem, are
s
-sparse:
∥∇f(x)0s
and
x0ss
, where
x= arg minxf(x)
. In addition, it was recently shown by Cai et al.
[7]
that they must in fact
assume that the support of the gradient is fixed for all
x∈ X
, for their convergence result to hold,
which is a hard limitation, since that amounts to say that the function
f
depends on
s
fixed variables.
To fill this gap, in this paper, we focus on the
0
constrained black-box stochastic optimization
problems, and propose a novel stochastic zeroth-order gradient hard-thresholding (SZOHT) algorithm.
Specifically, we propose a dimension friendly ZO gradient estimator powered by a novel random
support sampling technique, and then embed it into the standard hard-thresholding operator.
2
We then provide the convergence and complexity analysis of SZOHT under the standard assumptions
of sparse learning, which are restricted strong smoothness (RSS), and restricted strong convexity
(RSC) [
31
,
36
], to retain generality, therefore providing a positive answer to the question of whether
ZO gradients can work with the hard-thresholding operator. Crucial to our analysis is to provide
carefully tuned requirements on the parameters
q
(the number of random directions used to estimate
the gradient, further defined in Section 3.1) and
k
. Finally, we illustrate the utility of our method on a
portfolio optimization problem as well as black-box adversarial attacks, by showing that our method
can achieve competitive performance in comparison to state of the art methods for sparsity-enforcing
zeroth-order algorithm described in Table 1, such as [14, 2, 7].
Importantly, we also show that in the smooth case, the query complexity of SZOHT is independent
of the dimensionality, which is significantly different to the dimensionality dependent results for
most existing ZO algorithms. Indeed, it is known from Jamieson et al.
[18]
that the worst case query
complexity of ZO optimization over the class
Fν,L
of
ν
-strongly convex and
L
-smooth functions
defined over a convex set
X
is linear in
d
. Our work is thus in line with other works achieving
dimension-insensitive query complexity in zeroth-order optimization such as [
15
,
37
,
44
,
7
,
6
,
2
,
7
,
22
,
18
], but contrary to those, instead of making further assumptions (i.e. restricting the class
Fν,L
to a smaller class), we bypass the impossibility result by replacing the convex feasible set
X
by a
non-convex set (the
0
ball), which is how we can avoid making stringent assumptions on the class of
functions f.
Contributions. We summarize the main contributions of our paper as follows:
1. We propose a new algorithm SZOHT that is, up to our knowledge, the first zeroth-order sparsity
constrained algorithm that is dimension independent under the smoothness assumption, without
assuming any gradient sparsity.
2.
We reveal an interesting conflict between the error from zeroth-order estimates and the hard-
thresholding operator, which results in a minimal value for the number of random directions
q
that is necessary to ensure at each iteration.
3.
We also provide the convergence analysis of our algorithm in the more general RSS setting,
providing, up to our knowledge, the first zeroth-order algorithm that can work with the usual
assumptions of RSS/RSC from the hard-thresholding literature.
2 Preliminaries
Throughout this paper, we denote by
x
the Euclidean norm for a vector
xRd
, by
x
the
maximum absolute component of that vector, and by
x0
the
0
norm (which is not a proper norm).
For simplicity, we denote
fξ(·) := f(·,ξ)
. We call
uF
(resp.
Ff(x)
) the vector which sets all
coordinates
i̸∈ F
of
u
(resp.
f(x)
) to
0
. We also denote by
x
the solution of problem
(1)
defined
in the introduction, for some target sparsity
k
which could be smaller than
k
. To derive our result,
we will need the following assumptions on f.
Assumption 1 (
(νs, s)
-RSC, [
17
,
28
,
26
,
45
,
20
,
36
,
31
]).
f
is said to be
νs
restricted strongly convex
with sparsity parameter
s
if it is differentiable, and there exist a generic constant
νs
such that for all
(x,y)Rdwith xy0s:
f(y)f(x) + ⟨∇f(x),yx+νs
2xy2
Assumption 2 (
(Ls, s)
-RSS, [
36
,
31
]).For almost any
ξ
,
fξ
is said to be
Ls
restricted smooth
with sparsity level
s
, if it is differentiable, and there exist a generic constant
Ls
such that for all
(x,y)Rdwith xy0s:
∥∇fξ(x)− ∇fξ(y)∥ ≤ Lsxy
Assumption 3 (
σ2
-FGN [
16
], Assumption 2.3 (Finite Gradient Noise)).
f
is said to have
σ
-finite
gradient noise if for almost any
ξ
,
fξ
is differentiable and the gradient noise
σ=σ(f, ξ)
defined
below is finite:
σ2=Eξ[∥∇fξ(x)2
]
Remark 1. Even though the original version of [
16
] uses the
2
norm, we use the
norm here,
in order to give more insightful results in terms of
k
and
d
, as is done classically in
0
optimization,
similarly to [
47
]. We also note that in [
16
],
x
denotes an unconstrained minimum when in our case
it denotes the constrained minimum for some sparsity k.
3
For Corollary 2, we will also need the more usual smoothness assumption:
Assumption 4 (
L
-smooth).For almost any
ξ
,
fξ
is said to be
L
smooth, if it is differentiable, and for
all (x,y)Rd:
∥∇fξ(x)− ∇fξ(y)∥ ≤ Lxy
3 Algorithm
3.1 Random support Zeroth-Order estimate
In this section, we describe our zeroth-order gradient estimator. It is basically composed of a random
support sampling step, followed by a random direction with uniform smoothing on the sphere
supported by this support. We also use the technique of averaging our estimator over
q
dimensions,
as described in [25]. More formally, our gradient estimator is described below:
ˆ
fξ(x) = d
qµ
q
X
i=1
(fξ(x+µui)fξ(x)) ui(2)
where each random direction
ui
is a unit vector sampled uniformly from the set
Sd
s2:= {u
Rd:u0s2,u= 1}
. We can obtain such vectors
u
by sampling first a random support
S
(i.e. a set of coordinates) of size
s2
from
[d]
, (denoted as
S∼ U([d]
s2)
in Algorithm 1) and then
by sampling a random unit vector
u
supported on that support
S
, that is, uniformly sampled from
the set
Sd
S:= {uRd:u[d]S=0,u= 1}
, (denoted as
u∼ U(Sd
S
) in algorithm 1). The
original uniform smoothing technique on the sphere is described in more detail in [
12
]. However,
in our case, the sphere along which we sample is restricted to a random support of size
s2
. Our
general estimator, through the setting of the variable
s2
, can take several forms, which are similar to
pre-existing gradient estimators from the literature described below:
If s2=d,ˆ
fξ(x)is the usual vanilla estimator with uniform smoothing on the sphere [12].
If
1s2d
, our estimator is similar to the Random Block-Coordinate gradient estimator
from Lian et al.
[21]
, except that the blocks are not fixed at initialization but chosen randomly,
and that we use a uniform smoothing with forward difference on the given block instead of a
coordinate-wise estimation with central difference. This random support technique allows us to
give a convergence analysis under the classical assumptions of the hard-thresholding literature
(see Remark 3), and to deal with huge scale optimization, when sampling uniformly from a unit
d
-sphere is costly [
7
,
6
]: in the distributed setting for instance, each worker would just need to
sample an
s2
-sparse random vector, and only the centralized server would materialize the full
gradient approximation containing up to qs2non-zero entries.
Error Bounds of the Zeroth-Order Estimator. We now derive error bounds on the gradient
estimator, that will be useful in the convergence rate proof, except that we consider only the restriction
to some support
F
(that is, we consider a subset of components of the gradient/estimator). Indeed,
proofs in the hard-thresholding literature (see for instance [
45
]), are usually written only on that
support. That is the key idea which explains how the dimensionality dependence is reduced when
doing SZOHT compared to vanilla ZO optimization. We give more insight on the shape of the
original distribution of gradient estimators, and the distribution of their projection onto a hyperplane
F
in Figure 5 in Appendix E. We can observe that even if the original gradient estimator is poor, in
the projected space, the estimation error is reduced, which we quantify in the proposition below.
Proposition 1. (Proof in Appendix C.3 ) Let us consider any support
F[d]
of size
s
(
|F|=s
).
For the Z0 gradient estimator in
(2)
, with
q
random directions, and random supports of size
s2
, and
assuming that each
fξ
is
(Ls2, s2)
-RSS, we have, with
ˆ
Ffξ(x)
denoting the hard thresholding of
the gradient fξ(x)on F(that is, we set all coordinates not in Fto 0):
(a) Eˆ
Ffξ(x)− ∇Ffξ(x)2εµµ2
(b) Eˆ
Ffξ(x)2εF∥∇Ffξ(x)2+εFc∥∇Fcfξ(x)2+εabsµ2
(c) Eˆ
Ffξ(x)− ∇Ffξ(x)22(εF+ 1)∥∇Ffξ(x)2+ 2εFc∥∇Fcfξ(x)2+ 2εabsµ2
4
with εµ=L2
s2sd, εF=2d
q(s2+ 2) (s1)(s21)
d1+ 3+ 2,
εFc=2d
q(s2+ 2) s(s21)
d1and εabs =2dL2
s2ss2
q(s1)(s21)
d1+ 1+L2
s2sd
(3)
3.2 SZOHT Algorithm
We now present our full algorithm to optimize problem 1, which we name SZOHT (Stochastic
Zeroth-Order Hard Thresholding). Each iteration of our algorithm is composed of two steps: (i) the
gradient estimation step, and (ii) the hard thresholding step, where the gradient estimation step is
the one described in the section above, and the hard-thresholding is described in more detail in the
following paragraph. We give the full formal description of our algorithm in Algorithm 1.
In the hard thresholding step, we only keep the
k
largest (in magnitude) components of the current
iterate
xt
. This ensures that all our iterates (including the last one) are
k
-sparse. This hard-thresholding
operator has been studied for instance in [
36
], and possesses several interesting properties. Firstly, it
can be seen as a projection on the
0
ball. Second, importantly, it is not non-expansive, contrary to
other operators like the soft-thresholding operator [
36
]. That expansivity plays an important role in
the analysis of our algorithm, as we will see later.
Compared to previous works, our algorithm can be seen as a variant of Stochastic Hard Thresholding
(StoIHT from [
31
]) , where we replaced the true gradient of
fξ
by the estimator
ˆ
fξ(x)
. It is also
very close to Algorithm 5 from Balasubramanian and Ghadimi
[2]
(Truncated-ZSGD), with just
a different zeroth-order gradient estimator: we use a uniform smoothing, random-block estimator,
instead of their gaussian smoothing, full support vanilla estimator. This allows us to deal with very
large dimensionalities, in the order of millions, similarly to Cai et al.
[6]
. Furthermore, as described
in the Introduction, contrary to Balasubramanian and Ghadimi
[2]
, we provide the analysis of our
algorithm without any gradient sparsity assumption.
The key challenge arising in our analysis is described in Figure 1: the hard-thresholding operator
being expansive [
36
], each approximate gradient step must approach the solution enough to stay close
to it even after hard-thresholding. Therefore, it is a priori unclear whether the zeroth-order estimate
can be accurate enough to guarantee the convergence of SZOHT. Hopefully, as we will see in the
next section, we can indeed ensure convergence, as long as we carefully choose the value of q.
Figure 1: Conflict between the hard-thresholding operator and the zeroth-order estimate.
4 Convergence analysis
In this section, we provide the convergence analysis of SZOHT, using the assumptions from section
2, and discuss an interesting property of the combination of the zeroth-order gradient estimate and
the hard-thresholding operator, providing a positive answer to the question from the previous section.
Theorem 1. (Proof in Appendix D.1) Assume that that each
fξ
is
(Ls, s:= max(s2, s))
-RSS, and
that
f
is
(νs, s)
-RSC and
σ
-FGN, with
s= 2k+kd
, with
dk
2kρ2k/(1 ρ2)2
, with
ρ
defined as below. Suppose that we run SZOHT with random supports of size
s2
,
q
random directions,
a learning rate of
η=νs
(4εF+1)L2
s
, and
k
coordinates kept at each iterations. Then, we have a
geometric convergence rate, of the following form, with x(t)denoting the t-iterate of SZOHT:
Ex(t)x∥ ≤ (γρ)tx(0) x+γa
1γρ σ+γb
1γρ µ
5
摘要:

Zeroth-OrderHard-Thresholding:GradientErrorvs.ExpansivityWilliamdeVazelhes1,HualinZhang2,HuiminWu2,Xiao-TongYuan2,BinGu11MohamedbinZayedUniversityofArtificialIntelligence2NanjingUniversityofInformationScience&Technology{wdevazelhes,zhanghualin98,xtyuan1980,jsgubin}@gmail.com,wuhuimin@nuist.edu.cnAbs...

展开>> 收起<<
Zeroth-Order Hard-Thresholding Gradient Error vs. Expansivity William de Vazelhes1 Hualin Zhang2 Huimin Wu2 Xiao-Tong Yuan2 Bin Gu1.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:1.61MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注