Zeroth-Order Hard-Thresholding Gradient Error vs. Expansivity William de Vazelhes1 Hualin Zhang2 Huimin Wu2 Xiao-Tong Yuan2 Bin Gu1

2025-04-24 0 0 1.61MB 33 页 10玖币

侵权投诉

Zeroth-Order Hard-Thresholding:

Gradient Error vs. Expansivity

William de Vazelhes1, Hualin Zhang2, Huimin Wu2, Xiao-Tong Yuan2, Bin Gu1

1Mohamed bin Zayed University of Artiﬁcial Intelligence

2Nanjing University of Information Science & Technology

{wdevazelhes,zhanghualin98,xtyuan1980,jsgubin}@gmail.com,

wuhuimin@nuist.edu.cn

Abstract

ℓ0

constrained optimization is prevalent in machine learning, particularly for high-

dimensional problems, because it is a fundamental approach to achieve sparse

learning. Hard-thresholding gradient descent is a dominant technique to solve

this problem. However, ﬁrst-order gradients of the objective function may be

either unavailable or expensive to calculate in a lot of real-world problems, where

zeroth-order (ZO) gradients could be a good surrogate. Unfortunately, whether ZO

gradients can work with the hard-thresholding operator is still an unsolved prob-

lem. To solve this puzzle, in this paper, we focus on the

ℓ0

constrained black-box

stochastic optimization problems, and propose a new stochastic zeroth-order gradi-

ent hard-thresholding (SZOHT) algorithm with a general ZO gradient estimator

of SZOHT under standard assumptions. Importantly, we reveal a conﬂict between

the deviation of ZO estimators and the expansivity of the hard-thresholding opera-

tor, and provide a theoretical minimal value of the number of random directions

in ZO gradients. In addition, we ﬁnd that the query complexity of SZOHT is

independent or weakly dependent on the dimensionality under different settings.

Finally, we illustrate the utility of our method on a portfolio optimization problem

as well as black-box adversarial attacks.

1 Introduction

ℓ0

constrained optimization is prevalent in machine learning, particularly for high-dimensional

problems, because it is a fundamental approach to achieve sparse learning. In addition to improving

the memory, computational and environmental footprint of the models, these sparse constraints

help reduce overﬁtting and obtain consistent statistical estimation [

]. We formulate the

problem as follows:

min

x∈Rd{f(x) := Eξf(x,ξ)},s.t. ∥x∥0≤k(1)

where

f(·,ξ) : Rd→R

is a differentiable function and

is a noise term, for instance related to an

underlying ﬁnite sum structure in

, of the form:

Eξf(x,ξ) = 1

nPn

i=1 fi(x)

. Hard-thresholding

gradient algorithm [

] is a dominant technique to solve this problem. It generally consists in

alternating between a gradient step, and a hard-thresholding operation which only keeps the

-largest

components (in absolute value) of the current iterate. The advantage of hard-thresholding over its

convex relaxations ([

]) is that it can often attain similar precision, but is more computationally

efﬁcient, since it can directly ensure a desired sparsity level instead of tuning an

ℓ1

penalty or

constraint. The only expensive computation in hard-thresholding is the hard-thresholding step itself,

which requires ﬁnding the top

elements of the current iterate. Hard-thresholding was originally

developed in its full gradient form [

], but has been later on extended to the stochastic setting by

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05279v2 [cs.LG] 18 Mar 2024

Table 1: Complexity of sparsity-enforcing algorithms. We give the query complexity for a precision

, up to

the system error (see section 4). For ﬁrst-order algorithms (FO), we give it in terms of number of ﬁrst order

oracle calls (#IFO), that is, calls to

∇f(x, ξ)

, and for ZO algorithms, in terms of calls of

f(ξ,·)

. Here

denotes

the condition number

, with

is the smoothness (or RSS) constant and

is the strong-convexity (or RSC)

constant.

Type Name Assumptions #IZO/#IFO #HT ops.

FO/ℓ0StoIHT [31] RSS, RSC O(κlog( 1

ε)) O(κlog( 1

ε))

ZO/ℓ1RSPGF [14] smooth3O(d

ε2)—

ZO/ℓ1ZSCG2

[2] convex, smooth O(d

ε2)—

ZO/ℓ1ZORO [7]

s-sparse gradient,

weakly sparse hessian,

smooth3

, RSCbis1O(slog(d) log(1

ε)) —

ZO/ℓ0SZOHT RSS, RSC O((k+d

s2)κ2log( 1

ε)) O(κ2log( 1

ε))

ZO/ℓ0SZOHT smooth, RSC O(kκ2log(1

ε)) O(κ2log( 1

ε))

The deﬁnition of Restricted Strong Convexity from [

] is different from ours and that of [

], hence

the bis subscript.

2We refer to the modiﬁed version of ZSCG (Algorithm 3 in [2]).

3RSPGF and ZORO minimize f(x) + λ∥x∥1: only fneeds to be smooth.

Nguyen et al.

[31]

, which developed a stochastic gradient descent (SGD) version of hard thresholding

(StoIHT), and further more with Zhou et al.

[47]

, Shen and Li

[36]

and Li et al.

[20]

, which used

variance reduction technique to improve upon StoIHT.

However, the ﬁrst-order gradients used in the above methods may be either unavailable or expensive

to calculate in a lot of real-world problems. For example, in certain graphical modeling tasks [

obtaining the gradient of the objective function is computationally hard. Even worse, in some settings,

the gradient is inaccessible by nature, for instance in bandit problems [

], black-box adversarial

attacks [

], or reinforcement learning [

]. To tackle those problems, ZO optimization

methods have been developed [

]. Those methods usually replace the inaccessible gradient by its

ﬁnite difference approximation which can be computed only from function evaluations, following

the idea that for a differentiable function f:R→R, we have: f′(x) = limh→0f(x+h)−f(x)

h. Later

on, ZO methods have been adapted to deal with a convex constraints set, and can therefore be used

to solve the

ℓ1

convex relaxation of problem

(1)

. To that end, Ghadimi et al.

[14]

, and Cai et al.

[7]

introduce proximal ZO algorithms, Liu et al.

[24]

introduce a ZO projected gradient algorithm and

Balasubramanian and Ghadimi

[2]

introduce a ZO conditional gradient [

] algorithm. We provide

a review of those results in Table 1. As can be seen from the table, their query complexity is high

(linear in

), except [

] that has a complexity of

O(slog(d) log(1

ε))

, but assumes that gradients are

sparse. In addition, those methods must introduce a hyperparameter

(the strength of the

ℓ1

penalty)

(the radius of the

ℓ1

ball), which need to be tuned to ﬁnd which value ensures the right sparsity

level. Therefore, it would be interesting to use the hard-thresholding techniques described in the

previous paragraph, instead of those convex relaxations.

Unfortunately, ZO hard-thresholding gradient algorithms have not been exploited formally. Even

more, whether ZO gradients can work with the hard-thresholding operator is still an unknown

problem. Although there was one related algorithm proposed recently by Balasubramanian and

Ghadimi

[2]

, they did not target

ℓ0

constrained optimization problem and importantly have strong

assumptions in their convergence analysis. Indeed, they assume that the gradients, as well as the

solution of the unconstrained problem, are

-sparse:

∥∇f(x)∥0≤s

and

∥x∗∥0≤s∗≈s

, where

x∗= arg minxf(x)

. In addition, it was recently shown by Cai et al.

[7]

that they must in fact

assume that the support of the gradient is ﬁxed for all

x∈ X

, for their convergence result to hold,

which is a hard limitation, since that amounts to say that the function

depends on

ﬁxed variables.

To ﬁll this gap, in this paper, we focus on the

ℓ0

constrained black-box stochastic optimization

problems, and propose a novel stochastic zeroth-order gradient hard-thresholding (SZOHT) algorithm.

Speciﬁcally, we propose a dimension friendly ZO gradient estimator powered by a novel random

support sampling technique, and then embed it into the standard hard-thresholding operator.

We then provide the convergence and complexity analysis of SZOHT under the standard assumptions

of sparse learning, which are restricted strong smoothness (RSS), and restricted strong convexity

(RSC) [

], to retain generality, therefore providing a positive answer to the question of whether

ZO gradients can work with the hard-thresholding operator. Crucial to our analysis is to provide

carefully tuned requirements on the parameters

(the number of random directions used to estimate

the gradient, further deﬁned in Section 3.1) and

. Finally, we illustrate the utility of our method on a

portfolio optimization problem as well as black-box adversarial attacks, by showing that our method

can achieve competitive performance in comparison to state of the art methods for sparsity-enforcing

zeroth-order algorithm described in Table 1, such as [14, 2, 7].

Importantly, we also show that in the smooth case, the query complexity of SZOHT is independent

of the dimensionality, which is signiﬁcantly different to the dimensionality dependent results for

most existing ZO algorithms. Indeed, it is known from Jamieson et al.

[18]

that the worst case query

complexity of ZO optimization over the class

Fν,L

-strongly convex and

-smooth functions

deﬁned over a convex set

is linear in

. Our work is thus in line with other works achieving

dimension-insensitive query complexity in zeroth-order optimization such as [

], but contrary to those, instead of making further assumptions (i.e. restricting the class

Fν,L

to a smaller class), we bypass the impossibility result by replacing the convex feasible set

by a

non-convex set (the

ℓ0

ball), which is how we can avoid making stringent assumptions on the class of

functions f.

Contributions. We summarize the main contributions of our paper as follows:

1. We propose a new algorithm SZOHT that is, up to our knowledge, the ﬁrst zeroth-order sparsity

constrained algorithm that is dimension independent under the smoothness assumption, without

assuming any gradient sparsity.

We reveal an interesting conﬂict between the error from zeroth-order estimates and the hard-

thresholding operator, which results in a minimal value for the number of random directions

that is necessary to ensure at each iteration.

We also provide the convergence analysis of our algorithm in the more general RSS setting,

providing, up to our knowledge, the ﬁrst zeroth-order algorithm that can work with the usual

assumptions of RSS/RSC from the hard-thresholding literature.

2 Preliminaries

Throughout this paper, we denote by

∥x∥

the Euclidean norm for a vector

x∈Rd

, by

∥x∥∞

the

maximum absolute component of that vector, and by

∥x∥0

the

ℓ0

norm (which is not a proper norm).

For simplicity, we denote

fξ(·) := f(·,ξ)

. We call

(resp.

∇Ff(x)

) the vector which sets all

coordinates

i̸∈ F

(resp.

∇f(x)

) to

. We also denote by

x∗

the solution of problem

(1)

deﬁned

in the introduction, for some target sparsity

k∗

which could be smaller than

. To derive our result,

we will need the following assumptions on f.

Assumption 1 (

(νs, s)

-RSC, [

]).

is said to be

νs

restricted strongly convex

with sparsity parameter

if it is differentiable, and there exist a generic constant

νs

such that for all

(x,y)∈Rdwith ∥x−y∥0≤s:

f(y)≥f(x) + ⟨∇f(x),y−x⟩+νs

2∥x−y∥2

Assumption 2 (

(Ls, s)

-RSS, [

]).For almost any

fξ

is said to be

restricted smooth

with sparsity level

, if it is differentiable, and there exist a generic constant

such that for all

(x,y)∈Rdwith ∥x−y∥0≤s:

∥∇fξ(x)− ∇fξ(y)∥ ≤ Ls∥x−y∥

Assumption 3 (

σ2

-FGN [

], Assumption 2.3 (Finite Gradient Noise)).

is said to have

-ﬁnite

gradient noise if for almost any

fξ

is differentiable and the gradient noise

σ=σ(f, ξ)

deﬁned

below is ﬁnite:

σ2=Eξ[∥∇fξ(x∗)∥2

∞]

Remark 1. Even though the original version of [

] uses the

ℓ2

norm, we use the

ℓ∞

norm here,

in order to give more insightful results in terms of

and

, as is done classically in

ℓ0

optimization,

similarly to [

]. We also note that in [

x∗

denotes an unconstrained minimum when in our case

it denotes the constrained minimum for some sparsity k∗.

For Corollary 2, we will also need the more usual smoothness assumption:

Assumption 4 (

-smooth).For almost any

fξ

is said to be

smooth, if it is differentiable, and for

all (x,y)∈Rd:

∥∇fξ(x)− ∇fξ(y)∥ ≤ L∥x−y∥

3 Algorithm

3.1 Random support Zeroth-Order estimate

In this section, we describe our zeroth-order gradient estimator. It is basically composed of a random

support sampling step, followed by a random direction with uniform smoothing on the sphere

supported by this support. We also use the technique of averaging our estimator over

dimensions,

as described in [25]. More formally, our gradient estimator is described below:

∇fξ(x) = d

qµ

i=1

(fξ(x+µui)−fξ(x)) ui(2)

where each random direction

is a unit vector sampled uniformly from the set

s2:= {u∈

Rd:∥u∥0≤s2,∥u∥= 1}

. We can obtain such vectors

by sampling ﬁrst a random support

(i.e. a set of coordinates) of size

from

[d]

, (denoted as

S∼ U([d]

s2)

in Algorithm 1) and then

by sampling a random unit vector

supported on that support

, that is, uniformly sampled from

the set

S:= {u∈Rd:u[d]−S=0,∥u∥= 1}

, (denoted as

u∼ U(Sd

) in algorithm 1). The

original uniform smoothing technique on the sphere is described in more detail in [

]. However,

in our case, the sphere along which we sample is restricted to a random support of size

. Our

general estimator, through the setting of the variable

, can take several forms, which are similar to

pre-existing gradient estimators from the literature described below:

• If s2=d,ˆ

∇fξ(x)is the usual vanilla estimator with uniform smoothing on the sphere [12].

•

1≤s2≤d

, our estimator is similar to the Random Block-Coordinate gradient estimator

from Lian et al.

[21]

, except that the blocks are not ﬁxed at initialization but chosen randomly,

and that we use a uniform smoothing with forward difference on the given block instead of a

coordinate-wise estimation with central difference. This random support technique allows us to

give a convergence analysis under the classical assumptions of the hard-thresholding literature

(see Remark 3), and to deal with huge scale optimization, when sampling uniformly from a unit

-sphere is costly [

]: in the distributed setting for instance, each worker would just need to

sample an

-sparse random vector, and only the centralized server would materialize the full

gradient approximation containing up to qs2non-zero entries.

Error Bounds of the Zeroth-Order Estimator. We now derive error bounds on the gradient

estimator, that will be useful in the convergence rate proof, except that we consider only the restriction

to some support

(that is, we consider a subset of components of the gradient/estimator). Indeed,

proofs in the hard-thresholding literature (see for instance [

]), are usually written only on that

support. That is the key idea which explains how the dimensionality dependence is reduced when

doing SZOHT compared to vanilla ZO optimization. We give more insight on the shape of the

original distribution of gradient estimators, and the distribution of their projection onto a hyperplane

in Figure 5 in Appendix E. We can observe that even if the original gradient estimator is poor, in

the projected space, the estimation error is reduced, which we quantify in the proposition below.

Proposition 1. (Proof in Appendix C.3 ) Let us consider any support

F⊂[d]

of size

(

|F|=s

For the Z0 gradient estimator in

(2)

, with

random directions, and random supports of size

, and

assuming that each

fξ

(Ls2, s2)

-RSS, we have, with

∇Ffξ(x)

denoting the hard thresholding of

the gradient ∇fξ(x)on F(that is, we set all coordinates not in Fto 0):

(a) ∥Eˆ

∇Ffξ(x)− ∇Ffξ(x)∥2≤εµµ2

(b) E∥ˆ

∇Ffξ(x)∥2≤εF∥∇Ffξ(x)∥2+εFc∥∇Fcfξ(x)∥2+εabsµ2

∇Ffξ(x)− ∇Ffξ(x)∥2≤2(εF+ 1)∥∇Ffξ(x)∥2+ 2εFc∥∇Fcfξ(x)∥2+ 2εabsµ2

with εµ=L2

s2sd, εF=2d

q(s2+ 2) (s−1)(s2−1)

d−1+ 3+ 2,

εFc=2d

q(s2+ 2) s(s2−1)

d−1and εabs =2dL2

s2ss2

q(s−1)(s2−1)

d−1+ 1+L2

s2sd

(3)

3.2 SZOHT Algorithm

We now present our full algorithm to optimize problem 1, which we name SZOHT (Stochastic

Zeroth-Order Hard Thresholding). Each iteration of our algorithm is composed of two steps: (i) the

gradient estimation step, and (ii) the hard thresholding step, where the gradient estimation step is

the one described in the section above, and the hard-thresholding is described in more detail in the

following paragraph. We give the full formal description of our algorithm in Algorithm 1.

In the hard thresholding step, we only keep the

largest (in magnitude) components of the current

iterate

. This ensures that all our iterates (including the last one) are

-sparse. This hard-thresholding

operator has been studied for instance in [

], and possesses several interesting properties. Firstly, it

can be seen as a projection on the

ℓ0

ball. Second, importantly, it is not non-expansive, contrary to

other operators like the soft-thresholding operator [

]. That expansivity plays an important role in

the analysis of our algorithm, as we will see later.

Compared to previous works, our algorithm can be seen as a variant of Stochastic Hard Thresholding

(StoIHT from [

]) , where we replaced the true gradient of

fξ

by the estimator

∇fξ(x)

. It is also

very close to Algorithm 5 from Balasubramanian and Ghadimi

[2]

(Truncated-ZSGD), with just

a different zeroth-order gradient estimator: we use a uniform smoothing, random-block estimator,

instead of their gaussian smoothing, full support vanilla estimator. This allows us to deal with very

large dimensionalities, in the order of millions, similarly to Cai et al.

[6]

. Furthermore, as described

in the Introduction, contrary to Balasubramanian and Ghadimi

[2]

, we provide the analysis of our

algorithm without any gradient sparsity assumption.

The key challenge arising in our analysis is described in Figure 1: the hard-thresholding operator

being expansive [

], each approximate gradient step must approach the solution enough to stay close

to it even after hard-thresholding. Therefore, it is a priori unclear whether the zeroth-order estimate

can be accurate enough to guarantee the convergence of SZOHT. Hopefully, as we will see in the

next section, we can indeed ensure convergence, as long as we carefully choose the value of q.

Figure 1: Conﬂict between the hard-thresholding operator and the zeroth-order estimate.

4 Convergence analysis

In this section, we provide the convergence analysis of SZOHT, using the assumptions from section

2, and discuss an interesting property of the combination of the zeroth-order gradient estimate and

the hard-thresholding operator, providing a positive answer to the question from the previous section.

Theorem 1. (Proof in Appendix D.1) Assume that that each

fξ

(Ls′, s′:= max(s2, s))

-RSS, and

that

(νs, s)

-RSC and

-FGN, with

s= 2k+k∗≤d

, with

d−k∗

2≥k≥ρ2k∗/(1 −ρ2)2

, with

deﬁned as below. Suppose that we run SZOHT with random supports of size

random directions,

a learning rate of

η=νs

(4εF+1)L2

s′

, and

coordinates kept at each iterations. Then, we have a

geometric convergence rate, of the following form, with x(t)denoting the t-iterate of SZOHT:

E∥x(t)−x∗∥ ≤ (γρ)t∥x(0) −x∗∥+γa

1−γρ σ+γb

1−γρ µ

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Zeroth-OrderHard-Thresholding:GradientErrorvs.ExpansivityWilliamdeVazelhes1,HualinZhang2,HuiminWu2,Xiao-TongYuan2,BinGu11MohamedbinZayedUniversityofArtificialIntelligence2NanjingUniversityofInformationScience&Technology{wdevazelhes,zhanghualin98,xtyuan1980,jsgubin}@gmail.com,wuhuimin@nuist.edu.cnAbs...

展开>> 收起<<

Zeroth-Order Hard-Thresholding Gradient Error vs. Expansivity William de Vazelhes1 Hualin Zhang2 Huimin Wu2 Xiao-Tong Yuan2 Bin Gu1.pdf

共33页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Zeroth-Order Hard-Thresholding Gradient Error vs. Expansivity William de Vazelhes1 Hualin Zhang2 Huimin Wu2 Xiao-Tong Yuan2 Bin Gu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: