Stochastic Zeroth-order Functional Constrained Optimization Oracle Complexity and Applications Anthony Nguyen Krishnakumar Balasubramanian

2025-05-02 0 0 1.99MB 38 页 10玖币

侵权投诉

Stochastic Zeroth-order Functional Constrained Optimization:

Oracle Complexity and Applications

Anthony Nguyen*Krishnakumar Balasubramanian†

October 11, 2022

Abstract

Functionally constrained stochastic optimization problems, where neither the objective function nor

the constraint functions are analytically available, arise frequently in machine learning applications.

In this work, assuming we only have access to the noisy evaluations of the objective and constraint

functions, we propose and analyze stochastic zeroth-order algorithms for solving the above class of

stochastic optimization problem. When the domain of the functions is

, assuming there are

constraint functions, we establish oracle complexities of order

O((m+ 1)n/2)

and

O((m+ 1)n/3)

respectively in the convex and nonconvex setting, where



represents the accuracy of the solutions required

in appropriately deﬁned metrics. The established oracle complexities are, to our knowledge, the ﬁrst such

results in the literature for functionally constrained stochastic zeroth-order optimization problems. We

demonstrate the applicability of our algorithms by illustrating its superior performance on the problem of

hyperparameter tuning for sampling algorithms and neural network training.

1 Introduction

We develop and analyze stochastic zeroth-order algorithms for solving the following non-linear optimization

problem with functional constraints:

min

x∈Xf0(x)such that fi(x)60, i ∈ {0,1, . . . , m},(1)

where, for

i∈ {0,1, . . . , m}

fi:Rn→R

are continuous functions which are not necessarily convex deﬁned

fi(x) = Eξi[Fi(x, ξi)]

with

ξi

denoting the noise vector associated with function

, and

X⊆Rn

is a

convex compact set that represents known constraints (i.e., constraints that are analytically available). In the

stochastic zeroth-order setting, we neither observe the objective function

nor the constraint functions

analytically. We only have access to noisy function evaluations of them. The study of stochastic zeroth-order

optimization algorithms for unconstrained optimization problems goes back to the early works of Kiefer and

Wolfowitz (1952), Blum (1954), Hooke and Jeeves (1961), Spendley et al. (1962), Powell (1964), Nelder

and Mead (1965), Nemirovski and Yudin (1983), Spall (1987). Such zeroth-order algorithms have proved to

be extremely useful for hyperparameter tuning (Snoek et al. 2012, Hernández-Lobato et al. 2015, Gelbart

et al. 2014, Ruan et al. 2019, Golovin et al. 2017), reinforcement learning (Mania et al. 2018, Salimans et al.

2017, Gao et al. 2020, Choromanski et al. 2020) and robotics (Jaquier et al. 2020, Jaquier and Rozo 2020).

However, the study of zeroth-order algorithms and their oracle complexities for constrained problem as in

(1)

is limited, despite the fact that several real-world machine learning problems fall under the setting of

(1)

. We

now describe two such applications that serve as our main motivation for developing stochastic zeroth-order

optimization algorithms for solving (1), and analyzing their oracle complexity.

*Department of Mathematics, University of California, Davis. antngu@ucdavis.edu.

†Department of Statistics, University of California, Davis kbala@ucdavis.edu.

arXiv:2210.04273v1 [math.OC] 9 Oct 2022

1.1 Motivating application I:

Hamiltonian Monte Carlo (HMC) algorithm, proposed by Duane et al. (1987) and popularized in the

statistical machine learning community by Neal (2011), is a gradient-based sampling algorithm that works

by discretizing the continuous time degenerate Langevin diffusion (Leimkuhler and Matthews 2015). It

has been used successfully as a state-of-the art sampler or a numerical integrator in the Bayesian statistical

machine learning community by Hoffman and Gelman (2014), Wang et al. (2013), Girolami and Calderhead

(2011), Chen et al. (2014), Carpenter et al. (2017). However, in order to obtain successful performance

in practice using HMC, several hyperparameters need to be tuned optimally. Typically, the functional

relationship between the hyperparameters that need to be tuned and the performance measure used is not

available in an analytical form. We can only evaluate the performance of the sampler for various settings of

the hyperparameter. Furthermore, in practice several constraints, for example, constraints on running times

and constraints that enforce the generated samples to pass certain standard diagnostic tests (Geweke 1991,

Gelman and Rubin 1992), are enforced in the hyperparameter tuning process. The functional relationship

between such constraints and the hyperparameters is also not available analytically. This makes the problem

of optimally setting the hyperparameters for HMC a constrained zeroth-order optimization problem. As a

preview, in Section 4.1, we show that our approach provides signiﬁcant improvements over existing methods

of Mahendran et al. (2012), Gelbart et al. (2014), Hernández-Lobato et al. (2015), which are based on

Bayesian optimization techniques for tuning HMC, when we measure the performance adopting the widely

used effective sample size metric (Kass et al. 1998).

1.2 Motivating application II:

Deep learning has achieved state-of-the-art performance in the recent years for various prediction tasks (Good-

fellow et al. 2016). Among the various factors involved behind the success of deep learning, hyperparameter

tuning is one of primary factors (Snoek et al. 2012, Bergstra and Bengio 2012, Li et al. 2017, Hazan et al.

2018, Elsken et al. 2019). However, most of the existing methods for tuning the hyperparameters do not

enforce any constraints on the prediction time required on the validation set or memory constraints on the

training algorithm. Such constraints are typically required to make deep learning methods widely applicable

to problem arising in several consumer applications based on tiny devices (Perera et al. 2015, Latré et al. 2011,

Yang et al. 2008). As in the above motivating application, the functional relationship between such constraints

and the hyperparameters is not available analytically. As a preview, in Section 4.2, we show that our approach

provides signiﬁcant improvements over the existing works of Gelbart et al. (2014), Hernández-Lobato et al.

(2015), Ariafar et al. (2019) that developed hyperparameter tuning techniques which explictly take into

account time/memory constraints.

1.3 Related works

In the operations research and statistics communities, zeroth-order optimization techniques are well-studied

under the name of derivative-free optimization. Interested readers are referred to Conn et al. (2009) and Audet

and Hare (2017). In the machine learning community, Bayesian optimization techniques have been developed

for optimizing functions with only noisy function evaluations. We refer the reader to Mockus (1994), Kolda

et al. (2003), Spall (2005), Conn et al. (2009), Mockus (2012), Brent (2013), Shahriari et al. (2015), Audet

and Hare (2017), Larson et al. (2019), Frazier (2018), Archetti and Candelieri (2019), Liu et al. (2020) for

more details. In what follows, we focus on relevant literature from zeroth-order optimization and Bayesian

optimization literature for known constrained optimization problems (i.e., problems with constraints that are

analytically available). When the constraint set is analytically available and only the objective function is

not, Lewis and Torczon (2002) and Bueno et al. (2013) considered an augmented Lagrangian approach and

an inexact restoration method respectively, and provided convergence analysis. Furthermore, Kolda et al.

(2003), Amaioua et al. (2018), Audet et al. (2015) extended the popular mesh adaptive direct search to this

setting. Projection-free methods based on Frank-Wolfe methods have been considered in Balasubramanian

and Ghadimi (2018), Sahu et al. (2019) for the case when the constraint set is a convex subset of

Furthermore, Li et al. (2020) considered the case when the constraint set is a Riemannian submanifold

embedded in

(and the function is deﬁned only over the manifold). None of the above works are directly

applicable to the case of unknown constraints that we consider in this work.

We now discuss some existing methods for solving (variants of) problem

(1)

in the zeroth-order setting.

For solving

(1)

in the deterministic setting (i.e., we could obtain exact evaluations of the objective and the

constraint functions at a given point), ﬁlter methods which reduce the objective function while trying to

reduce constraint violations were proposed and analyzed in Audet and Dennis Jr (2004), Echebest et al.

(2017), Pourmohamad and Lee (2020). Barrier methods in the zeroth-order setting were considered in Audet

and Dennis Jr (2006, 2009), Liuzzi and Lucidi (2009), Gratton and Vicente (2014), Fasano et al. (2014),

Liuzzi et al. (2010), Dzahini et al. (2020), with some works also developing line search approaches for setting

the tuning parameters. Model based approaches were considered in the works of Müller and Woodbury

(2017), Tröltzsch (2016), Augustin and Marzouk (2014), Gramacy et al. (2016), Conn and Le Digabel (2013).

Furthermore, B˝urmen et al. (2006), Audet and Tribes (2018) developed extensions of Nelder–Mead algorithm

to the constrained setting.

Several works in the statistical machine learning community also considered Bayesian optimization

methods in the constrained setting, in both the noiseless and noisy setting. We refer the reader, for example,

to Gardner et al. (2014), Gelbart et al. (2016), Ariafar et al. (2019), Balandat et al. (2020), Bachoc et al.

(2020), Greenhill et al. (2020), Eriksson and Poloczek (2020), Letham et al. (2019), Hernández-Lobato et al.

(2015), Lam and Willcox (2017), Picheny et al. (2016), Acerbi and Ma (2017). On one hand, the above works

demonstrate the interest in the optimization and machine learning communities for developing algorithms for

constrained zeroth-order optimization problems. On the other hand, most of the above works are not designed

to handle stochastic zeroth-order constrained optimization that we consider. Furthermore, a majority of the

above works are methodological, and the few works that develop convergence analysis do so only in the

asymptotic setting. A recent work by Usmanova et al. (2019) considered the case when the constraints are

linear functions (but unknown), and provided a Frank-Wolfe based algorithm with estimated constraints.

However, the proposed approach is limited to only linear constraints, and the oracle complexities established

are highly sub-optimal. To the best of our knowledge, there is no rigorous non-asymptotic analysis of the

oracle complexity of stochastic zeroth-optimization when the constraints and the objective values are available

only via noisy function evaluations.

1.4 Methodology and Main Contributions:

Our methodology is based on a novel constraint extrapolation technique developed for the zeroth-order

setting, extending the work of Boob et al. (2022) in the ﬁrst-order setting, and the Gaussian smoothing based

zeroth-order stochastic gradient estimators. Speciﬁcally, we propose the

SZO-ConEX

method in Algorithm 1

for solving problems of the form in

(1)

. We theoretically characterize how to set the tuning parameters of the

algorithm so as to mitigate the issues caused by the bias in the stochastic zeroth-order gradient estimates and

obtain the oracle complexity of our algorithm. More speciﬁcally, we make the following main contributions:

•

When the functions

i= 0, . . . , m

, are convex, in Theorem 3.1, we show that the number of calls

to the stochastic zeroth-order oracle to achieve an appropriately deﬁned



-optimal solution of

(1)

(see

Deﬁnition 3.1) is of order O((m+ 1)n/2).

•

When the functions are nonconvex, in Proposition 3.1, we show that the number of calls to the stochastic

zeroth-order oracle to achieve an appropriately deﬁned



-optimal KKT solution of

(1)

(see Deﬁnition 3.2)

is of order O((m+ 1)n/3).

To our knowledge, these are the ﬁrst non-asymptotic oracle complexity results for stochastic zeroth-order

optimization with stochastic zeroth-order functional constraints. We illustrate the practical applicability of the

developed methodology by testing its performance on hyperparameter tuning for HMC sampling algorithm

(Section 4.1) and 3-layer neural network (Section 4.2).

2 Preliminaries and Methodology

Notations:

Let

denote the vector of elements

and

[m]:={1, . . . , m}

. Let

f(x):= [f1(x), . . . , fm(x)]T

;

then, the constraints in

(1)

can be expressed as

f(x)60

. We use

ξ:= [ξ1,··· , ξm]

to denote the random

vectors in the constraints. Furthermore,

k · k

denotes a general norm and

k · k∗

denotes its dual norm deﬁned

kzk∗:= sup{zTx:kxk ≤ 1}

. Furthermore,

[x]+:= max{x, 0}

for any

x∈R

. For any vector

x∈Rk

we deﬁne [x]+as element-wise application of [·]+.

We now describe the precise assumption made on the stochastic zeroth-order oracle in this work.

Assumption 2.1.

Let

k·k

be a norm on

. For

i∈ {0, . . . , m}

and for any

x∈Rn

, the zeroth-

order oracle outputs an estimator

Fi(x, ξi)

fi(x)

such that

E[Fi(x, ξi)] = fi(x)

E[Fi(x, ξi)2]≤σ2

E[∇Fi(x, ξi)] = ∇fi(x),E[k∇Fi(x, ξi)− ∇fi(x)k2

∗]6σ2

i, , where k·k∗denotes the dual norm.

The assumption above assumes that we have access to a stochastic zeroth-order oracle which provides

unbiased function evaluations with bounded variance. It is worth noting that in the above assumption, we

do not necessarily assume the noise

ξi

is additive. Furthermore, we allow for different noise models for the

objective function and the

constraint functions, which is a signiﬁcantly general model compared to several

existing works such as Digabel and Wild (2015). Our gradient estimator is then constructed by leveraging the

Gaussian smoothing technique proposed in Nemirovski and Yudin (1983), Nesterov and Spokoiny (2017).

For

νi∈(0,∞)

we introduce the smoothed function

fi,νi(x) = Eui[fi(x+νiui)]

where

ui∼N(0, In)

and

independent across

. We can estimate the gradient of this smoothed function using function evaluations of

Speciﬁcally, we deﬁne the stochastic zeroth-order gradient of fi,νi(x)as

Gi,νi(x, ξi, ui) = Fi(x+νiui, ξi)−Fi(x, ξi)

νi

ui,(2)

which is an unbiased estimator of

∇fi,νi(x)

, i.e., we have

Eu,ξi[Gi,νi(x, ξi, u)] = ∇fi,νi(x)

. However, it is

well-known that

Gi,νi(x, ξi, ui)

is a biased estimator of

∇fi(x)

. An interpretation of the gradient estimator

(2)

as a consequence of Gaussian Stein’s identity, popular in the statistics literature (Stein 1972), was

provided in Balasubramanian and Ghadimi (2022).

The gradient estimator in

(2)

is referred to as the two-point estimator in the literature. The reason is

that, for a given random vector

ξi

, it is assumed that the stochastic function in

(2)

could be evaluated at

two points,

Fi(x+νiui, ξi)

and

Fi(x, ξi)

. Such an assumption is satisﬁed in several statistics, machine

learning, simulation based optimization, and sampling problems; see for example Spall (2005), Mokkadem

and Pelletier (2007), Dippon (2003), Agarwal et al. (2010), Duchi et al. (2015), Ghadimi and Lan (2013),

Nesterov and Spokoiny (2017). Yet another estimator in the literature is the one-point estimator, which

assumes that for each

ξi

, we observe only one noisy function evaluation

Fi(x+νiui, ξi)

. It is well-known that

the one-point setting is more challenging than the two-point setting (Shamir 2013). From a theoretical point

of view, the use of two-point evaluation based gradient estimator is primarily motivated by the sub-optimality

(in terms of oracle complexity) of one-point feedback based stochastic zeroth-order optimization methods

either in terms of the approximation accuracy or dimension dependency. For the rest of this work, we focus

on the two-point setting and leave the question of obtaining results in the one-point setting as future work.

We now describe our assumptions on the objective and constraint functions.

Assumption 2.2.

Function

has Lipschitz continuous gradient with constant

, almost surely for any

ξi

, i.e.,

k∇Fi(y, ξi)− ∇Fi(x, ξi)k∗6Liky−xk

, which consequently implies that

|Fi(y, ξi)−Fi(x, ξi)−

h∇Fi(x, ξi), y −xi| 6Li

2ky−xk2for i∈ {0,1, . . . , m}.

Assumption 2.3.

Function

is Lipschitz continuous with constant

, almost surely for any

ξi

, i.e.,

|Fi(y, ξi)−Fi(x, ξi)| ≤ Miky−xk, for i∈ {0,1, . . . , m}.

The above smoothness assumptions are standard in the literature on stochastic zeroth-order optimization

and are made in several works Nesterov and Spokoiny (2017), Ghadimi and Lan (2013), Balasubramanian

and Ghadimi (2022) for obtaining oracle complexity results. It is easy to see that Assumption 2.2 implies

that for

i∈ {0, . . . , m}

has Lipschitz continuous gradient with constant

since

k∇fi(y)− ∇fi(x)k∗6

E[k∇F(y, ξ)−∇F(x, ξ)k∗]6Liky−xk

, due to Jensen’s inequality for the dual norm. By similar reasoning

and Assumption 2.3, we also see that

is Lipschitz continuous with constant

. Due to Assumptions

2.2 and 2.3, we also have

kf(x1)−f(x2)k26Mfkx1−x2k

k∇f(x2)T(x1−x2)k26Mfkx1−x2k

and

kf(x1)−f(x2)− ∇f(x2)T(x1−x2)k26Lf

2kx1−x2k2

, for all

x1, x2∈Rn

, where

∇f(·) :=

[∇f1(·),...,∇fm(·)] ∈Rn×mand constants Mfand Lfare deﬁned as

Mf:=qPm

i=1 M2

iand Lf:=qPm

i=1 L2

i.(3)

We now state the deﬁnition of the

prox

-function and the

prox

-operator. The class of algorithms based on

prox

-operators are called proximal algorithms. Such algorithms have turned out to be particularly useful

for efﬁciently solving various machine learning problems in the recent past. We refer the interested reader

to Parikh and Boyd (2014), Beck (2017) for more details.

Deﬁnition 2.1.

Let

ω:X→R

be continuously differentiable,

Lω

-Lipschitz gradient smooth, and

-strongly

convex with respect to

k·k

function. We deﬁne the

prox

-function associated with

ω(·)

∀x, y ∈Rn

, as

W(y, x):=ω(y)−ω(x)−h∇ω(x), y −xi

. Based on the smoothness and strong convexity of

ω(x)

, we have

W(y, x)6Lω

2kx−yk26LωW(x, y)

∀x, y ∈Rn

. For any

v∈Rn

, we deﬁne the following

prox

-operator

as prox(v, ˜x, η) := arg minx∈X{hv, xi+ηW (x, ˜x)}.

The function

is also called as Bregman divergence in the literature. A canonical example of

is that

of the Euclidean distance function

kx−yk2

which is useful when

X=Rn

. We will see in Section 2.1 that

our algorithm is based on the above

prox

-operator. Finally, we have the following results which will prove

to be useful for subsequent calculations. Let

u:= [u1,··· , um]

and

DX:= supx,y∈XpW(x, y)

be the

diameter of the set X.

Lemma 2.1.

Let

ν:= [ν1,··· , νm]

Fν(x, ξ, u):= [F1(x+ν1u1, ξ1), . . . , Fm(x+νmum, ξm)]T

and

fν(x):= [f1,ν1(x), . . . , fm,νm(x)]T

. Under assumption 2.3, we have

Eu,ξ[kFν(x, ξ, u)−fν(x)k2]6σ2

f,ν

where σ2

f,ν := (Pm

i=1 4(n+ 2)M2

iν2

i+L2

iν4

in2)+2σ2

f, where σ2

f=Pm

i=1 σ2

fi.

Lemma 2.2. Let ˜

Bi:= νi

2Li(n+ 3)3/2+LiDX+Mi. Under assumptions 2.1 and 2.2, we have

Eu,ξ[kGi,νi(x, ξ, u)− ∇fi,νi(x)k2]6σ2

i,νi,(4)

where σ2

i,νi:=ν2

iL2

i(n+ 6)3+ 10(n+ 4)[σ2

i+˜

i].

2.1 Algorithmic Methodology

We now present the

SZO-ConEX

algorithm for solving the stochastic zeroth-order functional constrained

optimization problem

(1)

. The constraint extrapolation framework is a novel primal-dual method that proceeds

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StochasticZeroth-orderFunctionalConstrainedOptimization:OracleComplexityandApplicationsAnthonyNguyen*KrishnakumarBalasubramanianOctober11,2022AbstractFunctionallyconstrainedstochasticoptimizationproblems,whereneithertheobjectivefunctionnortheconstraintfunctionsareanalyticallyavailable,arisefrequent...

展开>> 收起<<

Stochastic Zeroth-order Functional Constrained Optimization Oracle Complexity and Applications Anthony Nguyen Krishnakumar Balasubramanian.pdf

共38页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Stochastic Zeroth-order Functional Constrained Optimization Oracle Complexity and Applications Anthony Nguyen Krishnakumar Balasubramanian

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: