Stochastic Zeroth-order Functional Constrained Optimization Oracle Complexity and Applications Anthony Nguyen Krishnakumar Balasubramanian

2025-05-02 0 0 1.99MB 38 页 10玖币
侵权投诉
Stochastic Zeroth-order Functional Constrained Optimization:
Oracle Complexity and Applications
Anthony Nguyen*Krishnakumar Balasubramanian
October 11, 2022
Abstract
Functionally constrained stochastic optimization problems, where neither the objective function nor
the constraint functions are analytically available, arise frequently in machine learning applications.
In this work, assuming we only have access to the noisy evaluations of the objective and constraint
functions, we propose and analyze stochastic zeroth-order algorithms for solving the above class of
stochastic optimization problem. When the domain of the functions is
Rn
, assuming there are
m
constraint functions, we establish oracle complexities of order
O((m+ 1)n/2)
and
O((m+ 1)n/3)
respectively in the convex and nonconvex setting, where
represents the accuracy of the solutions required
in appropriately defined metrics. The established oracle complexities are, to our knowledge, the first such
results in the literature for functionally constrained stochastic zeroth-order optimization problems. We
demonstrate the applicability of our algorithms by illustrating its superior performance on the problem of
hyperparameter tuning for sampling algorithms and neural network training.
1 Introduction
We develop and analyze stochastic zeroth-order algorithms for solving the following non-linear optimization
problem with functional constraints:
min
xXf0(x)such that fi(x)60, i ∈ {0,1, . . . , m},(1)
where, for
i∈ {0,1, . . . , m}
,
fi:RnR
are continuous functions which are not necessarily convex defined
as
fi(x) = Eξi[Fi(x, ξi)]
with
ξi
denoting the noise vector associated with function
fi
, and
XRn
is a
convex compact set that represents known constraints (i.e., constraints that are analytically available). In the
stochastic zeroth-order setting, we neither observe the objective function
f0
nor the constraint functions
fi
analytically. We only have access to noisy function evaluations of them. The study of stochastic zeroth-order
optimization algorithms for unconstrained optimization problems goes back to the early works of Kiefer and
Wolfowitz (1952), Blum (1954), Hooke and Jeeves (1961), Spendley et al. (1962), Powell (1964), Nelder
and Mead (1965), Nemirovski and Yudin (1983), Spall (1987). Such zeroth-order algorithms have proved to
be extremely useful for hyperparameter tuning (Snoek et al. 2012, Hernández-Lobato et al. 2015, Gelbart
et al. 2014, Ruan et al. 2019, Golovin et al. 2017), reinforcement learning (Mania et al. 2018, Salimans et al.
2017, Gao et al. 2020, Choromanski et al. 2020) and robotics (Jaquier et al. 2020, Jaquier and Rozo 2020).
However, the study of zeroth-order algorithms and their oracle complexities for constrained problem as in
(1)
is limited, despite the fact that several real-world machine learning problems fall under the setting of
(1)
. We
now describe two such applications that serve as our main motivation for developing stochastic zeroth-order
optimization algorithms for solving (1), and analyzing their oracle complexity.
*Department of Mathematics, University of California, Davis. antngu@ucdavis.edu.
Department of Statistics, University of California, Davis kbala@ucdavis.edu.
1
arXiv:2210.04273v1 [math.OC] 9 Oct 2022
1.1 Motivating application I:
Hamiltonian Monte Carlo (HMC) algorithm, proposed by Duane et al. (1987) and popularized in the
statistical machine learning community by Neal (2011), is a gradient-based sampling algorithm that works
by discretizing the continuous time degenerate Langevin diffusion (Leimkuhler and Matthews 2015). It
has been used successfully as a state-of-the art sampler or a numerical integrator in the Bayesian statistical
machine learning community by Hoffman and Gelman (2014), Wang et al. (2013), Girolami and Calderhead
(2011), Chen et al. (2014), Carpenter et al. (2017). However, in order to obtain successful performance
in practice using HMC, several hyperparameters need to be tuned optimally. Typically, the functional
relationship between the hyperparameters that need to be tuned and the performance measure used is not
available in an analytical form. We can only evaluate the performance of the sampler for various settings of
the hyperparameter. Furthermore, in practice several constraints, for example, constraints on running times
and constraints that enforce the generated samples to pass certain standard diagnostic tests (Geweke 1991,
Gelman and Rubin 1992), are enforced in the hyperparameter tuning process. The functional relationship
between such constraints and the hyperparameters is also not available analytically. This makes the problem
of optimally setting the hyperparameters for HMC a constrained zeroth-order optimization problem. As a
preview, in Section 4.1, we show that our approach provides significant improvements over existing methods
of Mahendran et al. (2012), Gelbart et al. (2014), Hernández-Lobato et al. (2015), which are based on
Bayesian optimization techniques for tuning HMC, when we measure the performance adopting the widely
used effective sample size metric (Kass et al. 1998).
1.2 Motivating application II:
Deep learning has achieved state-of-the-art performance in the recent years for various prediction tasks (Good-
fellow et al. 2016). Among the various factors involved behind the success of deep learning, hyperparameter
tuning is one of primary factors (Snoek et al. 2012, Bergstra and Bengio 2012, Li et al. 2017, Hazan et al.
2018, Elsken et al. 2019). However, most of the existing methods for tuning the hyperparameters do not
enforce any constraints on the prediction time required on the validation set or memory constraints on the
training algorithm. Such constraints are typically required to make deep learning methods widely applicable
to problem arising in several consumer applications based on tiny devices (Perera et al. 2015, Latré et al. 2011,
Yang et al. 2008). As in the above motivating application, the functional relationship between such constraints
and the hyperparameters is not available analytically. As a preview, in Section 4.2, we show that our approach
provides significant improvements over the existing works of Gelbart et al. (2014), Hernández-Lobato et al.
(2015), Ariafar et al. (2019) that developed hyperparameter tuning techniques which explictly take into
account time/memory constraints.
1.3 Related works
In the operations research and statistics communities, zeroth-order optimization techniques are well-studied
under the name of derivative-free optimization. Interested readers are referred to Conn et al. (2009) and Audet
and Hare (2017). In the machine learning community, Bayesian optimization techniques have been developed
for optimizing functions with only noisy function evaluations. We refer the reader to Mockus (1994), Kolda
et al. (2003), Spall (2005), Conn et al. (2009), Mockus (2012), Brent (2013), Shahriari et al. (2015), Audet
and Hare (2017), Larson et al. (2019), Frazier (2018), Archetti and Candelieri (2019), Liu et al. (2020) for
more details. In what follows, we focus on relevant literature from zeroth-order optimization and Bayesian
optimization literature for known constrained optimization problems (i.e., problems with constraints that are
analytically available). When the constraint set is analytically available and only the objective function is
not, Lewis and Torczon (2002) and Bueno et al. (2013) considered an augmented Lagrangian approach and
an inexact restoration method respectively, and provided convergence analysis. Furthermore, Kolda et al.
2
(2003), Amaioua et al. (2018), Audet et al. (2015) extended the popular mesh adaptive direct search to this
setting. Projection-free methods based on Frank-Wolfe methods have been considered in Balasubramanian
and Ghadimi (2018), Sahu et al. (2019) for the case when the constraint set is a convex subset of
Rn
.
Furthermore, Li et al. (2020) considered the case when the constraint set is a Riemannian submanifold
embedded in
Rn
(and the function is defined only over the manifold). None of the above works are directly
applicable to the case of unknown constraints that we consider in this work.
We now discuss some existing methods for solving (variants of) problem
(1)
in the zeroth-order setting.
For solving
(1)
in the deterministic setting (i.e., we could obtain exact evaluations of the objective and the
constraint functions at a given point), filter methods which reduce the objective function while trying to
reduce constraint violations were proposed and analyzed in Audet and Dennis Jr (2004), Echebest et al.
(2017), Pourmohamad and Lee (2020). Barrier methods in the zeroth-order setting were considered in Audet
and Dennis Jr (2006, 2009), Liuzzi and Lucidi (2009), Gratton and Vicente (2014), Fasano et al. (2014),
Liuzzi et al. (2010), Dzahini et al. (2020), with some works also developing line search approaches for setting
the tuning parameters. Model based approaches were considered in the works of Müller and Woodbury
(2017), Tröltzsch (2016), Augustin and Marzouk (2014), Gramacy et al. (2016), Conn and Le Digabel (2013).
Furthermore, B˝urmen et al. (2006), Audet and Tribes (2018) developed extensions of Nelder–Mead algorithm
to the constrained setting.
Several works in the statistical machine learning community also considered Bayesian optimization
methods in the constrained setting, in both the noiseless and noisy setting. We refer the reader, for example,
to Gardner et al. (2014), Gelbart et al. (2016), Ariafar et al. (2019), Balandat et al. (2020), Bachoc et al.
(2020), Greenhill et al. (2020), Eriksson and Poloczek (2020), Letham et al. (2019), Hernández-Lobato et al.
(2015), Lam and Willcox (2017), Picheny et al. (2016), Acerbi and Ma (2017). On one hand, the above works
demonstrate the interest in the optimization and machine learning communities for developing algorithms for
constrained zeroth-order optimization problems. On the other hand, most of the above works are not designed
to handle stochastic zeroth-order constrained optimization that we consider. Furthermore, a majority of the
above works are methodological, and the few works that develop convergence analysis do so only in the
asymptotic setting. A recent work by Usmanova et al. (2019) considered the case when the constraints are
linear functions (but unknown), and provided a Frank-Wolfe based algorithm with estimated constraints.
However, the proposed approach is limited to only linear constraints, and the oracle complexities established
are highly sub-optimal. To the best of our knowledge, there is no rigorous non-asymptotic analysis of the
oracle complexity of stochastic zeroth-optimization when the constraints and the objective values are available
only via noisy function evaluations.
1.4 Methodology and Main Contributions:
Our methodology is based on a novel constraint extrapolation technique developed for the zeroth-order
setting, extending the work of Boob et al. (2022) in the first-order setting, and the Gaussian smoothing based
zeroth-order stochastic gradient estimators. Specifically, we propose the
SZO-ConEX
method in Algorithm 1
for solving problems of the form in
(1)
. We theoretically characterize how to set the tuning parameters of the
algorithm so as to mitigate the issues caused by the bias in the stochastic zeroth-order gradient estimates and
obtain the oracle complexity of our algorithm. More specifically, we make the following main contributions:
When the functions
fi
,
i= 0, . . . , m
, are convex, in Theorem 3.1, we show that the number of calls
to the stochastic zeroth-order oracle to achieve an appropriately defined
-optimal solution of
(1)
(see
Definition 3.1) is of order O((m+ 1)n/2).
When the functions are nonconvex, in Proposition 3.1, we show that the number of calls to the stochastic
zeroth-order oracle to achieve an appropriately defined
-optimal KKT solution of
(1)
(see Definition 3.2)
is of order O((m+ 1)n/3).
3
To our knowledge, these are the first non-asymptotic oracle complexity results for stochastic zeroth-order
optimization with stochastic zeroth-order functional constraints. We illustrate the practical applicability of the
developed methodology by testing its performance on hyperparameter tuning for HMC sampling algorithm
(Section 4.1) and 3-layer neural network (Section 4.2).
2 Preliminaries and Methodology
Notations:
Let
0
denote the vector of elements
0
and
[m]:={1, . . . , m}
. Let
f(x):= [f1(x), . . . , fm(x)]T
;
then, the constraints in
(1)
can be expressed as
f(x)60
. We use
ξ:= [ξ1,··· , ξm]
to denote the random
vectors in the constraints. Furthermore,
k · k
denotes a general norm and
k · k
denotes its dual norm defined
as
kzk:= sup{zTx:kxk ≤ 1}
. Furthermore,
[x]+:= max{x, 0}
for any
xR
. For any vector
xRk
,
we define [x]+as element-wise application of [·]+.
We now describe the precise assumption made on the stochastic zeroth-order oracle in this work.
Assumption 2.1.
Let
k·k
be a norm on
Rn
. For
i∈ {0, . . . , m}
and for any
xRn
, the zeroth-
order oracle outputs an estimator
Fi(x, ξi)
of
fi(x)
such that
E[Fi(x, ξi)] = fi(x)
,
E[Fi(x, ξi)2]σ2
fi
,
E[Fi(x, ξi)] = fi(x),E[k∇Fi(x, ξi)− ∇fi(x)k2
]6σ2
i, , where k·kdenotes the dual norm.
The assumption above assumes that we have access to a stochastic zeroth-order oracle which provides
unbiased function evaluations with bounded variance. It is worth noting that in the above assumption, we
do not necessarily assume the noise
ξi
is additive. Furthermore, we allow for different noise models for the
objective function and the
m
constraint functions, which is a significantly general model compared to several
existing works such as Digabel and Wild (2015). Our gradient estimator is then constructed by leveraging the
Gaussian smoothing technique proposed in Nemirovski and Yudin (1983), Nesterov and Spokoiny (2017).
For
νi(0,)
we introduce the smoothed function
fi,νi(x) = Eui[fi(x+νiui)]
where
uiN(0, In)
and
independent across
i
. We can estimate the gradient of this smoothed function using function evaluations of
fi
.
Specifically, we define the stochastic zeroth-order gradient of fi,νi(x)as
Gi,νi(x, ξi, ui) = Fi(x+νiui, ξi)Fi(x, ξi)
νi
ui,(2)
which is an unbiased estimator of
fi,νi(x)
, i.e., we have
Eu,ξi[Gi,νi(x, ξi, u)] = fi,νi(x)
. However, it is
well-known that
Gi,νi(x, ξi, ui)
is a biased estimator of
fi(x)
. An interpretation of the gradient estimator
in
(2)
as a consequence of Gaussian Stein’s identity, popular in the statistics literature (Stein 1972), was
provided in Balasubramanian and Ghadimi (2022).
The gradient estimator in
(2)
is referred to as the two-point estimator in the literature. The reason is
that, for a given random vector
ξi
, it is assumed that the stochastic function in
(2)
could be evaluated at
two points,
Fi(x+νiui, ξi)
and
Fi(x, ξi)
. Such an assumption is satisfied in several statistics, machine
learning, simulation based optimization, and sampling problems; see for example Spall (2005), Mokkadem
and Pelletier (2007), Dippon (2003), Agarwal et al. (2010), Duchi et al. (2015), Ghadimi and Lan (2013),
Nesterov and Spokoiny (2017). Yet another estimator in the literature is the one-point estimator, which
assumes that for each
ξi
, we observe only one noisy function evaluation
Fi(x+νiui, ξi)
. It is well-known that
the one-point setting is more challenging than the two-point setting (Shamir 2013). From a theoretical point
of view, the use of two-point evaluation based gradient estimator is primarily motivated by the sub-optimality
(in terms of oracle complexity) of one-point feedback based stochastic zeroth-order optimization methods
either in terms of the approximation accuracy or dimension dependency. For the rest of this work, we focus
on the two-point setting and leave the question of obtaining results in the one-point setting as future work.
We now describe our assumptions on the objective and constraint functions.
4
Assumption 2.2.
Function
Fi
has Lipschitz continuous gradient with constant
Li
, almost surely for any
ξi
, i.e.,
k∇Fi(y, ξi)− ∇Fi(x, ξi)k6Likyxk
, which consequently implies that
|Fi(y, ξi)Fi(x, ξi)
h∇Fi(x, ξi), y xi| 6Li
2kyxk2for i∈ {0,1, . . . , m}.
Assumption 2.3.
Function
Fi
is Lipschitz continuous with constant
Mi
, almost surely for any
ξi
, i.e.,
|Fi(y, ξi)Fi(x, ξi)| ≤ Mikyxk, for i∈ {0,1, . . . , m}.
The above smoothness assumptions are standard in the literature on stochastic zeroth-order optimization
and are made in several works Nesterov and Spokoiny (2017), Ghadimi and Lan (2013), Balasubramanian
and Ghadimi (2022) for obtaining oracle complexity results. It is easy to see that Assumption 2.2 implies
that for
i∈ {0, . . . , m}
,
fi
has Lipschitz continuous gradient with constant
Li
since
k∇fi(y)− ∇fi(x)k6
E[k∇F(y, ξ)F(x, ξ)k]6Likyxk
, due to Jensen’s inequality for the dual norm. By similar reasoning
and Assumption 2.3, we also see that
fi
is Lipschitz continuous with constant
Mi
. Due to Assumptions
2.2 and 2.3, we also have
kf(x1)f(x2)k26Mfkx1x2k
,
k∇f(x2)T(x1x2)k26Mfkx1x2k
and
kf(x1)f(x2)− ∇f(x2)T(x1x2)k26Lf
2kx1x2k2
, for all
x1, x2Rn
, where
f(·) :=
[f1(·),...,fm(·)] Rn×mand constants Mfand Lfare defined as
Mf:=qPm
i=1 M2
iand Lf:=qPm
i=1 L2
i.(3)
We now state the definition of the
prox
-function and the
prox
-operator. The class of algorithms based on
prox
-operators are called proximal algorithms. Such algorithms have turned out to be particularly useful
for efficiently solving various machine learning problems in the recent past. We refer the interested reader
to Parikh and Boyd (2014), Beck (2017) for more details.
Definition 2.1.
Let
ω:XR
be continuously differentiable,
Lω
-Lipschitz gradient smooth, and
1
-strongly
convex with respect to
k·k
function. We define the
prox
-function associated with
ω(·)
,
x, y Rn
, as
W(y, x):=ω(y)ω(x)h∇ω(x), y xi
. Based on the smoothness and strong convexity of
ω(x)
, we have
W(y, x)6Lω
2kxyk26LωW(x, y)
,
x, y Rn
. For any
vRn
, we define the following
prox
-operator
as prox(v, ˜x, η) := arg minxX{hv, xi+ηW (x, ˜x)}.
The function
W
is also called as Bregman divergence in the literature. A canonical example of
W
is that
of the Euclidean distance function
kxyk2
which is useful when
X=Rn
. We will see in Section 2.1 that
our algorithm is based on the above
prox
-operator. Finally, we have the following results which will prove
to be useful for subsequent calculations. Let
u:= [u1,··· , um]
and
DX:= supx,yXpW(x, y)
be the
diameter of the set X.
Lemma 2.1.
Let
ν:= [ν1,··· , νm]
,
Fν(x, ξ, u):= [F1(x+ν1u1, ξ1), . . . , Fm(x+νmum, ξm)]T
and
fν(x):= [f11(x), . . . , fm,νm(x)]T
. Under assumption 2.3, we have
Eu,ξ[kFν(x, ξ, u)fν(x)k2]6σ2
f
,
where σ2
f:= (Pm
i=1 4(n+ 2)M2
iν2
i+L2
iν4
in2)+2σ2
f, where σ2
f=Pm
i=1 σ2
fi.
Lemma 2.2. Let ˜
Bi:= νi
2Li(n+ 3)3/2+LiDX+Mi. Under assumptions 2.1 and 2.2, we have
Eu,ξ[kGi,νi(x, ξ, u)− ∇fi,νi(x)k2]6σ2
i,νi,(4)
where σ2
i,νi:=ν2
iL2
i(n+ 6)3+ 10(n+ 4)[σ2
i+˜
B2
i].
2.1 Algorithmic Methodology
We now present the
SZO-ConEX
algorithm for solving the stochastic zeroth-order functional constrained
optimization problem
(1)
. The constraint extrapolation framework is a novel primal-dual method that proceeds
5
摘要:

StochasticZeroth-orderFunctionalConstrainedOptimization:OracleComplexityandApplicationsAnthonyNguyen*KrishnakumarBalasubramanian†October11,2022AbstractFunctionallyconstrainedstochasticoptimizationproblems,whereneithertheobjectivefunctionnortheconstraintfunctionsareanalyticallyavailable,arisefrequent...

展开>> 收起<<
Stochastic Zeroth-order Functional Constrained Optimization Oracle Complexity and Applications Anthony Nguyen Krishnakumar Balasubramanian.pdf

共38页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:38 页 大小:1.99MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 38
客服
关注