Zero-Order One-Point Estimate with Distributed Stochastic Gradient-Tracking Technique Elissa Mhanna elissa.mhannacentralesupelec.fr

2025-04-24 0 0 697.03KB 36 页 10玖币
侵权投诉
Zero-Order One-Point Estimate with Distributed Stochastic
Gradient-Tracking Technique
Elissa Mhanna elissa.mhanna@centralesupelec.fr
Mohamad Assaad mohamad.assaad@centralesupelec.fr
Laboratoire des Signaux & Syst`emes
CentraleSup´elec
3 rue Joliot Curie
91190 Gif-sur-Yvette, France
Abstract
In this work, we consider a distributed multi-agent stochastic optimization problem, where
each agent holds a local objective function that is smooth and convex, and that is subject
to a stochastic process. The goal is for all agents to collaborate to find a common solution
that optimizes the sum of these local functions. With the practical assumption that agents
can only obtain noisy numerical function queries at exactly one point at a time, we extend
the distributed stochastic gradient-tracking method to the bandit setting where we don’t
have an estimate of the gradient, and we introduce a zero-order (ZO) one-point estimate
(1P-DSGT). We analyze the convergence of this novel technique for smooth and convex
objectives using stochastic approximation tools, and we prove that it converges almost
surely to the optimum. We then study the convergence rate for when the objectives are
additionally strongly convex. We obtain a rate of O(1
k) after a sufficient number of
iterations k > K2which is usually optimal for techniques utilizing one-point estimators.
We also provide a regret bound of O(k), which is exceptionally good compared to the
aforementioned techniques. We further illustrate the usefulness of the proposed technique
using numerical experiments.
Keywords: Gradient-free optimization, stochastic framework, distributed algorithms,
consensus, gradient tracking, convergence analysis
1. Introduction
Gradient-free optimization is an old topic in the research community; however, there has
been an increased interest recently, especially in machine learning applications, where opti-
mization problems are typically solved with gradient descent algorithms. Successful appli-
cations of gradient-free methods in machine learning include competing with an adversary
in bandit problems (Flaxman et al., 2004; Agarwal et al., 2010), generating adversarial
attacks for deep learning models (Chen et al., 2019; Liu et al., 2019) and reinforcement
learning (Vemula et al., 2019). Gradient-free optimization aims to solve optimization prob-
lems with only functional (zero-order) information rather than first-order (FO) gradient
information. These techniques are essential in settings where explicit gradient computation
may be impractical, expensive, or impossible. Instances of such settings include high data
dimensionality, time or resource straining function differentiation, or the cost function not
having a closed-form. Zero-order information-based methods include direct search methods
(Golovin et al., 2019), 1-point methods (Li and Assaad, 2021; Flaxman et al., 2004; Bach
©2022 Elissa Mhanna and Mohamad Assaad.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.
arXiv:2210.05618v1 [math.OC] 11 Oct 2022
Mhanna and Assaad
and Perchet, 2016; Vemula et al., 2019) where the function is evaluated at a single point
with some randomization to estimate the gradient, 2- or more point methods (Duchi et al.,
2015; Nesterov and Spokoiny, 2017; Gorbunov et al., 2018; Bach and Perchet, 2016; Ha-
jinezhad et al., 2019; Kumar Sahu et al., 2018; Agarwal et al., 2010; Chen et al., 2019; Liu
et al., 2019; Vemula et al., 2019), where functional difference at various points is employed
for estimation, and other methods such as sign information of gradient estimates (Liu et al.,
2019).
Another area of great interest is distributed multi-agent optimization, where agents
try to cooperatively solve a problem with information exchange only limited to immediate
neighbors in the network. Distributed computing and data storing are particularly essential
in fields such as vehicular communications and coordination, data processing and distributed
control in sensor networks (Shi and Eryilmaz, 2020), big-data analytics (Daneshmand et al.,
2015), and federated learning (McMahan et al., 2017). More specifically, one direction
of research integrates (sub)gradient-based methods with a consensus/averaging strategy;
the local agent incorporates one or multiple consensus steps alongside evaluating the local
gradient during optimization. Hence, these algorithms can tackle a fundamental challenge:
overcoming differences between agents’ local data distributions.
1.1 Problem Description
Consider a set of agents N={1,2, . . . , n}connected by a communication network. Each
agent iis associated with a local objective function fi:RdR. The global goal of the
agents is to collaboratively locate the decision variable xRdthat solves the stochastic
optimization problem:
minxRdF(x) = 1
n
n
X
i=1
Fi(x) (1)
where
Fi(x) = ESfi(x, S),
with S∈ S denoting an i.i.d. ergodic stochastic process describing uncertainties in the
communication system.
We assume that at each time step, agent ican only query the function values of fiat
exactly one point, and can only communicate with its neighbors. Further, we assume that
the function queries are noisy ˜
fi=fi+ζiwith ζisome additive noise. Agent imust then
employ this query to estimate the gradient of the form gi(x, Si).
One efficient algorithm with a straightforward averaging scheme to solve this problem
is the gradient-tracking (GT) technique, which has proven to achieve rates competing with
its centralized counterparts. For example, the acquired error bound under a distributed
stochastic variant was found to decrease with the network size n (Pu and Nedi´c, 2018). In
most work, this technique proved to converge linearly to the optimal solution with constant
step size (Qu and Li, 2018; Nedi´c et al., 2017; Pu, 2020), which is also a unique attribute
among other distributed stochastic gradient algorithms. It has been extended to time-
varying (undirected or directed) graphs (Nedi´c et al., 2017), a gossip-like method which is
efficient in communication (Pu and Nedi´c, 2018), and nonconvex settings (Tang et al., 2021;
Lorenzo and Scutari, 2016; Jiang et al., 2022; Lu et al., 2019). All references mentioned
2
ZO One-Point Estimate with Distributed Stochastic GT Technique
above consider the case where an accurate gradient computation or a non-biased gradient
estimation with bounded variance (BV) is available.
1.2 Function Classes
Consider the following five classes of functions:
The convex class Ccvx containing all functions f:RdRthat are convex.
The strongly convex class Csc containing all functions f:RdRthat are continuously
differentiable and admit a constant µfsuch that
h∇f(x)− ∇f(y), x yik ≥ µfkxyk2,x, y Rd.
The Lipschitz continuous class Clip containing all functions f:RdRthat admit a
constant Lfsuch that
|f(x)f(y)| ≤ Lfkxyk,x, y Rd.
The smooth class Csmo containing all functions f:RdRthat are continuously
differentiable and admit a constant Gfsuch that
k∇f(x)− ∇f(y)k ≤ Gfkxyk,x, y Rd.
The gradient dominated class Cgd containing all functions f:RdRthat are differ-
entiable, have a global minimizer x, and admit a constant νfsuch that
2νf(f(x)f(x)) ≤ k∇f(x)k2,xRd.
1.3 Related Work
An adversarial convex bandit problem in a zero-order framework is studied in Flaxman
et al. (2004); Agarwal et al. (2010); Bubeck et al. (2021). In Flaxman et al. (2004), a
one point estimator is proposed and a regret bound of O(k3
4) is achieved for Lipschitz
continuous functions. In Agarwal et al. (2010), they extend the problem to the multi-point
setting, where the player queries each loss function at many points; hence, their estimates
are unbiased with respect to the true gradient and have bounded variance. When the
number of points is two, they prove regret bounds of ˜
O(k) with high probability and of
O(log(k)) in expectation for strongly convex loss functions. When the number is d+ 1
point, they prove regret bounds of O(k) and of O(log(k)) with strong convexity. As
mentioned in Agarwal et al. (2010) and the references therein, these bounds achieved are
even optimal for the full-information setting. In Bubeck et al. (2021), a kernelized loss
estimator is proposed where a generalization of Bernoulli convolutions is adopted, and an
annealing schedule for exponential weights is used to control the estimator’s variance in a
focus region for dimensions higher than 1. Bubeck et al. (2021) achieves a regret bound of
O(k). However, all these references solve the bandit problem in a centralized framework
3
Mhanna and Assaad
GRADIENT
ESTIMATE OP FUNCTION
CLASS
CONS-
ENSUS
REGRET
BOUND
CONVERGENCE
RATE
ZO
One-point
Cent. Ccvx TClip -O(k3
4)O(1
4
k)Flaxman et al. (2004)
Dist. Csc TClip TCsmo None - O(1
k)Li and Assaad (2021)
Dist. Csc TClip TCsmo GT O(k)O(1
k)1P-DSGT
Two-point
Cent. Ccvx TClip -O(k)O(1
k)Agarwal et al. (2010)
Cent. Csc TClip -O(log k)O(log k
k)Agarwal et al. (2010)
Dist. Clip TCsmo None - O(1
klog k)Tang et al. (2021)
Dist. Csmo TCgd None - O(1
k)Tang et al. (2021)
(d+1)-point Cent. Csc TClip TCsmo -O(log k)O(log k
k)Agarwal et al. (2010)
2d-point Dist. Csmo GT - O(1
k)Tang et al. (2021)
Dist. Csmo TCgd GT - O(λk)Tang et al. (2021)
Kernel-based Cent. Ccvx TClip -O(k)O(1
k)Bubeck et al. (2021)
FO Unbiased/BV
Dist. Csc TCsmo GT - O(λk)Pu and Nedi´c (2018);
Xin et al. (2019); Pu (2020)
Dist. Csmo GT - O(1
k)Lu et al. (2019)
Table 1: Convergence rates for various algorithms related to our work, classified according
to the nature of the gradient estimate, whether the optimization problem (OP) is centralized
or distributed, the assumptions on the objective function, whether consensus is aimed at or
not, and the achieved regret bound and convergence rate
or the case where a single agent has all the data and is the only one responsible for the
decision-making.
Li and Assaad (2021) consider a distributed stochastic optimization problem in networks
under the assumption of smoothness and concavity. They solve it using a stochastic gradient
descent method, where the exchange between the nodes is limited to a partial trade of the
observation of their local utilities, which each agent then uses to estimate the global utility.
With this estimated global utility, each agent can construct a one-point gradient estimator
based on a stochastic perturbation. The convergence rate is proved to scale as O(1
k), under
the further assumption of a strongly concave objective function. We must remark that in
Li and Assaad (2021), there is a distributed solution, the optimization variable is a scalar,
not a vector, and each agent can only update its own scalar. Hence, there is no consensus
to be aimed at, as in our case. In the context of the derivative-free stochastic centralized
optimization, all Duchi et al. (2015); Jamieson et al. (2012); Shamir (2013) have derived
4
ZO One-Point Estimate with Distributed Stochastic GT Technique
several lower bounds to confirm that the convergence rate cannot be better than O(1
k)
after k iterations for strongly convex and smooth objective functions.
All Qu and Li (2018); Lorenzo and Scutari (2016); Nedi´c et al. (2017); Shi et al. (2015);
Li et al. (2022); Jiang et al. (2022) present a distributed gradient-tracking method that
employs local auxiliary variables to track the average of all agents’ gradients, considering
the availability of accurate gradient information. In both Li et al. (2022); Jiang et al. (2022),
each local objective function is an average of finite instantaneous functions. Thus, they
incorporate the gradient-tracking algorithm with stochastic averaging gradient technology
(Li et al., 2022) (smooth convex optimization) or with variance reduction techniques (Jiang
et al., 2022) (smooth nonconvex optimization). At each iteration, they randomly select
only one gradient of an instantaneous function to approximate the local batch gradient. In
Li et al. (2022), this is an unbiased estimate of the local gradient, whereas, in Jiang et al.
(2022), it is biased. Nevertheless, both references assume access to an exact gradient oracle.
In Tang et al. (2021), they develop two algorithms for nonconvex multi-agent optimiza-
tion. One is a gradient-tracking algorithm based on a noise-free 2d-point estimator of the
gradient, achieving a rate of O(1
k) with smoothness assumptions and a linear rate for an
extra ν-gradient dominated objective assumption. The other is based on a 2-point unbiased
estimator without global gradient tracking and achieves a rate of O(1
klog k) under Lips-
chitz continuity and smoothness conditions and O(1
k) under an extra gradient dominated
function structure. Both these gradient estimators are unbiased with respect to the exact
gradient and even have a vanishing variance.
All Pu and Nedi´c (2018); Xin et al. (2019); Pu (2020); Lu et al. (2019) assume access
to local stochastic first-order oracles where the gradient is unbiased and with a bounded
variance. In the first three, they additionally assume smooth and strongly-convex local
objectives, and they all accomplish a linear convergence rate under a constant step size.
Pu and Nedi´c (2018) proposes a distributed stochastic gradient-tracking method (DSGT)
and a gossip-like stochastic gradient-tracking method (GSGT) where at each round, each
agent wakes up with a certain probability. Further, in Pu and Nedi´c (2018), when the
step-size is diminishing, the convergence rate is that of O(1
k). Xin et al. (2019) employs
a gradient-tracking algorithm for agents communicating over a strongly-connected graph.
Pu (2020) introduces a robust gradient-tracking method (R-Push-Pull) in the context of
noisy information exchange between agents and with a directed network topology. In Lu
et al. (2019), they propose a gradient-tracking based nonconvex stochastic decentralized
(GNSD) algorithm for nonconvex optimization problems in machine learning, and they
fulfill a convergence rate of O(1
k) under constant step size.
1.4 Contributions
In this paper, we consider smooth and convex local objectives, and we extend the gradient-
tracking algorithm to the case where we do not have an estimation of the gradient. Under
the realistic assumption that the agent only has access to a single noisy function value at
each time, without necessarily knowing the form of this function, we propose a one-point
estimator in a stochastic framework. Naturally, one-point estimators are biased with respect
to the true gradient and suffer from high variance (Liu et al., 2020) hence they do not match
the assumptions for convergence presented in Tang et al. (2021); Pu and Nedi´c (2018); Xin
5
摘要:

Zero-OrderOne-PointEstimatewithDistributedStochasticGradient-TrackingTechniqueElissaMhannaelissa.mhanna@centralesupelec.frMohamadAssaadmohamad.assaad@centralesupelec.frLaboratoiredesSignaux&SystemesCentraleSupelec3rueJoliotCurie91190Gif-sur-Yvette,FranceAbstractInthiswork,weconsideradistributedmul...

展开>> 收起<<
Zero-Order One-Point Estimate with Distributed Stochastic Gradient-Tracking Technique Elissa Mhanna elissa.mhannacentralesupelec.fr.pdf

共36页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:36 页 大小:697.03KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 36
客服
关注