Zero-Order One-Point Estimate with Distributed Stochastic Gradient-Tracking Technique Elissa Mhanna elissa.mhannacentralesupelec.fr

2025-04-24 0 0 697.03KB 36 页 10玖币

侵权投诉

Zero-Order One-Point Estimate with Distributed Stochastic

Gradient-Tracking Technique

Elissa Mhanna elissa.mhanna@centralesupelec.fr

Mohamad Assaad mohamad.assaad@centralesupelec.fr

Laboratoire des Signaux & Syst`emes

CentraleSup´elec

3 rue Joliot Curie

91190 Gif-sur-Yvette, France

Abstract

In this work, we consider a distributed multi-agent stochastic optimization problem, where

each agent holds a local objective function that is smooth and convex, and that is subject

to a stochastic process. The goal is for all agents to collaborate to ﬁnd a common solution

that optimizes the sum of these local functions. With the practical assumption that agents

can only obtain noisy numerical function queries at exactly one point at a time, we extend

the distributed stochastic gradient-tracking method to the bandit setting where we don’t

have an estimate of the gradient, and we introduce a zero-order (ZO) one-point estimate

(1P-DSGT). We analyze the convergence of this novel technique for smooth and convex

objectives using stochastic approximation tools, and we prove that it converges almost

surely to the optimum. We then study the convergence rate for when the objectives are

additionally strongly convex. We obtain a rate of O(1

√k) after a suﬃcient number of

iterations k > K2which is usually optimal for techniques utilizing one-point estimators.

We also provide a regret bound of O(√k), which is exceptionally good compared to the

aforementioned techniques. We further illustrate the usefulness of the proposed technique

using numerical experiments.

Keywords: Gradient-free optimization, stochastic framework, distributed algorithms,

consensus, gradient tracking, convergence analysis

1. Introduction

Gradient-free optimization is an old topic in the research community; however, there has

been an increased interest recently, especially in machine learning applications, where opti-

mization problems are typically solved with gradient descent algorithms. Successful appli-

cations of gradient-free methods in machine learning include competing with an adversary

in bandit problems (Flaxman et al., 2004; Agarwal et al., 2010), generating adversarial

attacks for deep learning models (Chen et al., 2019; Liu et al., 2019) and reinforcement

learning (Vemula et al., 2019). Gradient-free optimization aims to solve optimization prob-

lems with only functional (zero-order) information rather than ﬁrst-order (FO) gradient

information. These techniques are essential in settings where explicit gradient computation

may be impractical, expensive, or impossible. Instances of such settings include high data

dimensionality, time or resource straining function diﬀerentiation, or the cost function not

having a closed-form. Zero-order information-based methods include direct search methods

(Golovin et al., 2019), 1-point methods (Li and Assaad, 2021; Flaxman et al., 2004; Bach

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.

arXiv:2210.05618v1 [math.OC] 11 Oct 2022

Mhanna and Assaad

and Perchet, 2016; Vemula et al., 2019) where the function is evaluated at a single point

with some randomization to estimate the gradient, 2- or more point methods (Duchi et al.,

2015; Nesterov and Spokoiny, 2017; Gorbunov et al., 2018; Bach and Perchet, 2016; Ha-

jinezhad et al., 2019; Kumar Sahu et al., 2018; Agarwal et al., 2010; Chen et al., 2019; Liu

et al., 2019; Vemula et al., 2019), where functional diﬀerence at various points is employed

for estimation, and other methods such as sign information of gradient estimates (Liu et al.,

2019).

Another area of great interest is distributed multi-agent optimization, where agents

try to cooperatively solve a problem with information exchange only limited to immediate

neighbors in the network. Distributed computing and data storing are particularly essential

in ﬁelds such as vehicular communications and coordination, data processing and distributed

control in sensor networks (Shi and Eryilmaz, 2020), big-data analytics (Daneshmand et al.,

2015), and federated learning (McMahan et al., 2017). More speciﬁcally, one direction

of research integrates (sub)gradient-based methods with a consensus/averaging strategy;

the local agent incorporates one or multiple consensus steps alongside evaluating the local

gradient during optimization. Hence, these algorithms can tackle a fundamental challenge:

overcoming diﬀerences between agents’ local data distributions.

1.1 Problem Description

Consider a set of agents N={1,2, . . . , n}connected by a communication network. Each

agent iis associated with a local objective function fi:Rd→R. The global goal of the

agents is to collaboratively locate the decision variable x∈Rdthat solves the stochastic

optimization problem:

minx∈RdF(x) = 1

i=1

Fi(x) (1)

where

Fi(x) = ESfi(x, S),

with S∈ S denoting an i.i.d. ergodic stochastic process describing uncertainties in the

communication system.

We assume that at each time step, agent ican only query the function values of fiat

exactly one point, and can only communicate with its neighbors. Further, we assume that

the function queries are noisy ˜

fi=fi+ζiwith ζisome additive noise. Agent imust then

employ this query to estimate the gradient of the form gi(x, Si).

One eﬃcient algorithm with a straightforward averaging scheme to solve this problem

is the gradient-tracking (GT) technique, which has proven to achieve rates competing with

its centralized counterparts. For example, the acquired error bound under a distributed

stochastic variant was found to decrease with the network size n (Pu and Nedi´c, 2018). In

most work, this technique proved to converge linearly to the optimal solution with constant

step size (Qu and Li, 2018; Nedi´c et al., 2017; Pu, 2020), which is also a unique attribute

among other distributed stochastic gradient algorithms. It has been extended to time-

varying (undirected or directed) graphs (Nedi´c et al., 2017), a gossip-like method which is

eﬃcient in communication (Pu and Nedi´c, 2018), and nonconvex settings (Tang et al., 2021;

Lorenzo and Scutari, 2016; Jiang et al., 2022; Lu et al., 2019). All references mentioned

ZO One-Point Estimate with Distributed Stochastic GT Technique

above consider the case where an accurate gradient computation or a non-biased gradient

estimation with bounded variance (BV) is available.

1.2 Function Classes

Consider the following ﬁve classes of functions:

•The convex class Ccvx containing all functions f:Rd→Rthat are convex.

•The strongly convex class Csc containing all functions f:Rd→Rthat are continuously

diﬀerentiable and admit a constant µfsuch that

h∇f(x)− ∇f(y), x −yik ≥ µfkx−yk2,∀x, y ∈Rd.

•The Lipschitz continuous class Clip containing all functions f:Rd→Rthat admit a

constant Lfsuch that

|f(x)−f(y)| ≤ Lfkx−yk,∀x, y ∈Rd.

•The smooth class Csmo containing all functions f:Rd→Rthat are continuously

diﬀerentiable and admit a constant Gfsuch that

k∇f(x)− ∇f(y)k ≤ Gfkx−yk,∀x, y ∈Rd.

•The gradient dominated class Cgd containing all functions f:Rd→Rthat are diﬀer-

entiable, have a global minimizer x∗, and admit a constant νfsuch that

2νf(f(x)−f(x∗)) ≤ k∇f(x)k2,∀x∈Rd.

1.3 Related Work

An adversarial convex bandit problem in a zero-order framework is studied in Flaxman

et al. (2004); Agarwal et al. (2010); Bubeck et al. (2021). In Flaxman et al. (2004), a

one point estimator is proposed and a regret bound of O(k3

4) is achieved for Lipschitz

continuous functions. In Agarwal et al. (2010), they extend the problem to the multi-point

setting, where the player queries each loss function at many points; hence, their estimates

are unbiased with respect to the true gradient and have bounded variance. When the

number of points is two, they prove regret bounds of ˜

O(√k) with high probability and of

O(log(k)) in expectation for strongly convex loss functions. When the number is d+ 1

point, they prove regret bounds of O(√k) and of O(log(k)) with strong convexity. As

mentioned in Agarwal et al. (2010) and the references therein, these bounds achieved are

even optimal for the full-information setting. In Bubeck et al. (2021), a kernelized loss

estimator is proposed where a generalization of Bernoulli convolutions is adopted, and an

annealing schedule for exponential weights is used to control the estimator’s variance in a

focus region for dimensions higher than 1. Bubeck et al. (2021) achieves a regret bound of

O(√k). However, all these references solve the bandit problem in a centralized framework

Mhanna and Assaad

GRADIENT

ESTIMATE OP FUNCTION

CLASS

CONS-

ENSUS

REGRET

BOUND

CONVERGENCE

RATE

One-point

Cent. Ccvx TClip -O(k3

4)O(1

√k)Flaxman et al. (2004)

Dist. Csc TClip TCsmo None - O(1

√k)Li and Assaad (2021)

Dist. Csc TClip TCsmo GT O(√k)O(1

√k)1P-DSGT

Two-point

Cent. Ccvx TClip -O(√k)O(1

√k)Agarwal et al. (2010)

Cent. Csc TClip -O(log k)O(log k

k)Agarwal et al. (2010)

Dist. Clip TCsmo None - O(1

√klog k)Tang et al. (2021)

Dist. Csmo TCgd None - O(1

k)Tang et al. (2021)

(d+1)-point Cent. Csc TClip TCsmo -O(log k)O(log k

k)Agarwal et al. (2010)

2d-point Dist. Csmo GT - O(1

k)Tang et al. (2021)

Dist. Csmo TCgd GT - O(λk)Tang et al. (2021)

Kernel-based Cent. Ccvx TClip -O(√k)O(1

√k)Bubeck et al. (2021)

FO Unbiased/BV

Dist. Csc TCsmo GT - O(λk)Pu and Nedi´c (2018);

Xin et al. (2019); Pu (2020)

Dist. Csmo GT - O(1

√k)Lu et al. (2019)

Table 1: Convergence rates for various algorithms related to our work, classiﬁed according

to the nature of the gradient estimate, whether the optimization problem (OP) is centralized

or distributed, the assumptions on the objective function, whether consensus is aimed at or

not, and the achieved regret bound and convergence rate

or the case where a single agent has all the data and is the only one responsible for the

decision-making.

Li and Assaad (2021) consider a distributed stochastic optimization problem in networks

under the assumption of smoothness and concavity. They solve it using a stochastic gradient

descent method, where the exchange between the nodes is limited to a partial trade of the

observation of their local utilities, which each agent then uses to estimate the global utility.

With this estimated global utility, each agent can construct a one-point gradient estimator

based on a stochastic perturbation. The convergence rate is proved to scale as O(1

√k), under

the further assumption of a strongly concave objective function. We must remark that in

Li and Assaad (2021), there is a distributed solution, the optimization variable is a scalar,

not a vector, and each agent can only update its own scalar. Hence, there is no consensus

to be aimed at, as in our case. In the context of the derivative-free stochastic centralized

optimization, all Duchi et al. (2015); Jamieson et al. (2012); Shamir (2013) have derived

ZO One-Point Estimate with Distributed Stochastic GT Technique

several lower bounds to conﬁrm that the convergence rate cannot be better than O(1

√k)

after k iterations for strongly convex and smooth objective functions.

All Qu and Li (2018); Lorenzo and Scutari (2016); Nedi´c et al. (2017); Shi et al. (2015);

Li et al. (2022); Jiang et al. (2022) present a distributed gradient-tracking method that

employs local auxiliary variables to track the average of all agents’ gradients, considering

the availability of accurate gradient information. In both Li et al. (2022); Jiang et al. (2022),

each local objective function is an average of ﬁnite instantaneous functions. Thus, they

incorporate the gradient-tracking algorithm with stochastic averaging gradient technology

(Li et al., 2022) (smooth convex optimization) or with variance reduction techniques (Jiang

et al., 2022) (smooth nonconvex optimization). At each iteration, they randomly select

only one gradient of an instantaneous function to approximate the local batch gradient. In

Li et al. (2022), this is an unbiased estimate of the local gradient, whereas, in Jiang et al.

(2022), it is biased. Nevertheless, both references assume access to an exact gradient oracle.

In Tang et al. (2021), they develop two algorithms for nonconvex multi-agent optimiza-

tion. One is a gradient-tracking algorithm based on a noise-free 2d-point estimator of the

gradient, achieving a rate of O(1

k) with smoothness assumptions and a linear rate for an

extra ν-gradient dominated objective assumption. The other is based on a 2-point unbiased

estimator without global gradient tracking and achieves a rate of O(1

√klog k) under Lips-

chitz continuity and smoothness conditions and O(1

k) under an extra gradient dominated

function structure. Both these gradient estimators are unbiased with respect to the exact

gradient and even have a vanishing variance.

All Pu and Nedi´c (2018); Xin et al. (2019); Pu (2020); Lu et al. (2019) assume access

to local stochastic ﬁrst-order oracles where the gradient is unbiased and with a bounded

variance. In the ﬁrst three, they additionally assume smooth and strongly-convex local

objectives, and they all accomplish a linear convergence rate under a constant step size.

Pu and Nedi´c (2018) proposes a distributed stochastic gradient-tracking method (DSGT)

and a gossip-like stochastic gradient-tracking method (GSGT) where at each round, each

agent wakes up with a certain probability. Further, in Pu and Nedi´c (2018), when the

step-size is diminishing, the convergence rate is that of O(1

k). Xin et al. (2019) employs

a gradient-tracking algorithm for agents communicating over a strongly-connected graph.

Pu (2020) introduces a robust gradient-tracking method (R-Push-Pull) in the context of

noisy information exchange between agents and with a directed network topology. In Lu

et al. (2019), they propose a gradient-tracking based nonconvex stochastic decentralized

(GNSD) algorithm for nonconvex optimization problems in machine learning, and they

fulﬁll a convergence rate of O(1

√k) under constant step size.

1.4 Contributions

In this paper, we consider smooth and convex local objectives, and we extend the gradient-

tracking algorithm to the case where we do not have an estimation of the gradient. Under

the realistic assumption that the agent only has access to a single noisy function value at

each time, without necessarily knowing the form of this function, we propose a one-point

estimator in a stochastic framework. Naturally, one-point estimators are biased with respect

to the true gradient and suﬀer from high variance (Liu et al., 2020) hence they do not match

the assumptions for convergence presented in Tang et al. (2021); Pu and Nedi´c (2018); Xin

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Zero-OrderOne-PointEstimatewithDistributedStochasticGradient-TrackingTechniqueElissaMhannaelissa.mhanna@centralesupelec.frMohamadAssaadmohamad.assaad@centralesupelec.frLaboratoiredesSignaux&SystemesCentraleSupelec3rueJoliotCurie91190Gif-sur-Yvette,FranceAbstractInthiswork,weconsideradistributedmul...

展开>> 收起<<

Zero-Order One-Point Estimate with Distributed Stochastic Gradient-Tracking Technique Elissa Mhanna elissa.mhannacentralesupelec.fr.pdf

共36页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Zero-Order One-Point Estimate with Distributed Stochastic Gradient-Tracking Technique Elissa Mhanna elissa.mhannacentralesupelec.fr

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: