1 Double Averaging and Gradient Projection Convergence Guarantees for Decentralized

2025-04-28 3 0 2.37MB 29 页 10玖币

侵权投诉

Double Averaging and Gradient Projection:

Convergence Guarantees for Decentralized

Constrained Optimization

Firooz Shahriari-Mehr, and Ashkan Panahi

Abstract

We consider a generic decentralized constrained optimization problem over static, directed communication

networks, where each agent has exclusive access to only one convex, differentiable, local objective term and one

convex constraint set. For this setup, we propose a novel decentralized algorithm, called DAGP (Double Averaging

and Gradient Projection), based on local gradients, projection onto local constraints, and local averaging. We

achieve global optimality through a novel distributed tracking technique we call distributed null projection. Further,

we show that DAGP can be used to solve unconstrained problems with non-differentiable objective terms with

a problem reduction scheme. Assuming only smoothness of the objective terms, we study the convergence of

DAGP and establish sub-linear rates of convergence in terms of feasibility, consensus, and optimality, with no extra

assumption (e.g. strong convexity). For the analysis, we forego the difﬁculties of selecting Lyapunov functions by

proposing a new methodology of convergence analysis in optimization problems, which we refer to as aggregate

lower-bounding. To demonstrate the generality of this method, we also provide an alternative convergence proof for

the standard gradient descent algorithm with smooth functions. Finally, we present numerical results demonstrating

the effectiveness of our proposed method in both constrained and unconstrained problems. In particular, we propose

a distributed scheme by DAGP for the optimal transport problem with superior performance and speed.

Index Terms

Constrained optimization, convergence analysis, convex optimization, distributed optimization, decentralized

optimal transport, multi-agent systems.

I. INTRODUCTION

A wide variety of applications involving multi-agent systems concern distributed optimization problems.

In these applications, the system consists of Minteracting agents designed to efﬁciently minimize a global

objective function f(x)of a common set of moptimization variables x∈Rmin the presence of a global

constraint set S. In many cases, the global function and constraint are only partially available to each

agent. A fairly general framework in such problems is an optimization problem of the following form

min

x∈Rm

v=1

fv(x)s.t. x∈

v=1

Sv,(1)

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version

may no longer be accessible. This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP)

funded by the Knut and Alice Wallenberg Foundation. A short version of this work was presented at the 60th IEEE Conference on Decision

and Control, Austin, Texas, USA, December 2021 [1].

The authors are with the department of Computer Science and Engineering, Chalmers University of Technology, G¨

oteborg, Sweden

(e-mails: {Firooz, Ashkan.panahi}@chalmers.se).

arXiv:2210.03232v3 [math.OC] 8 Nov 2023

where the local objective functions fv(x)and the local constraint sets Svare exclusively available to

their corresponding agent. There are multiple motivating applications for this setup. Parameter estimation

and resource allocation in sensor networks [2], [3], [4, chapter 10], ﬁtting models to big datasets [5], [6],

smart grid control [7], and optimal transport [8] are but a few examples.

The standard single-machine optimization algorithms such as projected gradient descent can solve many

instances of the problem in (1). However, in large-scale or privacy-sensitive applications, only distributed

optimization methods are applicable. In a distributed setup, two different architectures can be considered:

centralized, where a central node (agent) coordinates all other agents (workers), and the decentralized

architecture, where there is no central coordinator. To avoid master failure and bottleneck problems,

decentralized architectures have attracted more attention, in the past decade [9]–[12].

In many decentralized optimization schemes, each node voperates on its own realization xvof the

optimization variables. The nodes are to ﬁnd a common minimizer by communicating their information

through a communication network and computing weighted averages among the neighbors, represented

by so-called gossip matrices [13]. Depending on the nature of communication, the network is represented

by either an undirected or directed graph where the edges denote available communication links between

the agents. In this paper, we consider gossip matrices with directed graphs.

During the past decade, there has been proliﬁc research on decentralized, unconstrained optimization

algorithms [12], [14]. Many well-known algorithms are developed with provable convergence rates under

different assumptions on the objective functions, communication graphs, and the step-size. A state-of-the-

art example is the Push-Pull method [15], [16], utilizing two gossip matrices and a ﬁxed step-size1for

handling directed communication graphs. The optimality gap in Push-Pull is proven to decreases with a

linear rate, when the local objective functions are smooth and strongly-convex. In the absence of strong

convexity and smoothness, no achievable rate of convergence is known for a directed graph.

Compared to the unconstrained setup, the constrained problem, especially with individual constraints as

in (1), is less studied. Several papers consider a simpliﬁcation by a shared (non-distributed) constraint and

utilize its orthogonal projection operator to ﬁnd a feasible solution [17]–[19]. Recently, [20] attempted

to address (1) by adapting the Push-Pull algorithm. They demonstrate that the main underlying strategy

(known as gradient tracking) of Push-Pull may not be applied with a ﬁxed step size. However, with a

diminishing step-size they show that the feasibility gap decreases at a sub-linear rate, under the assumption

of smooth and strongly convex functions, but do not discuss the optimality gap. Alternative strategies with

a ﬁxed step size are yet to be discovered. More generally, algorithms with generic convergence properties

in the distributed constrained framework of (1) are still lacking. One main reason is that in the constrained

case, the standard method of analysis based on the Lyapunov (potential) functions becomes complicated. A

decaying step-size can simplify the analysis, but it often results in a dramatic reduction in the convergence

speed [17]. In this paper, we address the above limitations in the distributed convex optimization literature.

Our main contributions are summarized as follows.

A. Contributions

•We propose a novel algorithm, called Double Averaging and Gradient Projection (DAGP), that solves

(1). Our method considers a directed communication graph with individual constraints at each node.

1In general algorithms may use a ﬁxed or a diminishing step size. In practice, ﬁxed step sizes are often superior as diminishing step sizes

require a careful design of a step size schedule, which substantially slows down the underlying algorithm.

DAGP leverages two gossip matrices and a ﬁxed step-size, ensuring fast convergence to the optimal

solution.

•We introduce a novel general convergence analysis framework that we refer to as aggregate lower-

bounding (ALB). It foregoes the need for Lyapunov functions and decaying step-sizes. We showcase

the power of ALB by presenting an alternative analysis of Gradient Descent (GD).

•Using ALB, we prove that under smoothness of the objective terms, the feasibility gap of DAGP

vanishes with O(1/K), and the optimality gap decays with O(1/√K), where Kdenotes the total

iterations. ALB lets us put the restrictive assumptions, such as identical local constraints, a decaying

step-size, and strong-convexity, aside.

•We present a reduction of decentralized unconstrained optimization problems with a non-smooth

objective to an equivalent constrained problem as in (1), with a linear (hence smooth) objective.

Using this reduction, DAGP can also solve generic non-smooth decentralized optimization problems.

In this way, we provide ﬁrst convergence guarantees for a decentralized setup without smoothness

and strong convexity assumptions.

•We explore the performance of DAGP in practical applications, particularly in the context of Optimal

Transport problem (OT). We present the ﬁrst decentralized OT formulation, efﬁciently computing

exact sparse solutions for large-scale instances.

B. Literature Review

Various decentralized optimization methods have been proposed in the literature for different scenarios

and underlying assumptions. For a comprehensive survey, see [9]–[12]. In this section, we review ﬁrst-

order decentralized optimization algorithms for both constrained and unconstrained optimization problems.

We ignore other possible extensions, such as consideration of time delay [21], local functions with

ﬁnite-sum structure [11], [22], [23], message compression [24], [25], time-varying graphs [26], [27], or

continuous-time methods [28], as they fall outside the scope of this paper. We also ignore decentralized

dual-based methods as they require computationally costly oracles, such as the gradient of conjugate

functions, which may not be feasible in generic applications.

1) Decentralized unconstrained optimization methods: Earlier studies on decentralized optimization

have considered undirected communication graphs [29], [30]. In this setup, doubly-stochastic gossip

matrices compatible with the graph structure can be constructed easily. These methods are restricted

to a decaying step-size for convergence. As a result, the provable convergence rate is O(1/√K), for the

setup with convex and smooth objective functions and O(1/K), when the objective functions become

strongly-convex. The next group of algorithms tracks the gradient of the global function over iterations

for convergence to the optimal solution using ﬁxed step-sizes. Gradient tracking can be implemented

based on the dynamic average consensus protocol [31]. Pioneered by [32], the optimization methods that

employ this approach are DIGing [27], EXTRA [33], and NEXT [34]. These methods use ﬁxed step-sizes

and achieve linear convergence, that is O(µK)for a µ∈(0,1), in a strongly-convex and smooth setting.

The construction of double-stochastic gossip matrices compatible with directed graphs is not straight-

forward [35]. Therefore, practical optimization algorithms over directed graphs use row-stochastic or

column-stochastic gossip matrices. This modiﬁcation makes it harder to achieve a consensus and optimal

solution, In response, the push-sum protocol [36] is introduced. The methods [26], [37] based on the push-

sum protocol still require a decaying step-size for convergence. [38], [27], [39], and [40] combine the

push-sum protocol and the gradient tracking techniques and respectively proposed the DEXTRA, Push-

DIGing, SONATA, and ADD-OPT algorithms, which are working with ﬁxed step-sizes. These algorithms

achieve a linear rate of convergence in a smooth and strongly-convex setting. All these methods use

only column-stochastic gossip matrices. Recently, the Push-Pull [15], [16] algorithm has been proposed

that utilizes both row-stochastic and column-stochastic matrices. It utilizes a ﬁxed step-size and has a

linear convergence rate in the strongly-convex and smooth setting. Accelerated version of the Push-Pull

algorithm is proposed in [41]. Our algorithm has a similar requirement of the underlying gossip matrices

to the Push-Pull algorithm.

2) Decentralized constrained optimization methods: We focus on methods that consider orthogonal

projection operators onto the constraint sets. These methods can be divided into two groups based on their

constraint structure. The ﬁrst group [17]–[19], [42] considers an identical constraint Savailable to every

agent. These methods differ in the graph connectivity assumption [17], the optimization approach [19],

the global objective function [43], or functions characteristics [18].

The second group corresponds to the setup where each agent knows its own local constraint Sv. [44] and

[45] proposed projection onto local constraint sets in each iteration. They considered undirected graphs

and proved a sub-linear rate of convergence for the optimality gap using decaying step-sizes, only in

two special cases: when the constraints are identical, or when the graph is fully connected [45]. No rate

for the feasibility gap is established. [46] extends [44] and [45] to the setup with noisy communication

links and noisy (sub)gradients. DDPS [17] uses two row-stochastic and column-stochastic matrices, but

it requires a decaying step-size and assumes identical constraints. [20] modiﬁes the Push-Pull algorithm

for this problem. They show that a ﬁxed-step size prevents their approach from reaching a ﬁxed-point;

thus, they employ a decaying step-size. To our knowledge, our algorithm is the ﬁrst one for constrained

optimization with different local constraints on directed graphs using a ﬁxed step-size.

C. Paper organization

In Section II, we present our problem setup, proposed method, and the reduction technique to analyze

non-smooth decentralized optimization problems. Section III introduces our new analytical framework for

general optimization algorithms with convex and smooth objectives, based on which we analyze DAGP

and GD. Finally, section IV presents the efﬁciency of DAGP in several experiments.

D. Notation

Bold lowercase and uppercase letters denote vectors and matrices, respectively, with matrix elements of

Wrepresented as wvu.1nand 0nrespectively denote the n−dimensional vectors of all ones and zeros,

while Om×ndenotes a m×nmatrix of zeros. The indices mand nmay be dropped if there is no risk of

confusion. ⟨., .⟩is the Euclidean inner product, and ⟨A,C⟩=Tr(ACT)denotes the matrix inner product,

where Tr(.)is the trace. δk,l is the Kronecker delta. Subscripts and superscripts typically indicate iteration

numbers and node numbers, respectively, e.g., ∇fv(xv

k)is the gradient at node vand iteration k. Matrix

representations use vector variables as rows, e.g., matrix G∈RM×mcontains gv∈Rmas rows. In this

context, G∈ker(W)shows that each column of Gis an element in the null space of G.

E. Preliminaries

Deﬁnition 1 (L-Smooth function).A function fis L-smooth if it is differentiable, and its derivative is

L-Lipschitz. For a convex function f, this is equivalent to

f(y)≤f(x) + ⟨∇f(x),y−x⟩+L

2∥x−y∥2

2.∀x,y(2)

Deﬁnition 2 (Normal cone and Projection operator).For a closed convex set S⊂Rn, the normal cone

of Sis given by

∂IS(x) = 



∅x/∈S

g∈Rn| ∀z∈S, gT(z−x)≤0x∈S.

Moreover, the projection of a vector x∈Rnonto Sis computed by PS(x) = arg miny∈S∥y−x∥2

2,and

the distance ∥x−PS(x)∥2between xand Sis denoted by dist(x, S).

Deﬁnition 3 ( Graph Theory).A directed graph is a pair G= (V,E)of V={1, . . . , M}as the node

set and E ⊆ V × V as the (directed) edges. The asymmetric adjacency matrix A= [aij ]is computed as

aij = +1 if (i, j)∈ E, and 0otherwise. In our study, edges represent communication links between nodes:

jcan send to iif (i, j)∈ E. Node i’s incoming and outgoing neighbors are Ni

in ={j|(i, j)∈ E} and

out ={j|(j, i)∈ E}, respectively. The in-degree and out-degree are the cardinalities of Ni

in and Ni

out. For

graphs with self-loops, node iis included in Ni

in/out. We deﬁne two Laplacian matrices: Lin =Din −A,

Lout =Dout −A, where Din and Dout are diagonal matrices of the nodes’ in-degree and out-degree.

These matrices respectively have zero row and zero column sums, and their scaled forms serve as gossip

matrices.

Deﬁnition 4 (Sufﬁcient optimality condition).x∗is a regular optimal point for problem (1) if it satisﬁes

the following optimality condition:

0∈

v=1

(∂ISv(x∗) + ∇fv(x∗)) .(3)

II. PROBLEM SETUP AND PROPOSED METHOD

In this section, we ﬁrst clarify our problem setup by stating underlying assumptions. Next, we present

the DAGP algorithm and its constructive blocks. Finally, we introduce a technique to address and analyze

non-smooth decentralized optimization problems using our optimization framework.

A. Problem setup

In our setup, Magents interact over a directed network, each having a unique objective function and

constraint set. We proceed by presenting the underlying assumptions:

Assumption 1 (Objectives and Constraints).The local objective functions fvare convex, differentiable,

and L−smooth for a constant L > 0. The constraints Svare closed and convex.

Assumption 2 (Problem Feasibility).The optimization problem is feasible and achieves a ﬁnite optimal

value f∗at a regular optimal feasible solution x∗, that is, f∗=f(x∗)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1DoubleAveragingandGradientProjection:ConvergenceGuaranteesforDecentralizedConstrainedOptimizationFiroozShahriari-Mehr,andAshkanPanahiAbstractWeconsideragenericdecentralizedconstrainedoptimizationproblemoverstatic,directedcommunicationnetworks,whereeachagenthasexclusiveaccesstoonlyoneconvex,differen...

展开>> 收起<<

1 Double Averaging and Gradient Projection Convergence Guarantees for Decentralized.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Double Averaging and Gradient Projection Convergence Guarantees for Decentralized

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: