1 Double Averaging and Gradient Projection Convergence Guarantees for Decentralized

2025-04-28 0 0 2.37MB 29 页 10玖币
侵权投诉
1
Double Averaging and Gradient Projection:
Convergence Guarantees for Decentralized
Constrained Optimization
Firooz Shahriari-Mehr, and Ashkan Panahi
Abstract
We consider a generic decentralized constrained optimization problem over static, directed communication
networks, where each agent has exclusive access to only one convex, differentiable, local objective term and one
convex constraint set. For this setup, we propose a novel decentralized algorithm, called DAGP (Double Averaging
and Gradient Projection), based on local gradients, projection onto local constraints, and local averaging. We
achieve global optimality through a novel distributed tracking technique we call distributed null projection. Further,
we show that DAGP can be used to solve unconstrained problems with non-differentiable objective terms with
a problem reduction scheme. Assuming only smoothness of the objective terms, we study the convergence of
DAGP and establish sub-linear rates of convergence in terms of feasibility, consensus, and optimality, with no extra
assumption (e.g. strong convexity). For the analysis, we forego the difficulties of selecting Lyapunov functions by
proposing a new methodology of convergence analysis in optimization problems, which we refer to as aggregate
lower-bounding. To demonstrate the generality of this method, we also provide an alternative convergence proof for
the standard gradient descent algorithm with smooth functions. Finally, we present numerical results demonstrating
the effectiveness of our proposed method in both constrained and unconstrained problems. In particular, we propose
a distributed scheme by DAGP for the optimal transport problem with superior performance and speed.
Index Terms
Constrained optimization, convergence analysis, convex optimization, distributed optimization, decentralized
optimal transport, multi-agent systems.
I. INTRODUCTION
A wide variety of applications involving multi-agent systems concern distributed optimization problems.
In these applications, the system consists of Minteracting agents designed to efficiently minimize a global
objective function f(x)of a common set of moptimization variables xRmin the presence of a global
constraint set S. In many cases, the global function and constraint are only partially available to each
agent. A fairly general framework in such problems is an optimization problem of the following form
min
xRm
1
M
M
X
v=1
fv(x)s.t. x
M
\
v=1
Sv,(1)
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version
may no longer be accessible. This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP)
funded by the Knut and Alice Wallenberg Foundation. A short version of this work was presented at the 60th IEEE Conference on Decision
and Control, Austin, Texas, USA, December 2021 [1].
The authors are with the department of Computer Science and Engineering, Chalmers University of Technology, G¨
oteborg, Sweden
(e-mails: {Firooz, Ashkan.panahi}@chalmers.se).
arXiv:2210.03232v3 [math.OC] 8 Nov 2023
2
where the local objective functions fv(x)and the local constraint sets Svare exclusively available to
their corresponding agent. There are multiple motivating applications for this setup. Parameter estimation
and resource allocation in sensor networks [2], [3], [4, chapter 10], fitting models to big datasets [5], [6],
smart grid control [7], and optimal transport [8] are but a few examples.
The standard single-machine optimization algorithms such as projected gradient descent can solve many
instances of the problem in (1). However, in large-scale or privacy-sensitive applications, only distributed
optimization methods are applicable. In a distributed setup, two different architectures can be considered:
centralized, where a central node (agent) coordinates all other agents (workers), and the decentralized
architecture, where there is no central coordinator. To avoid master failure and bottleneck problems,
decentralized architectures have attracted more attention, in the past decade [9]–[12].
In many decentralized optimization schemes, each node voperates on its own realization xvof the
optimization variables. The nodes are to find a common minimizer by communicating their information
through a communication network and computing weighted averages among the neighbors, represented
by so-called gossip matrices [13]. Depending on the nature of communication, the network is represented
by either an undirected or directed graph where the edges denote available communication links between
the agents. In this paper, we consider gossip matrices with directed graphs.
During the past decade, there has been prolific research on decentralized, unconstrained optimization
algorithms [12], [14]. Many well-known algorithms are developed with provable convergence rates under
different assumptions on the objective functions, communication graphs, and the step-size. A state-of-the-
art example is the Push-Pull method [15], [16], utilizing two gossip matrices and a fixed step-size1for
handling directed communication graphs. The optimality gap in Push-Pull is proven to decreases with a
linear rate, when the local objective functions are smooth and strongly-convex. In the absence of strong
convexity and smoothness, no achievable rate of convergence is known for a directed graph.
Compared to the unconstrained setup, the constrained problem, especially with individual constraints as
in (1), is less studied. Several papers consider a simplification by a shared (non-distributed) constraint and
utilize its orthogonal projection operator to find a feasible solution [17]–[19]. Recently, [20] attempted
to address (1) by adapting the Push-Pull algorithm. They demonstrate that the main underlying strategy
(known as gradient tracking) of Push-Pull may not be applied with a fixed step size. However, with a
diminishing step-size they show that the feasibility gap decreases at a sub-linear rate, under the assumption
of smooth and strongly convex functions, but do not discuss the optimality gap. Alternative strategies with
a fixed step size are yet to be discovered. More generally, algorithms with generic convergence properties
in the distributed constrained framework of (1) are still lacking. One main reason is that in the constrained
case, the standard method of analysis based on the Lyapunov (potential) functions becomes complicated. A
decaying step-size can simplify the analysis, but it often results in a dramatic reduction in the convergence
speed [17]. In this paper, we address the above limitations in the distributed convex optimization literature.
Our main contributions are summarized as follows.
A. Contributions
We propose a novel algorithm, called Double Averaging and Gradient Projection (DAGP), that solves
(1). Our method considers a directed communication graph with individual constraints at each node.
1In general algorithms may use a fixed or a diminishing step size. In practice, fixed step sizes are often superior as diminishing step sizes
require a careful design of a step size schedule, which substantially slows down the underlying algorithm.
3
DAGP leverages two gossip matrices and a fixed step-size, ensuring fast convergence to the optimal
solution.
We introduce a novel general convergence analysis framework that we refer to as aggregate lower-
bounding (ALB). It foregoes the need for Lyapunov functions and decaying step-sizes. We showcase
the power of ALB by presenting an alternative analysis of Gradient Descent (GD).
Using ALB, we prove that under smoothness of the objective terms, the feasibility gap of DAGP
vanishes with O(1/K), and the optimality gap decays with O(1/K), where Kdenotes the total
iterations. ALB lets us put the restrictive assumptions, such as identical local constraints, a decaying
step-size, and strong-convexity, aside.
We present a reduction of decentralized unconstrained optimization problems with a non-smooth
objective to an equivalent constrained problem as in (1), with a linear (hence smooth) objective.
Using this reduction, DAGP can also solve generic non-smooth decentralized optimization problems.
In this way, we provide first convergence guarantees for a decentralized setup without smoothness
and strong convexity assumptions.
We explore the performance of DAGP in practical applications, particularly in the context of Optimal
Transport problem (OT). We present the first decentralized OT formulation, efficiently computing
exact sparse solutions for large-scale instances.
B. Literature Review
Various decentralized optimization methods have been proposed in the literature for different scenarios
and underlying assumptions. For a comprehensive survey, see [9]–[12]. In this section, we review first-
order decentralized optimization algorithms for both constrained and unconstrained optimization problems.
We ignore other possible extensions, such as consideration of time delay [21], local functions with
finite-sum structure [11], [22], [23], message compression [24], [25], time-varying graphs [26], [27], or
continuous-time methods [28], as they fall outside the scope of this paper. We also ignore decentralized
dual-based methods as they require computationally costly oracles, such as the gradient of conjugate
functions, which may not be feasible in generic applications.
1) Decentralized unconstrained optimization methods: Earlier studies on decentralized optimization
have considered undirected communication graphs [29], [30]. In this setup, doubly-stochastic gossip
matrices compatible with the graph structure can be constructed easily. These methods are restricted
to a decaying step-size for convergence. As a result, the provable convergence rate is O(1/K), for the
setup with convex and smooth objective functions and O(1/K), when the objective functions become
strongly-convex. The next group of algorithms tracks the gradient of the global function over iterations
for convergence to the optimal solution using fixed step-sizes. Gradient tracking can be implemented
based on the dynamic average consensus protocol [31]. Pioneered by [32], the optimization methods that
employ this approach are DIGing [27], EXTRA [33], and NEXT [34]. These methods use fixed step-sizes
and achieve linear convergence, that is O(µK)for a µ(0,1), in a strongly-convex and smooth setting.
The construction of double-stochastic gossip matrices compatible with directed graphs is not straight-
forward [35]. Therefore, practical optimization algorithms over directed graphs use row-stochastic or
column-stochastic gossip matrices. This modification makes it harder to achieve a consensus and optimal
solution, In response, the push-sum protocol [36] is introduced. The methods [26], [37] based on the push-
sum protocol still require a decaying step-size for convergence. [38], [27], [39], and [40] combine the
4
push-sum protocol and the gradient tracking techniques and respectively proposed the DEXTRA, Push-
DIGing, SONATA, and ADD-OPT algorithms, which are working with fixed step-sizes. These algorithms
achieve a linear rate of convergence in a smooth and strongly-convex setting. All these methods use
only column-stochastic gossip matrices. Recently, the Push-Pull [15], [16] algorithm has been proposed
that utilizes both row-stochastic and column-stochastic matrices. It utilizes a fixed step-size and has a
linear convergence rate in the strongly-convex and smooth setting. Accelerated version of the Push-Pull
algorithm is proposed in [41]. Our algorithm has a similar requirement of the underlying gossip matrices
to the Push-Pull algorithm.
2) Decentralized constrained optimization methods: We focus on methods that consider orthogonal
projection operators onto the constraint sets. These methods can be divided into two groups based on their
constraint structure. The first group [17]–[19], [42] considers an identical constraint Savailable to every
agent. These methods differ in the graph connectivity assumption [17], the optimization approach [19],
the global objective function [43], or functions characteristics [18].
The second group corresponds to the setup where each agent knows its own local constraint Sv. [44] and
[45] proposed projection onto local constraint sets in each iteration. They considered undirected graphs
and proved a sub-linear rate of convergence for the optimality gap using decaying step-sizes, only in
two special cases: when the constraints are identical, or when the graph is fully connected [45]. No rate
for the feasibility gap is established. [46] extends [44] and [45] to the setup with noisy communication
links and noisy (sub)gradients. DDPS [17] uses two row-stochastic and column-stochastic matrices, but
it requires a decaying step-size and assumes identical constraints. [20] modifies the Push-Pull algorithm
for this problem. They show that a fixed-step size prevents their approach from reaching a fixed-point;
thus, they employ a decaying step-size. To our knowledge, our algorithm is the first one for constrained
optimization with different local constraints on directed graphs using a fixed step-size.
C. Paper organization
In Section II, we present our problem setup, proposed method, and the reduction technique to analyze
non-smooth decentralized optimization problems. Section III introduces our new analytical framework for
general optimization algorithms with convex and smooth objectives, based on which we analyze DAGP
and GD. Finally, section IV presents the efficiency of DAGP in several experiments.
D. Notation
Bold lowercase and uppercase letters denote vectors and matrices, respectively, with matrix elements of
Wrepresented as wvu.1nand 0nrespectively denote the ndimensional vectors of all ones and zeros,
while Om×ndenotes a m×nmatrix of zeros. The indices mand nmay be dropped if there is no risk of
confusion. ., .is the Euclidean inner product, and A,C=Tr(ACT)denotes the matrix inner product,
where Tr(.)is the trace. δk,l is the Kronecker delta. Subscripts and superscripts typically indicate iteration
numbers and node numbers, respectively, e.g., fv(xv
k)is the gradient at node vand iteration k. Matrix
representations use vector variables as rows, e.g., matrix GRM×mcontains gvRmas rows. In this
context, Gker(W)shows that each column of Gis an element in the null space of G.
5
E. Preliminaries
Definition 1 (L-Smooth function).A function fis L-smooth if it is differentiable, and its derivative is
L-Lipschitz. For a convex function f, this is equivalent to
f(y)f(x) + ⟨∇f(x),yx+L
2xy2
2.x,y(2)
Definition 2 (Normal cone and Projection operator).For a closed convex set SRn, the normal cone
of Sis given by
IS(x) =
x/S
gRn| ∀zS, gT(zx)0xS.
Moreover, the projection of a vector xRnonto Sis computed by PS(x) = arg minySyx2
2,and
the distance xPS(x)2between xand Sis denoted by dist(x, S).
Definition 3 ( Graph Theory).A directed graph is a pair G= (V,E)of V={1, . . . , M}as the node
set and E V × V as the (directed) edges. The asymmetric adjacency matrix A= [aij ]is computed as
aij = +1 if (i, j)∈ E, and 0otherwise. In our study, edges represent communication links between nodes:
jcan send to iif (i, j)∈ E. Node is incoming and outgoing neighbors are Ni
in ={j|(i, j)∈ E} and
Ni
out ={j|(j, i)∈ E}, respectively. The in-degree and out-degree are the cardinalities of Ni
in and Ni
out. For
graphs with self-loops, node iis included in Ni
in/out. We define two Laplacian matrices: Lin =Din A,
Lout =Dout A, where Din and Dout are diagonal matrices of the nodes’ in-degree and out-degree.
These matrices respectively have zero row and zero column sums, and their scaled forms serve as gossip
matrices.
Definition 4 (Sufficient optimality condition).xis a regular optimal point for problem (1) if it satisfies
the following optimality condition:
0
M
X
v=1
(ISv(x) + fv(x)) .(3)
II. PROBLEM SETUP AND PROPOSED METHOD
In this section, we first clarify our problem setup by stating underlying assumptions. Next, we present
the DAGP algorithm and its constructive blocks. Finally, we introduce a technique to address and analyze
non-smooth decentralized optimization problems using our optimization framework.
A. Problem setup
In our setup, Magents interact over a directed network, each having a unique objective function and
constraint set. We proceed by presenting the underlying assumptions:
Assumption 1 (Objectives and Constraints).The local objective functions fvare convex, differentiable,
and Lsmooth for a constant L > 0. The constraints Svare closed and convex.
Assumption 2 (Problem Feasibility).The optimization problem is feasible and achieves a finite optimal
value fat a regular optimal feasible solution x, that is, f=f(x)
摘要:

1DoubleAveragingandGradientProjection:ConvergenceGuaranteesforDecentralizedConstrainedOptimizationFiroozShahriari-Mehr,andAshkanPanahiAbstractWeconsideragenericdecentralizedconstrainedoptimizationproblemoverstatic,directedcommunicationnetworks,whereeachagenthasexclusiveaccesstoonlyoneconvex,differen...

展开>> 收起<<
1 Double Averaging and Gradient Projection Convergence Guarantees for Decentralized.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:2.37MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注