ARTICLE TEMPLATE Gradient-Type Methods For Decentralized Optimization Problems With Polyak- Lojasiewicz Condition Over Time-Varying Networks

2025-04-27 0 0 705.16KB 28 页 10玖币
侵权投诉
ARTICLE TEMPLATE
Gradient-Type Methods For Decentralized Optimization Problems
With Polyak- Lojasiewicz Condition Over Time-Varying Networks
Ilya Kuruzovaand Mohammad Alkousaa,b and Fedor Stonyakina,c and Alexander
Gasnikova,b,d
aMoscow Institute of Physics and Technology, Dolgoprudny, Russia; bNational Research
University Higher School of Economics, Moscow, Russia; cV. Vernadsky Crimean Federal
University, Simferopol, Republic of Crimea; dISP RAS Research Center for Trusted Artificial
Intelligence
ARTICLE HISTORY
Compiled May 14, 2024
ABSTRACT
This paper focuses on the decentralized optimization (minimization and saddle
point) problems with objective functions that satisfy Polyak- Lojasiewicz condition
(PL-condition). The first part of the paper is devoted to the minimization problem
of the sum-type cost functions. In order to solve a such class of problems, we propose
a gradient descent type method with a consensus projection procedure and the in-
exact gradient of the objectives. Next, in the second part, we study the saddle-point
problem (SPP) with a structure of the sum, with objectives satisfying the two-sided
PL-condition. To solve such SPP, we propose a generalization of the Multi-step Gra-
dient Descent Ascent method with a consensus procedure, and inexact gradients of
the objective function with respect to both variables. Finally, we present some of
the numerical experiments, to show the efficiency of the proposed algorithm for the
robust least squares problem.
KEYWORDS
Distributed optimization; non-convex optimization; saddle-point problem;
PL-condition
1. Introduction
In this paper, firstly we study a sum-type minimization problem
min
xRd(f(x) := 1
n
n
X
i=1
fi(x)),(1)
where the functions fiare generally non-convex and stored separately by nodes in
a communication network, which is represented by a non-directed graph G= (E, V ).
The graph G, possibly, can have a change over time structure. This problem, where the
This work was supported by a grant for research centers in the field of artificial intelligence, provided by the
Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement
(agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and
Technology dated November 1, 2021 No. 70-2021-00138.
CONTACT Ilya Kuruzov. Email: kuruzov.ia@phystech.edu
arXiv:2210.03810v2 [math.OC] 12 May 2024
object function depends on distributed data is typed as a decentralized optimization
problem. We assume that fsatisfies the well-known Polyak- Lojasiewicz condition (for
brevity, we write PL-condition). This condition was originally introduced by Polyak
[28], who proved that it is sufficient to show the global linear convergence rate for the
gradient descent without assuming convexity. The PL-condition is very well studied by
many researchers in many different works for many different settings of optimization
problems and has been theoretically verified for objective functions of optimization
problems arising in many practical problems. For example, it has been proven to be
true for objectives of over-parameterized deep networks [6], learning LQR models [9],
phase retrieval [37], Generative adversarial imitation learning of linear quadratic (see
Example 3.1 in [3]). More discussions of PL-condition and many other simple problems
can be found in [15].
This type of problems arises in different areas: distributed machine learning [17],
resource allocation problem [13] and power system control [31].
The problem (1) can be reformulated as a problem with linear constraints. For this,
let us assign each agent in the network a personal copy of parameter vector xiRd
(column vector) and introduce
X:= x
1. . . x
nRn×d, F (X) =
n
X
i=1
fi(xi).(2)
Now we equivalently rewrite problem (1) as
min
XRn×dF(X) =
n
X
i=1
fi(xi),s.t. x1=··· =xn,(3)
where it has the same optimal value as problem (1). This reformulation increases the
number of variables but induces additional constraints at the same time.
Let us denote the set of consensus constraints C={X|x1=··· =xn}. Also, for
each XRn×ddenote the average of its columns x=1
nPn
i=1 xiRdand introduce
its projection onto constraint set
X=1
n1n1
nX= ΠC(X) = x. . . xRn×d.
Note that Cis a linear subspace in Rn×d, and therefore projection operator ΠCis
linear.
Secondly, we study the saddle-point problem with a structure of the sum
min
xRdx
max
yRdy(ϕ(x, y) = 1
n
n
X
i=1
ϕi(x, y)),(4)
where functions ϕi(·,·) are non-convex (with respect to the variable xfor each y) and
smooth (i.e., with Lipschitz-continuous gradient). These functions can be calculated
only separately in different nodes in the communication network, which is represented
by a non-directed graph G.
Problems of type (4) arise in many applications, such that Generative adversarial
network [10], adversarial training [18], and fair training [3]. In our work with problem
2
(4), we assume that ϕ(·, y) and ϕ(x, ·) satisfy the PL-condition (see Assumption
(2.2), below).
Let y(x) := arg maxyRdyϕ(x, y), and let we set fi(x) := ϕi(x, y(x)), then problem
(4) can be rewritten in the form of the minimization problem (1), i.e.,
min
xRdx(max
yRdy(ϕ(x, y) = 1
n
n
X
i=1
ϕi(x, y)))= min
xRdx(f(x) := 1
n
n
X
i=1
fi(x)).(5)
Let Cx:= XRn×dx|x1=. . . =xn,Cy:= YRn×dy|y1=. . . =ynand
Φ(X,Y) :=
n
P
i=1
ϕi(xi, yi). So, we can rewrite (4), in the similar way like [32, 33],
in the following form
min
X∈Cx
max
Y∈Cy
Φ(X,Y).(6)
Similarly to the first proposed minimization problem (1), note that the problem (4),
also can be rewritten in the same form
min
X∈Cx
F(X) =
n
X
i=1
fi(xi).(7)
So, for solving the problem (5) (which is equivalent to problem (6)), we can try
to use some gradient type algorithms, at each iteration of the used algorithm, we
calculate the inexact gradient of ϕ(x, ·) for each xRdxin order to solve the (inner)
maximization problem in (5).
In practice, Multi-step Gradient Descent Ascent (MGDA) algorithm and its modi-
fications are widely used to solve the problem (4) (see e.g. [10, 11, 23]), as important
algorithms for the considered type of saddle-point problems. In [25], authors demon-
strate the effectiveness of MGDA on a class of games in which one of the players
satisfies the PL-condition and another player has a general non-convex structure. At
the same time, [41] shows that one step of gradient descent ascent demonstrates good
performance and has theoretical guarantees for convergence in two-sided PL-games.
Finally, we mention that the main difference between distributed problems from
usual optimization problems is to keep every agent’s vectors to their average. It is
approached through communication steps, where the communication can be performed
in different scenarios. In our work, we will consider the time-varying network with
changeable edges set and use the standard consensus procedure (see Subsec. 2.2) when
the agent’s vectors are averaged by multiplying by a weighted matrix of the graph at
the current moment.
1.1. Related works
The decentralized algorithm makes two types of steps: local updates and information
exchange. Local steps may use gradient [19, 26, 29, 30, 35, 42] or sub-gradient [27]
computations. In primal-only methods, the agents compute gradients of their local
functions and alternate taking gradient steps and communication procedures. Under
cheap communication costs, it may be beneficial to replace a single consensus iteration
with a series of information exchange rounds. Such methods as MSDA [34], D-NC [14],
3
and Mudag [42] employ multi-step gossip procedures. In order to achieve acceptable
complexity bounds, one may distribute accelerated methods directly [7, 14, 19, 30, 42]
or use a Catalyst framework [21]. Accelerated methods meet the lower complexity
bounds for decentralized optimization [12, 22, 34]. As usual, by complexity we mean
a sufficient number of iterations of the algorithm that guarantee the solution of the
problem with a given accuracy.
Consensus restrictions x1=. . . =xnmay be treated as linear constraints, thus
allowing for a dual reformulation of problem (1). Dual-based methods include dual
ascent and its accelerated variants [34, 38, 40, 44]. Primal-dual approaches like ADMM
[2, 39] are also implementable in decentralized scenarios. In [35], the authors developed
algorithms for non-smooth convex objectives and provided lower complexity bounds
for this case, as well.
Changing topology for time-varying networks requires new approaches to decen-
tralized methods and a more complicated theoretical analysis. The first method with
provable geometric convergence was proposed in [26]. Such primal algorithms as the
Push-Pull Gradient Method [29] and DIGing [26] are robust to network changes and
have theoretical guarantees of convergence over time-varying graphs. Recently, a dual
method for time-varying architectures was introduced in [24].
In [33], it was studied the problem of decentralized optimization with strongly con-
vex smooth objective functions. The authors investigated accelerated deterministic
algorithms under time-varying network constraints with a consensus projection proce-
dure. The distributed stochastic optimization over time-varying graphs with consensus
projection procedure was studied in [32]. The consensus projection procedure made it
possible to use more acceptable parameters of smoothness and strong convexity in the
complexity estimates. The main goal of this paper is to investigate a similar consensus
projection approach for problems with PL-condition. Note that approach [32, 33] is
based on the well-known concept of the inexact (δ, L, µ)-oracle. But in the PL-case,
we will use inexact gradients with additive noise to describe the consensus projection
procedure.
1.2. Our contributions
Summing up, the contribution of this paper is as follows.
We study the sum-type minimization problem when the objective function sat-
isfies the PL-condition. To solve a such class of problems, we propose a gradient
descent type method with consensus projection procedure (see Algorithm 2) and
access to only inexact gradient. We consider two cases of gradient inexactness:
the bounded deterministic inexactness and the sum of random noise with de-
terministic bias. We estimated the sufficient communication steps and iterations
number to approach the required quality concerning the function and the dis-
tance between the agent’s vectors and their average for both cases.
We study the decentralized saddle-point problem (with the structure of a sum)
when the objective function satisfies the two-sided PL-condition. For solving a
such generalized class of problems, we proposed a generalization of the MGDA
method (see Algorithm 3) with a consensus procedure. We provided an estima-
tion for the sufficient number of iterations for inner and outer loops to approach
the acceptable quality concerning the function. Also, we estimate the commu-
nication complexity to a distance between the agent’s vectors and their average
for both cases to be small enough. Additionally, we research the influence of
4
interference on the convergence of the proposed Algorithm 3. We suppose that
the inexactness in the gradient can be separated into deterministic noise with
the bounded norm and zero-mean noise with the finite second moment.
We present some numerical experiments, which demonstrate the effectiveness
of the proposed algorithms, for the Least Squares and Robust Least Squares
problems.
2. Fundamentals and Assumptions for the problems under consideration
Throughout the paper, ⟨·,·⟩ denotes the inner product of vectors or matrices. Cor-
respondingly, by ∥·∥, we denote the 2-norm for vectors or the Frobenius norm for
matrices.
At the first, for the minimization problem (1) (or its equivalent (3)), we assume
Assumption 2.1 (Lipschitz smoothness).For every i= 1, . . . , n, the function fiis
Li-smooth, for some Li>0.
Under this assumption, we find that the function F(X) (see (2)) is Ll-smooth on
Rn×d, where Ll= max
1inLi, and Lg-smooth on C, where Lg=1
n
n
P
i=1
Li. The constant
Llis called a local constant and Lgis called a global constant.
Assumption 2.2 (PL-condition).The function fsatisfies the PL-condition, i.e., it
holds the following inequality
f(x)f1
2µ∥∇f(x)2,xRd,(8)
for some µ > 0, and fis the optimal value of the function f.
Also under this assumption, we find that fsatisfies the quadratic growth condition
(QG-condition) (see [15]):
xx22
µ(f(x)f) ; xRd,(9)
where xis the nearest point to the optimal solution of the minimization problem
under consideration.
At the second, for the saddle-point problem (4), let us introduce the following
assumption.
Assumption 2.3 (Lipschitz smoothness).For every i= 1, . . . , n, the function ϕi(·,·)
is differentiable with respect to its both variables and smooth, i.e., the following in-
equalities hold
∥∇xϕi(x1, y1)− ∇xϕi(x2, y2)∥ ≤ Lxx,ix1x2+Lxy,iy1y2,
∥∇yϕi(x1, y1)− ∇yϕi(x2, y2)∥ ≤ Lyx,ix1x2+Lyy,iy1y2,
for all x1, x2Rdx, y1, y2Rdyand Lxx,i, Lyy,i, Lxy,i, Lyx,i R
+.
Note, that if Assumption 2.3 is satisfied, then it will be also satisfied for the function
5
摘要:

ARTICLETEMPLATEGradient-TypeMethodsForDecentralizedOptimizationProblemsWithPolyak-LojasiewiczConditionOverTime-VaryingNetworksIlyaKuruzovaandMohammadAlkousaa,bandFedorStonyakina,candAlexanderGasnikova,b,daMoscowInstituteofPhysicsandTechnology,Dolgoprudny,Russia;bNationalResearchUniversityHigherSchoo...

展开>> 收起<<
ARTICLE TEMPLATE Gradient-Type Methods For Decentralized Optimization Problems With Polyak- Lojasiewicz Condition Over Time-Varying Networks.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:705.16KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注