ARTICLE TEMPLATE Gradient-Type Methods For Decentralized Optimization Problems With Polyak- Lojasiewicz Condition Over Time-Varying Networks

2025-04-27 0 0 705.16KB 28 页 10玖币

侵权投诉

ARTICLE TEMPLATE

Gradient-Type Methods For Decentralized Optimization Problems

With Polyak- Lojasiewicz Condition Over Time-Varying Networks

Ilya Kuruzovaand Mohammad Alkousaa,b and Fedor Stonyakina,c and Alexander

Gasnikova,b,d

aMoscow Institute of Physics and Technology, Dolgoprudny, Russia; bNational Research

University Higher School of Economics, Moscow, Russia; cV. Vernadsky Crimean Federal

University, Simferopol, Republic of Crimea; dISP RAS Research Center for Trusted Artiﬁcial

Intelligence

ARTICLE HISTORY

Compiled May 14, 2024

ABSTRACT

This paper focuses on the decentralized optimization (minimization and saddle

point) problems with objective functions that satisfy Polyak- Lojasiewicz condition

(PL-condition). The ﬁrst part of the paper is devoted to the minimization problem

of the sum-type cost functions. In order to solve a such class of problems, we propose

a gradient descent type method with a consensus projection procedure and the in-

exact gradient of the objectives. Next, in the second part, we study the saddle-point

problem (SPP) with a structure of the sum, with objectives satisfying the two-sided

PL-condition. To solve such SPP, we propose a generalization of the Multi-step Gra-

dient Descent Ascent method with a consensus procedure, and inexact gradients of

the objective function with respect to both variables. Finally, we present some of

the numerical experiments, to show the eﬃciency of the proposed algorithm for the

robust least squares problem.

KEYWORDS

Distributed optimization; non-convex optimization; saddle-point problem;

PL-condition

1. Introduction

In this paper, ﬁrstly we study a sum-type minimization problem

min

x∈Rd(f(x) := 1

i=1

fi(x)),(1)

where the functions fiare generally non-convex and stored separately by nodes in

a communication network, which is represented by a non-directed graph G= (E, V ).

The graph G, possibly, can have a change over time structure. This problem, where the

This work was supported by a grant for research centers in the ﬁeld of artiﬁcial intelligence, provided by the

Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement

(agreement identiﬁer 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and

Technology dated November 1, 2021 No. 70-2021-00138.

CONTACT Ilya Kuruzov. Email: kuruzov.ia@phystech.edu

arXiv:2210.03810v2 [math.OC] 12 May 2024

object function depends on distributed data is typed as a decentralized optimization

problem. We assume that fsatisﬁes the well-known Polyak- Lojasiewicz condition (for

brevity, we write PL-condition). This condition was originally introduced by Polyak

[28], who proved that it is suﬃcient to show the global linear convergence rate for the

gradient descent without assuming convexity. The PL-condition is very well studied by

many researchers in many diﬀerent works for many diﬀerent settings of optimization

problems and has been theoretically veriﬁed for objective functions of optimization

problems arising in many practical problems. For example, it has been proven to be

true for objectives of over-parameterized deep networks [6], learning LQR models [9],

phase retrieval [37], Generative adversarial imitation learning of linear quadratic (see

Example 3.1 in [3]). More discussions of PL-condition and many other simple problems

can be found in [15].

This type of problems arises in diﬀerent areas: distributed machine learning [17],

resource allocation problem [13] and power system control [31].

The problem (1) can be reformulated as a problem with linear constraints. For this,

let us assign each agent in the network a personal copy of parameter vector xi∈Rd

(column vector) and introduce

X:= x⊤

1. . . x⊤

n⊤∈Rn×d, F (X) =

i=1

fi(xi).(2)

Now we equivalently rewrite problem (1) as

min

X∈Rn×dF(X) =

i=1

fi(xi),s.t. x1=··· =xn,(3)

where it has the same optimal value as problem (1). This reformulation increases the

number of variables but induces additional constraints at the same time.

Let us denote the set of consensus constraints C={X|x1=··· =xn}. Also, for

each X∈Rn×ddenote the average of its columns x=1

nPn

i=1 xi∈Rdand introduce

its projection onto constraint set

X=1

n1n1⊤

nX= ΠC(X) = x⊤. . . x⊤⊤∈Rn×d.

Note that Cis a linear subspace in Rn×d, and therefore projection operator ΠCis

linear.

Secondly, we study the saddle-point problem with a structure of the sum

min

x∈Rdx

max

y∈Rdy(ϕ(x, y) = 1

i=1

ϕi(x, y)),(4)

where functions ϕi(·,·) are non-convex (with respect to the variable xfor each y) and

smooth (i.e., with Lipschitz-continuous gradient). These functions can be calculated

only separately in diﬀerent nodes in the communication network, which is represented

by a non-directed graph G.

Problems of type (4) arise in many applications, such that Generative adversarial

network [10], adversarial training [18], and fair training [3]. In our work with problem

(4), we assume that ϕ(·, y) and −ϕ(x, ·) satisfy the PL-condition (see Assumption

(2.2), below).

Let y(x) := arg maxy∈Rdyϕ(x, y), and let we set fi(x) := ϕi(x, y(x)), then problem

(4) can be rewritten in the form of the minimization problem (1), i.e.,

min

x∈Rdx(max

y∈Rdy(ϕ(x, y) = 1

i=1

ϕi(x, y)))= min

x∈Rdx(f(x) := 1

i=1

fi(x)).(5)

Let Cx:= X∈Rn×dx|x1=. . . =xn,Cy:= Y∈Rn×dy|y1=. . . =ynand

Φ(X,Y) :=

i=1

ϕi(xi, yi). So, we can rewrite (4), in the similar way like [32, 33],

in the following form

min

X∈Cx

max

Y∈Cy

Φ(X,Y).(6)

Similarly to the ﬁrst proposed minimization problem (1), note that the problem (4),

also can be rewritten in the same form

min

X∈Cx

F(X) =

i=1

fi(xi).(7)

So, for solving the problem (5) (which is equivalent to problem (6)), we can try

to use some gradient type algorithms, at each iteration of the used algorithm, we

calculate the inexact gradient of ϕ(x, ·) for each x∈Rdxin order to solve the (inner)

maximization problem in (5).

In practice, Multi-step Gradient Descent Ascent (MGDA) algorithm and its modi-

ﬁcations are widely used to solve the problem (4) (see e.g. [10, 11, 23]), as important

algorithms for the considered type of saddle-point problems. In [25], authors demon-

strate the eﬀectiveness of MGDA on a class of games in which one of the players

satisﬁes the PL-condition and another player has a general non-convex structure. At

the same time, [41] shows that one step of gradient descent ascent demonstrates good

performance and has theoretical guarantees for convergence in two-sided PL-games.

Finally, we mention that the main diﬀerence between distributed problems from

usual optimization problems is to keep every agent’s vectors to their average. It is

approached through communication steps, where the communication can be performed

in diﬀerent scenarios. In our work, we will consider the time-varying network with

changeable edges set and use the standard consensus procedure (see Subsec. 2.2) when

the agent’s vectors are averaged by multiplying by a weighted matrix of the graph at

the current moment.

1.1. Related works

The decentralized algorithm makes two types of steps: local updates and information

exchange. Local steps may use gradient [19, 26, 29, 30, 35, 42] or sub-gradient [27]

computations. In primal-only methods, the agents compute gradients of their local

functions and alternate taking gradient steps and communication procedures. Under

cheap communication costs, it may be beneﬁcial to replace a single consensus iteration

with a series of information exchange rounds. Such methods as MSDA [34], D-NC [14],

and Mudag [42] employ multi-step gossip procedures. In order to achieve acceptable

complexity bounds, one may distribute accelerated methods directly [7, 14, 19, 30, 42]

or use a Catalyst framework [21]. Accelerated methods meet the lower complexity

bounds for decentralized optimization [12, 22, 34]. As usual, by complexity we mean

a suﬃcient number of iterations of the algorithm that guarantee the solution of the

problem with a given accuracy.

Consensus restrictions x1=. . . =xnmay be treated as linear constraints, thus

allowing for a dual reformulation of problem (1). Dual-based methods include dual

ascent and its accelerated variants [34, 38, 40, 44]. Primal-dual approaches like ADMM

[2, 39] are also implementable in decentralized scenarios. In [35], the authors developed

algorithms for non-smooth convex objectives and provided lower complexity bounds

for this case, as well.

Changing topology for time-varying networks requires new approaches to decen-

tralized methods and a more complicated theoretical analysis. The ﬁrst method with

provable geometric convergence was proposed in [26]. Such primal algorithms as the

Push-Pull Gradient Method [29] and DIGing [26] are robust to network changes and

have theoretical guarantees of convergence over time-varying graphs. Recently, a dual

method for time-varying architectures was introduced in [24].

In [33], it was studied the problem of decentralized optimization with strongly con-

vex smooth objective functions. The authors investigated accelerated deterministic

algorithms under time-varying network constraints with a consensus projection proce-

dure. The distributed stochastic optimization over time-varying graphs with consensus

projection procedure was studied in [32]. The consensus projection procedure made it

possible to use more acceptable parameters of smoothness and strong convexity in the

complexity estimates. The main goal of this paper is to investigate a similar consensus

projection approach for problems with PL-condition. Note that approach [32, 33] is

based on the well-known concept of the inexact (δ, L, µ)-oracle. But in the PL-case,

we will use inexact gradients with additive noise to describe the consensus projection

procedure.

1.2. Our contributions

Summing up, the contribution of this paper is as follows.

•We study the sum-type minimization problem when the objective function sat-

isﬁes the PL-condition. To solve a such class of problems, we propose a gradient

descent type method with consensus projection procedure (see Algorithm 2) and

access to only inexact gradient. We consider two cases of gradient inexactness:

the bounded deterministic inexactness and the sum of random noise with de-

terministic bias. We estimated the suﬃcient communication steps and iterations

number to approach the required quality concerning the function and the dis-

tance between the agent’s vectors and their average for both cases.

•We study the decentralized saddle-point problem (with the structure of a sum)

when the objective function satisﬁes the two-sided PL-condition. For solving a

such generalized class of problems, we proposed a generalization of the MGDA

method (see Algorithm 3) with a consensus procedure. We provided an estima-

tion for the suﬃcient number of iterations for inner and outer loops to approach

the acceptable quality concerning the function. Also, we estimate the commu-

nication complexity to a distance between the agent’s vectors and their average

for both cases to be small enough. Additionally, we research the inﬂuence of

interference on the convergence of the proposed Algorithm 3. We suppose that

the inexactness in the gradient can be separated into deterministic noise with

the bounded norm and zero-mean noise with the ﬁnite second moment.

•We present some numerical experiments, which demonstrate the eﬀectiveness

of the proposed algorithms, for the Least Squares and Robust Least Squares

problems.

2. Fundamentals and Assumptions for the problems under consideration

Throughout the paper, ⟨·,·⟩ denotes the inner product of vectors or matrices. Cor-

respondingly, by ∥·∥, we denote the 2-norm for vectors or the Frobenius norm for

matrices.

At the ﬁrst, for the minimization problem (1) (or its equivalent (3)), we assume

Assumption 2.1 (Lipschitz smoothness).For every i= 1, . . . , n, the function fiis

Li-smooth, for some Li>0.

Under this assumption, we ﬁnd that the function F(X) (see (2)) is Ll-smooth on

Rn×d, where Ll= max

1≤i≤nLi, and Lg-smooth on C, where Lg=1

i=1

Li. The constant

Llis called a local constant and Lgis called a global constant.

Assumption 2.2 (PL-condition).The function fsatisﬁes the PL-condition, i.e., it

holds the following inequality

f(x)−f∗≤1

2µ∥∇f(x)∥2,∀x∈Rd,(8)

for some µ > 0, and f∗is the optimal value of the function f.

Also under this assumption, we ﬁnd that fsatisﬁes the quadratic growth condition

(QG-condition) (see [15]):

∥x−x∗∥2≤2

µ(f(x)−f∗) ; ∀x∈Rd,(9)

where x∗is the nearest point to the optimal solution of the minimization problem

under consideration.

At the second, for the saddle-point problem (4), let us introduce the following

assumption.

Assumption 2.3 (Lipschitz smoothness).For every i= 1, . . . , n, the function ϕi(·,·)

is diﬀerentiable with respect to its both variables and smooth, i.e., the following in-

equalities hold

∥∇xϕi(x1, y1)− ∇xϕi(x2, y2)∥ ≤ Lxx,i∥x1−x2∥+Lxy,i∥y1−y2∥,

∥∇yϕi(x1, y1)− ∇yϕi(x2, y2)∥ ≤ Lyx,i∥x1−x2∥+Lyy,i∥y1−y2∥,

for all x1, x2∈Rdx, y1, y2∈Rdyand Lxx,i, Lyy,i, Lxy,i, Lyx,i ∈R∗

Note, that if Assumption 2.3 is satisﬁed, then it will be also satisﬁed for the function

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ARTICLETEMPLATEGradient-TypeMethodsForDecentralizedOptimizationProblemsWithPolyak-LojasiewiczConditionOverTime-VaryingNetworksIlyaKuruzovaandMohammadAlkousaa,bandFedorStonyakina,candAlexanderGasnikova,b,daMoscowInstituteofPhysicsandTechnology,Dolgoprudny,Russia;bNationalResearchUniversityHigherSchoo...

展开>> 收起<<

ARTICLE TEMPLATE Gradient-Type Methods For Decentralized Optimization Problems With Polyak- Lojasiewicz Condition Over Time-Varying Networks.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ARTICLE TEMPLATE Gradient-Type Methods For Decentralized Optimization Problems With Polyak- Lojasiewicz Condition Over Time-Varying Networks

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: