Almost Sure Convergence of Distributed Optimization with Imperfect Information Sharing Hadi Reisizadeh Anand Gokhale Behrouz Touri and Soheil Mohajer

2025-04-27 0 0 915.37KB 22 页 10玖币

侵权投诉

Almost Sure Convergence of Distributed Optimization with

Imperfect Information Sharing

Hadi Reisizadeh, Anand Gokhale, Behrouz Touri, and Soheil Mohajer ∗

Abstract

To design algorithms that reduce communication cost or meet rate constraints and are robust to

communication noise, we study convex distributed optimization problems where a set of agents are

interested in solving a separable optimization problem collaboratively with imperfect information sharing

over time-varying networks. We study the almost sure convergence of a two-time-scale decentralized

gradient descent algorithm to reach the consensus on an optimizer of the objective loss function. One

time scale fades out the imperfect incoming information from neighboring agents, and the second one

adjusts the local loss functions’ gradients. We show that under certain conditions on the connectivity of

the underlying time-varying network and the time-scale sequences, the dynamics converge almost surely

to an optimal point supported in the optimizer set of the loss function.

1 Introduction

In recent years, with the rapid growth of areas such as big data and machine learning, there has been a

surge of interest in studying multi-agent networks. In many machine learning applications, it is impractical

to implement the learning task in a centralized fashion due to the decentralized nature of datasets. Multi-

agent networked systems provide scalability to larger datasets and systems, data locality, ownership, and

privacy, especially for modern computing services. These systems arise in various applications such as sensor

networks [1], multi-agent control [2], large-scale machine learning [3], and power networks [4]. The task of

learning a common objective function over a multi-agent network can be reduced to a distributed optimization

problem.

In distributed optimization problems, a set of agents are interested in ﬁnding a minimizer of a separable

function

(

) :=

i=1 fi

(

)such that each agent

has access to the local and private loss function

(

)of this

decomposable cost function. Therefore, the goal is to identify system dynamics that guarantee the asymptotic

convergence of all agent states to a common state which minimizes the objective cost function

(

). Various

methods have been proposed to solve distributed optimization problems in strongly convex [5,6], convex [7,8],

and non-convex settings [9

–

12]. Under perfect information sharing assumption, a subgradient-push algorithm

is proposed for (strongly) convex loss functions in [13]. The almost sure and

-convergences of this method

are shown for certain conditions on the connectivity, loss functions, and the step-size sequence.

Most of the works mentioned above on distributed optimization problems assume ideal communication

channels and perfect information sharing among agents. However, most communication channels are noisy,

and exchanging real-valued vectors introduces a massive communication overhead. To mitigate this challenge,

various compression approaches have been introduced [14,15] where the vectors are represented with a ﬁnite

number of bits or only a certain number of the most signiﬁcant coordinates of the vectors are selected. Then,

the quantized/sparsiﬁed vectors are communicated over the network. Consequently, each agent receives an

imperfect estimate of the intended messages from the neighboring agents. Various gradient descent algorithms

with imperfect information sharing have been proposed [16

–

18], showing the convergence rates in

sense

for the diverse set of problem setups.

∗

H. Reisizadeh (email: hadir@umn.edu) and S. Mohajer (email: soheil@umn.edu) are with the University of Minnesota,

A. Gokhale (email:anand_gokhale@ucsb.edu) is with the University of California Santa Barbara, and B. Touri (email:

btouri@ucsd.edu) is with the University of California San Diego.

arXiv:2210.05897v2 [math.OC] 30 Apr 2023

A related work [19] considers distributed constrained optimization problems with noisy communication

links. The work diﬀers from our current study as they assumed i.i.d. weight matrices with a symmetric

expected weight matrix. In a recent work [20], a two-time-scale gradient descent algorithm has been presented

for (strongly) convex cost functions over a convex compact set. Assuming a ﬁxed topology for the underlying

network, the uniform weighting of the local cost functions, and a speciﬁc scheme for the lossy sharing of

information, it is shown that under certain conditions, the agents’ states converge to the optimal solution of

the problem almost surely. Considering the averaging-based distributed optimization over random networks

with possible dependence on the past and under certain conditions, the almost sure convergence to an optimal

point is presented in [21].

In this paper, we study the distributed convex optimization problems over time-varying networks with

imperfect information sharing. We consider the two-time-scale gradient descent method studied in [18,22]

to solve the optimization problem. One time-scale adjusts the (imperfect) incoming information from the

neighboring agents, and one time-scale controls the local cost functions’ gradients. It is shown that with a

proper choice of parameters, the proposed algorithm reaches the global optimal point for strongly convex loss

functions at a rate of

(

T−1/2

)and achieves a convergence rate of

(

T−1/3+

)with any

 >

0for non-convex

cost function in

sense. Here, we identify the suﬃcient conditions on the step-sizes sequences for the almost

sure convergence of the agent’s states to an optimal solution for the class of convex cost functions.

The paper is organized as follows. We conclude this section by discussing the notations used in the paper.

We formulate the main problem and state the relevant underlying assumptions in Section 2. In Section 3, we

present our main results. To corroborate our theoretical analysis, we present simulation results in Section 4.

We discuss some preliminary results which are required to prove the main results in Section 5. The proofs

of the preliminary lemmas are presented in Appendix. Then, we present the proof of the main results in

Section 6and Section 7. Finally, we conclude the paper and discuss some possible directions for future works

in Section 8.

Notation

. We denote the set of integers

{

, . . . , n}

by [

]. In this paper, we consider

agents that

are minimizing a function in

. We assume that all local objective functions are acting on

row

vectors

R1×d

, and thus we view vectors in

as row vectors. Note that the rest of the vectors, i.e., the

vectors in

Rn×1

, are assumed to be column vectors. We denote the

norm of a vector

x∈Rd

kxk

. For a matrix

A∈Rn×d

, and a strictly positive stochastic vector

r∈Rn

, we deﬁne the

r−

norm

kAk2

i=1 rikAik2

, where

denotes the

-th row of

. We denote the Frobenius norm for a

matrix

A∈Rn×d

kAk2

i=1 Pd

j=1 |Aij |2

. The diﬀerence operator ∆on a real-valued sequence

(

)

}

is deﬁned as ∆a(t) := a(t+ 1) −a(t)for every t≥1.

2 Problem Formulation

In this section, we ﬁrst formulate the problem of interest. Then, we present the underlying assumptions on

the information exchange, the network connectivity, the properties of the cost functions, and the time-scales

sequences.

2.1 Problem Statement

Consider a set of

n≥

2agents, that are connected through a time-varying network. Each agent

i∈

[

]has

access to a local cost function

Rd→R

. The goal of this work is to solve the optimization problem which

is given by

min

x1,x2,...,xn∈Rd

i=1

rifi(xi)subject to x1=··· =xn,

where r= (r1, r2, . . . , rn)Tis a stochastic vector, i.e., ri≥0and Pn

i=1 ri= 1.

At each iteration

t≥

1, we represent the topology of the network connecting the agents by a directed

graph

G(t) = ([n],E(t))

, where the vertex set [

]represents the agents, and the edge set

(

)

⊆

[

]

[

]

represents the connectivity pattern of the agents at time

, i.e., the edge

(i, j)∈ E(t)

denotes a directed edge

from agent

to agent

. We assume that at each time

t≥

1, each agent

i∈

[

]can only send a message to its

out-neighbours, i.e., the set of all agents

such that (

i, j

)

∈ E

(

). We also assume that

(

)satisﬁes certain

connectivity conditions, which are discussed in Assumption 2.

In this work, the communication between the agents is assumed to be imperfect. We adapt the general

framework of the noisy sharing of information introduced in [18] as described below. Given the states

(

)

of agents

i∈

[

]at time

, we assume that each agent has access to an imperfect weighted average of its

in-neighbours states, denoted by ˆ

xi(t)given by

xi(t) =

j=1

Wij (t)xj(t) + ei(t),

where

(

)=[

Wij

(

)] is a row-stochastic matrix and

(

)is a random noise vector in

. Note that the

matrix

(

)is consistent with the underlying graph

(

). More precisely, we have

Wij

(

)

0if and only if

(j, i)∈ E(t), where E(t)is the edge set of the graph G(t).

Regarding the local cost function, we assume that agent

i∈

[

]has access to a subgradient

(

)) of

the local cost function

(

)at each local decision variable

(

)at time

. Inspired by [18], we present the

update rule in this work as

xi(t+ 1) = (1 −β(t))xi(t) + β(t)ˆ

xi(t)−α(t)gi(xi(t)),(1)

where

{β

(

)

}

and

{α

(

)

}

are the sequences of step-sizes of the algorithm. This work identiﬁes suﬃcient

conditions on the sequences of step-sizes for almost sure convergence of the dynamics

(1)

to an optimal point

x?∈ X?= arg minx∈RdPn

i=1 fi(x). For simplicity of notation, let

X(t) := 





x1(t)

xn(t)





, E(t) := 





e1(t)

en(t)





, G(t) := 





g1(x1(t))

gn(xn(t))





.

Using these matrices, we can write the update rule

(1)

in the form of a linear time-varying system given by

X(t+ 1) = A(t)X(t) + U(t),(2)

where

A(t) := (1 −β(t))I+β(t)W(t)

and

U(t) := β(t)E(t)−α(t)G(t).

We also deﬁne Φ(t:s) := A(t−1) ···A(s+ 1) for t>s, with Φ(t:t−1) = I.

2.2 Assumptions

To proceed with our main result, we need to make certain assumptions regarding the noise vectors

{ei

(

)

}

the weight matrix {W(t)}, the local cost functions, and the sequences of step-sizes.

Assumption 1 (Noise Sequence Assumptions) We assume that the noise sequence {ei(t)}satisﬁes

E[ei(t)|Ft]=0,and Ehkei(t)k2Fti≤γ,

for some γ > 0, all i∈[n], and all t≥1. Here, {Ft}is the natural ﬁltration of the random process {X(t)}.

Assumption 2 (Connectivity Assumptions)

We assume that the sequence

(

)

}

of underlying graphs,

and its associated weight matrix sequence {W(t)}satisfy the following properties.

(a) W

(

)is non-negative,

(

)

, and

rTW(t) = rT

for all

t≥

1, where

is the all-one vector, and

r>0is the given stochastic weight vector.

(b)

Each nonzero element of

(

)is bounded away from zero, i.e., there exists some

η >

0, such that if

Wij (t)>0for i, j ∈[n]and t≥1, then Wij (t)≥η.

(c)

There exists an integer

B≥

1such that the graph

[n],St+B

k=t+1 E(k)

is strongly connected for all

t≥

where E(k)is the edge set of G(k).

Assumption 3 (Objective Function Assumptions)

We assume that objective functions

satisfy the

following properties.

(a) fiis convex for all i∈[n].

(b) The optimizer set X?:= arg minx∈RdPn

i=1 rifi(x)is non-empty.

(c)

Each

has bounded subgradients, i.e., there exists

L > 0

such that

kgik< L

for all subgradients

fi(x)at every x∈Rd. This also implies that each fi(·)is L-Lipschitz continuous, i.e.,

|fi(x)−fi(y)|< L kx−yk,

for all x,y∈Rd.

Assumption 4 (Step-size Sequences Assumptions)

For the non-increasing step-size sequences

{α

(

)

}

and {β(t)}where {β(t)}take values in [0,1], we assume that

(a) P∞

t=1 α(t) = ∞,

(b) P∞

t=1 α2(t)<∞,P∞

t=1 β2(t)<∞, and

t=1

α2(t)

β(t)<∞.

Also, there exists some t0≥1such that for every t≥t0

(d) −∆β(t)≤c1β2(t),

(e) −∆α(t)≤c2α(t)β(t),

for some positive constants c1<λ

2and c2<λ

4, where λ:= ηrmin

2Bn2<1and rmin := mini∈[n]{ri} ≤ 1.

Remark 1

Note that Assumptions 4-

(a)

(b)

, and

(c)

imply

P∞

t=1 β

(

) =

∞

as if

P∞

t=1 β

(

)

<∞

, using the

Cauchy-Schwarz inequality we get

∞

t=1

α(t)≤ ∞

t=1

α2(t)

β(t)!1

2 ∞

t=1

β(t)!1

<∞,

which is a contradiction with Assumption 4-(a). Similarly, we can write

∞

t=1

α(t)β1

2(t)≤ ∞

t=1

α2(t)

β(t)!1

2 ∞

t=1

β2(t)!1

<∞,(3)

where the second inequality follows from Assumptions 4-(b)and (c).

Remark 2

Note that unlike Assumption 4-

(a)

(c)

that do not depend on the dynamics parameters, Assump-

tion 4-

(d)

(e)

depend on those parameters. However, we will show in Section 7that Assumption 4-

(d)

(e)

will be satisﬁed for suﬃciently large t, regardless of the dynamic parameters.

3 Main Results

In this section, we provide the main results of the paper. First, we present suﬃcient conditions for the

sequences

{β

(

)

}

and

{α

(

)

}

for the almost sure convergence to an optimal point for the agents acting under

the dynamics

(1)

. Then, for step-sizes of the form

(

) =

α0

tν

and

(

) =

β0

tµ

, we provide the region of (

µ, ν

)

for which the almost sure convergence is guaranteed.

Theorem 1

If Assumptions 1-4are satisﬁed, then, for the dynamics

(1)

, for all

i∈

[

]we have

limt→∞ xi(t) = ˜

almost surely, where ˜

xis an optimal point in the set of optimal solutions X?.

The proof of Theorem 1is provided in Section 6. The implication of the above result for the practical

step-sizes α(t) = α0

tνand β(t) = β0

tµas as follows.

Proposition 1

Let Assumptions 1-3hold. Then, for every

i∈

[

], the dynamics

(1)

with step-sizes

α(t) = α0

tν

and

β(t) = β0

tµ

converges almost surely, i.e., we have

limt→∞ xi(t) = ˜

almost surely for some optimal point

x∈ X?, provided that β0≤1,1

2< µ ≤1, and 1

2(1+µ)< ν ≤1.

The proof of Proposition 1is provided in Section 7.

Remark 3

Proposition 1identiﬁes suﬃcient conditions for (

µ, ν

)such that the dynamics

(1)

converges

almost surely to the optimal set of a convex objective function over time-varying networks, when utilizing

step-sizes of the form

(

) =

α0

tν

and

(

) =

β0

tµ

. The same dynamic is studied for constrained optimization

problems with i.i.d. weight matrices with symmetric expected weight, and independent noisy communication

links in [19], where interestingly, the same (µ, ν)-region for the convergence of the dynamic is obtained.

Remark 4

In an earlier work [18], the

-convergence of a similar dynamic is studied for strongly convex

objective functions. Figure 1compares the the region of (

µ, ν

)to guarantee the

-convergence [18] which is

R1∪ R2∪ R3

and the region for the almost sure convergence (this work), i.e.,

. Interestingly, the optimal

parameters

µ?

= 0

75 and

ν?

= 1 that lead to the fastest convergence in

sense for strongly convex loss

functions also guarantee the almost sure convergence.

Note that both regions only characterize suﬃcient conditions for two types of convergence criteria under

diﬀerent function properties, and hence, not comparable. For example, if for

convergence, we relax the

class of strongly convex functions to general convex functions, we can show that it is necessary to have

µ > 1

. In other words, we cannot have

convergence in region

for the general class of (not necessarily

strong) convex functions. To see this, consider the convex functions

(

) = 0 for

i∈

[

]with the set of

optimizers

, and zero-mean i.i.d. noise sequences with variance

, which satisfy Assumption 1. Then,

multiplying both sides of (2) by rT, we get ¯

x(k+ 1) = ¯

x(k) + β(k)rTE(k). Therefore, we can write

Ek¯

x(k+1)k2=EEk¯

x(k+1)k2Fk

=Ek¯

x(k)k2+β(k)hrTE[E(k)|Fk],¯

x(k)i+β2(k)Eh

rTE(k)



2Fki

=Ek¯

x(k)k2+γβ2(k),(4)

where the last equality is due to Assumption 1(see

(34)

for more details). Summing up

(4)

over

, we arrive

lim

t→∞

Ek¯

x(t)k2=k¯

x(1)k2+γ∞

k=1

β2(k).

(

)converges to some

for all

i∈

[

], then we have

limt→∞ Ek¯

x(t)k2

Ek˜

xk2<∞

. This

implies that

P∞

k=1 β2

(

)

<∞

, which means that we need to have

µ > 1

. In other words, the condition

µ > 1

is necessary for the class of convex functions if

-convergence is desired. Further investigation is required to

determine whether region R2leads to a convergence.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AlmostSureConvergenceofDistributedOptimizationwithImperfectInformationSharingHadiReisizadeh,AnandGokhale,BehrouzTouri,andSoheilMohajer*AbstractTodesignalgorithmsthatreducecommunicationcostormeetrateconstraintsandarerobusttocommunicationnoise,westudyconvexdistributedoptimizationproblemswhereasetofage...

展开>> 收起<<

Almost Sure Convergence of Distributed Optimization with Imperfect Information Sharing Hadi Reisizadeh Anand Gokhale Behrouz Touri and Soheil Mohajer.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Almost Sure Convergence of Distributed Optimization with Imperfect Information Sharing Hadi Reisizadeh Anand Gokhale Behrouz Touri and Soheil Mohajer

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: