Almost Sure Convergence of Distributed Optimization with Imperfect Information Sharing Hadi Reisizadeh Anand Gokhale Behrouz Touri and Soheil Mohajer

2025-04-27 0 0 915.37KB 22 页 10玖币
侵权投诉
Almost Sure Convergence of Distributed Optimization with
Imperfect Information Sharing
Hadi Reisizadeh, Anand Gokhale, Behrouz Touri, and Soheil Mohajer
Abstract
To design algorithms that reduce communication cost or meet rate constraints and are robust to
communication noise, we study convex distributed optimization problems where a set of agents are
interested in solving a separable optimization problem collaboratively with imperfect information sharing
over time-varying networks. We study the almost sure convergence of a two-time-scale decentralized
gradient descent algorithm to reach the consensus on an optimizer of the objective loss function. One
time scale fades out the imperfect incoming information from neighboring agents, and the second one
adjusts the local loss functions’ gradients. We show that under certain conditions on the connectivity of
the underlying time-varying network and the time-scale sequences, the dynamics converge almost surely
to an optimal point supported in the optimizer set of the loss function.
1 Introduction
In recent years, with the rapid growth of areas such as big data and machine learning, there has been a
surge of interest in studying multi-agent networks. In many machine learning applications, it is impractical
to implement the learning task in a centralized fashion due to the decentralized nature of datasets. Multi-
agent networked systems provide scalability to larger datasets and systems, data locality, ownership, and
privacy, especially for modern computing services. These systems arise in various applications such as sensor
networks [1], multi-agent control [2], large-scale machine learning [3], and power networks [4]. The task of
learning a common objective function over a multi-agent network can be reduced to a distributed optimization
problem.
In distributed optimization problems, a set of agents are interested in finding a minimizer of a separable
function
f
(
x
) :=
Pn
i=1 fi
(
x
)such that each agent
i
has access to the local and private loss function
fi
(
·
)of this
decomposable cost function. Therefore, the goal is to identify system dynamics that guarantee the asymptotic
convergence of all agent states to a common state which minimizes the objective cost function
f
(
·
). Various
methods have been proposed to solve distributed optimization problems in strongly convex [5,6], convex [7,8],
and non-convex settings [9
12]. Under perfect information sharing assumption, a subgradient-push algorithm
is proposed for (strongly) convex loss functions in [13]. The almost sure and
L2
-convergences of this method
are shown for certain conditions on the connectivity, loss functions, and the step-size sequence.
Most of the works mentioned above on distributed optimization problems assume ideal communication
channels and perfect information sharing among agents. However, most communication channels are noisy,
and exchanging real-valued vectors introduces a massive communication overhead. To mitigate this challenge,
various compression approaches have been introduced [14,15] where the vectors are represented with a finite
number of bits or only a certain number of the most significant coordinates of the vectors are selected. Then,
the quantized/sparsified vectors are communicated over the network. Consequently, each agent receives an
imperfect estimate of the intended messages from the neighboring agents. Various gradient descent algorithms
with imperfect information sharing have been proposed [16
18], showing the convergence rates in
L2
sense
for the diverse set of problem setups.
H. Reisizadeh (email: hadir@umn.edu) and S. Mohajer (email: soheil@umn.edu) are with the University of Minnesota,
A. Gokhale (email:anand_gokhale@ucsb.edu) is with the University of California Santa Barbara, and B. Touri (email:
btouri@ucsd.edu) is with the University of California San Diego.
1
arXiv:2210.05897v2 [math.OC] 30 Apr 2023
A related work [19] considers distributed constrained optimization problems with noisy communication
links. The work differs from our current study as they assumed i.i.d. weight matrices with a symmetric
expected weight matrix. In a recent work [20], a two-time-scale gradient descent algorithm has been presented
for (strongly) convex cost functions over a convex compact set. Assuming a fixed topology for the underlying
network, the uniform weighting of the local cost functions, and a specific scheme for the lossy sharing of
information, it is shown that under certain conditions, the agents’ states converge to the optimal solution of
the problem almost surely. Considering the averaging-based distributed optimization over random networks
with possible dependence on the past and under certain conditions, the almost sure convergence to an optimal
point is presented in [21].
In this paper, we study the distributed convex optimization problems over time-varying networks with
imperfect information sharing. We consider the two-time-scale gradient descent method studied in [18,22]
to solve the optimization problem. One time-scale adjusts the (imperfect) incoming information from the
neighboring agents, and one time-scale controls the local cost functions’ gradients. It is shown that with a
proper choice of parameters, the proposed algorithm reaches the global optimal point for strongly convex loss
functions at a rate of
O
(
T1/2
)and achieves a convergence rate of
O
(
T1/3+
)with any
 >
0for non-convex
cost function in
L2
sense. Here, we identify the sufficient conditions on the step-sizes sequences for the almost
sure convergence of the agent’s states to an optimal solution for the class of convex cost functions.
The paper is organized as follows. We conclude this section by discussing the notations used in the paper.
We formulate the main problem and state the relevant underlying assumptions in Section 2. In Section 3, we
present our main results. To corroborate our theoretical analysis, we present simulation results in Section 4.
We discuss some preliminary results which are required to prove the main results in Section 5. The proofs
of the preliminary lemmas are presented in Appendix. Then, we present the proof of the main results in
Section 6and Section 7. Finally, we conclude the paper and discuss some possible directions for future works
in Section 8.
Notation
. We denote the set of integers
{
1
, . . . , n}
by [
n
]. In this paper, we consider
n
agents that
are minimizing a function in
Rd
. We assume that all local objective functions are acting on
row
vectors
in
R1×d
=
Rd
, and thus we view vectors in
Rd
as row vectors. Note that the rest of the vectors, i.e., the
vectors in
Rn×1
=
Rn
, are assumed to be column vectors. We denote the
L2
norm of a vector
xRd
by
kxk
. For a matrix
ARn×d
, and a strictly positive stochastic vector
rRn
, we define the
r
norm
of
A
by
kAk2
r
=
Pn
i=1 rikAik2
2
, where
Ai
denotes the
i
-th row of
A
. We denote the Frobenius norm for a
matrix
ARn×d
by
kAk2
=
Pn
i=1 Pd
j=1 |Aij |2
. The difference operator on a real-valued sequence
{a
(
t
)
}
is defined as a(t) := a(t+ 1) a(t)for every t1.
2 Problem Formulation
In this section, we first formulate the problem of interest. Then, we present the underlying assumptions on
the information exchange, the network connectivity, the properties of the cost functions, and the time-scales
sequences.
2.1 Problem Statement
Consider a set of
n
2agents, that are connected through a time-varying network. Each agent
i
[
n
]has
access to a local cost function
fi
:
RdR
. The goal of this work is to solve the optimization problem which
is given by
min
x1,x2,...,xnRd
n
X
i=1
rifi(xi)subject to x1=··· =xn,
where r= (r1, r2, . . . , rn)Tis a stochastic vector, i.e., ri0and Pn
i=1 ri= 1.
At each iteration
t
1, we represent the topology of the network connecting the agents by a directed
graph
G(t) = ([n],E(t))
, where the vertex set [
n
]represents the agents, and the edge set
E
(
t
)
[
n
]
×
[
n
]
represents the connectivity pattern of the agents at time
t
, i.e., the edge
(i, j)∈ E(t)
denotes a directed edge
from agent
i
to agent
j
. We assume that at each time
t
1, each agent
i
[
n
]can only send a message to its
2
out-neighbours, i.e., the set of all agents
j
such that (
i, j
)
∈ E
(
t
). We also assume that
G
(
t
)satisfies certain
connectivity conditions, which are discussed in Assumption 2.
In this work, the communication between the agents is assumed to be imperfect. We adapt the general
framework of the noisy sharing of information introduced in [18] as described below. Given the states
xi
(
t
)
of agents
i
[
n
]at time
t
, we assume that each agent has access to an imperfect weighted average of its
in-neighbours states, denoted by ˆ
xi(t)given by
ˆ
xi(t) =
n
X
j=1
Wij (t)xj(t) + ei(t),
where
W
(
t
)=[
Wij
(
t
)] is a row-stochastic matrix and
ei
(
t
)is a random noise vector in
Rd
. Note that the
matrix
W
(
t
)is consistent with the underlying graph
G
(
t
). More precisely, we have
Wij
(
t
)
>
0if and only if
(j, i)∈ E(t), where E(t)is the edge set of the graph G(t).
Regarding the local cost function, we assume that agent
i
[
n
]has access to a subgradient
gi
(
xi
(
t
)) of
the local cost function
fi
(
·
)at each local decision variable
xi
(
t
)at time
t
. Inspired by [18], we present the
update rule in this work as
xi(t+ 1) = (1 β(t))xi(t) + β(t)ˆ
xi(t)α(t)gi(xi(t)),(1)
where
{β
(
t
)
}
and
{α
(
t
)
}
are the sequences of step-sizes of the algorithm. This work identifies sufficient
conditions on the sequences of step-sizes for almost sure convergence of the dynamics
(1)
to an optimal point
x?∈ X?= arg minxRdPn
i=1 fi(x). For simplicity of notation, let
X(t) :=
x1(t)
.
.
.
xn(t)
, E(t) :=
e1(t)
.
.
.
en(t)
, G(t) :=
g1(x1(t))
.
.
.
gn(xn(t))
.
Using these matrices, we can write the update rule
(1)
in the form of a linear time-varying system given by
X(t+ 1) = A(t)X(t) + U(t),(2)
where
A(t) := (1 β(t))I+β(t)W(t)
and
U(t) := β(t)E(t)α(t)G(t).
We also define Φ(t:s) := A(t1) ···A(s+ 1) for t>s, with Φ(t:t1) = I.
2.2 Assumptions
To proceed with our main result, we need to make certain assumptions regarding the noise vectors
{ei
(
t
)
}
,
the weight matrix {W(t)}, the local cost functions, and the sequences of step-sizes.
Assumption 1 (Noise Sequence Assumptions) We assume that the noise sequence {ei(t)}satisfies
E[ei(t)|Ft]=0,and Ehkei(t)k2Ftiγ,
for some γ > 0, all i[n], and all t1. Here, {Ft}is the natural filtration of the random process {X(t)}.
Assumption 2 (Connectivity Assumptions)
We assume that the sequence
{G
(
t
)
}
of underlying graphs,
and its associated weight matrix sequence {W(t)}satisfy the following properties.
(a) W
(
t
)is non-negative,
W
(
t
)
1
=
1
, and
rTW(t) = rT
for all
t
1, where
1
is the all-one vector, and
r>0is the given stochastic weight vector.
(b)
Each nonzero element of
W
(
t
)is bounded away from zero, i.e., there exists some
η >
0, such that if
Wij (t)>0for i, j [n]and t1, then Wij (t)η.
3
(c)
There exists an integer
B
1such that the graph
[n],St+B
k=t+1 E(k)
is strongly connected for all
t
1,
where E(k)is the edge set of G(k).
Assumption 3 (Objective Function Assumptions)
We assume that objective functions
fi
satisfy the
following properties.
(a) fiis convex for all i[n].
(b) The optimizer set X?:= arg minxRdPn
i=1 rifi(x)is non-empty.
(c)
Each
fi
has bounded subgradients, i.e., there exists
L > 0
such that
kgik< L
for all subgradients
gi
of
fi(x)at every xRd. This also implies that each fi(·)is L-Lipschitz continuous, i.e.,
|fi(x)fi(y)|< L kxyk,
for all x,yRd.
Assumption 4 (Step-size Sequences Assumptions)
For the non-increasing step-size sequences
{α
(
t
)
}
and {β(t)}where {β(t)}take values in [0,1], we assume that
(a) P
t=1 α(t) = ,
(b) P
t=1 α2(t)<,P
t=1 β2(t)<, and
(c) P
t=1
α2(t)
β(t)<.
Also, there exists some t01such that for every tt0
(d) β(t)c1β2(t),
(e) α(t)c2α(t)β(t),
for some positive constants c1<λ
2and c2<λ
4, where λ:= ηrmin
2Bn2<1and rmin := mini[n]{ri} ≤ 1.
Remark 1
Note that Assumptions 4-
(a)
,
(b)
, and
(c)
imply
P
t=1 β
(
t
) =
as if
P
t=1 β
(
t
)
<
, using the
Cauchy-Schwarz inequality we get
X
t=1
α(t)
X
t=1
α2(t)
β(t)!1
2
X
t=1
β(t)!1
2
<,
which is a contradiction with Assumption 4-(a). Similarly, we can write
X
t=1
α(t)β1
2(t)
X
t=1
α2(t)
β(t)!1
2
X
t=1
β2(t)!1
2
<,(3)
where the second inequality follows from Assumptions 4-(b)and (c).
Remark 2
Note that unlike Assumption 4-
(a)
-
(c)
that do not depend on the dynamics parameters, Assump-
tion 4-
(d)
-
(e)
depend on those parameters. However, we will show in Section 7that Assumption 4-
(d)
-
(e)
will be satisfied for sufficiently large t, regardless of the dynamic parameters.
4
3 Main Results
In this section, we provide the main results of the paper. First, we present sufficient conditions for the
sequences
{β
(
t
)
}
and
{α
(
t
)
}
for the almost sure convergence to an optimal point for the agents acting under
the dynamics
(1)
. Then, for step-sizes of the form
α
(
t
) =
α0
tν
and
β
(
t
) =
β0
tµ
, we provide the region of (
µ, ν
)
for which the almost sure convergence is guaranteed.
Theorem 1
If Assumptions 1-4are satisfied, then, for the dynamics
(1)
, for all
i
[
n
]we have
limt→∞ xi(t) = ˜
x
almost surely, where ˜
xis an optimal point in the set of optimal solutions X?.
The proof of Theorem 1is provided in Section 6. The implication of the above result for the practical
step-sizes α(t) = α0
tνand β(t) = β0
tµas as follows.
Proposition 1
Let Assumptions 1-3hold. Then, for every
i
[
n
], the dynamics
(1)
with step-sizes
α(t) = α0
tν
and
β(t) = β0
tµ
converges almost surely, i.e., we have
limt→∞ xi(t) = ˜
x
almost surely for some optimal point
˜
x∈ X?, provided that β01,1
2< µ 1, and 1
2(1+µ)< ν 1.
The proof of Proposition 1is provided in Section 7.
Remark 3
Proposition 1identifies sufficient conditions for (
µ, ν
)such that the dynamics
(1)
converges
almost surely to the optimal set of a convex objective function over time-varying networks, when utilizing
step-sizes of the form
α
(
t
) =
α0
tν
and
β
(
t
) =
β0
tµ
. The same dynamic is studied for constrained optimization
problems with i.i.d. weight matrices with symmetric expected weight, and independent noisy communication
links in [19], where interestingly, the same (µ, ν)-region for the convergence of the dynamic is obtained.
Remark 4
In an earlier work [18], the
L2
-convergence of a similar dynamic is studied for strongly convex
objective functions. Figure 1compares the the region of (
µ, ν
)to guarantee the
L2
-convergence [18] which is
R1∪ R2∪ R3
and the region for the almost sure convergence (this work), i.e.,
R1
. Interestingly, the optimal
parameters
µ?
= 0
.
75 and
ν?
= 1 that lead to the fastest convergence in
L2
sense for strongly convex loss
functions also guarantee the almost sure convergence.
Note that both regions only characterize sufficient conditions for two types of convergence criteria under
different function properties, and hence, not comparable. For example, if for
L2
convergence, we relax the
class of strongly convex functions to general convex functions, we can show that it is necessary to have
µ > 1
2
. In other words, we cannot have
L2
convergence in region
R3
for the general class of (not necessarily
strong) convex functions. To see this, consider the convex functions
fi
(
x
) = 0 for
i
[
n
]with the set of
optimizers
X?
=
Rd
, and zero-mean i.i.d. noise sequences with variance
γ
, which satisfy Assumption 1. Then,
multiplying both sides of (2) by rT, we get ¯
x(k+ 1) = ¯
x(k) + β(k)rTE(k). Therefore, we can write
Ek¯
x(k+1)k2=EEk¯
x(k+1)k2Fk
=Ek¯
x(k)k2+β(k)hrTE[E(k)|Fk],¯
x(k)i+β2(k)Eh
rTE(k)
2Fki
=Ek¯
x(k)k2+γβ2(k),(4)
where the last equality is due to Assumption 1(see
(34)
for more details). Summing up
(4)
over
k
, we arrive
at
lim
t→∞
Ek¯
x(t)k2=k¯
x(1)k2+γ
X
k=1
β2(k).
If
xi
(
t
)converges to some
˜
x
in
L2
for all
i
[
n
], then we have
limt→∞ Ek¯
x(t)k2
=
Ek˜
xk2<
. This
implies that
P
k=1 β2
(
k
)
<
, which means that we need to have
µ > 1
2
. In other words, the condition
µ > 1
2
is necessary for the class of convex functions if
L2
-convergence is desired. Further investigation is required to
determine whether region R2leads to a convergence.
5
摘要:

AlmostSureConvergenceofDistributedOptimizationwithImperfectInformationSharingHadiReisizadeh,AnandGokhale,BehrouzTouri,andSoheilMohajer*AbstractTodesignalgorithmsthatreducecommunicationcostormeetrateconstraintsandarerobusttocommunicationnoise,westudyconvexdistributedoptimizationproblemswhereasetofage...

展开>> 收起<<
Almost Sure Convergence of Distributed Optimization with Imperfect Information Sharing Hadi Reisizadeh Anand Gokhale Behrouz Touri and Soheil Mohajer.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:915.37KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注