1 Hierarchical Federated Learning with Momentum Acceleration in Multi-Tier Networks

2025-04-30 0 0 1.16MB 18 页 10玖币
侵权投诉
1
Hierarchical Federated Learning with Momentum
Acceleration in Multi-Tier Networks
Zhengjie Yang, Sen Fu, Wei Bao, Dong Yuan, and Albert Y. Zomaya
Abstract—In this paper, we propose Hierarchical Federated
Learning with Momentum Acceleration (HierMo), a three-tier
worker-edge-cloud federated learning algorithm that applies mo-
mentum for training acceleration. Momentum is calculated and
aggregated in the three tiers. We provide convergence analysis for
HierMo, showing a convergence rate of O1
T. In the analysis,
we develop a new approach to characterize model aggregation,
momentum aggregation, and their interactions. Based on this
result, we prove that HierMo achieves a tighter convergence
upper bound compared with HierFAVG without momentum.
We also propose HierOPT, which optimizes the aggregation
periods (worker-edge and edge-cloud aggregation periods) to
minimize the loss given a limited training time. By conducting
the experiment, we verify that HierMo outperforms existing
mainstream benchmarks under a wide range of settings. In
addition, HierOPT can achieve a near-optimal performance when
we test HierMo under different aggregation periods.
Index Terms—Federated learning; momentum; convergence
analysis; edge computing
I. INTRODUCTION
With the advancement of Industry 4.0, Internet of Things
(IoT), and Artificial Intelligence, machine learning applica-
tions such as image classification [1], automatic driving [2],
and automatic speech recognition [3] are rapidly developed.
Since the machine learning dataset is distributed in individual
users and in many situations they are not willing to share these
sensitive raw data, Federated Learning (FL) emerges [4]. It
allows workers to participate in the model training without
sharing their raw data. Typically, FL is implemented in two
tiers, where multiple devices (workers) are distributed and
connected to a remote aggregator (usually located in the
cloud). A potential issue of the two-tier FL setting is its
scalability. The communication overhead between workers and
the cloud is proportional to the number of workers, which
causes problems when there are a large number of geo-
distributed workers connecting to the remote cloud via the
public Internet.
With the development of edge computing [5], a more
effective solution is adding the edge tier between local workers
and the remote cloud to address the scalability issue. Dif-
ferent from the typical two-tier architecture, in the three-tier
hierarchical architecture as shown in Fig. 1, workers can first
communicate with the edge node for edge-level aggregation,
and then the edge nodes communicate with the remote cloud
for cloud-level aggregation. Each edge node is closer to the
workers and is usually connected with them in the same
local/edge network, so that the communication cost is much
cheaper compared with the two-tier case when the workers
directly communicate with the cloud. In Fig. 1, we can see that
cloud server
worker
two-tier architecture
public
Internet
three-tier architecture
public
Internet
edge node
end to end
connection
Fig. 1. Two-tier architecture vs. three-tier architecture. 6 connections are
through the public Internet in the two-tier architecture but only 2 connections
are through the public Internet in the three-tier architecture. Communication
burdens are restrained in the local/edge networks.
much of the traffic through the public Internet (left subfigure)
is restrained in the local edge networks (right subfigure) due
to the existence of the edge nodes. Therefore, the three-tier
architecture is a good fit for larger-scale FL, and has attracted
attentions from researchers in recent years [6]–[8].
Although the three-tier FL can improve the communication
efficiency in one training iteration by replacing worker-cloud
communication with worker-edge communication, there is also
a need to accelerate its convergence performance to reduce
the number of iterations. One obstacle in the three-tier FL
is that each edge node can only aggregate the updates of
its local workers, and there is a discrepancy among edge
nodes. The edge nodes are to be synchronized in the cloud-
level aggregation. The two-level aggregation causes delayed
synchronization, leading to less training efficiency. Therefore,
it is a strong motivation for us to develop a more efficient
algorithm to accelerate the convergence, reducing the number
of training iterations in the three-tier hierarchical architecture,
and finally improve the overall training efficiency (considering
both per-iteration cost and the number of iterations).
Momentum is proved to be an effective mechanism to
accelerate model training. Many studies have demonstrated
its advantage in both centralized machine learning environ-
ment [9]–[12] and two-tier FL environment [13]–[16]. Apart
from the conventional gradient descent step, the momentum
method conducts additional momentum steps [17] to accelerate
convergence. In this paper, we propose Hierarchical Feder-
ated Learning with Momentum Acceleration (HierMo), which
leverages momentum to accelerate three-tier FL. HierMo is
operated as follows: 1In each iteration, each worker locally
updates its own model and worker momentum; 2In every
arXiv:2210.14560v1 [cs.LG] 26 Oct 2022
2
τiterations (τis called the worker-edge aggregation period),
each edge node receives, averages, and sends back the models
and momentum values with its connected workers.13In
every τ·πiterations (πis called the edge-cloud aggregation
period), the cloud receives, averages, and sends back the
models and momentum values with edge nodes. The edge
nodes will then distribute them to connected workers. The
above 13steps are repeated for multiple rounds until the
loss is sufficiently small.
Theoretically, we prove that HierMo is convergent and has
an O1
Tconvergence rate for smooth non-convex problems
for a given Titerations. In this step, we need to address
substantial new challenges, compared with two-tier FL. In
particular, we develop a new method to characterize the multi-
time cross-two-tier momentum interaction and cross-three-tier
momentum interaction, which do not exist in the two-tier FL.
After we theoretically prove the convergence, we observe that
the worker-edge and edge-cloud aggregation periods τand
πare key design variables we aim to optimize. Based on
the result of the convergence analysis, we propose HierOPT
algorithm, which can find a local optimal (τ, π)value pair.
In the experiment, we demonstrate the performance of
HierMo compared with various mainstream hierarchical FL
and momentum-based FL algorithms, including hierarchical
FL without momentum (HierFAVG [18] and CFL [19]), two-
tier FL with momentum (FedMom [20], SlowMo [21], Fed-
NAG [22], Mime [23], DOMO [24], and FedADC [25]), and
two-tier FL without momentum (FedAvg [4]). The experiment
is implemented on different kinds of models (linear regression,
logistic regress, CNN [26], VGG16 [27], and ResNet18 [28])
based on various real-world datasets (MNIST [29], CIFAR-
10 [30], ImageNet [28], [31] for image classification, and UCI-
HAR [32] for human activity recognition). The experimental
results illustrate that HierMo drastically outperforms bench-
marks under a wide range of settings. We also verify HierOPT
can output a near-optimal (τ, π)in the real-world settings. All
these results match our expectations by the theoretical analysis.
The contributions of this paper are summarized as follows.
We have proved that HierMo is convergent and has an
O1
Tconvergence rate for smooth non-convex problems
for a given Titerations under non-i.i.d. data.
We have proved that as long as learning step size ηis
sufficiently small, HierMo (with momentum acceleration)
achieves the tighter convergence upper bound than Hier-
FAVG (without momentum acceleration).
We have proposed the new HierOPT algorithm which can
find a local optimal pair of (τ, π)when total training
time is constrained.
HierMo is efficient and decreases the total training
time by 21–70% compared with the mainstream two-tier
momentum-based algorithms and three-tier algorithms.
HierOPT generates the near-optimal pair of (τ, π)
when the total training time is constrained. HierOPT
achieves the near-optimal accuracy with only 0.23–0.29%
1Each edge node also calculates another momentum for its own usage to
further accelerate convergence. See Section III for the detailed algorithm.
(CNN on MNIST) and 0.04–0.16% (CNN on CIFAR10)
gap from the real-world optimum.
The rest of the paper is organized as follows. In Section II,
we introduce related works. The HierMo algorithm design is
described in Section III. In Section IV, we provide theoretical
results including the convergence analysis of HierMo and the
performance gain of momentum. The algorithm to optimize
the aggregation periods, i.e., HierOPT, is proposed in Sec-
tion V. Section VI provides our experimental results and the
conclusion is made in Section VII.
II. RELATED WORK
A. Momentum in Machine Learning and Federated Learning
Momentum [33] is a method that helps accelerate gradient
descent in the relevant direction by adding a fraction γof
the difference between past and current model vectors. In the
classical centralized setting, the update rule of the momentum
(Polyak’s momentum) is as follows:
m(t) = γm(t1) ηF(w(t1)),(1)
w(t) = w(t1) + m(t),(2)
with γ[0,1), t = 1,2,...,m(0) = 0, where γis
momentum factor (weight of momentum), tis update iteration,
m(t)is momentum term at iteration t, and w(t)is model
parameter at iteration t. Through this method, the momentum
term increases for dimensions whose gradients point in the
same directions and reduces updates for dimensions whose
gradients change directions. As a result, momentum gains
faster convergence and reduces oscillation [17], [34].
Momentum has been investigated in both centralized ma-
chine learning and FL. In the centralized environment, an-
other form of momentum called Nesterov Accelerate Gra-
dient (NAG) [17], [35] is proposed. NAG2calculates the
gradient based on an approximation of the next position of
the parameters, i.e., F(w(t1) + γm(t1)), instead
of F(w(t1)) in Polyak’s momentum, leading to better
convergence performance. In [11], authors study the utilization
of momentum in over-parameterized models. [9] provides a
unified convergence analysis for both Polyak’s momentum and
NAG. [12] studies NAG in stochastic settings.
All the above works show the advantages of momentum to
accelerate the centralized training and it attracts researchers’
attention to apply momentum in FL environment. Depending
on where the momentum is adopted, we can categorize them
into the worker momentum, aggregator momentum, and com-
bination momentum. For the worker momentum (e.g., Fed-
NAG [22] and Mime [23]), momentum acceleration is adopted
at workers in each local iteration. However, it is vulnerable
to data heterogeneity among workers, which may harm the
long-run performance. For the aggregator momentum (e.g.,
FedMom [20] and SlowMo [21]), the momentum acceleration
is adopted only at the aggregator based on the global model
and it shares the same property of acceleration as in centralized
setting and dampens oscillations [17]. Nevertheless, it is
2There are two mainstream equivalent representations of NAG. In this paper,
we employ the representation in [9], [36].
3
TABLE I
KEY NOTATIONS
ηworker model learning rate
τworker-edge aggregation period
πedge-cloud aggregation period
γworker momentum factor
γaedge momentum factor
Tnumber of total local (worker) iterations indexed by t
Knumber of total edge aggregations indexed by k
Pnumber of total global (cloud) aggregations indexed by p
Lnumber of edge nodes indexed by `
C`number of workers under edge node `
Nnumber of workers in the system indexed by {i, `}
xt
i,` worker model parameter in worker {i, `}at iteration t
yt
i,` worker momentum parameter in worker {i, `}at iteration t
yt
`aggregated worker momentum in edge node `at iteration t
xt
`aggregated worker model in edge node `at iteration t
yt
`+updated edge momentum in edge node `at iteration t
xt
`+updated edge model in edge node `at iteration t
ytworker momentum cloud aggregation in the cloud at iteration t
xtcloud model in the cloud at iteration t
conducted less frequently (every τiterations3) compared with
worker momentum (every iteration), and the performance gain
may not be obvious especially when τis large. To address the
above limitations, works in [24], [25], [37] combine the worker
and aggregator momenta and they show a better convergence
performance than only using either worker or aggregator
momentum. The above forms of momentum are only adopted
and analyzed in the two-tier FL and we focus on the three-tier
scenarios in this paper.
B. Three-Tier Hierarchical Federated Learning
Three-tier FL has attracted more attention in recent years.
Without considering momentum, studies have demonstrated
the convergence performance in three-tier FL [18], [19], [38],
[39]. The communication overhead can be further optimized
in [40]. The convergence analysis extended from two-tier to
three-tier FL is not straightforward. Different from two-tier
FL where the global aggregation is executed every τlocal
iterations, in three-tier FL, each worker’s local model will be
first aggregated by the connected edge node every τlocal
iterations, and will then be aggregated by the cloud in another
level of every πedge aggregations. Existing two-tier methods
can only bound the two-tier effects, but not the three-tier
effects. Substantial new challenges are encountered in this
paper. When momentum is leveraged in the three-tier scenario,
it additionally introduces multi-time cross-two-tier momentum
interaction and cross-three-tier momentum interaction. This is
completely different from the two-tier scenario. Existing two-
tier analyses cannot deal with the above two new terms. They
can only characterize multi-time inner-tier momentum acceler-
ation and one-time cross-two-tier momentum interaction. We
devise a two-level virtual update (edge and cloud) method,
which is able to bound the aforementioned new terms so that
the convergence of HierMo still holds.
3τthe is aggregation period
III. HIERMOPROBLEM FORMULATION
A. Overview
We consider a three-tier hierarchical FL system consisting
of a cloud server, Ledge nodes, and Nworkers. Each edge
node `serves C`workers, and the total number of workers is
N=PL
`=1 C`. Worker {i, `}denotes the ith worker served
by edge node `, where i= 1,2, . . . , C`. It contains its local
dataset with the number of data samples denoted by Di,`.
The total training dataset in the cluster of workers served by
edge node `is D`,PC`
i=1 Di,` and the total training dataset
D,PL
`=1 D`=PL
`=1 PC`
i=1 Di,`. The target of three-tier
hierarchical FL is to find the stationary point xthat minimizes
the global loss function F(x)that is the weighted average of
all workers’ loss functions. The problem can be formulated as
follows:
min
xRdF(x),1
D
L
X
`=1
C`
X
i=1
Di,`Fi,`(x)(3)
=
L
X
`=1
D`
D
C`
X
i=1
Di,`
D`
Fi,`(x)(4)
,
L
X
`=1
D`
DF`(x),(5)
where dis the dimension of x,F(x)is the global loss function
at the cloud server, and Fi,`(x)is the local loss function at
worker {i, `}. (4) is the mathematical transformation from (3)
by adding D`. We also define the edge loss function at edge
node `as F`(x),PC`
i=1
Di,`
D`Fi,`(x), which is the weighted
average of edge node `s connected workers’ local loss func-
tions Fi,`(x). Therefore, by replacing PC`
i=1
Di,`
D`Fi,`(x)with
F`(x)in (4), we can directly derive (5), demonstrating that the
global loss function is the weighted average of all edge loss
functions as F(x),PL
`=1
D`
DF`(x). We assume the problem
is within the scope of cross-siloed federated learning [41]
where all workers are required to participate in the training
with siloed data. Each worker represents a repository of data,
and data are sensitive and non-i.i.d.. The key notations are
summarized in Table I.
B. Worker Momentum and Edge Momentum
We notice that there are two types of momentum in two-
tier FL: One type (i.e., worker momentum) is calculated at
each worker and is aggregated; The other type (i.e., aggregator
momentum) is calculated at the aggregator. Since both types
can accelerate the convergence, we adopt both of them in our
work. In the three-tier case in our paper, the worker momentum
is individually computed in each worker and aggregated in
the edge node (worker momentum edge aggregation) and the
cloud (worker momentum cloud aggregation). We still call it
worker momentum throughout the paper. For the aggregator
momentum, we apply it at each edge node. Each edge node
computes its own momentum and it is not shared with the
workers or the cloud. We call it edge momentum throughout
this paper.
4
Algorithm 1 HierMo algorithm.
Input:τ, π,T=Kτ =P τπ,η,γ,γa
Output: Final cloud (global) model parameter xT
(a)
1: For each worker, initialize: x0
i,` as same value for all i, `, and
y0
i,` =x0
i,`
2: For each edge node, initialize: x0
`(a)=x0
i,`, and y0
`(a)=x0
`(a)
3: for t= 1,2,...,T do
4: For each worker i= 1,2,...,N in parallel,
5: yt
i,` xt1
i,` ηFi,`(xt1
i,` )
// Worker momentum update
6: xt
i,` yt
i,` +γ(yt
i,` yt1
i,` )
// Worker model update
7: if t== kτ where k= 1,...,K then
8: For each edge node `= 1,2,...,L in parallel,
9: y
`PC`
i=1
Di,`
D`y
i,`
// Worker momentum edge aggregation
10: y
`+x(k1)τ
`+PC`
i=1
Di,`
D`x(k1)τ
`+x
i,`
// Edge momentum update
11: x
`+y
`++γay
`+y(k1)τ
`+
// Edge model update
12: Set y
i,` y
`for all worker iC`
// Edge aggregated worker momentum
re-distribution to workers
13: Set x
i,` x
`+for all worker iC`
// Edge model re-distribution to workers
14: end if
15: if t== π where p= 1,2,...,P then
16: Aggregate ypτ π PL
`=1
D`
Dypτ π
`
// Worker momentum cloud aggregation
17: Aggregate xpτ π PL
`=1
D`
Dxpτ π
`+
// Edge model cloud aggregation
18: Set ypτ π
`ypτ π for all edge node lL
// Cloud aggregated worker momentum
re-distribution to edge nodes
19: Set xpτ π
`+xpτ π for all edge node lL
// Cloud model re-distribution to edge nodes
20: Set ypτ π
i,` ypτ π
`for all worker iC`, l L
// Cloud aggregated worker momentum
re-distribution from edge nodes to workers
21: Set xpτ π
i,` xpτ π
`+for all worker iC`, l L
// Cloud model re-distribution
from edge nodes to workers
22: end if
23: end for
C. HierMo Algorithm
In Algorithm 1, we propose a momentum-based three-tier
hierarchical FL algorithm, named as HierMo, which applies
both worker momentum and edge momentum. HierMo aims
to find the final cloud model xT
(a)to solve the formula (3).
It conducts Tlocal iterations, Kedge aggregations, and P
cloud aggregations, where T=Kτ =P τ π,τis the worker-
edge aggregation period, and πis the edge-cloud aggregation
period.
1) Worker update: In each local iteration t, each
worker {i, `}computes its worker update, which includes two
things: 1worker momentum update yt
i,` (Line 5) and 2
worker model update xt
i,` (Line 6). 1and 2follow the
Nesterov Accelerated Gradient (NAG) [35] momentum update
and are conducted every iteration. Through this way, each
worker can utilize its own worker momentum acceleration.
2) Edge update: When t=kτ, k = 1,2, . . . , K, each edge
node `receives workers’ momenta and models in C`and per-
forms edge update, which includes two operations: 1Worker
momentum edge aggregation y
`(Line 9) with re-distribution
(Line 12). Through this way, some straggler workers with
high data-heterogeneity whose local momenta y
i,` pointing
to an inappropriate direction can be refined from y
`.2
Edge momentum y
`+and model x
`+update (Lines 10–11)
with model re-distribution (Line 13). Since the computation
of edge momentum and model update is based on the edge
model, it is equivalent to perform it in edge setting involving
all workers’ dataset under edge node `(D`=PC`
i=1 Di,`).
By doing so, it dampens oscillations [17] within the edge
node. Please note that 1and 2are two operations on the
same edge node, so that we use subscript “” and “+” to
label the momentum/model right after operations 1and 2
respectively. Finally, both 1and 2are conducted in each
edge node every τiterations.
3) Cloud update: When t=π, p = 1,2, . . . , P , the
cloud receives edge aggregated worker momentum ypτ π
`and
edge model xpτ π
`+for all `Land performs cloud update,
which includes two things: 1Worker momentum cloud ag-
gregation ypτ π (Line 16) and re-distribution (Lines 18 and 20).
Through this way, all edge nodes and workers receive the cloud
aggregated worker momentum and mitigate the disadvantage
caused by non-i.i.d. data heterogeneity. 2Edge model cloud
aggregation xpτ π (Line 17) and cloud model re-distribution
(Lines 19 and 21). Please note that the cloud will re-distribute
the momentum and model to all edge nodes and all edge nodes
will then distribute them to all workers when tis a multiple
of τπ.
IV. CONVERGENCE ANALYSIS OF HIERMO
In this section, we present the theoretical analysis of Hi-
erMo. We first provide preliminaries. Then, we introduce the
concept of virtual update which is a significant intermediate
step to conduct convergence analysis. Afterward, we show the
convergence guarantee of HierMo. Finally, we compare the
convergence upper bound of HierMo and HierFAVG to analyze
the performance gain of momentum.
A. Preliminaries
We assume Fi,`(·)satisfies the following standard condi-
tions that are commonly adopted in the literature [13], [22],
[42].
Assumption 1. Fi,`(x)is ρ-Lipschitz, i.e., kFi,`(x1)
Fi,`(x2)k ≤ ρkx1x2kfor any x1,x2, i, `.
Assumption 2. Fi,`(x)is β-smooth, i.e., k∇Fi,`(x1)
Fi,`(x2)k ≤ βkx1x2kfor any x1,x2, i, `.
Assumption 3. (Bounded diversity) The variance of local
gradient to edge gradient is bounded. i.e., k∇Fi,`(x)
F`(x)k ≤ δi,` for i,`, and x. We also define δ`as
the weighted average of δi,` and δas the weighted average of
δ`, i.e., δ`,PiC`
Di,`
D`δi,` and δ,P`L
D`
Dδ`.
摘要:

1HierarchicalFederatedLearningwithMomentumAccelerationinMulti-TierNetworksZhengjieYang,SenFu,WeiBao,DongYuan,andAlbertY.ZomayaAbstract—Inthispaper,weproposeHierarchicalFederatedLearningwithMomentumAcceleration(HierMo),athree-tierworker-edge-cloudfederatedlearningalgorithmthatappliesmo-mentumfortrain...

展开>> 收起<<
1 Hierarchical Federated Learning with Momentum Acceleration in Multi-Tier Networks.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.16MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注