1 Hierarchical Federated Learning with Momentum Acceleration in Multi-Tier Networks

2025-04-30 0 0 1.16MB 18 页 10玖币

侵权投诉

Hierarchical Federated Learning with Momentum

Acceleration in Multi-Tier Networks

Zhengjie Yang, Sen Fu, Wei Bao, Dong Yuan, and Albert Y. Zomaya

Abstract—In this paper, we propose Hierarchical Federated

Learning with Momentum Acceleration (HierMo), a three-tier

worker-edge-cloud federated learning algorithm that applies mo-

mentum for training acceleration. Momentum is calculated and

aggregated in the three tiers. We provide convergence analysis for

HierMo, showing a convergence rate of O1

T. In the analysis,

we develop a new approach to characterize model aggregation,

momentum aggregation, and their interactions. Based on this

result, we prove that HierMo achieves a tighter convergence

upper bound compared with HierFAVG without momentum.

We also propose HierOPT, which optimizes the aggregation

periods (worker-edge and edge-cloud aggregation periods) to

minimize the loss given a limited training time. By conducting

the experiment, we verify that HierMo outperforms existing

mainstream benchmarks under a wide range of settings. In

addition, HierOPT can achieve a near-optimal performance when

we test HierMo under different aggregation periods.

Index Terms—Federated learning; momentum; convergence

analysis; edge computing

I. INTRODUCTION

With the advancement of Industry 4.0, Internet of Things

(IoT), and Artiﬁcial Intelligence, machine learning applica-

tions such as image classiﬁcation [1], automatic driving [2],

and automatic speech recognition [3] are rapidly developed.

Since the machine learning dataset is distributed in individual

users and in many situations they are not willing to share these

sensitive raw data, Federated Learning (FL) emerges [4]. It

allows workers to participate in the model training without

sharing their raw data. Typically, FL is implemented in two

tiers, where multiple devices (workers) are distributed and

connected to a remote aggregator (usually located in the

cloud). A potential issue of the two-tier FL setting is its

scalability. The communication overhead between workers and

the cloud is proportional to the number of workers, which

causes problems when there are a large number of geo-

distributed workers connecting to the remote cloud via the

public Internet.

With the development of edge computing [5], a more

effective solution is adding the edge tier between local workers

and the remote cloud to address the scalability issue. Dif-

ferent from the typical two-tier architecture, in the three-tier

hierarchical architecture as shown in Fig. 1, workers can ﬁrst

communicate with the edge node for edge-level aggregation,

and then the edge nodes communicate with the remote cloud

for cloud-level aggregation. Each edge node is closer to the

workers and is usually connected with them in the same

local/edge network, so that the communication cost is much

cheaper compared with the two-tier case when the workers

directly communicate with the cloud. In Fig. 1, we can see that

cloud server

worker

two-tier architecture

public

Internet

three-tier architecture

public

Internet

edge node

end to end

connection

Fig. 1. Two-tier architecture vs. three-tier architecture. 6 connections are

through the public Internet in the two-tier architecture but only 2 connections

are through the public Internet in the three-tier architecture. Communication

burdens are restrained in the local/edge networks.

much of the trafﬁc through the public Internet (left subﬁgure)

is restrained in the local edge networks (right subﬁgure) due

to the existence of the edge nodes. Therefore, the three-tier

architecture is a good ﬁt for larger-scale FL, and has attracted

attentions from researchers in recent years [6]–[8].

Although the three-tier FL can improve the communication

efﬁciency in one training iteration by replacing worker-cloud

communication with worker-edge communication, there is also

a need to accelerate its convergence performance to reduce

the number of iterations. One obstacle in the three-tier FL

is that each edge node can only aggregate the updates of

its local workers, and there is a discrepancy among edge

nodes. The edge nodes are to be synchronized in the cloud-

level aggregation. The two-level aggregation causes delayed

synchronization, leading to less training efﬁciency. Therefore,

it is a strong motivation for us to develop a more efﬁcient

algorithm to accelerate the convergence, reducing the number

of training iterations in the three-tier hierarchical architecture,

and ﬁnally improve the overall training efﬁciency (considering

both per-iteration cost and the number of iterations).

Momentum is proved to be an effective mechanism to

accelerate model training. Many studies have demonstrated

its advantage in both centralized machine learning environ-

ment [9]–[12] and two-tier FL environment [13]–[16]. Apart

from the conventional gradient descent step, the momentum

method conducts additional momentum steps [17] to accelerate

convergence. In this paper, we propose Hierarchical Feder-

ated Learning with Momentum Acceleration (HierMo), which

leverages momentum to accelerate three-tier FL. HierMo is

operated as follows: 1In each iteration, each worker locally

updates its own model and worker momentum; 2In every

arXiv:2210.14560v1 [cs.LG] 26 Oct 2022

τiterations (τis called the worker-edge aggregation period),

each edge node receives, averages, and sends back the models

and momentum values with its connected workers.13In

every τ·πiterations (πis called the edge-cloud aggregation

period), the cloud receives, averages, and sends back the

models and momentum values with edge nodes. The edge

nodes will then distribute them to connected workers. The

above 1–3steps are repeated for multiple rounds until the

loss is sufﬁciently small.

Theoretically, we prove that HierMo is convergent and has

an O1

Tconvergence rate for smooth non-convex problems

for a given Titerations. In this step, we need to address

substantial new challenges, compared with two-tier FL. In

particular, we develop a new method to characterize the multi-

time cross-two-tier momentum interaction and cross-three-tier

momentum interaction, which do not exist in the two-tier FL.

After we theoretically prove the convergence, we observe that

the worker-edge and edge-cloud aggregation periods τand

πare key design variables we aim to optimize. Based on

the result of the convergence analysis, we propose HierOPT

algorithm, which can ﬁnd a local optimal (τ, π)value pair.

In the experiment, we demonstrate the performance of

HierMo compared with various mainstream hierarchical FL

and momentum-based FL algorithms, including hierarchical

FL without momentum (HierFAVG [18] and CFL [19]), two-

tier FL with momentum (FedMom [20], SlowMo [21], Fed-

NAG [22], Mime [23], DOMO [24], and FedADC [25]), and

two-tier FL without momentum (FedAvg [4]). The experiment

is implemented on different kinds of models (linear regression,

logistic regress, CNN [26], VGG16 [27], and ResNet18 [28])

based on various real-world datasets (MNIST [29], CIFAR-

10 [30], ImageNet [28], [31] for image classiﬁcation, and UCI-

HAR [32] for human activity recognition). The experimental

results illustrate that HierMo drastically outperforms bench-

marks under a wide range of settings. We also verify HierOPT

can output a near-optimal (τ, π)in the real-world settings. All

these results match our expectations by the theoretical analysis.

The contributions of this paper are summarized as follows.

•We have proved that HierMo is convergent and has an

O1

Tconvergence rate for smooth non-convex problems

for a given Titerations under non-i.i.d. data.

•We have proved that as long as learning step size ηis

sufﬁciently small, HierMo (with momentum acceleration)

achieves the tighter convergence upper bound than Hier-

FAVG (without momentum acceleration).

•We have proposed the new HierOPT algorithm which can

ﬁnd a local optimal pair of (τ∗, π∗)when total training

time is constrained.

•HierMo is efﬁcient and decreases the total training

time by 21–70% compared with the mainstream two-tier

momentum-based algorithms and three-tier algorithms.

•HierOPT generates the near-optimal pair of (τ∗, π∗)

when the total training time is constrained. HierOPT

achieves the near-optimal accuracy with only 0.23–0.29%

1Each edge node also calculates another momentum for its own usage to

further accelerate convergence. See Section III for the detailed algorithm.

(CNN on MNIST) and 0.04–0.16% (CNN on CIFAR10)

gap from the real-world optimum.

The rest of the paper is organized as follows. In Section II,

we introduce related works. The HierMo algorithm design is

described in Section III. In Section IV, we provide theoretical

results including the convergence analysis of HierMo and the

performance gain of momentum. The algorithm to optimize

the aggregation periods, i.e., HierOPT, is proposed in Sec-

tion V. Section VI provides our experimental results and the

conclusion is made in Section VII.

II. RELATED WORK

A. Momentum in Machine Learning and Federated Learning

Momentum [33] is a method that helps accelerate gradient

descent in the relevant direction by adding a fraction γof

the difference between past and current model vectors. In the

classical centralized setting, the update rule of the momentum

(Polyak’s momentum) is as follows:

m(t) = γm(t−1) −η∇F(w(t−1)),(1)

w(t) = w(t−1) + m(t),(2)

with γ∈[0,1), t = 1,2,...,m(0) = 0, where γis

momentum factor (weight of momentum), tis update iteration,

m(t)is momentum term at iteration t, and w(t)is model

parameter at iteration t. Through this method, the momentum

term increases for dimensions whose gradients point in the

same directions and reduces updates for dimensions whose

gradients change directions. As a result, momentum gains

faster convergence and reduces oscillation [17], [34].

Momentum has been investigated in both centralized ma-

chine learning and FL. In the centralized environment, an-

other form of momentum called Nesterov Accelerate Gra-

dient (NAG) [17], [35] is proposed. NAG2calculates the

gradient based on an approximation of the next position of

the parameters, i.e., ∇F(w(t−1) + γm(t−1)), instead

of ∇F(w(t−1)) in Polyak’s momentum, leading to better

convergence performance. In [11], authors study the utilization

of momentum in over-parameterized models. [9] provides a

uniﬁed convergence analysis for both Polyak’s momentum and

NAG. [12] studies NAG in stochastic settings.

All the above works show the advantages of momentum to

accelerate the centralized training and it attracts researchers’

attention to apply momentum in FL environment. Depending

on where the momentum is adopted, we can categorize them

into the worker momentum, aggregator momentum, and com-

bination momentum. For the worker momentum (e.g., Fed-

NAG [22] and Mime [23]), momentum acceleration is adopted

at workers in each local iteration. However, it is vulnerable

to data heterogeneity among workers, which may harm the

long-run performance. For the aggregator momentum (e.g.,

FedMom [20] and SlowMo [21]), the momentum acceleration

is adopted only at the aggregator based on the global model

and it shares the same property of acceleration as in centralized

setting and dampens oscillations [17]. Nevertheless, it is

2There are two mainstream equivalent representations of NAG. In this paper,

we employ the representation in [9], [36].

TABLE I

KEY NOTATIONS

ηworker model learning rate

τworker-edge aggregation period

πedge-cloud aggregation period

γworker momentum factor

γaedge momentum factor

Tnumber of total local (worker) iterations indexed by t

Knumber of total edge aggregations indexed by k

Pnumber of total global (cloud) aggregations indexed by p

Lnumber of edge nodes indexed by `

C`number of workers under edge node `

Nnumber of workers in the system indexed by {i, `}

i,` worker model parameter in worker {i, `}at iteration t

i,` worker momentum parameter in worker {i, `}at iteration t

`−aggregated worker momentum in edge node `at iteration t

`−aggregated worker model in edge node `at iteration t

`+updated edge momentum in edge node `at iteration t

`+updated edge model in edge node `at iteration t

ytworker momentum cloud aggregation in the cloud at iteration t

xtcloud model in the cloud at iteration t

conducted less frequently (every τiterations3) compared with

worker momentum (every iteration), and the performance gain

may not be obvious especially when τis large. To address the

above limitations, works in [24], [25], [37] combine the worker

and aggregator momenta and they show a better convergence

performance than only using either worker or aggregator

momentum. The above forms of momentum are only adopted

and analyzed in the two-tier FL and we focus on the three-tier

scenarios in this paper.

B. Three-Tier Hierarchical Federated Learning

Three-tier FL has attracted more attention in recent years.

Without considering momentum, studies have demonstrated

the convergence performance in three-tier FL [18], [19], [38],

[39]. The communication overhead can be further optimized

in [40]. The convergence analysis extended from two-tier to

three-tier FL is not straightforward. Different from two-tier

FL where the global aggregation is executed every τlocal

iterations, in three-tier FL, each worker’s local model will be

ﬁrst aggregated by the connected edge node every τlocal

iterations, and will then be aggregated by the cloud in another

level of every πedge aggregations. Existing two-tier methods

can only bound the two-tier effects, but not the three-tier

effects. Substantial new challenges are encountered in this

paper. When momentum is leveraged in the three-tier scenario,

it additionally introduces multi-time cross-two-tier momentum

interaction and cross-three-tier momentum interaction. This is

completely different from the two-tier scenario. Existing two-

tier analyses cannot deal with the above two new terms. They

can only characterize multi-time inner-tier momentum acceler-

ation and one-time cross-two-tier momentum interaction. We

devise a two-level virtual update (edge and cloud) method,

which is able to bound the aforementioned new terms so that

the convergence of HierMo still holds.

3τthe is aggregation period

III. HIERMOPROBLEM FORMULATION

A. Overview

We consider a three-tier hierarchical FL system consisting

of a cloud server, Ledge nodes, and Nworkers. Each edge

node `serves C`workers, and the total number of workers is

N=PL

`=1 C`. Worker {i, `}denotes the ith worker served

by edge node `, where i= 1,2, . . . , C`. It contains its local

dataset with the number of data samples denoted by Di,`.

The total training dataset in the cluster of workers served by

edge node `is D`,PC`

i=1 Di,` and the total training dataset

D,PL

`=1 D`=PL

`=1 PC`

i=1 Di,`. The target of three-tier

hierarchical FL is to ﬁnd the stationary point x∗that minimizes

the global loss function F(x)that is the weighted average of

all workers’ loss functions. The problem can be formulated as

follows:

min

x∈RdF(x),1

`=1

i=1

Di,`Fi,`(x)(3)

`=1

i=1

Di,`

Fi,`(x)(4)

`=1

DF`(x),(5)

where dis the dimension of x,F(x)is the global loss function

at the cloud server, and Fi,`(x)is the local loss function at

worker {i, `}. (4) is the mathematical transformation from (3)

by adding D`. We also deﬁne the edge loss function at edge

node `as F`(x),PC`

i=1

Di,`

D`Fi,`(x), which is the weighted

average of edge node `’s connected workers’ local loss func-

tions Fi,`(x). Therefore, by replacing PC`

i=1

Di,`

D`Fi,`(x)with

F`(x)in (4), we can directly derive (5), demonstrating that the

global loss function is the weighted average of all edge loss

functions as F(x),PL

`=1

DF`(x). We assume the problem

is within the scope of cross-siloed federated learning [41]

where all workers are required to participate in the training

with siloed data. Each worker represents a repository of data,

and data are sensitive and non-i.i.d.. The key notations are

summarized in Table I.

B. Worker Momentum and Edge Momentum

We notice that there are two types of momentum in two-

tier FL: One type (i.e., worker momentum) is calculated at

each worker and is aggregated; The other type (i.e., aggregator

momentum) is calculated at the aggregator. Since both types

can accelerate the convergence, we adopt both of them in our

work. In the three-tier case in our paper, the worker momentum

is individually computed in each worker and aggregated in

the edge node (worker momentum edge aggregation) and the

cloud (worker momentum cloud aggregation). We still call it

worker momentum throughout the paper. For the aggregator

momentum, we apply it at each edge node. Each edge node

computes its own momentum and it is not shared with the

workers or the cloud. We call it edge momentum throughout

this paper.

Algorithm 1 HierMo algorithm.

Input:τ, π,T=Kτ =P τπ,η,γ,γa

Output: Final cloud (global) model parameter xT

(a)

1: For each worker, initialize: x0

i,` as same value for all i, `, and

i,` =x0

i,`

2: For each edge node, initialize: x0

`(a)=x0

i,`, and y0

`(a)=x0

`(a)

3: for t= 1,2,...,T do

4: For each worker i= 1,2,...,N in parallel,

5: yt

i,` ←xt−1

i,` −η∇Fi,`(xt−1

i,` )

// Worker momentum update

6: xt

i,` ←yt

i,` +γ(yt

i,` −yt−1

i,` )

// Worker model update

7: if t== kτ where k= 1,...,K then

8: For each edge node `= 1,2,...,L in parallel,

9: ykτ

`−←PC`

i=1

Di,`

D`ykτ

i,`

// Worker momentum edge aggregation

10: ykτ

`+←x(k−1)τ

`+−PC`

i=1

Di,`

D`x(k−1)τ

`+−xkτ

i,` 

// Edge momentum update

11: xkτ

`+←ykτ

`++γaykτ

`+−y(k−1)τ

`+

// Edge model update

12: Set ykτ

i,` ←ykτ

`−for all worker i∈C`

// Edge aggregated worker momentum

re-distribution to workers

13: Set xkτ

i,` ←xkτ

`+for all worker i∈C`

// Edge model re-distribution to workers

14: end if

15: if t== pτπ where p= 1,2,...,P then

16: Aggregate ypτ π ←PL

`=1

Dypτ π

`−

// Worker momentum cloud aggregation

17: Aggregate xpτ π ←PL

`=1

Dxpτ π

// Edge model cloud aggregation

18: Set ypτ π

`−←ypτ π for all edge node l∈L

// Cloud aggregated worker momentum

re-distribution to edge nodes

19: Set xpτ π

`+←xpτ π for all edge node l∈L

// Cloud model re-distribution to edge nodes

20: Set ypτ π

i,` ←ypτ π

`−for all worker i∈C`, l ∈L

// Cloud aggregated worker momentum

re-distribution from edge nodes to workers

21: Set xpτ π

i,` ←xpτ π

`+for all worker i∈C`, l ∈L

// Cloud model re-distribution

from edge nodes to workers

22: end if

23: end for

C. HierMo Algorithm

In Algorithm 1, we propose a momentum-based three-tier

hierarchical FL algorithm, named as HierMo, which applies

both worker momentum and edge momentum. HierMo aims

to ﬁnd the ﬁnal cloud model xT

(a)to solve the formula (3).

It conducts Tlocal iterations, Kedge aggregations, and P

cloud aggregations, where T=Kτ =P τ π,τis the worker-

edge aggregation period, and πis the edge-cloud aggregation

period.

1) Worker update: In each local iteration t, each

worker {i, `}computes its worker update, which includes two

things: 1worker momentum update yt

i,` (Line 5) and 2

worker model update xt

i,` (Line 6). 1and 2follow the

Nesterov Accelerated Gradient (NAG) [35] momentum update

and are conducted every iteration. Through this way, each

worker can utilize its own worker momentum acceleration.

2) Edge update: When t=kτ, k = 1,2, . . . , K, each edge

node `receives workers’ momenta and models in C`and per-

forms edge update, which includes two operations: 1Worker

momentum edge aggregation ykτ

`−(Line 9) with re-distribution

(Line 12). Through this way, some straggler workers with

high data-heterogeneity whose local momenta ykτ

i,` pointing

to an inappropriate direction can be reﬁned from ykτ

`−.2

Edge momentum ykτ

`+and model xkτ

`+update (Lines 10–11)

with model re-distribution (Line 13). Since the computation

of edge momentum and model update is based on the edge

model, it is equivalent to perform it in edge setting involving

all workers’ dataset under edge node `(D`=PC`

i=1 Di,`).

By doing so, it dampens oscillations [17] within the edge

node. Please note that 1and 2are two operations on the

same edge node, so that we use subscript “−” and “+” to

label the momentum/model right after operations 1and 2

respectively. Finally, both 1and 2are conducted in each

edge node every τiterations.

3) Cloud update: When t=pτπ, p = 1,2, . . . , P , the

cloud receives edge aggregated worker momentum ypτ π

`−and

edge model xpτ π

`+for all `∈Land performs cloud update,

which includes two things: 1Worker momentum cloud ag-

gregation ypτ π (Line 16) and re-distribution (Lines 18 and 20).

Through this way, all edge nodes and workers receive the cloud

aggregated worker momentum and mitigate the disadvantage

caused by non-i.i.d. data heterogeneity. 2Edge model cloud

aggregation xpτ π (Line 17) and cloud model re-distribution

(Lines 19 and 21). Please note that the cloud will re-distribute

the momentum and model to all edge nodes and all edge nodes

will then distribute them to all workers when tis a multiple

of τπ.

IV. CONVERGENCE ANALYSIS OF HIERMO

In this section, we present the theoretical analysis of Hi-

erMo. We ﬁrst provide preliminaries. Then, we introduce the

concept of virtual update which is a signiﬁcant intermediate

step to conduct convergence analysis. Afterward, we show the

convergence guarantee of HierMo. Finally, we compare the

convergence upper bound of HierMo and HierFAVG to analyze

the performance gain of momentum.

A. Preliminaries

We assume Fi,`(·)satisﬁes the following standard condi-

tions that are commonly adopted in the literature [13], [22],

[42].

Assumption 1. Fi,`(x)is ρ-Lipschitz, i.e., kFi,`(x1)−

Fi,`(x2)k ≤ ρkx1−x2kfor any x1,x2, i, `.

Assumption 2. Fi,`(x)is β-smooth, i.e., k∇Fi,`(x1)−

∇Fi,`(x2)k ≤ βkx1−x2kfor any x1,x2, i, `.

Assumption 3. (Bounded diversity) The variance of local

gradient to edge gradient is bounded. i.e., k∇Fi,`(x)−

∇F`(x)k ≤ δi,` for ∀i,∀`, and ∀x. We also deﬁne δ`as

the weighted average of δi,` and δas the weighted average of

δ`, i.e., δ`,Pi∈C`

Di,`

D`δi,` and δ,P`∈L

Dδ`.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1HierarchicalFederatedLearningwithMomentumAccelerationinMulti-TierNetworksZhengjieYang,SenFu,WeiBao,DongYuan,andAlbertY.ZomayaAbstractInthispaper,weproposeHierarchicalFederatedLearningwithMomentumAcceleration(HierMo),athree-tierworker-edge-cloudfederatedlearningalgorithmthatappliesmo-mentumfortrain...

展开>> 收起<<

1 Hierarchical Federated Learning with Momentum Acceleration in Multi-Tier Networks.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Hierarchical Federated Learning with Momentum Acceleration in Multi-Tier Networks

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: