
2
τiterations (τis called the worker-edge aggregation period),
each edge node receives, averages, and sends back the models
and momentum values with its connected workers.13In
every τ·πiterations (πis called the edge-cloud aggregation
period), the cloud receives, averages, and sends back the
models and momentum values with edge nodes. The edge
nodes will then distribute them to connected workers. The
above 1–3steps are repeated for multiple rounds until the
loss is sufficiently small.
Theoretically, we prove that HierMo is convergent and has
an O1
Tconvergence rate for smooth non-convex problems
for a given Titerations. In this step, we need to address
substantial new challenges, compared with two-tier FL. In
particular, we develop a new method to characterize the multi-
time cross-two-tier momentum interaction and cross-three-tier
momentum interaction, which do not exist in the two-tier FL.
After we theoretically prove the convergence, we observe that
the worker-edge and edge-cloud aggregation periods τand
πare key design variables we aim to optimize. Based on
the result of the convergence analysis, we propose HierOPT
algorithm, which can find a local optimal (τ, π)value pair.
In the experiment, we demonstrate the performance of
HierMo compared with various mainstream hierarchical FL
and momentum-based FL algorithms, including hierarchical
FL without momentum (HierFAVG [18] and CFL [19]), two-
tier FL with momentum (FedMom [20], SlowMo [21], Fed-
NAG [22], Mime [23], DOMO [24], and FedADC [25]), and
two-tier FL without momentum (FedAvg [4]). The experiment
is implemented on different kinds of models (linear regression,
logistic regress, CNN [26], VGG16 [27], and ResNet18 [28])
based on various real-world datasets (MNIST [29], CIFAR-
10 [30], ImageNet [28], [31] for image classification, and UCI-
HAR [32] for human activity recognition). The experimental
results illustrate that HierMo drastically outperforms bench-
marks under a wide range of settings. We also verify HierOPT
can output a near-optimal (τ, π)in the real-world settings. All
these results match our expectations by the theoretical analysis.
The contributions of this paper are summarized as follows.
•We have proved that HierMo is convergent and has an
O1
Tconvergence rate for smooth non-convex problems
for a given Titerations under non-i.i.d. data.
•We have proved that as long as learning step size ηis
sufficiently small, HierMo (with momentum acceleration)
achieves the tighter convergence upper bound than Hier-
FAVG (without momentum acceleration).
•We have proposed the new HierOPT algorithm which can
find a local optimal pair of (τ∗, π∗)when total training
time is constrained.
•HierMo is efficient and decreases the total training
time by 21–70% compared with the mainstream two-tier
momentum-based algorithms and three-tier algorithms.
•HierOPT generates the near-optimal pair of (τ∗, π∗)
when the total training time is constrained. HierOPT
achieves the near-optimal accuracy with only 0.23–0.29%
1Each edge node also calculates another momentum for its own usage to
further accelerate convergence. See Section III for the detailed algorithm.
(CNN on MNIST) and 0.04–0.16% (CNN on CIFAR10)
gap from the real-world optimum.
The rest of the paper is organized as follows. In Section II,
we introduce related works. The HierMo algorithm design is
described in Section III. In Section IV, we provide theoretical
results including the convergence analysis of HierMo and the
performance gain of momentum. The algorithm to optimize
the aggregation periods, i.e., HierOPT, is proposed in Sec-
tion V. Section VI provides our experimental results and the
conclusion is made in Section VII.
II. RELATED WORK
A. Momentum in Machine Learning and Federated Learning
Momentum [33] is a method that helps accelerate gradient
descent in the relevant direction by adding a fraction γof
the difference between past and current model vectors. In the
classical centralized setting, the update rule of the momentum
(Polyak’s momentum) is as follows:
m(t) = γm(t−1) −η∇F(w(t−1)),(1)
w(t) = w(t−1) + m(t),(2)
with γ∈[0,1), t = 1,2,...,m(0) = 0, where γis
momentum factor (weight of momentum), tis update iteration,
m(t)is momentum term at iteration t, and w(t)is model
parameter at iteration t. Through this method, the momentum
term increases for dimensions whose gradients point in the
same directions and reduces updates for dimensions whose
gradients change directions. As a result, momentum gains
faster convergence and reduces oscillation [17], [34].
Momentum has been investigated in both centralized ma-
chine learning and FL. In the centralized environment, an-
other form of momentum called Nesterov Accelerate Gra-
dient (NAG) [17], [35] is proposed. NAG2calculates the
gradient based on an approximation of the next position of
the parameters, i.e., ∇F(w(t−1) + γm(t−1)), instead
of ∇F(w(t−1)) in Polyak’s momentum, leading to better
convergence performance. In [11], authors study the utilization
of momentum in over-parameterized models. [9] provides a
unified convergence analysis for both Polyak’s momentum and
NAG. [12] studies NAG in stochastic settings.
All the above works show the advantages of momentum to
accelerate the centralized training and it attracts researchers’
attention to apply momentum in FL environment. Depending
on where the momentum is adopted, we can categorize them
into the worker momentum, aggregator momentum, and com-
bination momentum. For the worker momentum (e.g., Fed-
NAG [22] and Mime [23]), momentum acceleration is adopted
at workers in each local iteration. However, it is vulnerable
to data heterogeneity among workers, which may harm the
long-run performance. For the aggregator momentum (e.g.,
FedMom [20] and SlowMo [21]), the momentum acceleration
is adopted only at the aggregator based on the global model
and it shares the same property of acceleration as in centralized
setting and dampens oscillations [17]. Nevertheless, it is
2There are two mainstream equivalent representations of NAG. In this paper,
we employ the representation in [9], [36].