
Definition Notation Definition Notation
Global Iteration tNumber of Clients n
Active Client at t itParameter Dimension d
Client (i, j)Communication Coefficient at t wt
i,j Data Dimension f
Local Model of Client iat t xt
i∈RdStep-Size γ
Local Model of Active Client at t xt
it∈RdTotal SWIFT Iterations T
Mini-Batch Data from Active Client at t ξt
it∈RfMini-Batch Size (Uniform) M
Gradient of the Active Model at t g(xt
it, ξt
it)∈RdCommunication Set Cs
Client iCommunication Vector at t wt
i∈RnGlobal Data Distribution D
Local Model Matrix at t Xt∈Rd×nGlobal Objective f(x)
Zero-Padded Gradient of the Active Model at t G(xt
it, ξt
it)∈Rd×nClient Influence Score pi
Expected Local Model Gradients at t¯
G(Xt, ξt)∈Rd×nClient Influence Vector p∈Rn
Active Client Communication Matrix at t W t
it∈Rn×nOne-Hot Vector ei∈Rn
Expected Client Communication Matrix at t¯
Wt∈Rn×nIdentity Matrix In∈Rn×n
Table 2: Notation Table
heavyweight protocols. A synchronous algorithm is developed to converge under expectation of an
unreliable communication matrix (probabilistic link reliability). SWIFT also convergences under
expectation of a communication matrix, yet in a different and asynchronous setting. SWIFT is already
lightweight and reliable, and our use of expectation does not regard link reliability.
Communication Efficiency.
Minimizing each client
i
’s communication time per round
Ci
is a
challenge in FL, as the radius of information exchange can be large (Kairouz et al.,2021). MATCHA
(Wang et al.,2019) decomposes the base network into
m
disjoint matchings. Every epoch, a random
sub-graph is generated from a combination of matchings, each having an activation probability
pk
. Clients then exchange parameters along this sub-graph. This requires a total communication-
time complexity of
O(TPm
k=1 pkmaxj∈NiCj)
, where
Ni
are client
i
’s neighbors. LD-SGD (Li
et al.,2019) and PA-SGD (Wang & Joshi,2018) explore how reducing the number of neighborhood
parameter exchanges affects convergence. Both algorithms create a communication set
Cs
(defined
in Appendix C) that dictate when clients communicate with one another. The communication-time
complexities are listed in Table 1. These methods, however, are synchronous and their communication-
time complexities depend upon the slowest neighbor
maxj∈NiCj
. SWIFT improves upon this,
achieving a communication-time complexity depending on a client’s own communication-time
per round. Unlike AD-PSGD Lian et al. (2018), which achieves a similar communication-time
complexity, SWIFT allows for periodic communication, uses only local memory, and does not require
a bounded-delay assumption.
3 PROBLEM FORMULATION
Decentralized FL.
In the FL setting, we have
n
clients represented as vertices of an arbitrary
communication graph
G
with vertex set
V={1, . . . , n}
and edge set
E ⊆ V × V
. Each client
i
communicates with one-hop neighboring clients
j
such that
(i, j)∈ E
. We denote the neighborhood
for client ias Ni, and clients work in tandem to find the global model parameters xby solving:
min
x∈Rdf(x) :=
n
X
i=1
pifi(x), fi(x) := Eξi∼Di`(x, ξ),
n
X
i=1
pi= 1, pi≥0.(1)
The global objective function f(x)is the weighted average of all local objective functions fi(x). In
Equation 1,
pi,∀i∈[n]
denotes the client influence score. This term controls the influence of client
i
on the global consensus model, forming the client influence vector
p={pi}n
i=1
. These scores also
reflect the sampling probability of each client. We note that each local objective function
fi(x)
is
the expectation of loss function
`
with respect to potentially different local data
ξi={ξi,j }M
j=1
from
each client i’s distribution Di, i.e., ξi,j ∼ Di. The total number of iterations is denoted as T.
Existing Inter-Client Communication in Decentralized FL.
All clients balance their individ-
ual training with inter-client communications in order to achieve consensus while operating in a
decentralized manner. The core idea of decentralized FL is that each client communicates with
its neighbors (connected clients) and shares local information. Balancing individual training with
inter-client communication ensures individual client models are well-tailored to personal data while
remaining (i) robust to other client data, and (ii) able to converge to an optimal consensus model.
3