
1 Introduction
Federated Learning (FL) is a novel paradigm for training supervised machine learning models.
Initiated a few years ago in several foundational papers (Konečný et al., 2016a,b; McMahan et al.,
2017; Bonawitz et al., 2017), it has become a rapidly growing interdisciplinary field. The key idea
is to exploit the wealth of information stored on edge devices, such as mobile phones, sensors and
hospital workstations, to train global models, in a collaborative way, while handling a multitude of
challenges, like data privacy concerns (Kairouz et al., 2021; Li et al., 2020; Wang et al., 2021). In
contrast to centralized learning in a datacenter, in FL, the parallel computing units have private
data stored on each of them and communicate with a distant orchestrating server, which aggregates
the information and synchronizes the computations, so that the process reaches a consensus and
converges to a globally optimal model. In this framework, communication between the parallel
workers and the server, which can take place over the internet or cell phone network, can be slow,
costly, and unreliable. Thus, communication dominates the overall time cost of the process and
is the main bottleneck to be addressed by the community, before FL can be widely adopted and
applied in our daily lives.
The baseline algorithm of distributed Gradient Descent (GD) alternates between two steps:
one round of parallel computation of the local function gradients at the current model estimate,
and one round of communication of these gradient vectors to the server, which averages them to
form the new estimate for the next iteration. To decrease the communication load, two strategies
can be used: 1) communicate less frequently, or equivalently do more local computations between
successive communication rounds; or 2) compress the communicated vectors. We detail these two
strategies in Section 1.3. In this paper, we combine them, within a unified framework for randomized
communication, and derive a new algorithm named CompressedScaffnew, with local training and
communication compression. It is variance-reduced (Hanzely and Richtárik, 2019; Gorbunov et al.,
2020a; Gower et al., 2020), so that it converges to an exact solution, and provably benefits from
the two mechanisms: the convergence rate is doubly accelerated, with a better dependency on the
condition number of the functions and on the dimension of the model, in comparison with GD. In
the remainder of this section, we formulate the convex optimization problem to solve, we propose a
new model to characterize the communication complexity, and we present the state of the art.
1.1 Formalism
We consider a distributed client-server setting, in which n≥2clients, or compute nodes, perform
computations in parallel and communicate back and forth with a server, or master node. We study
the convex optimization problem:
minimize
x∈Rdf(x):=1
n
n
X
i=1
fi(x),(1)
where each function fi:Rd→Rmodels the individual cost and underlying private data of client
i∈[n]:={1, . . . , n}. The number nof clients, as well as the dimension d≥1of the model,
are typically large. This problem is of key importance as it is an abstraction of empirical risk
minimization, the dominant framework in supervised machine learning.
For every i∈[n], the function fiis supposed L-smooth and µ-strongly convex1, for some
1A function f:Rd→Ris said to be L-smooth if it is differentiable and its gradient is Lipschitz continuous with
3