Provably Doubly Accelerated Federated Learning The First Theoretically Successful Combination of Local Training and Communication Compression

2025-05-02 0 0 1.36MB 30 页 10玖币
侵权投诉
Provably Doubly Accelerated Federated Learning:
The First Theoretically Successful Combination of
Local Training and Communication Compression
Laurent Condat1Ivan Agarský2,3Peter Richtárik1
1King Abdullah University of Science and Technology (KAUST)
Thuwal, Kingdom of Saudi Arabia
2Brno University of Technology
Brno, Czech Republic
3Kempelen Institute of Intelligent Technologies (KInIT)
Bratislava, Slovakia
January 2023
Abstract
In federated learning, a large number of users are involved in a global learning task, in
a collaborative way. They alternate local computations and two-way communication with a
distant orchestrating server. Communication, which can be slow and costly, is the main bot-
tleneck in this setting. To reduce the communication load and therefore accelerate distributed
gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform
several iterations of local computations between the communication rounds; and 2) communicate
compressed information instead of full-dimensional vectors. We propose the first algorithm for
distributed optimization and federated learning, which harnesses these two strategies jointly and
converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated
rate: our algorithm benefits from the two acceleration mechanisms provided by local training
and compression, namely a better dependency on the condition number of the functions and on
the dimension of the model, respectively.
1
arXiv:2210.13277v3 [cs.LG] 2 Feb 2023
Contents
1 Introduction 3
1.1 Formalism.......................................... 3
1.2 Asymmetric communication regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Communication efficiency in FL: state of the art . . . . . . . . . . . . . . . . . . . . 5
1.3.1 LocalTraining(LT) ................................ 5
1.3.2 Communication Compression (CC) . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Goals, challenges, contributions 6
3 Proposed algorithm CompressedScaffnew 8
3.1 Iterationcomplexity .................................... 9
3.2 Convergence in the convex case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Communication complexity 11
5 Experiments 13
6 Conclusion 14
A Proof of Theorem 3.1 19
A.1 The random variable dt.................................. 23
B Proof of Theorem 4.1 25
C Proof of Theorem 3.3 26
2
1 Introduction
Federated Learning (FL) is a novel paradigm for training supervised machine learning models.
Initiated a few years ago in several foundational papers (Konečný et al., 2016a,b; McMahan et al.,
2017; Bonawitz et al., 2017), it has become a rapidly growing interdisciplinary field. The key idea
is to exploit the wealth of information stored on edge devices, such as mobile phones, sensors and
hospital workstations, to train global models, in a collaborative way, while handling a multitude of
challenges, like data privacy concerns (Kairouz et al., 2021; Li et al., 2020; Wang et al., 2021). In
contrast to centralized learning in a datacenter, in FL, the parallel computing units have private
data stored on each of them and communicate with a distant orchestrating server, which aggregates
the information and synchronizes the computations, so that the process reaches a consensus and
converges to a globally optimal model. In this framework, communication between the parallel
workers and the server, which can take place over the internet or cell phone network, can be slow,
costly, and unreliable. Thus, communication dominates the overall time cost of the process and
is the main bottleneck to be addressed by the community, before FL can be widely adopted and
applied in our daily lives.
The baseline algorithm of distributed Gradient Descent (GD) alternates between two steps:
one round of parallel computation of the local function gradients at the current model estimate,
and one round of communication of these gradient vectors to the server, which averages them to
form the new estimate for the next iteration. To decrease the communication load, two strategies
can be used: 1) communicate less frequently, or equivalently do more local computations between
successive communication rounds; or 2) compress the communicated vectors. We detail these two
strategies in Section 1.3. In this paper, we combine them, within a unified framework for randomized
communication, and derive a new algorithm named CompressedScaffnew, with local training and
communication compression. It is variance-reduced (Hanzely and Richtárik, 2019; Gorbunov et al.,
2020a; Gower et al., 2020), so that it converges to an exact solution, and provably benefits from
the two mechanisms: the convergence rate is doubly accelerated, with a better dependency on the
condition number of the functions and on the dimension of the model, in comparison with GD. In
the remainder of this section, we formulate the convex optimization problem to solve, we propose a
new model to characterize the communication complexity, and we present the state of the art.
1.1 Formalism
We consider a distributed client-server setting, in which n2clients, or compute nodes, perform
computations in parallel and communicate back and forth with a server, or master node. We study
the convex optimization problem:
minimize
xRdf(x):=1
n
n
X
i=1
fi(x),(1)
where each function fi:RdRmodels the individual cost and underlying private data of client
i[n]:={1, . . . , n}. The number nof clients, as well as the dimension d1of the model,
are typically large. This problem is of key importance as it is an abstraction of empirical risk
minimization, the dominant framework in supervised machine learning.
For every i[n], the function fiis supposed L-smooth and µ-strongly convex1, for some
1A function f:RdRis said to be L-smooth if it is differentiable and its gradient is Lipschitz continuous with
3
Lµ > 0(except in Section 3.2 where we study the merely convex case, i.e. µ= 0). Thus, the
sought solution x?of (1) exists and is unique. We define κ:=L
µ.
To solve the problem (1), the baseline algorithm of Gradient Descent (GD) consists in the simple
iteration, for t= 0,1,...,
xt+1 :=xtγ
n
n
X
i=1 fi(xt),
for some stepsize γ(0,2
L). That is, at iteration t0,xtis first broadcast by the server to
all clients, which compute the gradients fi(xt)Rdin parallel. These vectors are then sent by
the clients to the server, which averages them and performs the gradient descent step. It is well
known that for γ= Θ( 1
L),GD converges linearly, with iteration complexity Oκlog(1)to reach
-accuracy. Since d-dimensional vectors are communicated at every iteration, the communication
complexity of GD in number of reals is Olog(1). Our goal is a twofold acceleration of GD,
with a better dependency to both κand din this communication complexity. We want to achieve
this goal by leveraging the best of the two popular mechanisms of local training and communication
compression.
1.2 Asymmetric communication regime
Uplink and downlink communication. We call uplink communication (UpCom) the parallel
transmission of data from the clients to the server and downlink communication (DownCom) the
broadcast of the same message from the server to all clients. UpCom is usually significantly slower
than DownCom, just like uploading is slower than downloading on the internet or cell phone network.
This can be due to the asymmetry of the service provider’s systems or protocols used on the
communication network, or cache memory and aggregation speed constraints of the server, which
has to decode and average the large number nof vectors received at the same time during UpCom.
Communication complexity. We measure the UpCom or DownCom complexity as the expected
number of communication rounds needed to estimate a solution with -accuracy, multiplied by the
number of real values sent during a communication round between the server and any client. Thus,
the UpCom or DownCom complexity of GD is Olog(1). We leave if for future work to
refine this model of counting real numbers, to take into account how sequences of real numbers are
quantized into bitstreams, achieving further compression (Horváth et al., 2022; Albasyoni et al.,
2020).
A model for the overall communication complexity. Since UpCom is usually slower than
DownCom, we propose to measure the total communication (TotalCom) complexity as a weighted
sum of the two UpCom and DownCom complexities: we assume that the UpCom cost is 1 (unit of
time per transmitted real number), whereas the downCom cost is c[0,1]. Therefore,
TotalCom =UpCom +c.DownCom.(2)
A symmetric but unrealistic communication regime corresponds to c= 1, whereas ignoring down-
Com and focusing on UpCom, which is usually the limiting factor, corresponds to c= 0. We will
constant L; that is, for every xRdand yRd,k∇f(x)− ∇f(y)k ≤ Lkxyk, where, here and throughout the
paper, the norm is the Euclidean norm. fis said to be µ-strongly convex if fµ
2k·k2is convex. We refer to Bauschke
and Combettes (2017) for such standard notions of convex analysis.
4
provide explicit expressions of the parameters of our algorithm to minimize the TotalCom complex-
ity for any given c[0,1], keeping in mind that realistic settings correspond to small values of c.
Thus, our model of communication complexity is richer than only considering c= 0, as is usually
the case.
1.3 Communication efficiency in FL: state of the art
Two approaches come naturally to mind to decrease the communication load: Local Training (LT),
which consists in communicating less frequently than at every iteration, and Communication Com-
pression (CC), which consists in sending less than dfloats during every communication round. In
this section, we review existing work related to these two strategies.
1.3.1 Local Training (LT)
LT is a conceptually simple and surprisingly powerful communication-acceleration technique. It
consists in the clients performing multiple local GD steps instead of only one, between successive
communication rounds. This intuitively results in “better” information being communicated, so that
less communication rounds are needed to achieve a given accuracy. As shown by ample empirical
evidence, LT is very efficient in practice. It was popularized by the FedAvg algorithm of McMahan
et al. (2017), in which LT is a core component, along with other features such as data sampling and
partial participation. However, LT was heuristic and no theory was provided in their paper. LT was
analyzed in several works, in the homogeneous, or i.i.d. data, regime (Haddadpour and Mahdavi,
2019), and in the heterogeneous regime, which is more representative in FL (Khaled et al., 2019,
2020; Stich, 2019; Woodworth et al., 2020; Gorbunov et al., 2021; Glasgow et al., 2022). It stands out
that LT suffers from so-called client drift, which is the fact that the local model obtained by client i
after several local GD steps approaches the minimizer of its local cost function fi. The discrepancy
between the exact solution x?of (1) and the approximate solution obtained at convergence of LT
was characterized in Malinovsky et al. (2020). This deficiency of LT was corrected in the Scaffold
algorithm of Karimireddy et al. (2020) by introducing control variates, which correct for the client
drift, so that the algorithm converges linearly to the exact solution. S-Local-GD (Gorbunov et al.,
2021) and FedLin (Mitra et al., 2021) were later proposed, with similar convergence properties. Yet,
despite the empirical superiority of these recent algorithms relying on LT, their communication
complexity remains the same as vanilla GD, i.e. Olog(1).
It is only very recently that Scaffnew was proposed by Mishchenko et al. (2022), a LT algo-
rithm finally achieving proved accelerated communication complexity, with a Odκlog(1)rate
for both UpCom and DownCom. In Scaffnew, communication is triggered randomly with a small
probability pat every iteration. Thus, the expected number of local GD steps between two com-
munication rounds is 1/p. By choosing p= Θ(1/κ), the optimal dependency on κinstead of
κis obtained. Thus, the discovery of Scaffnew is an important milestone, as it provides the first
theoretical confirmation that LT is a communication acceleration mechanism. In this paper, we
propose to go even further and tackle the multiplicative factor din the complexity of Scaffnew.
1.3.2 Communication Compression (CC)
To decrease the communication complexity, a widely used strategy is to make use of (lossy) com-
pression; that is, a possibly randomized mapping C:RdRdis applied to the vector xthat
5
摘要:

ProvablyDoublyAcceleratedFederatedLearning:TheFirstTheoreticallySuccessfulCombinationofLocalTrainingandCommunicationCompressionLaurentCondat1IvanAgarský2;3PeterRichtárik11KingAbdullahUniversityofScienceandTechnology(KAUST)Thuwal,KingdomofSaudiArabia2BrnoUniversityofTechnologyBrno,CzechRepublic3Kempe...

展开>> 收起<<
Provably Doubly Accelerated Federated Learning The First Theoretically Successful Combination of Local Training and Communication Compression.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:1.36MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注