Provably Doubly Accelerated Federated Learning The First Theoretically Successful Combination of Local Training and Communication Compression

2025-05-02 0 0 1.36MB 30 页 10玖币

侵权投诉

Provably Doubly Accelerated Federated Learning:

The First Theoretically Successful Combination of

Local Training and Communication Compression

Laurent Condat1Ivan Agarský2,3Peter Richtárik1

1King Abdullah University of Science and Technology (KAUST)

Thuwal, Kingdom of Saudi Arabia

2Brno University of Technology

Brno, Czech Republic

3Kempelen Institute of Intelligent Technologies (KInIT)

Bratislava, Slovakia

January 2023

Abstract

In federated learning, a large number of users are involved in a global learning task, in

a collaborative way. They alternate local computations and two-way communication with a

distant orchestrating server. Communication, which can be slow and costly, is the main bot-

tleneck in this setting. To reduce the communication load and therefore accelerate distributed

gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform

several iterations of local computations between the communication rounds; and 2) communicate

compressed information instead of full-dimensional vectors. We propose the ﬁrst algorithm for

distributed optimization and federated learning, which harnesses these two strategies jointly and

converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated

rate: our algorithm beneﬁts from the two acceleration mechanisms provided by local training

and compression, namely a better dependency on the condition number of the functions and on

the dimension of the model, respectively.

arXiv:2210.13277v3 [cs.LG] 2 Feb 2023

Contents

1 Introduction 3

1.1 Formalism.......................................... 3

1.2 Asymmetric communication regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Communication eﬃciency in FL: state of the art . . . . . . . . . . . . . . . . . . . . 5

1.3.1 LocalTraining(LT) ................................ 5

1.3.2 Communication Compression (CC) . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Goals, challenges, contributions 6

3 Proposed algorithm CompressedScaﬀnew 8

3.1 Iterationcomplexity .................................... 9

3.2 Convergence in the convex case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Communication complexity 11

5 Experiments 13

6 Conclusion 14

A Proof of Theorem 3.1 19

A.1 The random variable dt.................................. 23

B Proof of Theorem 4.1 25

C Proof of Theorem 3.3 26

1 Introduction

Federated Learning (FL) is a novel paradigm for training supervised machine learning models.

Initiated a few years ago in several foundational papers (Konečný et al., 2016a,b; McMahan et al.,

2017; Bonawitz et al., 2017), it has become a rapidly growing interdisciplinary ﬁeld. The key idea

is to exploit the wealth of information stored on edge devices, such as mobile phones, sensors and

hospital workstations, to train global models, in a collaborative way, while handling a multitude of

challenges, like data privacy concerns (Kairouz et al., 2021; Li et al., 2020; Wang et al., 2021). In

contrast to centralized learning in a datacenter, in FL, the parallel computing units have private

data stored on each of them and communicate with a distant orchestrating server, which aggregates

the information and synchronizes the computations, so that the process reaches a consensus and

converges to a globally optimal model. In this framework, communication between the parallel

workers and the server, which can take place over the internet or cell phone network, can be slow,

costly, and unreliable. Thus, communication dominates the overall time cost of the process and

is the main bottleneck to be addressed by the community, before FL can be widely adopted and

applied in our daily lives.

The baseline algorithm of distributed Gradient Descent (GD) alternates between two steps:

one round of parallel computation of the local function gradients at the current model estimate,

and one round of communication of these gradient vectors to the server, which averages them to

form the new estimate for the next iteration. To decrease the communication load, two strategies

can be used: 1) communicate less frequently, or equivalently do more local computations between

successive communication rounds; or 2) compress the communicated vectors. We detail these two

strategies in Section 1.3. In this paper, we combine them, within a uniﬁed framework for randomized

communication, and derive a new algorithm named CompressedScaﬀnew, with local training and

communication compression. It is variance-reduced (Hanzely and Richtárik, 2019; Gorbunov et al.,

2020a; Gower et al., 2020), so that it converges to an exact solution, and provably beneﬁts from

the two mechanisms: the convergence rate is doubly accelerated, with a better dependency on the

condition number of the functions and on the dimension of the model, in comparison with GD. In

the remainder of this section, we formulate the convex optimization problem to solve, we propose a

new model to characterize the communication complexity, and we present the state of the art.

1.1 Formalism

We consider a distributed client-server setting, in which n≥2clients, or compute nodes, perform

computations in parallel and communicate back and forth with a server, or master node. We study

the convex optimization problem:

minimize

x∈Rdf(x):=1

i=1

fi(x),(1)

where each function fi:Rd→Rmodels the individual cost and underlying private data of client

i∈[n]:={1, . . . , n}. The number nof clients, as well as the dimension d≥1of the model,

are typically large. This problem is of key importance as it is an abstraction of empirical risk

minimization, the dominant framework in supervised machine learning.

For every i∈[n], the function fiis supposed L-smooth and µ-strongly convex1, for some

1A function f:Rd→Ris said to be L-smooth if it is diﬀerentiable and its gradient is Lipschitz continuous with

L≥µ > 0(except in Section 3.2 where we study the merely convex case, i.e. µ= 0). Thus, the

sought solution x?of (1) exists and is unique. We deﬁne κ:=L

µ.

To solve the problem (1), the baseline algorithm of Gradient Descent (GD) consists in the simple

iteration, for t= 0,1,...,

xt+1 :=xt−γ

i=1 ∇fi(xt),

for some stepsize γ∈(0,2

L). That is, at iteration t≥0,xtis ﬁrst broadcast by the server to

all clients, which compute the gradients ∇fi(xt)∈Rdin parallel. These vectors are then sent by

the clients to the server, which averages them and performs the gradient descent step. It is well

known that for γ= Θ( 1

L),GD converges linearly, with iteration complexity Oκlog(−1)to reach

-accuracy. Since d-dimensional vectors are communicated at every iteration, the communication

complexity of GD in number of reals is Odκ log(−1). Our goal is a twofold acceleration of GD,

with a better dependency to both κand din this communication complexity. We want to achieve

this goal by leveraging the best of the two popular mechanisms of local training and communication

compression.

1.2 Asymmetric communication regime

Uplink and downlink communication. We call uplink communication (UpCom) the parallel

transmission of data from the clients to the server and downlink communication (DownCom) the

broadcast of the same message from the server to all clients. UpCom is usually signiﬁcantly slower

than DownCom, just like uploading is slower than downloading on the internet or cell phone network.

This can be due to the asymmetry of the service provider’s systems or protocols used on the

communication network, or cache memory and aggregation speed constraints of the server, which

has to decode and average the large number nof vectors received at the same time during UpCom.

Communication complexity. We measure the UpCom or DownCom complexity as the expected

number of communication rounds needed to estimate a solution with -accuracy, multiplied by the

number of real values sent during a communication round between the server and any client. Thus,

the UpCom or DownCom complexity of GD is Odκ log(−1). We leave if for future work to

reﬁne this model of counting real numbers, to take into account how sequences of real numbers are

quantized into bitstreams, achieving further compression (Horváth et al., 2022; Albasyoni et al.,

2020).

A model for the overall communication complexity. Since UpCom is usually slower than

DownCom, we propose to measure the total communication (TotalCom) complexity as a weighted

sum of the two UpCom and DownCom complexities: we assume that the UpCom cost is 1 (unit of

time per transmitted real number), whereas the downCom cost is c∈[0,1]. Therefore,

TotalCom =UpCom +c.DownCom.(2)

A symmetric but unrealistic communication regime corresponds to c= 1, whereas ignoring down-

Com and focusing on UpCom, which is usually the limiting factor, corresponds to c= 0. We will

constant L; that is, for every x∈Rdand y∈Rd,k∇f(x)− ∇f(y)k ≤ Lkx−yk, where, here and throughout the

paper, the norm is the Euclidean norm. fis said to be µ-strongly convex if f−µ

2k·k2is convex. We refer to Bauschke

and Combettes (2017) for such standard notions of convex analysis.

provide explicit expressions of the parameters of our algorithm to minimize the TotalCom complex-

ity for any given c∈[0,1], keeping in mind that realistic settings correspond to small values of c.

Thus, our model of communication complexity is richer than only considering c= 0, as is usually

the case.

1.3 Communication eﬃciency in FL: state of the art

Two approaches come naturally to mind to decrease the communication load: Local Training (LT),

which consists in communicating less frequently than at every iteration, and Communication Com-

pression (CC), which consists in sending less than dﬂoats during every communication round. In

this section, we review existing work related to these two strategies.

1.3.1 Local Training (LT)

LT is a conceptually simple and surprisingly powerful communication-acceleration technique. It

consists in the clients performing multiple local GD steps instead of only one, between successive

communication rounds. This intuitively results in “better” information being communicated, so that

less communication rounds are needed to achieve a given accuracy. As shown by ample empirical

evidence, LT is very eﬃcient in practice. It was popularized by the FedAvg algorithm of McMahan

et al. (2017), in which LT is a core component, along with other features such as data sampling and

partial participation. However, LT was heuristic and no theory was provided in their paper. LT was

analyzed in several works, in the homogeneous, or i.i.d. data, regime (Haddadpour and Mahdavi,

2019), and in the heterogeneous regime, which is more representative in FL (Khaled et al., 2019,

2020; Stich, 2019; Woodworth et al., 2020; Gorbunov et al., 2021; Glasgow et al., 2022). It stands out

that LT suﬀers from so-called client drift, which is the fact that the local model obtained by client i

after several local GD steps approaches the minimizer of its local cost function fi. The discrepancy

between the exact solution x?of (1) and the approximate solution obtained at convergence of LT

was characterized in Malinovsky et al. (2020). This deﬁciency of LT was corrected in the Scaﬀold

algorithm of Karimireddy et al. (2020) by introducing control variates, which correct for the client

drift, so that the algorithm converges linearly to the exact solution. S-Local-GD (Gorbunov et al.,

2021) and FedLin (Mitra et al., 2021) were later proposed, with similar convergence properties. Yet,

despite the empirical superiority of these recent algorithms relying on LT, their communication

complexity remains the same as vanilla GD, i.e. Odκ log(−1).

It is only very recently that Scaﬀnew was proposed by Mishchenko et al. (2022), a LT algo-

rithm ﬁnally achieving proved accelerated communication complexity, with a Od√κlog(−1)rate

for both UpCom and DownCom. In Scaﬀnew, communication is triggered randomly with a small

probability pat every iteration. Thus, the expected number of local GD steps between two com-

munication rounds is 1/p. By choosing p= Θ(1/√κ), the optimal dependency on √κinstead of

κis obtained. Thus, the discovery of Scaﬀnew is an important milestone, as it provides the ﬁrst

theoretical conﬁrmation that LT is a communication acceleration mechanism. In this paper, we

propose to go even further and tackle the multiplicative factor din the complexity of Scaﬀnew.

1.3.2 Communication Compression (CC)

To decrease the communication complexity, a widely used strategy is to make use of (lossy) com-

pression; that is, a possibly randomized mapping C:Rd→Rdis applied to the vector xthat

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ProvablyDoublyAcceleratedFederatedLearning:TheFirstTheoreticallySuccessfulCombinationofLocalTrainingandCommunicationCompressionLaurentCondat1IvanAgarský2;3PeterRichtárik11KingAbdullahUniversityofScienceandTechnology(KAUST)Thuwal,KingdomofSaudiArabia2BrnoUniversityofTechnologyBrno,CzechRepublic3Kempe...

展开>> 收起<<

Provably Doubly Accelerated Federated Learning The First Theoretically Successful Combination of Local Training and Communication Compression.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Provably Doubly Accelerated Federated Learning The First Theoretically Successful Combination of Local Training and Communication Compression

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: