PersA-FL Personalized Asynchronous Federated Learning Mohammad Taha Toghania Soomin Leeb C esar A. Uribea aDepartment of Electrical and Computer Engineering Rice University Houston TX USA

2025-05-02 0 0 1019.67KB 41 页 10玖币
侵权投诉
PersA-FL: Personalized Asynchronous Federated Learning
Mohammad Taha Toghania, Soomin Leeb, C´esar A. Uribea
aDepartment of Electrical and Computer Engineering, Rice University, Houston, TX, USA
bYahoo! Research, Sunnyvale, California, USA
ARTICLE HISTORY
Compiled October 5, 2023
ABSTRACT
We study the personalized federated learning problem under asynchronous updates.
In this problem, each client seeks to obtain a personalized model that simultaneously
outperforms local and global models. We consider two optimization-based frame-
works for personalization: (i) Model-Agnostic Meta-Learning (MAML) and (ii) Moreau
Envelope (ME). MAML involves learning a joint model adapted for each client through
fine-tuning, whereas ME requires a bi-level optimization problem with implicit gra-
dients to enforce personalization via regularized losses. We focus on improving the
scalability of personalized federated learning by removing the synchronous commu-
nication assumption. Moreover, we extend the studied function class by removing
boundedness assumptions on the gradient norm. Our main technical contribution is
a unified proof for asynchronous federated learning with bounded staleness that we
apply to MAML and ME personalization frameworks. For the smooth and non-convex
functions class, we show the convergence of our method to a first-order stationary
point. We illustrate the performance of our method and its tolerance to staleness
through experiments for classification tasks over heterogeneous datasets.
KEYWORDS
Federated Learning; Personalization; Asynchronous Communication;
Heterogeneous Data; Distributed Optimization; Staleness.
1. Introduction
Federated Learning (FL) is designed to facilitate distributed training of machine learn-
ing models across devices by exploiting the data and computation power available to
them [38]. A major benefit of FL is its ability to allow training models on data dis-
tributed across multiple devices without centralization. This is particularly beneficial
in situations with limited sensitive data [32, 33] where clients are reluctant to share
their private data. At the same time, it is known that training over a larger set of data
points improves the quality of the obtained model [68]. In such scenarios, FL enjoys
the power of collaborative learning without relocating the data from its original source
[32]. Nevertheless, FL poses challenges such as data heterogeneity (statistical diversity
among clients) [11, 17, 34, 42], fairness [14, 43], privacy [24, 30, 55, 69], unreliable
communication [60, 64], and staleness [3, 44, 53, 70].
This work was partly done while MTT interning at Yahoo! Research. Part of this material is based upon
work supported by the National Science Foundation under Grants #2211815 and #2213568. Corresponding
Author’s Email: mttoghani@rice.edu
arXiv:2210.01176v2 [cs.LG] 4 Oct 2023
The common underlying assumption that determines the superiority of FL to indi-
vidual local training is that the data points of all clients are coming from the same
distribution, i.e., homogeneous data across clients. Consequently, FL can improve the
quality of empirical loss minimization when data available on each device is limited;
otherwise, each client may obtain a proper model without collaboration or commu-
nication with others. Therefore, FL1results in a common global model with better
generalization across clients [46] compared to individual training. In heterogeneous
data setups where clients hold samples from non-identical data distributions, a com-
mon (global) model may perform poorly on the local data points of each client. For
instance, consider the next word prediction task on a smart keyboard [28], where each
client has a unique writing style or emphasis on the vocabulary domain. In this exam-
ple, the corresponding mobile application is supposed to suggest a set of words that
will likely be selected as the next word in the sentence. This scenario clearly states a
case with a heterogeneous data setup with a limited sample on each client’s device.
Thus, if each client trains a model independently, without collaboration with the other
clients, the model will likely perform poorly on the new data due to sample limitation.
Hence, the question arises about what will occur if the clients hold data samples from
similar (but not identical) distributions.
In FL with heterogeneous data, an ideal scenario is to learn a globally common
model easily adaptable to local data on each client, i.e., model fusion. This approach
is known as Personalized Federated Learning (PFL), which strives to exploit both the
shared and unshared information from the data of all clients. A solution to the model
fusion in PFL is to apply transfer learning [12, 73] (e.g., fine-tuning) on a jointly
trained model under FL. Interestingly, the centralized version of this problem has
been extensively studied in Meta-Learning [66] and Multi-Task Learning [50], where
the goal is to obtain a meta (global) model that with (potentially) minimal adap-
tation performs well on multiple tasks. Particularly, Model-Agnostic Meta-Learning
(MAML) [21, 56] proposes an optimization-based formulation that aims to find an initial
meta-model with proper performance after applying one or a few steps of (stochastic)
gradient descent. The key property of MAML is its ability to gauge fine-tuning during
the learning process. Multiple studies have been conducted on the convergence and
generalization of MAML [8, 16, 18, 19, 22, 31] for various problems and setups. Fallah
et al. [17] suggest the MAML formulation as a potential solution for PFL, and propose
Per-FedAvg algorithm for collaborative learning with MAML personalized cost function.
Dinh et al. [13] present pFedMe algorithm for PFL via adopting a different formulation
for personalization, namely Moreau Envelopes (ME). The proposed algorithm is a joint
bi-level optimization problem with personalized parameters which are regularized to
be close to the global model. We will elaborate on these two formulations (MAML &ME)
in Section 2. Additionally, several recent works have approached PFL mainly through
optimization-based [5, 10, 20, 23, 26, 27, 29, 45, 46, 64, 72], or structure-based [9, 59, 65]
techniques.
Scalability to large-scale setups with potentially many clients is another major chal-
lenge for FL. The proposed algorithms in this scheme, mostly require synchronous com-
munications between the server and clients [10, 13, 17, 23, 38, 43, 47]. Such constraints
impose considerable delays on the learning progress, since increasing the concurrency
in synchronous updates decreases the training speed and quality. For example, lim-
ited communication bandwidth, computation power, and communication failures incur
large delays in the training process. In cross-device FL, devices are naturally prone to
1We refer to Federated Learning with no personalization as FL.
2
update and communicate models under less restrictive rules, whereas clients may ap-
ply updates in an asynchronous fashion, i.e., staleness. Hogwild! [52] is one of the
first efforts to model asynchrony in distributed setup with delayed updates. Multi-
ple works have studied asynchronous training under different setups and assumptions
[1, 4, 6, 15, 44, 49, 53]. Specifically, some recent seminal works have studied the conver-
gence of asynchronous SGD-based methods, and show their convergence under certain
assumptions on maximum or average delay [2, 37, 48, 61].2In decentralized setups,
Hadjicostis et al. [25] propose a consensus algorithm called running-sum, which is
robust to message losses. Furthermore, Olshevsky et al. [54] present a more general
framework with robustness to asynchrony, message losses, and delays for both consen-
sus and optimization problems [60, 64]. More closely, FL under stale updates has been
thoroughly studied in [3, 40, 51, 57, 65, 70]. Particularly, Tziotis et al. [65] studies
the existence of stragglers in PFL via shared representations, i.e., system and data
heterogeneity in structure-based personalization.
The main contribution of paper [51] is on the server algorithm, where this paper
proposes a more secure and robust algorithm by aggregating a buffer of asynchronous
updates within a secure channel prior to sending them to the server. Whereas, our
work focuses on scalability and personalization via asynchronous communication and
learning personalized models.
In this work, we study the PFL problem under asynchronous communications to
improve training concurrency, performance, and efficiency. We propose the PersA-FL
algorithm, a novel personalized & asynchronous method that jointly addresses the
heterogeneity and staleness in FL. We develop a technique based on asynchronous
updates to resolve the communication bottleneck imposed by synchronized learning
in PFL, where we improve the training scalability and performance. To the best of our
knowledge, this is the first study on the intersection of staleness and personalization
through the lens of optimization-based techniques. We summarize our contributions
as follows:
Through the integration of two personalization formulations, MAML &ME, we
propose PersA-FL, an algorithm that allows personalized federated learning
under asynchronous communications between the server and clients. Our pro-
posed method consists of two algorithms from the perspectives of the server
and clients. We present the client algorithm under three different options for
the local updates, each addressing a separate formulation, (A) FedAsync,(B)
PersA-FL-MAML, and (C) PersA-FL-ME.
We present a new convergence analysis for Asynchronous Federated Learning
(FedAsync) under smooth non-convex cost functions by removing the bounded-
ness assumption from the gradient norm. Our analysis assumes bounded variance
of stochasticity and heterogeneity, and bounded maximum delay. Hence, we im-
prove the existing theory by extending the result to a broader function class,
i.e., unbounded gradient norm.
We show the convergence rate of PersA-FL-MAML based on the maximum delay
and personalization budget under the same assumptions as Fallah et al. [17].3
We highlight the impact of batch size in the biased stochastic estimation of
the full gradients for the MAML cost. We present the communication and sample
2Mishchenko et al. [48] studies the convergence of distributed optimization for homogeneous strongly convex
and smooth functions with no assumptions on maximum delay, i.e., unbounded staleness.
3Besides the assumptions for FedAsync, seminal works [17, 22, 56] assume second-order Lipschitzness, bounded
variance, and bounded gradient in the analysis of MAML cost functions.
3
complexity to find an εfirst-order stationary point for the proposed algorithm.
We prove the convergence of PersA-FL-ME with no boundedness assumption on
the gradient norm. We discuss the connection of convergence rate to the gradient
estimation error and level of personalization. Compared to [13], we show an
explicit dependence of convergence rate to the estimation error. Moreover, we
relax the heterogeneity assumption in [13] allowing bounded population diversity
instead of uniformly bounded heterogeneity. We determine the communication
and local inexact solver complexity to find an εfirst-order stationary point for
this method.
We present numerical experiments evaluating our proposed algorithm on hetero-
geneous MNIST and CIFAR10 with unbalanced distributions across the clients.
We illustrate the advantages of our method in terms of performance and scala-
bility to varying delays in setups with heterogeneity.
Table 1 illustrates the properties of our proposed method and provides a comparison
between our algorithm and underlying analysis with related seminal works. As shown
in this table, building upon the results in [13, 17], we extend the capability of FL to
staleness. Table 1 also contains the convergence results for our proposed algorithm,
which we will discuss in more details in Section 4.
The main difference between our method and the works in [51, 70] mainly lies in
the client algorithm, where we consider three options (A,B, and C) for updating the
parameters locally. Option A is similar to the client algorithm in [51, 70], but we
improve the theoretical convergence results by removing the assumption on bounded
gradients for this setup. Option B and Option C, along with the server algorithm are
novel methods for personalized asynchronous federated learning. Nguyen et al. [51]
characterize the server algorithm with a secure and robust update aggregation and
[63] enhances its theoretical properties. Study of secure aggregation on the server side
remains as a future direction for this work.
The remainder of this paper is organized as follows. In Section 2, we introduce the
PFL setup and discuss the asynchronous communication framework between the server
and clients. In Section 3, we describe our algorithm, PersA-FL, for PFL under staleness.
In Section 4, we state the convergence result for our proposed algorithm along with the
underlying assumptions and technical lemmas. We present the numerical experiments
in Section 5. We finally end by concluding remarks in Section 6.
2. Problem Setup & Background
In this section, we first present the formal problem setup for FL [47], as well as the
personalization formulations in MAML [17] and ME [13]. Then, we discuss the underlying
communication setting under asynchronous updates.
2.1. Federated Learning Problem Setup
We consider a set of nclients and one server, where each client i[n] holds a private
function fi:RdR, and the goal is to collaboratively obtain a model wRdthat
4
Table 1. A comparison of related federated learning methods with convergence guarantees for smooth non-
convex functions. Parameters τ,α,ν, and brespectively denote the maximum delay, MAML personalization
stepsize, ME inexact gradient estimation error, batch size.
Algorithm & Reference
Personalized
Cost
Asynchronous
Updates
Unbounded
Gradient
Convergence Rate
McMahan et al. [47] ✗ ✗ - No Analysis
FedAvg Yu et al. [71] O1
T
Wang et al. [67] O1
T
FedAsync
Xie et al. [70] O1
T+Oτ2
T
This Work O1
T+Oτ2
T
FedBuff
Nguyen et al. [51] O1
T+Oτ2
T
Toghani and Uribe [63] O1
T+Oτ2
T
Per-FedAvg Fallah et al. [17] O1
T+Oα2
b
pFedMe Dinh et al. [13] O1
T+Oλ2(1
b+ν2)
(λL)2
PersA-FL-MAML This Work O1
T+Oτ2
T+Oα2
b
PersA-FL-ME This Work O1
T+Oτ2
T+Oλ2
(λL)2ν2
minimizes the local cost functions on average, as follows:
min
wRdf(w):=1
n
n
X
i=1
fi(w),
with fi(w):=EΞipi[i(w, Ξi)],
(1)
where i:Rd×SiRis a cost function that determines the prediction error of some
model wRdover a single data point ξi∈ Sion client i, where ξiis a realization
of Ξipi, i.e., piis the client i’s data distribution over Si, for i[n]. In the above
definition, fi(·) is the local cost function of client i, and f(·) denotes the global cost
function, i.e., average loss. For instance, in a supervised learning setup with Zi:=
Xi×Yi, we have i(w, ξi) as the prediction cost of some learning model parameterized
by wfor sample ξi= (x, y), where x∈ Xiand y∈ Yi. Let Dibe a data batch with
samples independently drawn from the distribution pi. Then, the unbiased stochastic
5
摘要:

PersA-FL:PersonalizedAsynchronousFederatedLearningMohammadTahaToghania,SoominLeeb,C´esarA.UribeaaDepartmentofElectricalandComputerEngineering,RiceUniversity,Houston,TX,USAbYahoo!Research,Sunnyvale,California,USAARTICLEHISTORYCompiledOctober5,2023ABSTRACTWestudythepersonalizedfederatedlearningproblem...

展开>> 收起<<
PersA-FL Personalized Asynchronous Federated Learning Mohammad Taha Toghania Soomin Leeb C esar A. Uribea aDepartment of Electrical and Computer Engineering Rice University Houston TX USA.pdf

共41页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:41 页 大小:1019.67KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 41
客服
关注