Decentralized Hyper-Gradient Computation over Time-Varying Directed Networks Naoyuki Terashita12Satoshi Hara2

2025-05-06 0 0 907.88KB 24 页 10玖币
侵权投诉
Decentralized Hyper-Gradient Computation
over Time-Varying Directed Networks
Naoyuki Terashita12Satoshi Hara2
1Hitachi, Ltd.
2Osaka University
Abstract
This paper addresses the communication issues when estimating hyper-gradients in
decentralized federated learning (FL). Hyper-gradients in decentralized FL quan-
tifies how the performance of globally shared optimal model is influenced by
the perturbations in clients’ hyper-parameters. In prior work, clients trace this
influence through the communication of Hessian matrices over a static undirected
network, resulting in (i) excessive communication costs and (ii) inability to make
use of more efficient and robust networks, namely, time-varying directed networks.
To solve these issues, we introduce an alternative optimality condition for FL
using an averaging operation on model parameters and gradients. We then em-
ploy Push-Sum as the averaging operation, which is a consensus optimization
technique for time-varying directed networks. As a result, the hyper-gradient
estimator derived from our optimality condition enjoys two desirable properties;
(i) it only requires Push-Sum communication of vectors and (ii) it can operate
over time-varying directed networks. We confirm the convergence of our estima-
tor to the true hyper-gradient both theoretically and empirically, and we further
demonstrate that it enables two novel applications: decentralized influence es-
timation and personalization over time-varying networks. Code is available at
https://github.com/hitachi-rd-cv/pdbo-hgp.git.
1 Introduction
1.1 Background
Hyper-gradient has gained attention for addressing various challenges in federated learning (FL) [
23
],
such as preserving fairness among clients in the face of data heterogeneity [
19
,
15
], tuning hyper-
parameters with client cooperation [33, 9, 5], and improving the interpretability of FL training [32].
This paper primarily focuses on hyper-gradient computation in decentralized (or peer-to-peer)
FL [
12
][1.1], under practical consideration for communications. Decentralized FL is known to
offer stronger privacy protection [
6
], faster model training [
17
,
21
], and robustness against slow
clients [
26
]. However, these properties of decentralized FL also bring unique challenges in the
hyper-gradient estimation. This is because clients must communicate in a peer-to-peer manner to
measure how the perturbations on hyper-parameters of individual clients alter the overall perfor-
mance of the shared optimal model, requiring the careful arrangement of what and how clients should
communicate.
Specifically, there are two essential challenges: (i) communication cost and (ii) configuration of
communication network. We provide a brief overview of these challenges below and in Table 1.
naoyuki.terashita.sk@hitachi.com
Preprint. Under review.
arXiv:2210.02129v3 [stat.ML] 13 Jun 2023
Table 1: Concise comparison of the hyper-gradient for FL.
Decentralized? Communication Cost Communication Network
Tarzanagh et al. [30] No Small Centralized
Chen et al. [5] Yes Large Static Undirected
Yang et al. [33] Yes Large Static Undirected
HGP (Ours) Yes Small Time-Varying Directed
Communication cost In centralized FL, the central server can gather all necessary client information
for hyper-gradient computation, enabling a communication-efficient algorithm as demonstrated by
Tarzanagh et al.
[30]
. However, designing such an efficient algorithm for decentralized FL is more
challenging, as clients need to perform the necessary communication and computation without central
orchestration. This challenge results in less efficient algorithms [
33
,
5
] as shown in Table 1; the
current decentralized hyper-gradient computations require large communication costs for exchanging
Hessian matrices.
Configuration of communication network There are several types of communication network
configurations for decentralized FL. One of the most general and efficient configurations is time-
varying directed communication networks, which allow any message passing to be unidirectional. This
configuration is known to be resilient to failing clients and deadlocks [
31
] at a minimal communication
overhead [
2
]. However, hyper-gradient computation on such a dynamic network remains unsolved,
and previous approaches operate over less efficient configurations as shown in Table 1.
1.2 Our Contributions
In this paper, we demonstrate that both problems can be solved simply by introducing an appropriate
optimality condition of FL which is utilized to derive the hyper-gradient estimator. We found that
the optimality condition of decentralized FL can be expressed by the averaging operation on model
parameters and gradients. We then select Push-Sum [
3
] as the averaging operation, which is known
as a consensus optimization that runs over time-varying directed networks. Based on our findings
and the specific choice of Push-Sum for average operation, we propose our decentralized algorithm
for hyper-gradient computation, named Hyper-Gradient Push (HGP), and provide its theoretical error
bound. Notably, the proposed HGP resolves the aforementioned two problems; (i) it communicates
only vectors using Push-Sum, avoiding exchanging Hessian matrices, and (ii) it can operate on
time-varying networks, which is more efficient and robust than static undirected networks.
Our numerical experiments confirmed the convergence of HGP towards the true hyper-gradient. We
verified the efficacy of our HGP on two tasks: influential training instance estimation and model
personalization. The experimental results demonstrated that our HGP enabled us, for the first time,
to solve these problems over time-varying communication networks. For personalization, we also
verified the superior performance of our HGP against existing methods on centralized and static
undirected communication networks.
Our contributions are summarized as follows:
We introduce a new formulation of hyper-gradient for decentralized FL using averaging operation,
which can be performed by Push-Sum iterations. This enabled us to design the proposed HGP; it
only requires the communication of the model parameter-sized vectors over time-varying directed
networks. We also provide a theoretical error bound of our estimator.
We empirically confirmed the convergence of our estimation to the true hyper-gradient through
the experiment. We also demonstrated two applications that are newly enabled by our algorithm:
influence estimation and personalization over time-varying communication networks.
Notation
Aij
denotes the
i
-th row and
j
-th column block of the matrix
A
and
ai
denotes
the
i
-th block vector of the vector
a
. For a vector function
h:RmRn
, we denote its total
and partial derivatives by
dxh(x)Rm×n
and
xh(x)Rm×n
, respectively. For a real-valued
function
h:Rm×RsR
, we denote the Jacobian of
xh(x,y)
with respect to
x
and
y
by
2
xh(x,y)Rm×m
and
2
xyh(x,y)Rs×m
, respectively. We also introduce a concatenating
2
notation
[zi]n
i=1 = [z
1··· z
n]Rnd
for vectors
ziRd
. We denote the largest and smallest
singular values of a matrix Aby σmax(A)and σmin(A), respectively.
2 Preliminaries
This section provides the background of our study. Section 2.1 introduces the model of the time-
varying network and Push-Sum algorithm. Section 2.2 provides the formulation of decentralized
FL and its optimality condition, then Section 2.3 presents a typical approach for hyper-gradient
estimation in a single client setting.
2.1 Time-Varying Directed Networks and Push-Sum
Time-varying directed communication networks, in which any message passing can be unidirec-
tional, have proven to be resilient to failing clients and deadlocks [
31
] and they enjoy the minimal
communication overhead [
2
]. We denote the time-varying directed graph at a time step index
s > 0
by G(s)with vertices {1, . . . , n}and edges defined by E(s).
We suppose that at step
s
, any
i
-th client sends messages to its out-neighborhoods
Nout
i(s)
{1, . . . , n}
and receives messages from the in-neighborhoods
Nin
i(s)
. In addition, by standard
practice, every
i
-th client is always regarded as its own in-neighbor and out-neighbor, i.e.,
i∈ Nout
i(s)
and
i∈ Nin
i(s)
for all
i, s
. We also introduce an assumption on the connectivity of
G(s)
following
Nedi´
c and Olshevsky
[25]
. Roughly speaking, Assumption 1 requires the time-varying network
G(s)
to be repeatedly connected over some sufficiently long time scale B > 0.
Assumption 1. The graph with edge set S(t+1)B1
s=tB E(s)is strongly-connected for every t0.
Algorithm 1: Push-Sum
Input: y(0)
i
z(0)
iy(0)
i, ω(s)
i1
for s= 1 to Sdo
z(s)
iPj∈Nin
i(s)
z(s1)
j
|Nout
j(s)|
ω(s)
iPj∈Nin
i(s)
ω(s1)
j
|Nout
j(s)|
y(s)
iz(s)
i
ω(s)
i
Output: y(S)
i
Push-Sum [
3
] (Alg. 1) is an algorithm for computing an
average of values possessed by each client through com-
munications over time-varying directed networks
G(s)
sat-
isfying Assumption 1. When each
i
-th client runs Alg. 1
from its initial value vector
y(0)
iRd
, it eventually ob-
tains the average of initial values (or consensus) over the
clients [
24
], i.e.,
limS→∞ y(S)
i=1
nPky(0)
k
. From this
property, we can regard Push-Sum as a linear operator
Θ
.
Namely, denoting concatenated vectors by
y(0) = [y(0)
i]n
i=1
,
¯
y= [ 1
nPky(0)
k]n
i=1, we have
Θy(0) =¯
y,Θij =1
nId,i, j = 1, . . . , n, (1)
where,
Id
denotes the identity matrix with size
d×d
. Finally, we remark on a useful consequence of
this section, which takes an important role in our decentralized hyper-gradient estimation:
Remark 1. When Assumption 1 is satisfied and every
j
-th client knows
y(0)
j
, any
i
-th client can
obtain Θy(0)iby communications over time-varying directed networks.
2.2 Decentralized Federated Learning
The federated learning (FL) [23] consisting of nclients is formulated by
min
x1,...,xn
n
X
k=1
E[gk(xk,λk;ξk)],s.t.xi=xj,i, j, (2)
where,
gi:Rdx×RdλR
is a cost function of the
i
-th client, and
ξi
denotes a random variable
that represents the instance only accessible by the
i
-th client. Note that the distribution of
ξi
may
differ between each client. In decentralized FL, the objective of each
i
-th client is to find
xi
that
minimizes the total cost while maintaining consensus constraint, i.e.,
xi=xj,i, j
. Stochastic
gradient push [25, 2] enables us to solve (2) over time-varying directed networks.
3
2.3 Hyper-Gradient Computation
Hyper-gradient is an effective tool for solving bilevel problems, which is the nested problem consisting
of inner- and outer-problem [
7
,
20
,
27
]. Hyper-gradient can also be used for influential training
instance estimation, which studies how the removal of a training instance influences the performance
of the optimal model [
14
]. Below, we explain the definition and computation method of the hyper-
gradient in the context of the bilevel problem.
Using differentiable function fand g, the bilevel problem is formulated by
min
λRbf(x(λ),λ)
| {z }
outer-problem
,s.t.x(λ) = min
xRag(x,λ)
| {z }
inner-problem
.(3)
Suppose that the optimal solution of the inner-problem
x(λ)Ra
is expressed by the stationary
point given by a differentiable function φ:Ra×RbRa:
x(λ) = φ(x(λ),λ).(4)
For example, if
g
is smooth and strongly convex with respect to
x
, we can use
φ(x,λ) = x
ηxg(x,λ)with η > 0to express the optimality condition xg(x,λ)=0using (4).
For the bilevel problem (3), we refer
dλf(x(λ),λ)
as hyper-gradient
2
. One of the most common
approach for computing
dλf
is the fixed-point method [
27
,
18
]. When
xφ
is positive-semidefinite
and has its eigenvalues smaller than one, by the derivative of (4) and Neumann approximation of
inverse, we obtain dλx(λ) = λφ(Ixφ)1=λφP
m=0(xφ)m, leading to
dλf=λφ
X
m=0
(xφ)mxf+λf. (5)
Fixed-point method also provides an efficient algorithm to compute (5):
(initialization) v(0) =λf, u(0) =xf, (6a)
(iteration for m= 1, . . . , M)v(m)=λφu(m1) +v(m1),u(m)=xφu(m1),(6b)
which results in
v(M)=λφPM1
m=0(xφ)m+λfdλf
. Here, no explicit computation of
Jacobians are required in (6b);
λφu(m1)
and
xφu(m1)
can be computed using the Jacobian-
vector-product technique.
3 Estimating Hyper-gradient over Time-varying Directed Networks
In this section, we first explain the main technical challenge of hyper-gradient computation in
distributed FL, namely, large communication costs due to the consensus constraint in the optimality
condition of FL (2). We then introduce our alternative optimality condition using the convergence of
Push-Sum. By using our optimality condition, we finally propose the decentralized hyper-gradient
estimation algorithm HGP that runs with reasonable communication cost over time-varying networks.
3.1 Main Challenge
We consider the stationary point of decentralized FL (2) and hyper-gradient derived from this
stationary point. Let
λ= [λi]n
i=1 Rndλ
and
x= [xi]n
i=1 Rndx
be concatenated inner-
parameters and hyper-parameters, respectively. We also denote the expectation of total inner-cost by
g(x,λ) = Pn
k=1 E[gk(xk,λk;ξk)] with the following assumption.
Assumption 2. For every i= 1, . . . , n,giis strongly convex with respect to the first argument.
We can then reformulate the optimality condition of (2) by the stationary point (4) with
φ(x,λ) = xηxg(x,λ),s.t.xi=xjRdx,i, j, (7)
2
In the remainder of the paper, we omit the arguments
(x(λ),λ)
when it is clear from the context, e.g.,
φ=φ(x(λ),λ).
4
where
ηR+
. Here, the latter constraint corresponds to the consensus constraint in (2) and
Assumption 2 ensures the existence of (Ixφ)1.
Let f(x,λ) = Pkfk(xk,λk)be an outer-cost in bilevel decentralized FL. Here, each i-th client is
interested in the hyper-gradient
f(x(λ),λ)
with respect to its hyper-parameter
λi
. The technical
challenge is to compute
dλf
in a decentralized manner, especially computing (6b). From the
consensus constraint (2), for any
m0
, any block vector of
u(m)
requires the evaluation of
fk
and
gkfor all k= 1, . . . , n, because of the following blocks in (6b):
xfi=ηX
k
xkfk(xk(λ),λk),xφij =IηX
k
E[2
xkgk(xk(λ),λk;ξk)],i, j. (8)
A naive computation of (8) requires gathering these derivatives from all clients through communica-
tions. The communication of the Hessian
2
xkgk(xk(λ),λk;ξk)
is particularly a problem for large
models such as deep neural networks.
In the next section, we show that there is an alternative yet equivalent stationary condition that does
not explicitly require consensus between any clients. Based on this alternative condition, we introduce
the proposed HGP, a fixed-point iteration without requiring exchanging Hessian matrices which can
run even on time-varying directed networks.
3.2 Alternative yet Equivalent Stationary Condition
We first present the alternative stationary condition as follows:
Lemma 1. A stationary condition x(λ) = φ(x(λ),λ)with a function
φ(x,λ) = Θ(xηxg(x,λ)) ,(9)
holds true if and only if x(λ)is the solution of (2).
Lemma 1 states that each
xi
is the optimal solution of FL only when it is identical to their average
and when the average of the gradients is zero. While both (9) and (7) characterize the optimality
condition (2) though their stationary condition, our (9) has the following desirable property:
Remark 2. Lemma 1 requires
xi=xj
only implicitly. Thus, any block of the partial derivative with
respect to xcan be calculated by a single client.
3.3 Stochastic and Decentralized Approximation of Hyper-Gradient
Finally, we present our decentralized algorithm, named Hyper-Gradient Push (HGP).
Since Assumption 2 ensures (Ixφ)1exists, we can derive the hyper-gradient similar to (5):
dλf=η2
gΘ
X
m=0 Iη2
xgΘmxf+λf,
where we used
Θ=Θ
from (1). Similar to (6b), we can obtain the fixed-point iteration of the form
v(m)=η2
gΘu(m1) +v(m1),u(m)=Iη2
xgΘu(m1).(10)
Our HGP is obtained simply by letting the
i
-th client compute the
i
-th block of
v(m)
and
u(m)
,
denoted by
v(m)
i
and
u(m)
i
, respectively. We also replace
Θ
with the
S
-step Push-Sum (Alg. 1) which
we denote by ˆ
Θ.
HGP described in Alg. 2 proceeds as follows. For
m= 0
, any
i
-th client can locally com-
pute
u(0)
i=xifi(xi(λ),λi)
and
v(0)
i=λifi(xi(λ),λi)
(Remark 2). Suppose the
i
-th
client knows
u(m1)
i
, which is trivially true when
m= 1
. Then, the
i
-th client can com-
pute the average
¯
u(m1)
i=ˆ
Θu(m1)i
in (10) by running Push-Sum (Remark 1). Because
2
xgii =E[2
xiλigi(xi(λ),λi;ξi)]
can be computed locally (Remark 2), the
i
-th client can com-
pute
u(m)
i
once
¯
u(m1)
i
is obtained. This can be performed for every
m0
, and similarly for
v(m)
i
.
Note that in Alg. 2, we replace expectations for giwith its finite sample estimates.
5
摘要:

DecentralizedHyper-GradientComputationoverTime-VaryingDirectedNetworksNaoyukiTerashita12∗SatoshiHara21Hitachi,Ltd.2OsakaUniversityAbstractThispaperaddressesthecommunicationissueswhenestimatinghyper-gradientsindecentralizedfederatedlearning(FL).Hyper-gradientsindecentralizedFLquan-tifieshowtheperform...

展开>> 收起<<
Decentralized Hyper-Gradient Computation over Time-Varying Directed Networks Naoyuki Terashita12Satoshi Hara2.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:907.88KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注