Decentralized Hyper-Gradient Computation over Time-Varying Directed Networks Naoyuki Terashita12Satoshi Hara2

2025-05-06 0 0 907.88KB 24 页 10玖币

侵权投诉

Decentralized Hyper-Gradient Computation

over Time-Varying Directed Networks

Naoyuki Terashita12∗Satoshi Hara2

1Hitachi, Ltd.

2Osaka University

Abstract

This paper addresses the communication issues when estimating hyper-gradients in

decentralized federated learning (FL). Hyper-gradients in decentralized FL quan-

tiﬁes how the performance of globally shared optimal model is inﬂuenced by

the perturbations in clients’ hyper-parameters. In prior work, clients trace this

inﬂuence through the communication of Hessian matrices over a static undirected

network, resulting in (i) excessive communication costs and (ii) inability to make

use of more efﬁcient and robust networks, namely, time-varying directed networks.

To solve these issues, we introduce an alternative optimality condition for FL

using an averaging operation on model parameters and gradients. We then em-

ploy Push-Sum as the averaging operation, which is a consensus optimization

technique for time-varying directed networks. As a result, the hyper-gradient

estimator derived from our optimality condition enjoys two desirable properties;

(i) it only requires Push-Sum communication of vectors and (ii) it can operate

over time-varying directed networks. We conﬁrm the convergence of our estima-

tor to the true hyper-gradient both theoretically and empirically, and we further

demonstrate that it enables two novel applications: decentralized inﬂuence es-

timation and personalization over time-varying networks. Code is available at

https://github.com/hitachi-rd-cv/pdbo-hgp.git.

1 Introduction

1.1 Background

Hyper-gradient has gained attention for addressing various challenges in federated learning (FL) [

such as preserving fairness among clients in the face of data heterogeneity [

], tuning hyper-

parameters with client cooperation [33, 9, 5], and improving the interpretability of FL training [32].

This paper primarily focuses on hyper-gradient computation in decentralized (or peer-to-peer)

FL [

][1.1], under practical consideration for communications. Decentralized FL is known to

offer stronger privacy protection [

], faster model training [

], and robustness against slow

clients [

]. However, these properties of decentralized FL also bring unique challenges in the

hyper-gradient estimation. This is because clients must communicate in a peer-to-peer manner to

measure how the perturbations on hyper-parameters of individual clients alter the overall perfor-

mance of the shared optimal model, requiring the careful arrangement of what and how clients should

communicate.

Speciﬁcally, there are two essential challenges: (i) communication cost and (ii) conﬁguration of

communication network. We provide a brief overview of these challenges below and in Table 1.

∗naoyuki.terashita.sk@hitachi.com

Preprint. Under review.

arXiv:2210.02129v3 [stat.ML] 13 Jun 2023

Table 1: Concise comparison of the hyper-gradient for FL.

Decentralized? Communication Cost Communication Network

Tarzanagh et al. [30] No Small Centralized

Chen et al. [5] Yes Large Static Undirected

Yang et al. [33] Yes Large Static Undirected

HGP (Ours) Yes Small Time-Varying Directed

Communication cost In centralized FL, the central server can gather all necessary client information

for hyper-gradient computation, enabling a communication-efﬁcient algorithm as demonstrated by

Tarzanagh et al.

[30]

. However, designing such an efﬁcient algorithm for decentralized FL is more

challenging, as clients need to perform the necessary communication and computation without central

orchestration. This challenge results in less efﬁcient algorithms [

] as shown in Table 1; the

current decentralized hyper-gradient computations require large communication costs for exchanging

Hessian matrices.

Conﬁguration of communication network There are several types of communication network

conﬁgurations for decentralized FL. One of the most general and efﬁcient conﬁgurations is time-

varying directed communication networks, which allow any message passing to be unidirectional. This

conﬁguration is known to be resilient to failing clients and deadlocks [

] at a minimal communication

overhead [

]. However, hyper-gradient computation on such a dynamic network remains unsolved,

and previous approaches operate over less efﬁcient conﬁgurations as shown in Table 1.

1.2 Our Contributions

In this paper, we demonstrate that both problems can be solved simply by introducing an appropriate

optimality condition of FL which is utilized to derive the hyper-gradient estimator. We found that

the optimality condition of decentralized FL can be expressed by the averaging operation on model

parameters and gradients. We then select Push-Sum [

] as the averaging operation, which is known

as a consensus optimization that runs over time-varying directed networks. Based on our ﬁndings

and the speciﬁc choice of Push-Sum for average operation, we propose our decentralized algorithm

for hyper-gradient computation, named Hyper-Gradient Push (HGP), and provide its theoretical error

bound. Notably, the proposed HGP resolves the aforementioned two problems; (i) it communicates

only vectors using Push-Sum, avoiding exchanging Hessian matrices, and (ii) it can operate on

time-varying networks, which is more efﬁcient and robust than static undirected networks.

Our numerical experiments conﬁrmed the convergence of HGP towards the true hyper-gradient. We

veriﬁed the efﬁcacy of our HGP on two tasks: inﬂuential training instance estimation and model

personalization. The experimental results demonstrated that our HGP enabled us, for the ﬁrst time,

to solve these problems over time-varying communication networks. For personalization, we also

veriﬁed the superior performance of our HGP against existing methods on centralized and static

undirected communication networks.

Our contributions are summarized as follows:

•

We introduce a new formulation of hyper-gradient for decentralized FL using averaging operation,

which can be performed by Push-Sum iterations. This enabled us to design the proposed HGP; it

only requires the communication of the model parameter-sized vectors over time-varying directed

networks. We also provide a theoretical error bound of our estimator.

•

We empirically conﬁrmed the convergence of our estimation to the true hyper-gradient through

the experiment. We also demonstrated two applications that are newly enabled by our algorithm:

inﬂuence estimation and personalization over time-varying communication networks.

Notation

⟨A⟩ij

denotes the

-th row and

-th column block of the matrix

and

⟨a⟩i

denotes

the

-th block vector of the vector

. For a vector function

h:Rm→Rn

, we denote its total

and partial derivatives by

dxh(x)∈Rm×n

and

∂xh(x)∈Rm×n

, respectively. For a real-valued

function

h:Rm×Rs→R

, we denote the Jacobian of

∂xh(x,y)

with respect to

and

∂2

xh(x,y)∈Rm×m

and

∂2

xyh(x,y)∈Rs×m

, respectively. We also introduce a concatenating

notation

[zi]n

i=1 = [z⊤

1··· z⊤

n]⊤∈Rnd

for vectors

zi∈Rd

. We denote the largest and smallest

singular values of a matrix Aby σmax(A)and σmin(A), respectively.

2 Preliminaries

This section provides the background of our study. Section 2.1 introduces the model of the time-

varying network and Push-Sum algorithm. Section 2.2 provides the formulation of decentralized

FL and its optimality condition, then Section 2.3 presents a typical approach for hyper-gradient

estimation in a single client setting.

2.1 Time-Varying Directed Networks and Push-Sum

Time-varying directed communication networks, in which any message passing can be unidirec-

tional, have proven to be resilient to failing clients and deadlocks [

] and they enjoy the minimal

communication overhead [

]. We denote the time-varying directed graph at a time step index

s > 0

by G(s)with vertices {1, . . . , n}and edges deﬁned by E(s).

We suppose that at step

, any

-th client sends messages to its out-neighborhoods

Nout

i(s)⊆

{1, . . . , n}

and receives messages from the in-neighborhoods

Nin

i(s)

. In addition, by standard

practice, every

-th client is always regarded as its own in-neighbor and out-neighbor, i.e.,

i∈ Nout

i(s)

and

i∈ Nin

i(s)

for all

i, s

. We also introduce an assumption on the connectivity of

G(s)

following

Nedi´

c and Olshevsky

[25]

. Roughly speaking, Assumption 1 requires the time-varying network

G(s)

to be repeatedly connected over some sufﬁciently long time scale B > 0.

Assumption 1. The graph with edge set S(t+1)B−1

s=tB E(s)is strongly-connected for every t≥0.

Algorithm 1: Push-Sum

Input: y(0)

z(0)

i←y(0)

i, ω(s)

i←1

for s= 1 to Sdo

z(s)

i←Pj∈Nin

i(s)

z(s−1)

|Nout

j(s)|

ω(s)

i←Pj∈Nin

i(s)

ω(s−1)

|Nout

j(s)|

y(s)

i←z(s)

ω(s)

Output: y(S)

Push-Sum [

] (Alg. 1) is an algorithm for computing an

average of values possessed by each client through com-

munications over time-varying directed networks

G(s)

sat-

isfying Assumption 1. When each

-th client runs Alg. 1

from its initial value vector

y(0)

i∈Rd

, it eventually ob-

tains the average of initial values (or consensus) over the

clients [

], i.e.,

limS→∞ y(S)

i=1

nPky(0)

. From this

property, we can regard Push-Sum as a linear operator

Namely, denoting concatenated vectors by

y(0) = [y(0)

i]n

i=1

y= [ 1

nPky(0)

k]n

i=1, we have

Θy(0) =¯

y,⟨Θ⟩ij =1

nId,∀i, j = 1, . . . , n, (1)

where,

denotes the identity matrix with size

d×d

. Finally, we remark on a useful consequence of

this section, which takes an important role in our decentralized hyper-gradient estimation:

Remark 1. When Assumption 1 is satisﬁed and every

-th client knows

y(0)

, any

-th client can

obtain ⟨Θy(0)⟩iby communications over time-varying directed networks.

2.2 Decentralized Federated Learning

The federated learning (FL) [23] consisting of nclients is formulated by

min

x1,...,xn

k=1

E[gk(xk,λk;ξk)],s.t.xi=xj,∀i, j, (2)

where,

gi:Rdx×Rdλ→R

is a cost function of the

-th client, and

ξi

denotes a random variable

that represents the instance only accessible by the

-th client. Note that the distribution of

ξi

may

differ between each client. In decentralized FL, the objective of each

-th client is to ﬁnd

that

minimizes the total cost while maintaining consensus constraint, i.e.,

xi=xj,∀i, j

. Stochastic

gradient push [25, 2] enables us to solve (2) over time-varying directed networks.

2.3 Hyper-Gradient Computation

Hyper-gradient is an effective tool for solving bilevel problems, which is the nested problem consisting

of inner- and outer-problem [

]. Hyper-gradient can also be used for inﬂuential training

instance estimation, which studies how the removal of a training instance inﬂuences the performance

of the optimal model [

]. Below, we explain the deﬁnition and computation method of the hyper-

gradient in the context of the bilevel problem.

Using differentiable function fand g, the bilevel problem is formulated by

min

λ∈Rbf(x(λ),λ)

| {z }

outer-problem

,s.t.x(λ) = min

x∈Rag(x,λ)

| {z }

inner-problem

.(3)

Suppose that the optimal solution of the inner-problem

x(λ)∈Ra

is expressed by the stationary

point given by a differentiable function φ:Ra×Rb→Ra:

x(λ) = φ(x(λ),λ).(4)

For example, if

is smooth and strongly convex with respect to

, we can use

φ(x,λ) = x−

η∂xg(x,λ)with η > 0to express the optimality condition ∂xg(x,λ)=0using (4).

For the bilevel problem (3), we refer

dλf(x(λ),λ)

as hyper-gradient

. One of the most common

approach for computing

dλf

is the ﬁxed-point method [

]. When

∂xφ

is positive-semideﬁnite

and has its eigenvalues smaller than one, by the derivative of (4) and Neumann approximation of

inverse, we obtain dλx(λ) = ∂λφ(I−∂xφ)−1=∂λφP∞

m=0(∂xφ)m, leading to

dλf=∂λφ∞

m=0

(∂xφ)m∂xf+∂λf. (5)

Fixed-point method also provides an efﬁcient algorithm to compute (5):

(initialization) v(0) =∂λf, u(0) =∂xf, (6a)

(iteration for m= 1, . . . , M)v(m)=∂λφu(m−1) +v(m−1),u(m)=∂xφu(m−1),(6b)

which results in

v(M)=∂λφPM−1

m′=0(∂xφ)m′+∂λf≈dλf

. Here, no explicit computation of

Jacobians are required in (6b);

∂λφu(m−1)

and

∂xφu(m−1)

can be computed using the Jacobian-

vector-product technique.

3 Estimating Hyper-gradient over Time-varying Directed Networks

In this section, we ﬁrst explain the main technical challenge of hyper-gradient computation in

distributed FL, namely, large communication costs due to the consensus constraint in the optimality

condition of FL (2). We then introduce our alternative optimality condition using the convergence of

Push-Sum. By using our optimality condition, we ﬁnally propose the decentralized hyper-gradient

estimation algorithm HGP that runs with reasonable communication cost over time-varying networks.

3.1 Main Challenge

We consider the stationary point of decentralized FL (2) and hyper-gradient derived from this

stationary point. Let

λ= [λi]n

i=1 ∈Rndλ

and

x= [xi]n

i=1 ∈Rndx

be concatenated inner-

parameters and hyper-parameters, respectively. We also denote the expectation of total inner-cost by

g(x,λ) = Pn

k=1 E[gk(xk,λk;ξk)] with the following assumption.

Assumption 2. For every i= 1, . . . , n,giis strongly convex with respect to the ﬁrst argument.

We can then reformulate the optimality condition of (2) by the stationary point (4) with

φ(x,λ) = x−η∂xg(x,λ),s.t.⟨x⟩i=⟨x⟩j∈Rdx,∀i, j, (7)

In the remainder of the paper, we omit the arguments

(x(λ),λ)

when it is clear from the context, e.g.,

φ=φ(x(λ),λ).

where

η∈R+

. Here, the latter constraint corresponds to the consensus constraint in (2) and

Assumption 2 ensures the existence of (I−∂xφ)−1.

Let f(x,λ) = Pkfk(xk,λk)be an outer-cost in bilevel decentralized FL. Here, each i-th client is

interested in the hyper-gradient

f(x(λ),λ)

with respect to its hyper-parameter

λi

. The technical

challenge is to compute

dλf

in a decentralized manner, especially computing (6b). From the

consensus constraint (2), for any

m≥0

, any block vector of

u(m)

requires the evaluation of

and

gkfor all k= 1, . . . , n, because of the following blocks in (6b):

⟨∂xf⟩i=ηX

∂xkfk(xk(λ),λk),⟨∂xφ⟩ij =I−ηX

E[∂2

xkgk(xk(λ),λk;ξk)],∀i, j. (8)

A naive computation of (8) requires gathering these derivatives from all clients through communica-

tions. The communication of the Hessian

∂2

xkgk(xk(λ),λk;ξk)

is particularly a problem for large

models such as deep neural networks.

In the next section, we show that there is an alternative yet equivalent stationary condition that does

not explicitly require consensus between any clients. Based on this alternative condition, we introduce

the proposed HGP, a ﬁxed-point iteration without requiring exchanging Hessian matrices which can

run even on time-varying directed networks.

3.2 Alternative yet Equivalent Stationary Condition

We ﬁrst present the alternative stationary condition as follows:

Lemma 1. A stationary condition x(λ) = φ′(x(λ),λ)with a function

φ′(x,λ) = Θ(x−η∂xg(x,λ)) ,(9)

holds true if and only if x(λ)is the solution of (2).

Lemma 1 states that each

is the optimal solution of FL only when it is identical to their average

and when the average of the gradients is zero. While both (9) and (7) characterize the optimality

condition (2) though their stationary condition, our (9) has the following desirable property:

Remark 2. Lemma 1 requires

xi=xj

only implicitly. Thus, any block of the partial derivative with

respect to xcan be calculated by a single client.

3.3 Stochastic and Decentralized Approximation of Hyper-Gradient

Finally, we present our decentralized algorithm, named Hyper-Gradient Push (HGP).

Since Assumption 2 ensures (I−∂xφ′)−1exists, we can derive the hyper-gradient similar to (5):

dλf=−η∂2

xλgΘ∞

m=0 I−η∂2

xgΘm∂xf+∂λf,

where we used

Θ⊤=Θ

from (1). Similar to (6b), we can obtain the ﬁxed-point iteration of the form

v(m)=−η∂2

xλgΘu(m−1) +v(m−1),u(m)=I−η∂2

xgΘu(m−1).(10)

Our HGP is obtained simply by letting the

-th client compute the

-th block of

v(m)

and

u(m)

denoted by

v(m)

and

u(m)

, respectively. We also replace

with the

-step Push-Sum (Alg. 1) which

we denote by ˆ

Θ.

HGP described in Alg. 2 proceeds as follows. For

m= 0

, any

-th client can locally com-

pute

u(0)

i=∂xifi(xi(λ),λi)

and

v(0)

i=∂λifi(xi(λ),λi)

(Remark 2). Suppose the

-th

client knows

u(m−1)

, which is trivially true when

m= 1

. Then, the

-th client can com-

pute the average

u(m−1)

i=⟨ˆ

Θu(m−1)⟩i

in (10) by running Push-Sum (Remark 1). Because

⟨∂2

xg⟩ii =E[∂2

xiλigi(xi(λ),λi;ξi)]

can be computed locally (Remark 2), the

-th client can com-

pute

u(m)

once

u(m−1)

is obtained. This can be performed for every

m≥0

, and similarly for

v(m)

Note that in Alg. 2, we replace expectations for giwith its ﬁnite sample estimates.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DecentralizedHyper-GradientComputationoverTime-VaryingDirectedNetworksNaoyukiTerashita12∗SatoshiHara21Hitachi,Ltd.2OsakaUniversityAbstractThispaperaddressesthecommunicationissueswhenestimatinghyper-gradientsindecentralizedfederatedlearning(FL).Hyper-gradientsindecentralizedFLquan-tifieshowtheperform...

展开>> 收起<<

Decentralized Hyper-Gradient Computation over Time-Varying Directed Networks Naoyuki Terashita12Satoshi Hara2.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Decentralized Hyper-Gradient Computation over Time-Varying Directed Networks Naoyuki Terashita12Satoshi Hara2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: