
Table 1: Concise comparison of the hyper-gradient for FL.
Decentralized? Communication Cost Communication Network
Tarzanagh et al. [30] No Small Centralized
Chen et al. [5] Yes Large Static Undirected
Yang et al. [33] Yes Large Static Undirected
HGP (Ours) Yes Small Time-Varying Directed
Communication cost In centralized FL, the central server can gather all necessary client information
for hyper-gradient computation, enabling a communication-efficient algorithm as demonstrated by
Tarzanagh et al.
[30]
. However, designing such an efficient algorithm for decentralized FL is more
challenging, as clients need to perform the necessary communication and computation without central
orchestration. This challenge results in less efficient algorithms [
33
,
5
] as shown in Table 1; the
current decentralized hyper-gradient computations require large communication costs for exchanging
Hessian matrices.
Configuration of communication network There are several types of communication network
configurations for decentralized FL. One of the most general and efficient configurations is time-
varying directed communication networks, which allow any message passing to be unidirectional. This
configuration is known to be resilient to failing clients and deadlocks [
31
] at a minimal communication
overhead [
2
]. However, hyper-gradient computation on such a dynamic network remains unsolved,
and previous approaches operate over less efficient configurations as shown in Table 1.
1.2 Our Contributions
In this paper, we demonstrate that both problems can be solved simply by introducing an appropriate
optimality condition of FL which is utilized to derive the hyper-gradient estimator. We found that
the optimality condition of decentralized FL can be expressed by the averaging operation on model
parameters and gradients. We then select Push-Sum [
3
] as the averaging operation, which is known
as a consensus optimization that runs over time-varying directed networks. Based on our findings
and the specific choice of Push-Sum for average operation, we propose our decentralized algorithm
for hyper-gradient computation, named Hyper-Gradient Push (HGP), and provide its theoretical error
bound. Notably, the proposed HGP resolves the aforementioned two problems; (i) it communicates
only vectors using Push-Sum, avoiding exchanging Hessian matrices, and (ii) it can operate on
time-varying networks, which is more efficient and robust than static undirected networks.
Our numerical experiments confirmed the convergence of HGP towards the true hyper-gradient. We
verified the efficacy of our HGP on two tasks: influential training instance estimation and model
personalization. The experimental results demonstrated that our HGP enabled us, for the first time,
to solve these problems over time-varying communication networks. For personalization, we also
verified the superior performance of our HGP against existing methods on centralized and static
undirected communication networks.
Our contributions are summarized as follows:
•
We introduce a new formulation of hyper-gradient for decentralized FL using averaging operation,
which can be performed by Push-Sum iterations. This enabled us to design the proposed HGP; it
only requires the communication of the model parameter-sized vectors over time-varying directed
networks. We also provide a theoretical error bound of our estimator.
•
We empirically confirmed the convergence of our estimation to the true hyper-gradient through
the experiment. We also demonstrated two applications that are newly enabled by our algorithm:
influence estimation and personalization over time-varying communication networks.
Notation
⟨A⟩ij
denotes the
i
-th row and
j
-th column block of the matrix
A
and
⟨a⟩i
denotes
the
i
-th block vector of the vector
a
. For a vector function
h:Rm→Rn
, we denote its total
and partial derivatives by
dxh(x)∈Rm×n
and
∂xh(x)∈Rm×n
, respectively. For a real-valued
function
h:Rm×Rs→R
, we denote the Jacobian of
∂xh(x,y)
with respect to
x
and
y
by
∂2
xh(x,y)∈Rm×m
and
∂2
xyh(x,y)∈Rs×m
, respectively. We also introduce a concatenating
2