
FeDXL: Provable Federated Learning for Deep X-Risk Optimization
et al.,2017;Kairouz et al.,2021;Smith et al.,2018;Stich,
2018;Yu et al.,2019a;b;Khaled et al.,2020;Woodworth
et al.,2020b;a;Karimireddy et al.,2020b;Haddadpour et al.,
2019), which focus on the following empirical risk mini-
mization (ERM) problem with the data set
S
distributed
over different machines:
min
w∈Rd
1
|S| X
z∈S
ℓ(w,z).(2)
The major differences between DXO and ERM are (i) the
ERM’s objective is decomposable over training data, while
the DXO is not; and (ii) the data-dependent losses in ERM
are decoupled between different data points; in contrast the
data-dependent loss in DXO couples different training data
points. These differences pose a big challenge for DXO
in the FL setting where the training data are distributed
on different machines and are prohibited to be moved to a
central server. In particular, the gradient of X-risk cannot be
written as the sum of local gradients at individual machines
that only depend on the local data in those machines. Instead,
the gradient of DXO at each machine not only depends on
local data but also on data in other machines. As a result,
the design of communication-efficient FL algorithms for
DXO is much more complicated than that for ERM. In
addition, the presence of non-linear function
f
makes the
algorithm design and analysis even more challenging than
that with linear
f
. There are two levels of coupling in
DXO with nonlinear
f
with one level at the pairwise loss
ℓ(h(w,z), h(w,z′))
and another level at the non-linear risk
of
f(g(w,z,S2))
, which makes estimation of stochastic
gradient more tricky.
Although DXO can be solved by existing algorithms in a
centralized learning setting (Hu et al.,2020;Wang & Yang,
2022), extension of the existing algorithms to the FL set-
ting is non-trivial. This is different from the extension of
centralized algorithms for ERM problems to the FL set-
ting. In the design and analysis of FL algorithms for ERM,
the individual machines compute local gradients and up-
date local models and communicate periodically to average
models. The rationale of local FL algorithms for ERM is
that as long as the gap error between local models and the
averaged model is on par with the noise in the stochastic
gradients by controlling the communication frequency, the
convergence of local FL algorithms will not be sacrificed
and is able to enjoy the parallel speed-up of using multiple
machines. However, this rationale is not sufficient for de-
veloping FL algorithms for DXO optimization due to the
challenges mentioned above.
To address these challenges, we propose two novel FL algo-
rithms named FeDXL1 and FeDXL2 for DXO with linear
and non-linear
f
, respectively. The main innovation in the
algorithm design lies at an active-passive decomposition
framework that decouples the gradient of the objective into
two types, active parts and passive parts. The active parts
depend on data in local machines and the passive parts de-
pend on data in other machines. We estimate the active parts
using the local data and the local model and estimate the
passive parts using the information with delayed communi-
cations from other machines that are computed at historical
models in the previous round. In terms of analysis, the chal-
lenge is that the model used in the computation of stochastic
gradient estimator depends on the (historical) samples for
computing the passive parts at the current iteration, which
is only exacerbated in the presence of non-linear function
f
. We develop a novel analysis that allows us to transfer the
error of the gradient estimator into the latency error of the
passive parts and the gap error between local models and
the global model. Hence, the rationale is that as long as the
latency error of the passive parts and the gap error between
local models and the global model is on par with the noise
in the stochastic gradient estimator we are able to achieve
convergence and linear speed-up.
The main contributions of this work are as follows:
•
We propose two novel communication-efficient algo-
rithms, FeDXL1 and FeDXL2, for DXO with linear and
nonlinear
f
, respectively, based on federated averaging
and merging. Besides communicating local models for
federated averaging, the proposed algorithms need to com-
municate local prediction outputs only periodically for
federated merging to enable the computing of passive
parts. The diagram of the proposed FeDXL algorithms is
shown in Figure 1.
•
We perform novel technical analysis to prove the conver-
gence of both algorithms. We show that both algorithms
enjoy parallel speed-up in terms of the iteration complex-
ity, and a lower-order communication complexity.
•
We conduct empirical studies on two tasks for federated
deep partial AUC optimization with a compositional loss
and federated deep AUC optimization with a pairwise
loss, and demonstrate the advantages of the proposed
algorithms over several baselines.
2. Related Work
FL for ERM. The challenge of FL is how to utilize the
distributed data to learn a ML model with light commu-
nication cost without harming the data privacy (Kone
ˇ
cn
`
y
et al.,2016;McMahan et al.,2017). To reduce the com-
munication cost, many algorithms have been proposed to
skip communications (Stich,2018;Yu et al.,2019a;b;Yang,
2013;Karimireddy et al.,2020b) or compress the communi-
cated statistics (Stich et al.,2018;Basu et al.,2019;Jiang
& Agrawal,2018;Wangni et al.,2018;Bernstein et al.,
2018). Tight analysis has been performed in various stud-
ies (Kairouz et al.,2021;Yu et al.,2019a;b;Khaled et al.,
2