
of nonconvex-concave and nonconvex-PL (Polyak-Łojasiewicz) min-max problems has important
applications in, e.g., AUC (area under the ROC curve) maximization [1,2], adversarial and robust
learning [3, 4], and generative adversarial network (GAN) [5]. The versatility of min-max optimiza-
tion thus sparks intense research on developing efficient min-max algorithms. In the literature, the
family of primal-dual stochastic gradient methods is one of the most popular and efficient approaches.
For example, the stochastic gradient descent ascent (SGDA) method in this family has been shown
effective in centralized (single-machine) learning, both theoretically and empirically. However, as
over-parameterized models (e.g., deep neural networks) being more and more prevalent, learning
on a single machine becomes increasingly inefficient. To address challenge, large-scale distributed
learning emerges as an effective mechanism to accelerate training and has achieved astonishing
successes in recent years. Moreover, as more stringent data privacy requirements arise in recent
years, centralized learning becomes increasingly infeasible due to the prohibition of data collection.
This also motivates the need for distributed learning without sharing raw data. Consequently, there
is a growing need for distributed/federated min-max optimization, such as federated deep AUC
maximization [6, 7], federated adversarial training [8] and distributed/federated GAN [9–11].
Similar to conventional federated learning for minimization problems, federated min-max learning
enjoys benefits of parallelism and privacy, but suffers from high communication costs. One effective
approach to reduce communication costs is to utilize infrequent communications. For example, in
conventional federated learning for minimization problems, the FedAvg algorithm [12] allows each
client performs multiple stochastic gradient descent (SGD) steps to update the local model between
two successive communication rounds. Then, local models are sent to and averaged periodically
at the server through communications. Although infrequent communication may introduce extra
noises due to data heterogeneity, FedAvg can still achieve the same convergence rate as distributed
SGD, while having a significant lower communication complexity. Inspired by the theoretical
and empirical success of FedAvg, a natural idea to lower the communication costs of federated
min-max optimization is to utilize infrequent communication in the federated version of SGDA.
Despite the simplicity of this idea, existing works can only show unsatisfactory convergence rates
(
O(1/√mT )
[13] and
O(1/(mKT )1/3)
[14]) for solving non-convex-strongly-concave or non-
convex-PL by federated SGDA with infrequent communication (
m
is the number of clients,
K
is
the number of local steps, and T is the number of communication rounds). These convergence
rates do not match with that of the FedAvg method. These unsatisfactory results are due to the fact
that federated min-max optimization not only needs to address the same challenges in conventional
federated learning (e.g., data heterogeneity and partial client participation), but also handle the more
complicated inter-outer problem structure. Thus, a fundamental question in federated min-max
optimization is: Can a federated SGDA-type method with infrequent communication provably achieve
the same convergence rate and even the highly desirable linear speedup effect for federated min-max
problems?
In this paper, we answer this question affirmatively. The main contributions of this paper are
summarized as follows:
•
We propose a new algorithmic framework called
SAGDA
(stochastic sampling averaging gradient
descent ascent), which assembles stochastic gradient estimators as control variates and leverages
two learning rates on both server and client sides. With these techniques,
SAGDA
relaxes the
restricted “bounded gradient dissimilarity” assumption, while still achieving the same convergence
rate with low communication complexity. We show that
SAGDA
achieves the highly desirable
linear speedup in terms of both the number of clients (even with partial client participation) and
local update steps, which yields an
O(−2)
communication complexity that is orders of magnitude
lower than the state of the art in the literature of federated min-max optimization.
•
Interestingly, by noting that the standard federated stochastic gradient descent ascent (FSGDA)
is in fact a “control-variant-free” special version of our
SAGDA
algorithm, we can conclude
from our theoretical analysis of
SAGDA
that FSGDA achieves an
O(1/√mKT )
convergence
rate for non-convex-PL problems with full client participation, which further implies the highly
desirable linear speedup effect. This improves the state-of-the-art result of FSGDA by a factor of
O(1/(mKT )1/6)
[14] and matches the optimal convergence rate of non-convex FL. Therefore,
through the lens of
SAGDA
, we also advance the current understanding on the communication
complexity of the standard FSGDA method for federated min-max learning.
2