
Problem Sample Complexity Thm Best Previous Result
Collab. Learning UB ε−2(log(|H|) + nlog(n/δ)) [5.1] ε−5log( 1
ε) log(n/δ)(log(|H|) + n)[42]
Collab. Learning LB ε−2(log(|H|) + nlog(n/δ)) [5.3] ε−1(log(|H|) + nlog(n/δ)) [9]
GDRO/AFL UB ε−2(log(|H|) + nlog(n/δ)) [5.1] ε−2(nlog(|H|) + nlog(n/δ)) [39]
GDRO/AFL UB ε−2(DH+nlog(n/δ)) [6.1] N/A
(Training error convg.) ε−2(DH+nlog(n/δ)) [6.2] ε−2n(log(n) + DH)(expected convergence only) [50]
Table 1: This table lists upper (UB) and lower bounds (LB) on the sample complexity of learning a model class
H
on
n
distributions. For the collaborative learning and agnostic federated learning (AFL) settings, the sample
complexity upper bounds refer to the problem of learning a (potentially randomized) model whose expected loss
on each distribution is at most
OPT
+
ε
, where
OPT
is the best possible such guarantee. For the GDRO setting,
sample complexity refers to learning a deterministic model with expected losses of at most
OPT
+
ε
, from a convex
compact model space
H
with a Bregman radius of
DH
. Sample complexity bounds for collaborative and agnostic
federated learning in existing works extend to VC dimension and Rademacher complexity. Our results also extend to
VC dimension under some assumptions.
Blum et al.
[9]
demonstrated the benefit of on-demand sampling in the collaborative learning setting, when
all data distributions are realizable with respect to the same target classifier. This line of work established that
learning
n
distributions with on-demand sampling requires a factor of
e
O
(
log
(
n
)) times the sample complexity
of learning a single realizable distribution [
9
,
13
,
42
], whereas relying on batched uniform convergence takes
e
Ω
(
n
)times more samples than learning a single distribution [
9
]. However, beyond the realizable setting,
the best known multi-distribution learning results fall short of this promise: existing on-demand sample
complexity bounds for agnostic collaborative learning have highly suboptimal dependence on
ε
, requiring
e
O
(
log
(
n
)
/ε3
)times the sample complexity of agnostically learning a single distribution [
42
]. On the other
hand, agnostic fair federated learning bounds [
39
] have been studied only for algorithms that sample in one
large batch and thus require
e
Ω
(
n
)times the sample complexity of a single learning task. Moreover, the
test-time performance of some key multi-distribution learning methods, such as group distributionally robust
optimization [50], have not been studied from a provable or mathematical perspective before.
In this paper, we give a general framework for obtaining optimal and on-demand sample complexity for
three multi-distribution learning settings. Table 1 summarizes our results. All three of these settings consider
a set
D
of
n
data distributions and a model class
H
, evaluating the performance of a model
h
by its worst-case
expected loss,
maxD∈D RD
(
h
). As a benchmark, they consider the worst-case expected loss of the best model,
i.e.,
OPT
=
minh∗∈H maxD∈D RD
(
h∗
). Notably, all of our sample complexity upper bounds demonstrate
only an additive increase of
ε−2nlog
(
n/δ
)over the sample complexity of a single learning task, compared to
the multiplicative factor increase required by existing works.
-
Collaborative learning of Blum et al.
[9]
:For agnostic collaborative learning, our Theorem 5.1 gives a
randomized and a deterministic model that achieves performance guarantees of
OPT
+
ε
and 2
OPT
+
ε
,
respectively. Our algorithms have an optimal sample complexity of
O
(
1
ε2
(
log
(
|H|
) +
nlog
(
n/δ
))). This
improves upon the work of Nguyen and Zakynthinou
[42]
in two ways. First, it provides risk bounds of
OPT
+
ε
for randomized classifiers, where only 2
OPT
+
ε
was established previously. Second, it improves
the upper bound of Nguyen and Zakynthinou
[42]
by a multiplicative factor of
log
(
n
)
/ε3
. In Theorem 5.3,
we give a matching lower bound on this sample complexity, thereby establishing the optimality of our
algorithms.
-
Group distributionally robust learning (group DRO) of Sagawa et al.
[50]
:For group DRO, we consider
a convex and compact model space
H
. Our Theorem 6.1 studies a model that achieves an
OPT
+
ε
guarantee on the worst-case test-time performance of the model with an on-demand sample complexity of
O1
ε2(DH+nlog(n/δ))
. Our results also imply a high-probability bound for the convergence of group
2