keeping existing well-performing architectures mostly unaltered [
118
,
74
,
21
,
109
]. In this context,
it is often assumed that massive training datasets can be collected and centralized in a single client
in order to maximize performance. However, in many application domains, data collection occurs
in distinct sites (further referred to as clients, e.g., mobile devices or hospitals), and the resulting
local datasets cannot be shared with a central repository or data center due to privacy or strategic
concerns [39, 15].
To enable cooperation among clients given such constraints, Federated Learning (FL) [
97
,
71
] has
emerged as a viable alternative to train models across data providers without sharing sensitive
data. While initially developed to enable training across a large number of small clients, such as
smartphones or Internet of Things (IoT) devices, it has been then extended to the collaboration of
fewer and larger clients, such as banks or hospitals. The two settings are now respectively referred to
as cross-device FL and cross-silo FL, each associated with specific use cases and challenges [71].
On the one hand, cross-device FL leverages edge devices such as mobile phones and wearable
technologies to exploit data distributed over billions of data sources [
97
,
13
,
11
,
101
]. Therefore, it
often requires solving problems related to edge computing [
51
,
85
,
129
], participant selection [
71
,
131
,
20
,
42
], system heterogeneity [
71
], and communication constraints such as low network bandwidth
and high latency [
113
,
91
,
49
]. On the other hand, cross-silo initiatives enable to untap the potential
of large datasets previously out of reach. This is especially true in healthcare, where the emergence
of federated networks of private and public actors [
112
,
115
,
103
], for the first time, allows scientists
to gather enough data to tackle open questions on poorly understood diseases such as triple negative
breast cancer [
37
] or COVID-19 [
31
]. In cross-silo applications, each silo has large computational
power, a relatively high bandwidth, and a stable network connection, allowing it to participate to the
whole training phase. However, cross-silo FL is typically characterized by high inter-client dataset
heterogeneity and biases of various types across the clients [103, 37].
As we show in Section 2, publicly available datasets for the cross-silo FL setting are scarce. As
a consequence, researchers usually rely on heuristics to artificially generate heterogeneous data
partitions from a single dataset and assign them to hypothetical clients. Such heuristics might
fall short of replicating the complexity of natural heterogeneity found in real-world datasets. The
example of digital histopathology [
126
], a crucial data type in cancer research, illustrates the potential
limitations of such synthetic partition methods. In digital histopathology, tissue samples are extracted
from patients, stained, and finally digitized. In this process, known factors of data heterogeneity
across hospitals include patient demographics, staining techniques, storage methodologies of the
physical slides, and digitization processes [
69
,
43
,
57
]. Although staining normalization [
79
,
32
]
has seen recent progress, mitigating this source of heterogeneity, the other highlighted sources of
heterogeneity are difficult to replicate with synthetic partitioning [
57
] and some may be unknown,
which calls for actual cross-silo cohort experiments. This observation is also valid for many other
application domains, e.g. radiology [
50
], dermatology [
7
], retinal images [
7
] and more generally
computer vision [122].
In order to address the lack of realistic cross-silo datasets, we propose FLamby, an open source
cross-silo federated dataset suite with natural partitions focused on healthcare, accompanied by
code examples, and benchmarking guidelines. Our ambition is that FLamby becomes the reference
benchmark for cross-silo FL, as LEAF [
16
] is for cross-device FL. To the best of our knowledge,
apart from some promising isolated works to build realistic cross-silo FL datasets (see Section 2), our
work is the first standard benchmark allowing to systematically study healthcare cross-silo FL on
different data modalities and tasks.
To summarize, our contributions are threefold:
1.
We build an open-source federated cross-silo healthcare dataset suite including
7
datasets.
These datasets cover different tasks (classification / segmentation / survival) in multiple
application domains and with different data modalities and scale. Crucially, all datasets are
partitioned using natural splits.
2.
We provide guidelines to help compare FL strategies in a fair and reproducible manner, and
provide illustrative results for this benchmark.
3.
We make open-source code accessible for benchmark reproducibility and easy integration in
different FL frameworks, but also to allow the research community to contribute to FLamby
development, by adding more datasets, benchmarking types and FL strategies.
2