FLamby Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings Jean Ogier du Terrail1Samy-Safwan Ayed2Edwige Cyffers3Felix Grimberg4

2025-04-27 0 0 9.42MB 44 页 10玖币
侵权投诉
FLamby: Datasets and Benchmarks for Cross-Silo
Federated Learning in Realistic Healthcare Settings
Jean Ogier du Terrail1Samy-Safwan Ayed2Edwige Cyffers3Felix Grimberg4
Chaoyang He5Regis Loeb1Paul Mangold3Tanguy Marchand1
Othmane Marfoq2Erum Mushtaq6Boris Muzellec1Constantin Philippenko7
Santiago Silva2Maria Tele ´
nczuk1Shadi Albarqouni8,9Salman Avestimehr5,6
Aurélien Bellet3Aymeric Dieuleveut7Martin Jaggi4
Sai Praneeth Karimireddy10 Marco Lorenzi2Giovanni Neglia2Marc Tommasi3
Mathieu Andreux1
1Owkin, Inc, 2Inria, Université Côte d’Azur, Sophia Antipolis, France
3Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRIStAL, F-59000 Lille, France
4EPFL 5FedML, Inc. 6University of Southern California
7CMAP, UMR 7641, École Polytechnique, Institut Polytechnique de Paris
8University Hospital Bonn 9Helmholtz Munich 10University of California, Berkeley
{jean.du-terrail, regis.loeb, tanguy.marchand, boris.muzellec,
maria.telenczuk, mathieu.andreux}@owkin.com, {samy-safwan.ayed,
edwige.cyffers, paul.mangold, othamne.marfoq,
santiago-smith.silva-rincon, aurelien.bellet,
marco.lorenzi, giovanni.neglia, marc.tommasi}@inria.fr
{felix.grimberg, martin.jaggi}@epfl.ch,
ch@fedml.ai,{emushtaq, avestime}@usc.edu,
{constantin.philippenko, aymeric.dieuleveut}@polytechnique.edu,
shadi.albarqouni@ukbonn.de,sp.karimireddy@berkeley.edu
Abstract
Federated Learning (FL) is a novel approach enabling several clients holding
sensitive data to collaboratively train machine learning models, without centralizing
data. The cross-silo FL setting corresponds to the case of few (
2
50
) reliable clients,
each holding medium to large datasets, and is typically found in applications such as
healthcare, finance, or industry. While previous works have proposed representative
datasets for cross-device FL, few realistic healthcare cross-silo FL datasets exist,
thereby slowing algorithmic research in this critical application. In this work, we
propose a novel cross-silo dataset suite focused on healthcare, FLamby (Federated
Learning AMple Benchmark of Your cross-silo strategies), to bridge the gap
between theory and practice of cross-silo FL. FLamby encompasses 7 healthcare
datasets with natural splits, covering multiple tasks, modalities, and data volumes,
each accompanied with baseline training code. As an illustration, we additionally
benchmark standard FL algorithms on all datasets. Our flexible and modular suite
allows researchers to easily download datasets, reproduce results and re-use the
different components for their research. FLamby is available at
www.github.com/
owkin/flamby.
1 Introduction
Recently it has become clear that, in many application fields, impressive machine learning (ML) task
performance can be reached by scaling the size of both ML models and their training data while
Preprint. Under review.
arXiv:2210.04620v3 [cs.LG] 5 May 2023
keeping existing well-performing architectures mostly unaltered [
118
,
74
,
21
,
109
]. In this context,
it is often assumed that massive training datasets can be collected and centralized in a single client
in order to maximize performance. However, in many application domains, data collection occurs
in distinct sites (further referred to as clients, e.g., mobile devices or hospitals), and the resulting
local datasets cannot be shared with a central repository or data center due to privacy or strategic
concerns [39, 15].
To enable cooperation among clients given such constraints, Federated Learning (FL) [
97
,
71
] has
emerged as a viable alternative to train models across data providers without sharing sensitive
data. While initially developed to enable training across a large number of small clients, such as
smartphones or Internet of Things (IoT) devices, it has been then extended to the collaboration of
fewer and larger clients, such as banks or hospitals. The two settings are now respectively referred to
as cross-device FL and cross-silo FL, each associated with specific use cases and challenges [71].
On the one hand, cross-device FL leverages edge devices such as mobile phones and wearable
technologies to exploit data distributed over billions of data sources [
97
,
13
,
11
,
101
]. Therefore, it
often requires solving problems related to edge computing [
51
,
85
,
129
], participant selection [
71
,
131
,
20
,
42
], system heterogeneity [
71
], and communication constraints such as low network bandwidth
and high latency [
113
,
91
,
49
]. On the other hand, cross-silo initiatives enable to untap the potential
of large datasets previously out of reach. This is especially true in healthcare, where the emergence
of federated networks of private and public actors [
112
,
115
,
103
], for the first time, allows scientists
to gather enough data to tackle open questions on poorly understood diseases such as triple negative
breast cancer [
37
] or COVID-19 [
31
]. In cross-silo applications, each silo has large computational
power, a relatively high bandwidth, and a stable network connection, allowing it to participate to the
whole training phase. However, cross-silo FL is typically characterized by high inter-client dataset
heterogeneity and biases of various types across the clients [103, 37].
As we show in Section 2, publicly available datasets for the cross-silo FL setting are scarce. As
a consequence, researchers usually rely on heuristics to artificially generate heterogeneous data
partitions from a single dataset and assign them to hypothetical clients. Such heuristics might
fall short of replicating the complexity of natural heterogeneity found in real-world datasets. The
example of digital histopathology [
126
], a crucial data type in cancer research, illustrates the potential
limitations of such synthetic partition methods. In digital histopathology, tissue samples are extracted
from patients, stained, and finally digitized. In this process, known factors of data heterogeneity
across hospitals include patient demographics, staining techniques, storage methodologies of the
physical slides, and digitization processes [
69
,
43
,
57
]. Although staining normalization [
79
,
32
]
has seen recent progress, mitigating this source of heterogeneity, the other highlighted sources of
heterogeneity are difficult to replicate with synthetic partitioning [
57
] and some may be unknown,
which calls for actual cross-silo cohort experiments. This observation is also valid for many other
application domains, e.g. radiology [
50
], dermatology [
7
], retinal images [
7
] and more generally
computer vision [122].
In order to address the lack of realistic cross-silo datasets, we propose FLamby, an open source
cross-silo federated dataset suite with natural partitions focused on healthcare, accompanied by
code examples, and benchmarking guidelines. Our ambition is that FLamby becomes the reference
benchmark for cross-silo FL, as LEAF [
16
] is for cross-device FL. To the best of our knowledge,
apart from some promising isolated works to build realistic cross-silo FL datasets (see Section 2), our
work is the first standard benchmark allowing to systematically study healthcare cross-silo FL on
different data modalities and tasks.
To summarize, our contributions are threefold:
1.
We build an open-source federated cross-silo healthcare dataset suite including
7
datasets.
These datasets cover different tasks (classification / segmentation / survival) in multiple
application domains and with different data modalities and scale. Crucially, all datasets are
partitioned using natural splits.
2.
We provide guidelines to help compare FL strategies in a fair and reproducible manner, and
provide illustrative results for this benchmark.
3.
We make open-source code accessible for benchmark reproducibility and easy integration in
different FL frameworks, but also to allow the research community to contribute to FLamby
development, by adding more datasets, benchmarking types and FL strategies.
2
Table 1: Overview of the datasets, tasks, metrics and baseline models in FLamby. For Fed-
Camelyon16 the two different sizes refer to the size of the dataset before and after tiling.
Dataset Fed-Camelyon16 Fed-LIDC-IDRI Fed-IXI Fed-TCGA-BRCA Fed-KITS2019 Fed-ISIC2019 Fed-Heart-Disease
Input (x) Slides CT-scans T1WI Patient info. CT-scans Dermoscopy Patient info.
Preprocessing Matter extraction
+ tiling Patch Sampling Registration None Patch Sampling Various image
transforms
Removing missing
data
Task type binary
classification 3D segmentation 3D segmentation survival 3D segmentation multi-class
classification binary classification
Prediction (y) Tumor on slide Lung Nodule Mask Brain mask Risk of death Kidney and tumor
masks Melanoma class Heart disease
Center extraction Hospital Scanner
Manufacturer Hospital Group of Hospitals Group of Hospitals Hospital Hospital
Thumbnails
Original paper Litjens et al.
2018
Armato et al.
2011
Perez et al.
2021
Liu et al.
2018
Heller et al.
2019
Tschandl et al. 2018 /
Codella et al. 2017 /
Combalia et al. 2019
Janosi et al.
1988
# clients 2 4 3 6 6 6 4
# examples 399 1,018 566 1, 088 96 23, 247 740
# examples per
center 239, 150 670, 205, 69, 74 311, 181, 74 311, 196, 206, 162,
162, 51
12, 14, 12, 12,
16, 30
12413, 3954, 3363, 2259,
819, 439 303, 261, 46, 130
Model DeepMIL [64] Vnet [98, 100] 3D U-net [22] Cox Model [30] nnU-Net [67] efficientnet [119]
+ linear layer Logistic Regression
Metric AUC DICE DICE C-index DICE Balanced Accuracy Accuracy
Size 50G (850G total) 115G 444M 115K 54G 9G 40K
Image resolution 0.5 µm / pixel 1.0 × 1.0 × 1.0
mm / voxel
1.0 × 1.0 × 1.0
mm / voxel NA 1.0 × 1.0 × 1.0
mm / voxel 0.02 mm / pixel NA
Input dimension 10, 000 x 2048 128 x 128 x 128 48 x 60 x 48 39 64 x 192 x 192 200 x 200 x 3 13
This paper is organized as follows. Section 2 reviews existing FL datasets and benchmarks, as well as
client partition methods used to artificially introduce data heterogeneity. In Section 3, we describe our
dataset suite in detail, notably its structure and the intrinsic heterogeneity of each federated dataset.
Finally, we define a benchmark of several FL strategies on all datasets and provide results thereof in
Section 4.
2 Related Work
In FL, data is collected locally in clients in different conditions and without coordination. As a
consequence, clients’ datasets differ both in size (unbalanced) and in distribution (non-IID) [
97
]. The
resulting statistical heterogeneity is a fundamental challenge in FL [
82
,
71
], and it is necessary to take
it into consideration when evaluating FL algorithms. Most FL papers simulate statistical heterogeneity
by artificially partitioning classic datasets, e.g., CIFAR-10/100 [
78
], MNIST [
81
] or ImageNet [
34
],
on a given number of clients. Common approaches to produce synthetic partitions of classification
datasets include associating samples from a limited number of classes to each client [
97
], Dirichlet
sampling on the class labels [
59
,
133
], and using Pachinko Allocation Method (PAM) [
84
,
110
]
(which is only possible when the labels have a hierarchical structure). In the case of regression tasks,
[
105
] partitions the superconduct dataset [
17
] across 20 clients using Gaussian Mixture clustering
based on T-SNE representations [
124
] of the features. Such synthetic partition approaches may fall
short of modelling the complex statistical heterogeneity of real federated datasets. Evaluating FL
strategies on datasets with natural client splits is a safer approach to ensuring that new strategies
address real-world issues.
For cross-device FL, the LEAF dataset suite [
16
] includes five datasets with natural partition, spanning
a wide range of machine learning tasks: natural language modeling (Reddit [
127
]), next character
prediction (Shakespeare [
97
]), sentiment analysis (Sent140 [
45
]), image classification (CelebA [
88
])
and handwritten-character recognition (FEMNIST [
25
]). TensorFlow Federated [
12
] complements
LEAF and provides three additional naturally split federated benchmarks, i.e., StackOverflow [
120
],
Google Landmark v2 [
60
] and iNaturalist [
125
]. Further, FLSim [
111
] provides cross-device exam-
ples based on LEAF and CIFAR10 [
78
] with a synthetic split, and FedScale [
80
] introduces a large
FL benchmark focused on mobile applications. Apart from iNaturalist, the aforementioned datasets
target the cross-device setting.
To the best of our knowledge, no extensive benchmark with natural splits is available for cross-silo
FL. However, some standalone works built cross-silo datasets with real partitions. [
46
] and [
95
]
partition Cityscapes [
27
] and iNaturalist [
125
], respectively, exploiting the geolocation of the picture
3
acquisition site. [
58
] releases a real-world, geo-tagged dataset of common mammals on Flickr. [
92
]
gathers a federated cross-silo benchmark for object detection created using street cameras. [
28
]
partitions Vehicle Sensor Dataset [
38
] and Human Activity Recognition dataset [
4
] by sensor and
by individuals, respectively. [
93
] builds an iris recognition federated dataset across five clients
using multiple iris datasets [
128
,
135
,
136
,
106
]. While FedML [
53
] introduces several cross-silo
benchmarks [
54
,
132
,
52
], the related client splits are synthetically obtained with Dirichlet sampling
and not based on a natural split. Similarly, FATE [
41
] provides several cross-silo examples but, to the
best of our knowledge, none of them stems from a natural split.
In the medical domain, several works use natural splits replicating the data collection process in
different hospitals: the works [
2
,
18
,
8
,
72
,
130
,
19
] respectively use the Camelyon datasets [
87
,
10
,
9
],
the CheXpert dataset [
65
], LIDC dataset [
5
], the chest X-ray dataset [
76
], the IXI dataset [
130
], the
Kaggle diabetic retinopathy detection dataset [
47
]. Finally, the works [
3
,
48
,
89
] use the TCGA
dataset [121] by extracting the Tissue Source site metadata.
Our work aims to give more visibility to such isolated cross-silo initiatives by regrouping seven medi-
cal datasets, some of which listed above, in a single benchmark suite. We also provide reproducible
code alongside precise benchmarking guidelines in order to connect past and subsequent works for a
better monitoring of the progress in cross-silo FL.
3 The FLamby Dataset Suite
3.1 Structure Overview
The FLamby datasets suite is a Python library organized in two main parts: datasets with correspond-
ing baseline models, and FL strategies with associated benchmarking code. The suite is modular,
with a standardized simple application programming interface (API) for each component, enabling
easy re-use and extensions of different components. Further, the suite is compatible with existing
FL software libraries, such as FedML [
53
], Fed-BioMed [
117
], or Substra [
44
]. Listing 1 provides a
code example of how the structure of FLamby allows to test new datasets and strategies in a few lines
of code, and Table 1 provides an overview of the FLamby datasets.
Dataset and baseline model.
The FLamby suite contains datasets with a natural notion of client
split, as well as a predefined task and associated metric. A train/test set is predefined for each client to
enable reproducible comparisons. We further provide a baseline model for each task, with a reference
implementation for training on pooled data. For each dataset, the suite provides documentation,
metadata and helper functions to: 1. download the original pooled dataset; 2. apply preprocessing if
required, making it suitable for ML training; 3. split each original pooled dataset between its natural
clients; and 4. easily iterate over the preprocessed dataset. The dataset API relies on PyTorch [102],
which makes it easy to iterate over the dataset with natural splits as well as to modify these splits if
needed.
FL strategies and benchmark.
FL training algorithms, called strategies in the FLamby suite, are
provided for simulation purposes. In order to be agnostic to existing FL libraries, these strategies
are provided in plain Python code. The API of these strategies is standardized and compatible with
the dataset API, making it easy to benchmark each strategy on each dataset. We further provide a
script performing such a benchmark for illustration purposes. We stress the fact that it is easy to
alternatively use implementations from existing FL libraries.
3.2 Datasets, Metrics and Baseline Models
We provide a brief description of each dataset in the FLamby dataset suite, which is summarized in
Table 1. In Section 3.4, we further explore the heterogeneity of each dataset, as displayed in Figure 1.
Fed-Camelyon16.
Camelyon16 [
87
] is a histopathology dataset of 399 digitized breast biopsies’
slides with or without tumor collected from two hospitals: Radboud University Medical Center
(RUMC) and University Medical Center Utrecht (UMCU). By recovering the original split information
we build a federated version of Camelyon16 with
2
clients. The task consists in binary classification
4
of each slide, which is challenging due to the large size of each image (
105×105
pixels at 20X
magnification), and measured by the Area Under the ROC curve (AUC).
As a baseline, we follow a weakly-supervised learning approach. Slides are first converted to bags of
local features, which are one order of magnitude smaller in terms of memory requirements, and a
model is then trained on top of this representation. For each slide, we detect regions with a matter-
detection network and then extract features from each tile with an ImageNet-pretrained Resnet50,
following state-of-the-art practice [
29
,
90
]. Note that due to the imbalanced distribution of tissue in
the different slides, a different number of features is produced for each slide: we cap the total number
of tiles to
105
and use zero-padding for consistency. We then train a DeepMIL architecture [
63
],
using its reference implementation [
64
] and hyperparameters from [
33
]. We refer to Appendix C for
more details.
Fed-LIDC-IDRI.
LIDC-IDRI [
5
,
62
,
23
] is an image database [
23
] study with 1018 CT-scans (3D
images) from The Cancer Imaging Archive (TCIA), proposed in the LUNA16 competition [
114
].
The task consists in automatically segmenting lung nodules in CT-scans, as measured by the DICE
score [
36
]. It is challenging because lung nodules are small, blurry, and hard to detect. By parsing
the metadata of the CT-scans from the provided annotations, we recover the manufacturer of each
scanning machine used, which we use as a proxy for a client. We therefore build a
4
-client federated
version of this dataset, split by manufacturer. Figure 1b displays the distribution of voxel intensities
in each client.
As a baseline model, we use a VNet [
98
] following the implementation from [
100
]. This model is
trained by sampling 3D-volumes into 3D patches fitting in GPU memory. Details of the sampling
procedure are available in Appendix D.
Fed-IXI.
This dataset is extracted from the Information eXtraction from Images - IXI database [
35
],
and has been previously released by Perez et al. [
108
,
104
] under the name of IXITiny. IXITiny
provides a database of brain T1 magnetic resonance images (MRIs) from
3
hospitals (Guys, HH,
and IOP). This dataset has been adapted to a brain segmentation task by obtaining spatial brain
masks using a state-of-the-art unsupervised brain segmentation tool [
61
]. The quality of the resulting
supervised segmentation task is measured by the DICE score [36].
The image pre-processing pipeline includes volume resizing to
48 ×60 ×48
voxels, and sample-
wise intensity normalization. Figure 1c highlights the heterogeneity of the raw MRI intensity
distributions between clients. As a baseline, we use a 3D U-net [
22
] following the implementation
of [
107
]. Appendix E provides more detailed information about this dataset, including demographic
information, and about the baseline.
Fed-TCGA-BRCA.
The Cancer Genome Atlas (TCGA)’s Genomics Data Commons (GDC) por-
tal [
99
] contains multi-modal data (tabular, 2D and 3D images) on a variety of cancers collected in
many different hospitals. Here, we focus on clinical data from the BReast CAncer study (BRCA),
which includes features gathered from 1066 patients. We use the Tissue Source Site metadata to split
data based on extraction site, grouped into geographic regions to obtain large enough clients. We
end up with
6
clients: USA (Northeast, South, Middlewest, West), Canada and Europe, with patient
counts varying from 51 to 311. The task consists in predicting survival outcomes [
70
] based on the
patients’ tabular data (39 features overall), with the event to predict being death. This survival task is
akin to a ranking problem with the score of each sample being known either directly or only by lower
bound (right censorship). The ranking is evaluated by using the concordance index (C-index) that
measures the percentage of correctly ranked pairs while taking censorship into account.
As a baseline, we use a linear Cox proportional hazard model [
30
] to predict time-to-death for patients.
Figure 1e highlights the survival distribution heterogeneity between the different clients. Appendix F
provides more details on this dataset.
Fed-KITS2019.
The KiTS19 dataset [
55
,
56
] stems from the Kidney Tumor Segmentation Chal-
lenge 2019 and contains CT scans of 210 patients along with the segmentation masks from 79
hospitals. We recover the hospital metadata and extract a
6
-client federated version of this dataset by
removing hospitals with less than
10
training samples. The task consists of both kidney and tumor
segmentation, labeled 1 and 2, respectively, and we measure the average of Kidney and Tumor DICE
scores [36] as our evaluation metric.
5
摘要:

FLamby:DatasetsandBenchmarksforCross-SiloFederatedLearninginRealisticHealthcareSettingsJeanOgierduTerrail1Samy-SafwanAyed2EdwigeCyffers3FelixGrimberg4ChaoyangHe5RegisLoeb1PaulMangold3TanguyMarchand1OthmaneMarfoq2ErumMushtaq6BorisMuzellec1ConstantinPhilippenko7SantiagoSilva2MariaTele´nczuk1ShadiAlbar...

展开>> 收起<<
FLamby Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings Jean Ogier du Terrail1Samy-Safwan Ayed2Edwige Cyffers3Felix Grimberg4.pdf

共44页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:44 页 大小:9.42MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 44
客服
关注