FLamby Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings Jean Ogier du Terrail1Samy-Safwan Ayed2Edwige Cyffers3Felix Grimberg4

2025-04-27 0 0 9.42MB 44 页 10玖币

侵权投诉

FLamby: Datasets and Benchmarks for Cross-Silo

Federated Learning in Realistic Healthcare Settings

Jean Ogier du Terrail1Samy-Safwan Ayed2Edwige Cyffers3Felix Grimberg4

Chaoyang He5Regis Loeb1Paul Mangold3Tanguy Marchand1

Othmane Marfoq2Erum Mushtaq6Boris Muzellec1Constantin Philippenko7

Santiago Silva2Maria Tele ´

nczuk1Shadi Albarqouni8,9Salman Avestimehr5,6

Aurélien Bellet3Aymeric Dieuleveut7Martin Jaggi4

Sai Praneeth Karimireddy10 Marco Lorenzi2Giovanni Neglia2Marc Tommasi3

Mathieu Andreux1

1Owkin, Inc, 2Inria, Université Côte d’Azur, Sophia Antipolis, France

3Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRIStAL, F-59000 Lille, France

4EPFL 5FedML, Inc. 6University of Southern California

7CMAP, UMR 7641, École Polytechnique, Institut Polytechnique de Paris

8University Hospital Bonn 9Helmholtz Munich 10University of California, Berkeley

{jean.du-terrail, regis.loeb, tanguy.marchand, boris.muzellec,

maria.telenczuk, mathieu.andreux}@owkin.com, {samy-safwan.ayed,

edwige.cyffers, paul.mangold, othamne.marfoq,

santiago-smith.silva-rincon, aurelien.bellet,

marco.lorenzi, giovanni.neglia, marc.tommasi}@inria.fr

{felix.grimberg, martin.jaggi}@epfl.ch,

ch@fedml.ai,{emushtaq, avestime}@usc.edu,

{constantin.philippenko, aymeric.dieuleveut}@polytechnique.edu,

shadi.albarqouni@ukbonn.de,sp.karimireddy@berkeley.edu

Abstract

Federated Learning (FL) is a novel approach enabling several clients holding

sensitive data to collaboratively train machine learning models, without centralizing

data. The cross-silo FL setting corresponds to the case of few (

–

) reliable clients,

each holding medium to large datasets, and is typically found in applications such as

healthcare, ﬁnance, or industry. While previous works have proposed representative

datasets for cross-device FL, few realistic healthcare cross-silo FL datasets exist,

thereby slowing algorithmic research in this critical application. In this work, we

propose a novel cross-silo dataset suite focused on healthcare, FLamby (Federated

Learning AMple Benchmark of Your cross-silo strategies), to bridge the gap

between theory and practice of cross-silo FL. FLamby encompasses 7 healthcare

datasets with natural splits, covering multiple tasks, modalities, and data volumes,

each accompanied with baseline training code. As an illustration, we additionally

benchmark standard FL algorithms on all datasets. Our ﬂexible and modular suite

allows researchers to easily download datasets, reproduce results and re-use the

different components for their research. FLamby is available at

www.github.com/

owkin/flamby.

1 Introduction

Recently it has become clear that, in many application ﬁelds, impressive machine learning (ML) task

performance can be reached by scaling the size of both ML models and their training data while

Preprint. Under review.

arXiv:2210.04620v3 [cs.LG] 5 May 2023

keeping existing well-performing architectures mostly unaltered [

118

109

]. In this context,

it is often assumed that massive training datasets can be collected and centralized in a single client

in order to maximize performance. However, in many application domains, data collection occurs

in distinct sites (further referred to as clients, e.g., mobile devices or hospitals), and the resulting

local datasets cannot be shared with a central repository or data center due to privacy or strategic

concerns [39, 15].

To enable cooperation among clients given such constraints, Federated Learning (FL) [

] has

emerged as a viable alternative to train models across data providers without sharing sensitive

data. While initially developed to enable training across a large number of small clients, such as

smartphones or Internet of Things (IoT) devices, it has been then extended to the collaboration of

fewer and larger clients, such as banks or hospitals. The two settings are now respectively referred to

as cross-device FL and cross-silo FL, each associated with speciﬁc use cases and challenges [71].

On the one hand, cross-device FL leverages edge devices such as mobile phones and wearable

technologies to exploit data distributed over billions of data sources [

101

]. Therefore, it

often requires solving problems related to edge computing [

129

], participant selection [

131

], system heterogeneity [

], and communication constraints such as low network bandwidth

and high latency [

113

]. On the other hand, cross-silo initiatives enable to untap the potential

of large datasets previously out of reach. This is especially true in healthcare, where the emergence

of federated networks of private and public actors [

112

115

103

], for the ﬁrst time, allows scientists

to gather enough data to tackle open questions on poorly understood diseases such as triple negative

breast cancer [

] or COVID-19 [

]. In cross-silo applications, each silo has large computational

power, a relatively high bandwidth, and a stable network connection, allowing it to participate to the

whole training phase. However, cross-silo FL is typically characterized by high inter-client dataset

heterogeneity and biases of various types across the clients [103, 37].

As we show in Section 2, publicly available datasets for the cross-silo FL setting are scarce. As

a consequence, researchers usually rely on heuristics to artiﬁcially generate heterogeneous data

partitions from a single dataset and assign them to hypothetical clients. Such heuristics might

fall short of replicating the complexity of natural heterogeneity found in real-world datasets. The

example of digital histopathology [

126

], a crucial data type in cancer research, illustrates the potential

limitations of such synthetic partition methods. In digital histopathology, tissue samples are extracted

from patients, stained, and ﬁnally digitized. In this process, known factors of data heterogeneity

across hospitals include patient demographics, staining techniques, storage methodologies of the

physical slides, and digitization processes [

]. Although staining normalization [

]

has seen recent progress, mitigating this source of heterogeneity, the other highlighted sources of

heterogeneity are difﬁcult to replicate with synthetic partitioning [

] and some may be unknown,

which calls for actual cross-silo cohort experiments. This observation is also valid for many other

application domains, e.g. radiology [

], dermatology [

], retinal images [

] and more generally

computer vision [122].

In order to address the lack of realistic cross-silo datasets, we propose FLamby, an open source

cross-silo federated dataset suite with natural partitions focused on healthcare, accompanied by

code examples, and benchmarking guidelines. Our ambition is that FLamby becomes the reference

benchmark for cross-silo FL, as LEAF [

] is for cross-device FL. To the best of our knowledge,

apart from some promising isolated works to build realistic cross-silo FL datasets (see Section 2), our

work is the ﬁrst standard benchmark allowing to systematically study healthcare cross-silo FL on

different data modalities and tasks.

To summarize, our contributions are threefold:

We build an open-source federated cross-silo healthcare dataset suite including

datasets.

These datasets cover different tasks (classiﬁcation / segmentation / survival) in multiple

application domains and with different data modalities and scale. Crucially, all datasets are

partitioned using natural splits.

We provide guidelines to help compare FL strategies in a fair and reproducible manner, and

provide illustrative results for this benchmark.

We make open-source code accessible for benchmark reproducibility and easy integration in

different FL frameworks, but also to allow the research community to contribute to FLamby

development, by adding more datasets, benchmarking types and FL strategies.

Table 1: Overview of the datasets, tasks, metrics and baseline models in FLamby. For Fed-

Camelyon16 the two different sizes refer to the size of the dataset before and after tiling.

Dataset Fed-Camelyon16 Fed-LIDC-IDRI Fed-IXI Fed-TCGA-BRCA Fed-KITS2019 Fed-ISIC2019 Fed-Heart-Disease

Input (x) Slides CT-scans T1WI Patient info. CT-scans Dermoscopy Patient info.

Preprocessing Matter extraction

+ tiling Patch Sampling Registration None Patch Sampling Various image

transforms

Removing missing

data

Task type binary

classiﬁcation 3D segmentation 3D segmentation survival 3D segmentation multi-class

classiﬁcation binary classiﬁcation

Prediction (y) Tumor on slide Lung Nodule Mask Brain mask Risk of death Kidney and tumor

masks Melanoma class Heart disease

Center extraction Hospital Scanner

Manufacturer Hospital Group of Hospitals Group of Hospitals Hospital Hospital

Thumbnails

Original paper Litjens et al.

2018

Armato et al.

2011

Perez et al.

2021

Liu et al.

2018

Heller et al.

2019

Tschandl et al. 2018 /

Codella et al. 2017 /

Combalia et al. 2019

Janosi et al.

1988

# clients 2 4 3 6 6 6 4

# examples 399 1,018 566 1, 088 96 23, 247 740

# examples per

center 239, 150 670, 205, 69, 74 311, 181, 74 311, 196, 206, 162,

162, 51

12, 14, 12, 12,

16, 30

12413, 3954, 3363, 2259,

819, 439 303, 261, 46, 130

Model DeepMIL [64] Vnet [98, 100] 3D U-net [22] Cox Model [30] nnU-Net [67] efﬁcientnet [119]

+ linear layer Logistic Regression

Metric AUC DICE DICE C-index DICE Balanced Accuracy Accuracy

Size 50G (850G total) 115G 444M 115K 54G 9G 40K

Image resolution 0.5 µm / pixel ∼1.0 × 1.0 × 1.0

mm / voxel

∼1.0 × 1.0 × 1.0

mm / voxel NA ∼1.0 × 1.0 × 1.0

mm / voxel ∼0.02 mm / pixel NA

Input dimension 10, 000 x 2048 128 x 128 x 128 48 x 60 x 48 39 64 x 192 x 192 200 x 200 x 3 13

This paper is organized as follows. Section 2 reviews existing FL datasets and benchmarks, as well as

client partition methods used to artiﬁcially introduce data heterogeneity. In Section 3, we describe our

dataset suite in detail, notably its structure and the intrinsic heterogeneity of each federated dataset.

Finally, we deﬁne a benchmark of several FL strategies on all datasets and provide results thereof in

Section 4.

2 Related Work

In FL, data is collected locally in clients in different conditions and without coordination. As a

consequence, clients’ datasets differ both in size (unbalanced) and in distribution (non-IID) [

]. The

resulting statistical heterogeneity is a fundamental challenge in FL [

], and it is necessary to take

it into consideration when evaluating FL algorithms. Most FL papers simulate statistical heterogeneity

by artiﬁcially partitioning classic datasets, e.g., CIFAR-10/100 [

], MNIST [

] or ImageNet [

on a given number of clients. Common approaches to produce synthetic partitions of classiﬁcation

datasets include associating samples from a limited number of classes to each client [

], Dirichlet

sampling on the class labels [

133

], and using Pachinko Allocation Method (PAM) [

110

]

(which is only possible when the labels have a hierarchical structure). In the case of regression tasks,

[

105

] partitions the superconduct dataset [

] across 20 clients using Gaussian Mixture clustering

based on T-SNE representations [

124

] of the features. Such synthetic partition approaches may fall

short of modelling the complex statistical heterogeneity of real federated datasets. Evaluating FL

strategies on datasets with natural client splits is a safer approach to ensuring that new strategies

address real-world issues.

For cross-device FL, the LEAF dataset suite [

] includes ﬁve datasets with natural partition, spanning

a wide range of machine learning tasks: natural language modeling (Reddit [

127

]), next character

prediction (Shakespeare [

]), sentiment analysis (Sent140 [

]), image classiﬁcation (CelebA [

])

and handwritten-character recognition (FEMNIST [

]). TensorFlow Federated [

] complements

LEAF and provides three additional naturally split federated benchmarks, i.e., StackOverﬂow [

120

Google Landmark v2 [

] and iNaturalist [

125

]. Further, FLSim [

111

] provides cross-device exam-

ples based on LEAF and CIFAR10 [

] with a synthetic split, and FedScale [

] introduces a large

FL benchmark focused on mobile applications. Apart from iNaturalist, the aforementioned datasets

target the cross-device setting.

To the best of our knowledge, no extensive benchmark with natural splits is available for cross-silo

FL. However, some standalone works built cross-silo datasets with real partitions. [

] and [

]

partition Cityscapes [

] and iNaturalist [

125

], respectively, exploiting the geolocation of the picture

acquisition site. [

] releases a real-world, geo-tagged dataset of common mammals on Flickr. [

]

gathers a federated cross-silo benchmark for object detection created using street cameras. [

]

partitions Vehicle Sensor Dataset [

] and Human Activity Recognition dataset [

] by sensor and

by individuals, respectively. [

] builds an iris recognition federated dataset across ﬁve clients

using multiple iris datasets [

128

135

136

106

]. While FedML [

] introduces several cross-silo

benchmarks [

132

], the related client splits are synthetically obtained with Dirichlet sampling

and not based on a natural split. Similarly, FATE [

] provides several cross-silo examples but, to the

best of our knowledge, none of them stems from a natural split.

In the medical domain, several works use natural splits replicating the data collection process in

different hospitals: the works [

130

] respectively use the Camelyon datasets [

the CheXpert dataset [

], LIDC dataset [

], the chest X-ray dataset [

], the IXI dataset [

130

], the

Kaggle diabetic retinopathy detection dataset [

]. Finally, the works [

] use the TCGA

dataset [121] by extracting the Tissue Source site metadata.

Our work aims to give more visibility to such isolated cross-silo initiatives by regrouping seven medi-

cal datasets, some of which listed above, in a single benchmark suite. We also provide reproducible

code alongside precise benchmarking guidelines in order to connect past and subsequent works for a

better monitoring of the progress in cross-silo FL.

3 The FLamby Dataset Suite

3.1 Structure Overview

The FLamby datasets suite is a Python library organized in two main parts: datasets with correspond-

ing baseline models, and FL strategies with associated benchmarking code. The suite is modular,

with a standardized simple application programming interface (API) for each component, enabling

easy re-use and extensions of different components. Further, the suite is compatible with existing

FL software libraries, such as FedML [

], Fed-BioMed [

117

], or Substra [

]. Listing 1 provides a

code example of how the structure of FLamby allows to test new datasets and strategies in a few lines

of code, and Table 1 provides an overview of the FLamby datasets.

Dataset and baseline model.

The FLamby suite contains datasets with a natural notion of client

split, as well as a predeﬁned task and associated metric. A train/test set is predeﬁned for each client to

enable reproducible comparisons. We further provide a baseline model for each task, with a reference

implementation for training on pooled data. For each dataset, the suite provides documentation,

metadata and helper functions to: 1. download the original pooled dataset; 2. apply preprocessing if

required, making it suitable for ML training; 3. split each original pooled dataset between its natural

clients; and 4. easily iterate over the preprocessed dataset. The dataset API relies on PyTorch [102],

which makes it easy to iterate over the dataset with natural splits as well as to modify these splits if

needed.

FL strategies and benchmark.

FL training algorithms, called strategies in the FLamby suite, are

provided for simulation purposes. In order to be agnostic to existing FL libraries, these strategies

are provided in plain Python code. The API of these strategies is standardized and compatible with

the dataset API, making it easy to benchmark each strategy on each dataset. We further provide a

script performing such a benchmark for illustration purposes. We stress the fact that it is easy to

alternatively use implementations from existing FL libraries.

3.2 Datasets, Metrics and Baseline Models

We provide a brief description of each dataset in the FLamby dataset suite, which is summarized in

Table 1. In Section 3.4, we further explore the heterogeneity of each dataset, as displayed in Figure 1.

Fed-Camelyon16.

Camelyon16 [

] is a histopathology dataset of 399 digitized breast biopsies’

slides with or without tumor collected from two hospitals: Radboud University Medical Center

(RUMC) and University Medical Center Utrecht (UMCU). By recovering the original split information

we build a federated version of Camelyon16 with

clients. The task consists in binary classiﬁcation

of each slide, which is challenging due to the large size of each image (

105×105

pixels at 20X

magniﬁcation), and measured by the Area Under the ROC curve (AUC).

As a baseline, we follow a weakly-supervised learning approach. Slides are ﬁrst converted to bags of

local features, which are one order of magnitude smaller in terms of memory requirements, and a

model is then trained on top of this representation. For each slide, we detect regions with a matter-

detection network and then extract features from each tile with an ImageNet-pretrained Resnet50,

following state-of-the-art practice [

]. Note that due to the imbalanced distribution of tissue in

the different slides, a different number of features is produced for each slide: we cap the total number

of tiles to

105

and use zero-padding for consistency. We then train a DeepMIL architecture [

using its reference implementation [

] and hyperparameters from [

]. We refer to Appendix C for

more details.

Fed-LIDC-IDRI.

LIDC-IDRI [

] is an image database [

] study with 1018 CT-scans (3D

images) from The Cancer Imaging Archive (TCIA), proposed in the LUNA16 competition [

114

The task consists in automatically segmenting lung nodules in CT-scans, as measured by the DICE

score [

]. It is challenging because lung nodules are small, blurry, and hard to detect. By parsing

the metadata of the CT-scans from the provided annotations, we recover the manufacturer of each

scanning machine used, which we use as a proxy for a client. We therefore build a

-client federated

version of this dataset, split by manufacturer. Figure 1b displays the distribution of voxel intensities

in each client.

As a baseline model, we use a VNet [

] following the implementation from [

100

]. This model is

trained by sampling 3D-volumes into 3D patches ﬁtting in GPU memory. Details of the sampling

procedure are available in Appendix D.

Fed-IXI.

This dataset is extracted from the Information eXtraction from Images - IXI database [

and has been previously released by Perez et al. [

108

104

] under the name of IXITiny. IXITiny

provides a database of brain T1 magnetic resonance images (MRIs) from

hospitals (Guys, HH,

and IOP). This dataset has been adapted to a brain segmentation task by obtaining spatial brain

masks using a state-of-the-art unsupervised brain segmentation tool [

]. The quality of the resulting

supervised segmentation task is measured by the DICE score [36].

The image pre-processing pipeline includes volume resizing to

48 ×60 ×48

voxels, and sample-

wise intensity normalization. Figure 1c highlights the heterogeneity of the raw MRI intensity

distributions between clients. As a baseline, we use a 3D U-net [

] following the implementation

of [

107

]. Appendix E provides more detailed information about this dataset, including demographic

information, and about the baseline.

Fed-TCGA-BRCA.

The Cancer Genome Atlas (TCGA)’s Genomics Data Commons (GDC) por-

tal [

] contains multi-modal data (tabular, 2D and 3D images) on a variety of cancers collected in

many different hospitals. Here, we focus on clinical data from the BReast CAncer study (BRCA),

which includes features gathered from 1066 patients. We use the Tissue Source Site metadata to split

data based on extraction site, grouped into geographic regions to obtain large enough clients. We

end up with

clients: USA (Northeast, South, Middlewest, West), Canada and Europe, with patient

counts varying from 51 to 311. The task consists in predicting survival outcomes [

] based on the

patients’ tabular data (39 features overall), with the event to predict being death. This survival task is

akin to a ranking problem with the score of each sample being known either directly or only by lower

bound (right censorship). The ranking is evaluated by using the concordance index (C-index) that

measures the percentage of correctly ranked pairs while taking censorship into account.

As a baseline, we use a linear Cox proportional hazard model [

] to predict time-to-death for patients.

Figure 1e highlights the survival distribution heterogeneity between the different clients. Appendix F

provides more details on this dataset.

Fed-KITS2019.

The KiTS19 dataset [

] stems from the Kidney Tumor Segmentation Chal-

lenge 2019 and contains CT scans of 210 patients along with the segmentation masks from 79

hospitals. We recover the hospital metadata and extract a

-client federated version of this dataset by

removing hospitals with less than

training samples. The task consists of both kidney and tumor

segmentation, labeled 1 and 2, respectively, and we measure the average of Kidney and Tumor DICE

scores [36] as our evaluation metric.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FLamby:DatasetsandBenchmarksforCross-SiloFederatedLearninginRealisticHealthcareSettingsJeanOgierduTerrail1Samy-SafwanAyed2EdwigeCyffers3FelixGrimberg4ChaoyangHe5RegisLoeb1PaulMangold3TanguyMarchand1OthmaneMarfoq2ErumMushtaq6BorisMuzellec1ConstantinPhilippenko7SantiagoSilva2MariaTele´nczuk1ShadiAlbar...

展开>> 收起<<

FLamby Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings Jean Ogier du Terrail1Samy-Safwan Ayed2Edwige Cyffers3Felix Grimberg4.pdf

共44页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FLamby Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings Jean Ogier du Terrail1Samy-Safwan Ayed2Edwige Cyffers3Felix Grimberg4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: