Mitigating Health Data Poverty Generative Approaches versus Resampling for Time-series Clinical Data

2025-05-02 0 0 1.57MB 15 页 10玖币

侵权投诉

Mitigating Health Data Poverty: Generative

Approaches versus Resampling for Time-series

Clinical Data

Raffaele Marchesi †,1,2Nicolo Micheletti †,1,3Giuseppe Jurman 1Venet Osmani 1

raffaele.marchesi@studenti.unitn.it

nicolo.micheletti@student.manchester.ac.uk

jurman@fbk.eu,vosmani@fbk.eu

†Equal contribution

1Fondazione Bruno Kessler Research Institute, Trento, Italy

2University of Trento, 3University of Manchester

Abstract

Several approaches have been developed to mitigate algorithmic bias stemming

from health data poverty, where minority groups are underrepresented in training

datasets. Augmenting the minority class using resampling (such as SMOTE)

is a widely used approach due to the simplicity of the algorithms. However,

these algorithms decrease data variability and may introduce correlations between

samples, giving rise to the use of generative approaches based on GAN. Generation

of high-dimensional, time-series, authentic data that provides a wide distribution

coverage of the real data, remains a challenging task for both resampling and GAN-

based approaches. In this work we propose CA-GAN architecture that addresses

some of the shortcomings of the current approaches, where we provide a detailed

comparison with both SMOTE and WGAN-GP*, using a high-dimensional, time-

series, real dataset of 3343 hypotensive Caucasian and Black patients. We show

that our approach is better at both generating authentic data of the minority class

and remaining within the original distribution of the real data.

1 Introduction

As machine learning methods increasingly weave themselves into societal decision making, critical

issues related to decision fairness and algorithmic bias are coming to light. These issues are especially

prominent in health and clinical decision making, where underprivileged and minority groups are

underrepresented, resulting in unfair decisions. Algorithmic bias can originate from diverse sources,

including health data poverty [

], where particular groups might be underrepresented in the training

sets, but it may also originate from procedural care practices, wider socioeconomic issues or the data

itself [

]. There are several attempts to address bias and improve fairness stemming from health data

poverty. One approach is data augmentation, where synthetic data are generated from unbalanced

datasets, mitigating minority class representation.

The Machine learning community has developed various approaches to generate synthetic data [

One of the widely used methods is data resampling, where the data from the minority class are

typically oversampled to generate additional synthetic data, with Synthetic Minority Over-sampling

TEchnique (SMOTE) [

] being a representative example. Synthetic samples lie between a randomly

selected sample and its randomly selected neighbour (using k-NN), resulting in plausible samples

close in feature space to the existing samples. SMOTE and related approaches are widely used due

to their simplicity and computational efﬁciency. However, in high-dimensional data SMOTE may

NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research.

arXiv:2210.13958v2 [cs.LG] 26 Oct 2022

decrease data variability and introduce correlation between samples [

]. As such, alternative

approaches based on generative adversarial networks (GAN) are gaining ground [

However, generation of high-dimensional time-series data remains a challenging task [

]. In

this work we propose a new generative architecture, Conditional Augmentation GAN (CA-GAN),

based on the Wasserstein GAN with Gradient Penalty [

] as presented in Health Gym [

]

(referred in this paper as WGAN-GP*), however with a different objective. Instead of generating

new synthetic datasets, we focus on data augmentation, speciﬁcally augmenting the minority class to

mitigate data poverty. We compare the performance of our CA-GAN with WGAN-GP* and SMOTE

in augmenting data of patients of an underrepresented ethnicity (Black patients in our case), using a

critical care dataset of 3343 hypotensive patients, derived from MIMIC-III database [19, 20].

Contributions.

(1) We propose a new architecture CA-GAN for data augmentation, to address

some of the shortcomings of the traditional and recent approaches in high-dimensional, time-series

synthetic data generation. (2) We compare qualitatively and quantitatively CA-GAN with state

of the art architecture in the synthesis of multivariate clinical time series. (3) We also compare

CA-GAN with SMOTE, a naive but effective and popular resampling method, demonstrating superior

performance of generative models in generalisation and synthesis of authentic data. (4) We show

that CA-GAN is able to synthesise realistic data that can augment the real data, when used in a

downstream predictive task.

2 Methods

2.1 Problem Formulation

Let

be a vector space of features and let

a∈A

. Let

be a binary mask, extracted from

L={0,1}

a distribution modiﬁer. Consider the following data set

D0={an}N

n=1

with

l= 0

, with individual

samples indexed by

n∈ {1, ..., N}

and

D1={am}N+M

m=N+1

with

l= 1

, with individual samples

indexed by

m∈ {N+ 1, ..., N +M}

where

N > M

. Then, consider the dataset

D=D0∪D1

our training dataset. Notations inspired by [21].

Our goals

. We want to learn a density

d{A}

that best approximates

d{A}

, the true distribution of

. We deﬁne

d1{A}

d{A}

with

l= 1

applied. From the modiﬁed distribution

d1{A}

we draw

random variables Xand add these to D1until N=M.

2.2 CGAN vs GAN

The Generative Adversarial Network (GAN) [

] entails 2 components, a generator and a discrimina-

tor. The generator

is fed a noise vector

taken from a latent distribution

and outputs a sample

of synthetic data. The discriminator

inputs either fake samples created by the generator or real

samples

taken from the true data distribution

pdata

. Hence, the GAN can be represented by the

following minimax loss function:

min

Gmax

DV(D, G) = Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[1 −log D(G(z))]

The goal of the discriminator is to maximise the probability to discern fake from real data, whilst

the goal of the generator is to make samples realistic enough to fool the discriminator, i.e. to

minimise

Ez∼pz(z)[1 −log D(G(z))]

. As a result of the reciprocal competition both the generator

and discriminator improve during training.

The limitations of vanilla GAN models become evident when working with highly imbalanced

datasets, where there might not be sufﬁcient samples to train the models in order to generate minority

class samples. A modiﬁed version of GAN, the Conditional GAN [

], solves this problem by using

labels

, both in the generator and discriminator. The additional information

divides the generation

and the discrimination in different classes. Hence, the model can now be trained on the whole dataset,

to then generate only minority class samples. Hence, the loss function is modiﬁed as follows:

min

Gmax

DV(D, G) = Ex∼pdata(x)[log D(x|y)] + Ez∼pz(z)[1 −log D(G(z|y))]

GAN and CGAN, overall, share the same major weaknesses during training, namely mode collapse

and vanishing gradient [

]. In addition, as GAN were initially designed to generate images, thus,

they have been shown unsuitable to generate time-series [21] and discrete data samples [25].

2.3 CA-GAN vs WGAN-GP*

The WGAN-GP* introduced by Kuo et al. [

] solved many of the limitations posed by vanilla

GANs. The model was a modiﬁed version of a WGAN-GP [

], thus it applied the Earth Mover

distance (EM) [

] to the distributions, which had been shown to solve both vanishing gradient and

mode collapse [

]. In addition, the model applied Gradient Penalty during training, which helped

to enforce more efﬁciently the Lipschitz constraint on the discriminator. More information on the

WGAN-GP* architecture can be found in Appendix A.

We built our CA-GAN on the WGAN-GP* of Kuo et al. by conditioning the generator and the

discriminator on static labels

. Hence, the updated loss functions used by our model are as follows:

LD=Ez∼pz(z)[D(G(z|y))] −Ex∼pdata(x)[D(x|y)] + λGP Ez∼pz(z)[(||∇D(G(z|y))||2−1)2]

LG=−Ez∼pz(z)[D(G(z|y))] + λcorr

i=1

i−1

j=1

kr(i,j)

syn −r(i,j)

real kL1

| {z }

Alignment loss

Where

can be any type of categorical label. During training the label

were used to differentiate

the minority from the majority class and during generation they were used to create fake samples of

the minority class.

In comparison with WGAN-GP*, we also increased the number of biLSTMs from 1 to 3 both in the

generator and the discriminator, as stacked biLSTMs have been shown to better capture complex

time-series [

]. In addition we decreased learning rate and batch size during training. An overview

of the CA-GAN architecture is shown in Figure 3.

3 Evaluation

Our dataset comprises 3343 hypotensive patients ([

]) admitted to critical care, the patients were

either of Black (395) or Caucasian (2948) ethnicity. Each patient is represented by 48 data points,

corresponding to the ﬁrst 48 hours after the admission, in addition to 9 numeric, 4 categorical and 7

binary variables (20 in total) as shown in Table 3.

3.1 Evaluation Metrics

Evaluating the quality of the data produced by a generative model is anything but trivial. Several eval-

uation metrics have been proposed, but there is still no standardised evaluation method. In this work,

given the complexity of the multivariate time series that we wanted to synthesise, we have chosen

to adopt both a qualitative and quantitative evaluation of generated data. First, we used Maximum

Mean Discrepancy (MMD) and Kullback–Leibler divergence to measure the difference between real

and synthetic data for the underlying distribution of each variable. Second, we use Kendall rank

correlation coefﬁcient to evaluate the ability of the generative model to capture correlations between

variables. Then, we veriﬁed that our model was generating authentic data (and not simply copying

real data) by measuring the Euclidean Distance between real data and synthetic data. In this respect

we also visualised real and synthetic data in a two dimensional latent space. Finally, we veriﬁed that

our CA-GAN was able to generate useful new time series and able to capture the temporal correlation

of the observations, by evaluating the predictive ability of an LSTM trained with synthetic data and

evaluated on test data. Furthermore, several of our evaluation metrics were qualitatively analysed

using plots of distributions, correlations, and two-dimensional representations of the datasets.

4 Results

In this section we present the comparison between the synthetic data generated by our CA-GAN

and the data generated by WGAN-GP* and SMOTE (with 5-NN). We used each method to generate

sufﬁcient data to augment the minority class (Black patients) and balance the original dataset.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MitigatingHealthDataPoverty:GenerativeApproachesversusResamplingforTime-seriesClinicalDataRaffaeleMarchesiy;1;2NicoloMichelettiy;1;3GiuseppeJurman1VenetOsmani1raffaele.marchesi@studenti.unitn.itnicolo.micheletti@student.manchester.ac.ukjurman@fbk.eu,vosmani@fbk.euyEqualcontribution1FondazioneBrunoKe...

展开>> 收起<<

Mitigating Health Data Poverty Generative Approaches versus Resampling for Time-series Clinical Data.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Mitigating Health Data Poverty Generative Approaches versus Resampling for Time-series Clinical Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: