Mitigating Health Data Poverty Generative Approaches versus Resampling for Time-series Clinical Data

2025-05-02 0 0 1.57MB 15 页 10玖币
侵权投诉
Mitigating Health Data Poverty: Generative
Approaches versus Resampling for Time-series
Clinical Data
Raffaele Marchesi ,1,2Nicolo Micheletti ,1,3Giuseppe Jurman 1Venet Osmani 1
raffaele.marchesi@studenti.unitn.it
nicolo.micheletti@student.manchester.ac.uk
jurman@fbk.eu,vosmani@fbk.eu
Equal contribution
1Fondazione Bruno Kessler Research Institute, Trento, Italy
2University of Trento, 3University of Manchester
Abstract
Several approaches have been developed to mitigate algorithmic bias stemming
from health data poverty, where minority groups are underrepresented in training
datasets. Augmenting the minority class using resampling (such as SMOTE)
is a widely used approach due to the simplicity of the algorithms. However,
these algorithms decrease data variability and may introduce correlations between
samples, giving rise to the use of generative approaches based on GAN. Generation
of high-dimensional, time-series, authentic data that provides a wide distribution
coverage of the real data, remains a challenging task for both resampling and GAN-
based approaches. In this work we propose CA-GAN architecture that addresses
some of the shortcomings of the current approaches, where we provide a detailed
comparison with both SMOTE and WGAN-GP*, using a high-dimensional, time-
series, real dataset of 3343 hypotensive Caucasian and Black patients. We show
that our approach is better at both generating authentic data of the minority class
and remaining within the original distribution of the real data.
1 Introduction
As machine learning methods increasingly weave themselves into societal decision making, critical
issues related to decision fairness and algorithmic bias are coming to light. These issues are especially
prominent in health and clinical decision making, where underprivileged and minority groups are
underrepresented, resulting in unfair decisions. Algorithmic bias can originate from diverse sources,
including health data poverty [
1
], where particular groups might be underrepresented in the training
sets, but it may also originate from procedural care practices, wider socioeconomic issues or the data
itself [
2
]. There are several attempts to address bias and improve fairness stemming from health data
poverty. One approach is data augmentation, where synthetic data are generated from unbalanced
datasets, mitigating minority class representation.
The Machine learning community has developed various approaches to generate synthetic data [
3
].
One of the widely used methods is data resampling, where the data from the minority class are
typically oversampled to generate additional synthetic data, with Synthetic Minority Over-sampling
TEchnique (SMOTE) [
4
] being a representative example. Synthetic samples lie between a randomly
selected sample and its randomly selected neighbour (using k-NN), resulting in plausible samples
close in feature space to the existing samples. SMOTE and related approaches are widely used due
to their simplicity and computational efficiency. However, in high-dimensional data SMOTE may
NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research.
arXiv:2210.13958v2 [cs.LG] 26 Oct 2022
decrease data variability and introduce correlation between samples [
5
,
6
,
7
]. As such, alternative
approaches based on generative adversarial networks (GAN) are gaining ground [
8
,
9
,
10
,
11
,
12
].
However, generation of high-dimensional time-series data remains a challenging task [
13
,
14
,
15
]. In
this work we propose a new generative architecture, Conditional Augmentation GAN (CA-GAN),
based on the Wasserstein GAN with Gradient Penalty [
16
,
17
] as presented in Health Gym [
18
]
(referred in this paper as WGAN-GP*), however with a different objective. Instead of generating
new synthetic datasets, we focus on data augmentation, specifically augmenting the minority class to
mitigate data poverty. We compare the performance of our CA-GAN with WGAN-GP* and SMOTE
in augmenting data of patients of an underrepresented ethnicity (Black patients in our case), using a
critical care dataset of 3343 hypotensive patients, derived from MIMIC-III database [19, 20].
Contributions.
(1) We propose a new architecture CA-GAN for data augmentation, to address
some of the shortcomings of the traditional and recent approaches in high-dimensional, time-series
synthetic data generation. (2) We compare qualitatively and quantitatively CA-GAN with state
of the art architecture in the synthesis of multivariate clinical time series. (3) We also compare
CA-GAN with SMOTE, a naive but effective and popular resampling method, demonstrating superior
performance of generative models in generalisation and synthesis of authentic data. (4) We show
that CA-GAN is able to synthesise realistic data that can augment the real data, when used in a
downstream predictive task.
2 Methods
2.1 Problem Formulation
Let
A
be a vector space of features and let
aA
. Let
l
be a binary mask, extracted from
L={0,1}
,
a distribution modifier. Consider the following data set
D0={an}N
n=1
with
l= 0
, with individual
samples indexed by
n∈ {1, ..., N}
and
D1={am}N+M
m=N+1
with
l= 1
, with individual samples
indexed by
m∈ {N+ 1, ..., N +M}
where
N > M
. Then, consider the dataset
D=D0D1
as
our training dataset. Notations inspired by [21].
Our goals
. We want to learn a density
ˆ
d{A}
that best approximates
d{A}
, the true distribution of
D
. We define
ˆ
d1{A}
as
ˆ
d{A}
with
l= 1
applied. From the modified distribution
ˆ
d1{A}
we draw
random variables Xand add these to D1until N=M.
2.2 CGAN vs GAN
The Generative Adversarial Network (GAN) [
22
] entails 2 components, a generator and a discrimina-
tor. The generator
G
is fed a noise vector
z
taken from a latent distribution
pz
and outputs a sample
of synthetic data. The discriminator
D
inputs either fake samples created by the generator or real
samples
x
taken from the true data distribution
pdata
. Hence, the GAN can be represented by the
following minimax loss function:
min
Gmax
DV(D, G) = Expdata(x)[log D(x)] + Ezpz(z)[1 log D(G(z))]
The goal of the discriminator is to maximise the probability to discern fake from real data, whilst
the goal of the generator is to make samples realistic enough to fool the discriminator, i.e. to
minimise
Ezpz(z)[1 log D(G(z))]
. As a result of the reciprocal competition both the generator
and discriminator improve during training.
The limitations of vanilla GAN models become evident when working with highly imbalanced
datasets, where there might not be sufficient samples to train the models in order to generate minority
class samples. A modified version of GAN, the Conditional GAN [
23
], solves this problem by using
labels
y
, both in the generator and discriminator. The additional information
y
divides the generation
and the discrimination in different classes. Hence, the model can now be trained on the whole dataset,
to then generate only minority class samples. Hence, the loss function is modified as follows:
min
Gmax
DV(D, G) = Expdata(x)[log D(x|y)] + Ezpz(z)[1 log D(G(z|y))]
GAN and CGAN, overall, share the same major weaknesses during training, namely mode collapse
and vanishing gradient [
24
]. In addition, as GAN were initially designed to generate images, thus,
they have been shown unsuitable to generate time-series [21] and discrete data samples [25].
2
2.3 CA-GAN vs WGAN-GP*
The WGAN-GP* introduced by Kuo et al. [
18
] solved many of the limitations posed by vanilla
GANs. The model was a modified version of a WGAN-GP [
16
,
17
], thus it applied the Earth Mover
distance (EM) [
26
] to the distributions, which had been shown to solve both vanishing gradient and
mode collapse [
27
]. In addition, the model applied Gradient Penalty during training, which helped
to enforce more efficiently the Lipschitz constraint on the discriminator. More information on the
WGAN-GP* architecture can be found in Appendix A.
We built our CA-GAN on the WGAN-GP* of Kuo et al. by conditioning the generator and the
discriminator on static labels
y
. Hence, the updated loss functions used by our model are as follows:
LD=Ezpz(z)[D(G(z|y))] Expdata(x)[D(x|y)] + λGP Ezpz(z)[(||∇D(G(z|y))||21)2]
LG=Ezpz(z)[D(G(z|y))] + λcorr
n
X
i=1
i1
X
j=1
kr(i,j)
syn r(i,j)
real kL1
| {z }
Alignment loss
Where
y
can be any type of categorical label. During training the label
y
were used to differentiate
the minority from the majority class and during generation they were used to create fake samples of
the minority class.
In comparison with WGAN-GP*, we also increased the number of biLSTMs from 1 to 3 both in the
generator and the discriminator, as stacked biLSTMs have been shown to better capture complex
time-series [
28
]. In addition we decreased learning rate and batch size during training. An overview
of the CA-GAN architecture is shown in Figure 3.
3 Evaluation
Our dataset comprises 3343 hypotensive patients ([
29
]) admitted to critical care, the patients were
either of Black (395) or Caucasian (2948) ethnicity. Each patient is represented by 48 data points,
corresponding to the first 48 hours after the admission, in addition to 9 numeric, 4 categorical and 7
binary variables (20 in total) as shown in Table 3.
3.1 Evaluation Metrics
Evaluating the quality of the data produced by a generative model is anything but trivial. Several eval-
uation metrics have been proposed, but there is still no standardised evaluation method. In this work,
given the complexity of the multivariate time series that we wanted to synthesise, we have chosen
to adopt both a qualitative and quantitative evaluation of generated data. First, we used Maximum
Mean Discrepancy (MMD) and Kullback–Leibler divergence to measure the difference between real
and synthetic data for the underlying distribution of each variable. Second, we use Kendall rank
correlation coefficient to evaluate the ability of the generative model to capture correlations between
variables. Then, we verified that our model was generating authentic data (and not simply copying
real data) by measuring the Euclidean Distance between real data and synthetic data. In this respect
we also visualised real and synthetic data in a two dimensional latent space. Finally, we verified that
our CA-GAN was able to generate useful new time series and able to capture the temporal correlation
of the observations, by evaluating the predictive ability of an LSTM trained with synthetic data and
evaluated on test data. Furthermore, several of our evaluation metrics were qualitatively analysed
using plots of distributions, correlations, and two-dimensional representations of the datasets.
4 Results
In this section we present the comparison between the synthetic data generated by our CA-GAN
and the data generated by WGAN-GP* and SMOTE (with 5-NN). We used each method to generate
sufficient data to augment the minority class (Black patients) and balance the original dataset.
3
摘要:

MitigatingHealthDataPoverty:GenerativeApproachesversusResamplingforTime-seriesClinicalDataRaffaeleMarchesiy;1;2NicoloMichelettiy;1;3GiuseppeJurman1VenetOsmani1raffaele.marchesi@studenti.unitn.itnicolo.micheletti@student.manchester.ac.ukjurman@fbk.eu,vosmani@fbk.euyEqualcontribution1FondazioneBrunoKe...

展开>> 收起<<
Mitigating Health Data Poverty Generative Approaches versus Resampling for Time-series Clinical Data.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.57MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注