decrease data variability and introduce correlation between samples [
5
,
6
,
7
]. As such, alternative
approaches based on generative adversarial networks (GAN) are gaining ground [
8
,
9
,
10
,
11
,
12
].
However, generation of high-dimensional time-series data remains a challenging task [
13
,
14
,
15
]. In
this work we propose a new generative architecture, Conditional Augmentation GAN (CA-GAN),
based on the Wasserstein GAN with Gradient Penalty [
16
,
17
] as presented in Health Gym [
18
]
(referred in this paper as WGAN-GP*), however with a different objective. Instead of generating
new synthetic datasets, we focus on data augmentation, specifically augmenting the minority class to
mitigate data poverty. We compare the performance of our CA-GAN with WGAN-GP* and SMOTE
in augmenting data of patients of an underrepresented ethnicity (Black patients in our case), using a
critical care dataset of 3343 hypotensive patients, derived from MIMIC-III database [19, 20].
Contributions.
(1) We propose a new architecture CA-GAN for data augmentation, to address
some of the shortcomings of the traditional and recent approaches in high-dimensional, time-series
synthetic data generation. (2) We compare qualitatively and quantitatively CA-GAN with state
of the art architecture in the synthesis of multivariate clinical time series. (3) We also compare
CA-GAN with SMOTE, a naive but effective and popular resampling method, demonstrating superior
performance of generative models in generalisation and synthesis of authentic data. (4) We show
that CA-GAN is able to synthesise realistic data that can augment the real data, when used in a
downstream predictive task.
2 Methods
2.1 Problem Formulation
Let
A
be a vector space of features and let
a∈A
. Let
l
be a binary mask, extracted from
L={0,1}
,
a distribution modifier. Consider the following data set
D0={an}N
n=1
with
l= 0
, with individual
samples indexed by
n∈ {1, ..., N}
and
D1={am}N+M
m=N+1
with
l= 1
, with individual samples
indexed by
m∈ {N+ 1, ..., N +M}
where
N > M
. Then, consider the dataset
D=D0∪D1
as
our training dataset. Notations inspired by [21].
Our goals
. We want to learn a density
ˆ
d{A}
that best approximates
d{A}
, the true distribution of
D
. We define
ˆ
d1{A}
as
ˆ
d{A}
with
l= 1
applied. From the modified distribution
ˆ
d1{A}
we draw
random variables Xand add these to D1until N=M.
2.2 CGAN vs GAN
The Generative Adversarial Network (GAN) [
22
] entails 2 components, a generator and a discrimina-
tor. The generator
G
is fed a noise vector
z
taken from a latent distribution
pz
and outputs a sample
of synthetic data. The discriminator
D
inputs either fake samples created by the generator or real
samples
x
taken from the true data distribution
pdata
. Hence, the GAN can be represented by the
following minimax loss function:
min
Gmax
DV(D, G) = Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[1 −log D(G(z))]
The goal of the discriminator is to maximise the probability to discern fake from real data, whilst
the goal of the generator is to make samples realistic enough to fool the discriminator, i.e. to
minimise
Ez∼pz(z)[1 −log D(G(z))]
. As a result of the reciprocal competition both the generator
and discriminator improve during training.
The limitations of vanilla GAN models become evident when working with highly imbalanced
datasets, where there might not be sufficient samples to train the models in order to generate minority
class samples. A modified version of GAN, the Conditional GAN [
23
], solves this problem by using
labels
y
, both in the generator and discriminator. The additional information
y
divides the generation
and the discrimination in different classes. Hence, the model can now be trained on the whole dataset,
to then generate only minority class samples. Hence, the loss function is modified as follows:
min
Gmax
DV(D, G) = Ex∼pdata(x)[log D(x|y)] + Ez∼pz(z)[1 −log D(G(z|y))]
GAN and CGAN, overall, share the same major weaknesses during training, namely mode collapse
and vanishing gradient [
24
]. In addition, as GAN were initially designed to generate images, thus,
they have been shown unsuitable to generate time-series [21] and discrete data samples [25].
2