
FairGen: Fair Synthetic Data Generation
Himanshu Chaudhary * 1 Bhushan Chaudhari * 1 Aakash Agarwal 2Kamna Meena 1Tanmoy Bhowmik 3
Abstract
With the rising adoption of Machine Learning
across the domains like banking, pharmaceutical,
ed-tech, etc, it has become utmost important to
adopt responsible AI methods to ensure models
are not unfairly discriminating against any group.
Given the lack of clean training data, generative
adversarial techniques are preferred to generate
synthetic data with several state-of-the-art archi-
tectures readily available across various domains
from unstructured data such as text, images to
structured datasets modelling fraud detection and
many more. These techniques overcome several
challenges such as class imbalance, limited train-
ing data, restricted access to data due to privacy
issues. Existing work focusing on generating fair
data either works for a certain GAN architecture
or is very difficult to tune across the GANs. In
this paper, we propose a pipeline to generate fairer
synthetic data independent of the GAN architec-
ture. The proposed paper utilizes a pre-processing
algorithm to identify and remove bias inducing
samples. In particular, we claim that while gen-
erating synthetic data most GANs amplify bias
present in the training data but by removing these
bias inducing samples, GANs essentially focuses
more on real informative samples. Our experi-
mental evaluation on two open-source datasets
demonstrates how the proposed pipeline is gener-
ating fair data along with improved performance
in some cases.
*
Equal contribution
1
Mastercard, India
2
Credgenics, India
3
GoTo Group, India. Correspondence to: Himanshu Chaud-
hary
<
himanshu.chaudhary@mastercard.com
>
, Bhushan
Chaudhari
<
bhushan.chaudhari@mastercard.com
>
, Aakash
Agarwal
<
aakash.agarwal1307@gmail.com
>
, Kamna Meena
<
mkamna14@gmail.com
>
, Tanmoy Bhowmik
<
tantan-
moy@gmail.com>.
Proceedings of the
39 th
International Conference on Machine
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-
right 2022 by the author(s).
1. Introduction
Machine learning algorithms are being used ubiquitously in
the current world ranging from recommendation systems,
fraud detection, facial recognition to Autonomous driving.
These algorithms uses historical data to automate decision
making. However, these machine learning models have un-
intended bias towards protected group as demonstrated by
(Calders &
ˇ
Zliobait
˙
e,2013). Biasness comes in the model
due to the training data being biased itself. Training data
can have unfairness based on personal protected attributes
such as gender race, religion, etc. and this unfairness in
data doesn’t limit to only real world data but it also get
propagated to models which are built on such biased data.
Deep Learning models require huge training data to be ac-
curate. But due to limited training data, class imbalance
issues, Synthetic data have gained prominence as a replace-
ment to get additional high quality real world training data.
Generative Adversarial Networks (GANs) are one of the
most used popular way to generate synthetic data (Good-
fellow et al.,2014). GANs are generative models that has
two components: generator and discriminator. Model is
trained in such a way that generator fools discriminator and
generates synthetic data almost similar to real data. This
generated high quality synthetic data can be used for dif-
ferent predictive analysis in case where real world data is
limited (Choi et al.,2017). Health and financial datasets are
particularly ripe for a synthetic approach. These fields are
highly restricted by privacy laws. Synthetic data can help
researchers get the data they need without violating people’s
privacy. For institutions in health, financial sectors they can
not share their customers’ data directly. However if they
want to open source their data by adhering to all regulatory
rules for privacy of their customers, then they can create
synthetic data using real customer’s data. In doing so, the
real data will not get shared and the generated synthetic data
has the same distributional properties as of real data and this
can be used by researchers to create and do groundbreaking
research.
Existing GAN techniques generates biased data by amplify-
ing the bias present in the training data. (Gupta et al.,2021)
showed bias amplification across various GAN architectures
including deferentially private generation schemes. This
paper motivated us to find ways of reducing bias from the
training data itself.
arXiv:2210.13023v2 [cs.LG] 1 Dec 2022