FCT-GAN Enhancing Table Synthesis via Fourier Transform Zilong Zhao Robert Birkey Lydia Y . Chen

2025-04-26 0 0 861.1KB 8 页 10玖币
侵权投诉
FCT-GAN: Enhancing Table Synthesis via Fourier
Transform
Zilong Zhao, Robert Birke, Lydia Y. Chen
TU Delft, Netherlands z.zhao-8@tudelft.nl
ABB Research, Switzerland birke@ieee.org
TU Delft, Netherlands lydiaychen@ieee.org
Abstract—Synthetic tabular data emerges as an alternative
for sharing knowledge while adhering to restrictive data access
regulations, e.g., European General Data Protection Regulation
(GDPR). Mainstream state-of-the-art tabular data synthesiz-
ers draw methodologies from Generative Adversarial Networks
(GANs), which are composed of a generator and a discriminator.
While convolution neural networks are shown to be a better
architecture than fully connected networks for tabular data
synthesizing, two key properties of tabular data are overlooked:
(i) the global correlation across columns, and (ii) invariant
synthesizing to column permutations of input data. To address the
above problems, we propose a Fourier conditional tabular gen-
erative adversarial network (FCT-GAN). We introduce feature
tokenization and Fourier networks to construct a transformer-
style generator and discriminator, and capture both local and
global dependencies across columns. The tokenizer captures
local spatial features and transforms original data into tokens.
Fourier networks transform tokens to frequency domains and
element-wisely multiply a learnable filter. Extensive evaluation
on benchmarks and real-world data shows that FCT-GAN can
synthesize tabular data with high machine learning utility (up to
27.8% better than state-of-the-art baselines) and high statistical
similarity to the original data (up to 26.5% better), while
maintaining the global correlation across columns, especially on
high dimensional dataset.
I. INTRODUCTION
While data sharing is crucial for knowledge development,
privacy concerns and strict regulations (e.g., European General
Data Protection Regulation (GDPR)) limit its full effective-
ness. An emerging solution is to leverage synthetic data
generated by machine learning models. Synthetic data has been
powered by generative adversarial networks (GAN) [6] for
various types of data, e.g., image [9], text to image [19] and
table [24].
Synthetic tabular data emerges as a prominent research
direction because of its ample application scenarios in areas
such as medicine [4] and finance [1]. Compared to image
data, one key difference of tabular data is that it is composed
of different types of columns such as continuous, categorical
or mixed variables. Therefore, GANs designed for image
synthesis cannot be directly applied for tabular data. Previous
works [24], [26], [27] propose feature engineering solutions
for different types of data such as using one-hot encoding for
categorical variable. One-hot encoding is shown [24] to better
recover the categorical variable distribution for tabular GANs
and capture inter-dependency across all the columns. However,
one-hot encoding inevitably increases the data dimensions.
High dimensional data1is challenging for tabular GANs to
learn global relations. Prior studies [24], [26], [27] show that
the tabular GAN algorithms, which adopt CNNs as generator
and discriminator, achieve better synthesis quality than using
purely fully-connected neural networks. This is due to the fact
that CNNs can extract local spatial features well. The first lim-
itation of directly adopting CNN to model tabular data is that it
may overlook global relations between columns due to the size
of the convolution filter. This limitation exacerbates when one-
hot encoding is applied for categorical variables. Secondly,
while permuting columns, e.g., reordering the columns by their
types, does not have any semantic meaning, the local feature
presentation extracted by convolution layers is distorted. When
using CNN for tabular GANs, one table row is transformed
into one fixed-size image by mapping each column value to a
pixel. The relationship between highly distant pixels, e.g. the
pixel in the upper left corner and the pixel in the lower right
corner, in a real image may not influence image classification.
But for the tabular data wrapped as an image, these two pixels
can represent highly correlated columns.
To address the above two limitations, we propose a condi-
tional tabular GAN with Fourier Network blocks (FNBs). The
objective of the FNBs is to learn the interactions among spatial
locations in the frequency domain. We use FNBs for both
the discriminator and generator with different designs. The
Fourier layer, which is the key part of an FNB, contains three
operations: (i) 2D discrete Fourier transform, (ii) element-wise
multiplication between frequency-domain features and learn-
able weights and (iii) 2D inverse discrete Fourier transform.
Furthermore, we process input data in a transformer-style
tokenization way. A CNN-based filter is applied to original
data to capture local spatial features and transform them into
feature tokens. Fourier layers transform tokens into frequency
domain, then the learnable weights are applied to all the
frequencies to learn the global relations. Our results show that
FCT-GAN outperforms state-of-the-art (SOTA) up to 27.8%
in machine learning utility and 26.5% in statistical similarity
on 7 datasets. Thanks to Fourier blocks ability to capture local
and global relations, our results also show that among three
different column orders, FCT-GAN has the least variation in
synthesis quality among all comparisons. The experiment, with
one high dimensional dataset, which 3 SOTA algorithms fail
1In this paper, dimension refers to the number of columns
arXiv:2210.06239v1 [cs.LG] 12 Oct 2022
to train due to the data dimension issue, shows that FCT-
GAN still maintains its performance and stability above all
comparisons.
The main contributions of this study can be summarized
as follows: (1) We introduce the Fourier transform into tab-
ular GAN training and design a generator and discriminator
architecture. (2) Combining with a transformer-style input tok-
enizer, the novel architecture can capture both local and global
relations of tabular data, leading to the desirable property of
column permutation invariance. (3) We extensively evaluate
FCT-GAN on 8 datasets against 5 state-of-the-art synthesizers,
with a special focus on high dimensional real-world data.
II. RELATED WORK
We introduce various tabular data synthesizing methods and
Fourier networks.
A. Tabular Data Synthesizers
The are various approaches for synthesizing tabular
data.Probabilistic models such as Copulas [17] uses Copula
function to model multivariate distributions. But categorical
data can not be modeled by Gaussian Copula. Synthpop [15]
works on a variable by variable basis by fitting a sequence
of regression models and drawing synthetic values from the
corresponding predictive distributions. Since it is variable
by variable, the training process is computationally intense.
Bayesian networks [2], [25] are used to synthesize categor-
ical variables. It lacks of the ability to generate continuous
variables.
Current state-of-the-art introduces several tabular GAN al-
gorithms. Table-GAN [16] introduces an auxiliary classifi-
cation model along with discriminator training to enhance
column dependency in the synthetic data. CT-GAN [24] and
CTAB-GAN [26] improve data synthesis by introducing sev-
eral preprocessing steps for categorical, continuous or mixed
data types which encode data columns into suitable form
for GAN training. The conditional vector designed by CT-
GAN and later improved by CTAB-GAN also helps the
GAN training to reduce mode-collapse on minority categories.
CTAB-GAN+ [27], PATE-GAN [8], and IT-GAN [10] gener-
ate tabular data without risking privacy of original data by
either adopting differential privacy or controlling the negative
log-density of real records during the GAN training. However,
as far as we know, there is no previous work studying tabular
GAN algorithm which focuses on countering the influence
of training data column permutation on final synthetic data
quality.
B. Fourier Networks
The Fourier transform has played an important role in
image processing for decades, e.g. JPEG compression [23].
Incorporating Fourier transform into the neural network ar-
chitecture design has been studied in many vision works [3],
[14], [21]. Recent work also leverages the Fourier transform
to design deep neural networks to solve partial differential
equations (PDE) [12] and NLP tasks [11]. Our Fourier network
Fig. 1: Normalized maximal absolute variation (MAV) of dif-
ference in statistical similarity between original and synthetic
data among different column orders: (1) Original, (2) Order
by data type, and (3) Order by column correlation.
block architecture design is mainly inspired by the Global
Filter Network (GFNet) [20]. We take the design of the input
tokenization and global filter layer from GFNet and use it
in our design of the generator and discriminator for tabular
GAN.
III. ANALYSIS ON PERMUTATION INVARIANCE
TABLE I: Average statistical similarity difference between
original and synthetic data among three different column
orders.
Method Avg-JSD Avg-WD Diff. Corr.
CTAB-GAN+ 0.040 0.012 2.21
CTAB-GAN 0.040 0.013 2.21
Table-GAN 0.20 0.017 3.57
TVAE 0.10 0.231 2.79
CT-GAN 0067 0.038 3.17
We empirically demonstrate the instability of prior SOTA
methods to permutations of the columns in the training data.
Details on the datasets and evaluation metrics are provided in
Sec. Experiment.
We consider three column orders: (i) Original: as the name
suggests, maintains the order as in the data downloaded from
dataset source. (ii) Order by data type: puts all the continuous
columns at the beginning and all categorical columns at the
back. (iii) Order by data correlation: first calculates the pair-
wise correlations between all columns. Then it sorts columns
based on the absolute correlation value with highly correlated
pairs in front and less correlated pairs later. Duplicate columns
are skipped. We evaluate the statistical dissimilarity between
real and synthetic table using Average Jensen–Shannon di-
vergence (Avg-JSD) for all categorical columns and Average
Wasserstein distance (Avg-WD) for all continuous columns.
Finally, Diff. Corr. denotes the averaged distance between the
correlation matrix of the real and the synthetic data. The lower
these metrics, the better the synthetic data quality. For
metric Mand dataset D,L={VN
DM ,VN
DM ... VN
DM }denotes
摘要:

FCT-GAN:EnhancingTableSynthesisviaFourierTransformZilongZhao,RobertBirkey,LydiaY.Chen{TUDelft,Netherlandsz.zhao-8@tudelft.nlyABBResearch,Switzerlandbirke@ieee.org{TUDelft,Netherlandslydiaychen@ieee.orgAbstract—Synthetictabulardataemergesasanalternativeforsharingknowledgewhileadheringtorestrictived...

展开>> 收起<<
FCT-GAN Enhancing Table Synthesis via Fourier Transform Zilong Zhao Robert Birkey Lydia Y . Chen.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:861.1KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注