FCT-GAN Enhancing Table Synthesis via Fourier Transform Zilong Zhao Robert Birkey Lydia Y . Chen

2025-04-26 0 0 861.1KB 8 页 10玖币

侵权投诉

FCT-GAN: Enhancing Table Synthesis via Fourier

Transform

Zilong Zhao∗, Robert Birke†, Lydia Y. Chen¶

∗TU Delft, Netherlands z.zhao-8@tudelft.nl

†ABB Research, Switzerland birke@ieee.org

¶TU Delft, Netherlands lydiaychen@ieee.org

Abstract—Synthetic tabular data emerges as an alternative

for sharing knowledge while adhering to restrictive data access

regulations, e.g., European General Data Protection Regulation

(GDPR). Mainstream state-of-the-art tabular data synthesiz-

ers draw methodologies from Generative Adversarial Networks

(GANs), which are composed of a generator and a discriminator.

While convolution neural networks are shown to be a better

architecture than fully connected networks for tabular data

synthesizing, two key properties of tabular data are overlooked:

(i) the global correlation across columns, and (ii) invariant

synthesizing to column permutations of input data. To address the

above problems, we propose a Fourier conditional tabular gen-

erative adversarial network (FCT-GAN). We introduce feature

tokenization and Fourier networks to construct a transformer-

style generator and discriminator, and capture both local and

global dependencies across columns. The tokenizer captures

local spatial features and transforms original data into tokens.

Fourier networks transform tokens to frequency domains and

element-wisely multiply a learnable ﬁlter. Extensive evaluation

on benchmarks and real-world data shows that FCT-GAN can

synthesize tabular data with high machine learning utility (up to

27.8% better than state-of-the-art baselines) and high statistical

similarity to the original data (up to 26.5% better), while

maintaining the global correlation across columns, especially on

high dimensional dataset.

I. INTRODUCTION

While data sharing is crucial for knowledge development,

privacy concerns and strict regulations (e.g., European General

Data Protection Regulation (GDPR)) limit its full effective-

ness. An emerging solution is to leverage synthetic data

generated by machine learning models. Synthetic data has been

various types of data, e.g., image [9], text to image [19] and

table [24].

Synthetic tabular data emerges as a prominent research

direction because of its ample application scenarios in areas

such as medicine [4] and ﬁnance [1]. Compared to image

data, one key difference of tabular data is that it is composed

of different types of columns such as continuous, categorical

or mixed variables. Therefore, GANs designed for image

synthesis cannot be directly applied for tabular data. Previous

works [24], [26], [27] propose feature engineering solutions

for different types of data such as using one-hot encoding for

categorical variable. One-hot encoding is shown [24] to better

recover the categorical variable distribution for tabular GANs

and capture inter-dependency across all the columns. However,

one-hot encoding inevitably increases the data dimensions.

High dimensional data1is challenging for tabular GANs to

learn global relations. Prior studies [24], [26], [27] show that

the tabular GAN algorithms, which adopt CNNs as generator

and discriminator, achieve better synthesis quality than using

purely fully-connected neural networks. This is due to the fact

that CNNs can extract local spatial features well. The ﬁrst lim-

itation of directly adopting CNN to model tabular data is that it

may overlook global relations between columns due to the size

of the convolution ﬁlter. This limitation exacerbates when one-

hot encoding is applied for categorical variables. Secondly,

while permuting columns, e.g., reordering the columns by their

types, does not have any semantic meaning, the local feature

presentation extracted by convolution layers is distorted. When

using CNN for tabular GANs, one table row is transformed

into one ﬁxed-size image by mapping each column value to a

pixel. The relationship between highly distant pixels, e.g. the

pixel in the upper left corner and the pixel in the lower right

corner, in a real image may not inﬂuence image classiﬁcation.

But for the tabular data wrapped as an image, these two pixels

can represent highly correlated columns.

To address the above two limitations, we propose a condi-

tional tabular GAN with Fourier Network blocks (FNBs). The

objective of the FNBs is to learn the interactions among spatial

locations in the frequency domain. We use FNBs for both

the discriminator and generator with different designs. The

Fourier layer, which is the key part of an FNB, contains three

operations: (i) 2D discrete Fourier transform, (ii) element-wise

multiplication between frequency-domain features and learn-

able weights and (iii) 2D inverse discrete Fourier transform.

Furthermore, we process input data in a transformer-style

tokenization way. A CNN-based ﬁlter is applied to original

data to capture local spatial features and transform them into

feature tokens. Fourier layers transform tokens into frequency

domain, then the learnable weights are applied to all the

frequencies to learn the global relations. Our results show that

FCT-GAN outperforms state-of-the-art (SOTA) up to 27.8%

in machine learning utility and 26.5% in statistical similarity

on 7 datasets. Thanks to Fourier blocks ability to capture local

and global relations, our results also show that among three

different column orders, FCT-GAN has the least variation in

synthesis quality among all comparisons. The experiment, with

one high dimensional dataset, which 3 SOTA algorithms fail

1In this paper, dimension refers to the number of columns

arXiv:2210.06239v1 [cs.LG] 12 Oct 2022

to train due to the data dimension issue, shows that FCT-

GAN still maintains its performance and stability above all

comparisons.

The main contributions of this study can be summarized

as follows: (1) We introduce the Fourier transform into tab-

ular GAN training and design a generator and discriminator

architecture. (2) Combining with a transformer-style input tok-

enizer, the novel architecture can capture both local and global

relations of tabular data, leading to the desirable property of

column permutation invariance. (3) We extensively evaluate

FCT-GAN on 8 datasets against 5 state-of-the-art synthesizers,

with a special focus on high dimensional real-world data.

II. RELATED WORK

We introduce various tabular data synthesizing methods and

Fourier networks.

A. Tabular Data Synthesizers

The are various approaches for synthesizing tabular

data.Probabilistic models such as Copulas [17] uses Copula

function to model multivariate distributions. But categorical

data can not be modeled by Gaussian Copula. Synthpop [15]

works on a variable by variable basis by ﬁtting a sequence

of regression models and drawing synthetic values from the

corresponding predictive distributions. Since it is variable

by variable, the training process is computationally intense.

Bayesian networks [2], [25] are used to synthesize categor-

ical variables. It lacks of the ability to generate continuous

variables.

Current state-of-the-art introduces several tabular GAN al-

gorithms. Table-GAN [16] introduces an auxiliary classiﬁ-

cation model along with discriminator training to enhance

column dependency in the synthetic data. CT-GAN [24] and

CTAB-GAN [26] improve data synthesis by introducing sev-

eral preprocessing steps for categorical, continuous or mixed

data types which encode data columns into suitable form

for GAN training. The conditional vector designed by CT-

GAN and later improved by CTAB-GAN also helps the

GAN training to reduce mode-collapse on minority categories.

CTAB-GAN+ [27], PATE-GAN [8], and IT-GAN [10] gener-

ate tabular data without risking privacy of original data by

either adopting differential privacy or controlling the negative

log-density of real records during the GAN training. However,

as far as we know, there is no previous work studying tabular

GAN algorithm which focuses on countering the inﬂuence

of training data column permutation on ﬁnal synthetic data

quality.

B. Fourier Networks

The Fourier transform has played an important role in

image processing for decades, e.g. JPEG compression [23].

Incorporating Fourier transform into the neural network ar-

chitecture design has been studied in many vision works [3],

[14], [21]. Recent work also leverages the Fourier transform

to design deep neural networks to solve partial differential

equations (PDE) [12] and NLP tasks [11]. Our Fourier network

Fig. 1: Normalized maximal absolute variation (MAV) of dif-

ference in statistical similarity between original and synthetic

data among different column orders: (1) Original, (2) Order

by data type, and (3) Order by column correlation.

block architecture design is mainly inspired by the Global

Filter Network (GFNet) [20]. We take the design of the input

tokenization and global ﬁlter layer from GFNet and use it

in our design of the generator and discriminator for tabular

GAN.

III. ANALYSIS ON PERMUTATION INVARIANCE

TABLE I: Average statistical similarity difference between

original and synthetic data among three different column

orders.

Method Avg-JSD Avg-WD Diff. Corr.

CTAB-GAN+ 0.040 0.012 2.21

CTAB-GAN 0.040 0.013 2.21

Table-GAN 0.20 0.017 3.57

TVAE 0.10 0.231 2.79

CT-GAN 0067 0.038 3.17

We empirically demonstrate the instability of prior SOTA

methods to permutations of the columns in the training data.

Details on the datasets and evaluation metrics are provided in

Sec. Experiment.

We consider three column orders: (i) Original: as the name

suggests, maintains the order as in the data downloaded from

dataset source. (ii) Order by data type: puts all the continuous

columns at the beginning and all categorical columns at the

back. (iii) Order by data correlation: ﬁrst calculates the pair-

wise correlations between all columns. Then it sorts columns

based on the absolute correlation value with highly correlated

pairs in front and less correlated pairs later. Duplicate columns

are skipped. We evaluate the statistical dissimilarity between

real and synthetic table using Average Jensen–Shannon di-

vergence (Avg-JSD) for all categorical columns and Average

Wasserstein distance (Avg-WD) for all continuous columns.

Finally, Diff. Corr. denotes the averaged distance between the

correlation matrix of the real and the synthetic data. The lower

these metrics, the better the synthetic data quality. For

metric Mand dataset D,L={VN

DM ,VN

DM ... VN

DM }denotes

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FCT-GAN:EnhancingTableSynthesisviaFourierTransformZilongZhao,RobertBirkey,LydiaY.Chen{TUDelft,Netherlandsz.zhao-8@tudelft.nlyABBResearch,Switzerlandbirke@ieee.org{TUDelft,Netherlandslydiaychen@ieee.orgAbstractSynthetictabulardataemergesasanalternativeforsharingknowledgewhileadheringtorestrictived...

展开>> 收起<<

FCT-GAN Enhancing Table Synthesis via Fourier Transform Zilong Zhao Robert Birkey Lydia Y . Chen.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FCT-GAN Enhancing Table Synthesis via Fourier Transform Zilong Zhao Robert Birkey Lydia Y . Chen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: