Synthetic Text Generation with Differential Privacy A Simple and Practical Recipe Xiang Yue1 Huseyin A. Inan2 Xuechen Li3

2025-05-02 0 0 1.01MB 20 页 10玖币
侵权投诉
Synthetic Text Generation with Differential Privacy:
A Simple and Practical Recipe
Xiang Yue1,
, Huseyin A. Inan2, Xuechen Li3,
Girish Kumar5, Julia McAnallen4, Hoda Shajari4, Huan Sun1, David Levitan4, and Robert Sim2
1The Ohio State University, 2Microsoft Research, 3Stanford University, 4Microsoft, 5UC Davis
{yue.149,sun.397}@osu.edu
lxuechen@cs.stanford.edu gkum@ucdavis.edu
{Huseyin.Inan,Julia.McAnallen,hodashajari,David.Levitan,rsim}@microsoft.com
Abstract
Privacy concerns have attracted increasing at-
tention in data-driven products due to the ten-
dency of machine learning models to memorize
sensitive training data. Generating synthetic
versions of such data with a formal privacy
guarantee, such as differential privacy (DP),
provides a promising path to mitigating these
privacy concerns, but previous approaches in
this direction have typically failed to produce
synthetic data of high quality. In this work, we
show that a simple and practical recipe in the
text domain is effective: simply fine-tuning a
pre-trained generative language model with DP
enables the model to generate useful synthetic
text with strong privacy protection. Through ex-
tensive empirical analyses on both benchmark
and private customer data, we demonstrate that
our method produces synthetic text that is com-
petitive in terms of utility with its non-private
counterpart, meanwhile providing strong pro-
tection against potential privacy leakages.1
1 Introduction
The issue of privacy has gained increasing attention
in natural language processing (NLP). Privacy at-
tacks against common NLP pipelines have demon-
strated that models trained without formal privacy
guarantees can reveal membership information and
enable training data reconstruction (Shokri et al.,
2017;Carlini et al.,2021). Privacy concerns mani-
fested through tightening legislation (e.g., GDPR
(Art. 29 WP,2014)) and growing discussions on
policy and ethics call for improved approaches for
privacy-preserving machine learning.
Most of the work was done when Xiang, Xuechen, and
Girish interned at Microsoft (Research).
1
Our code is available at
https://github.com/
microsoft/dp-transformers
Among different approaches for learning with
private data, learning with differential privacy
(DP) (Dwork et al.,2006) has become the gold
standard as its formal guarantee enables reasoning
about the privacy loss in a principled manner and
makes the approach resilient to strong privacy at-
tacks (Carlini et al.,2019). Recent developments
have substantially improved the computational effi-
ciency and privacy-utility trade-off of DP machine
learning (Subramani et al.,2021;Li et al.,2022b;
Yu et al.,2022;De et al.,2022;Bu et al.,2022;
Li et al.,2022a;Mehta et al.,2022,inter alia),
demonstrating gains for learning models that per-
form specific downstream tasks.
In contrast to the above works, we study syn-
thetic text generation by building generative text
models with DP training algorithms (Figure 1).
The goal of this approach is to learn a generative
model that faithfully captures distributional proper-
ties of the training data (and the underlying distribu-
tion), as opposed to learning task-oriented models
with specific functions. Compared to directly learn-
ing models for target tasks, this paradigm has sev-
eral advantages: (1) DP-trained generative models
can be used to draw synthetic data for learning an
expanding set of task models without incurring any
additional privacy loss (due to the post-processing
property of DP); (2) Dataset debugging is made
easy as synthetic text generated from DP-trained
models can be shared more freely, and inspecting
its samples poses less of a privacy concern com-
pared to examining the original private data (Au-
genstein et al.,2020); (3) Synthetic data gener-
ated from DP-trained models can be retained for
a longer time under certain existing policies (e.g.,
right to be forgotten) thanks to the fact that DP im-
plies some degree of approximate machine unlearn-
arXiv:2210.14348v3 [cs.CL] 18 Jul 2023
ML/NLP
Tasks
Statistical
Analysis
Feedback
Analysis
Private customer data
(e.g., review and feedback)
Data Owner Downstream Applications
Synthetic Text Generation with DP-trained LM
Data prepended with a control code
Business Type:
Restaurants|
Review Stars:
5.0:\n
Generative
Language Model
The food came
right on time
and it was
delicious!!!
Synthetic texts are generated by prompting the language model
with the control codes without incurring any additional privacy loss
DP-SGD
Generative
Language Model
Train a LM with Differential Privacy
Generate
Private Data Provision
Synthetic Data Release
Business Type:
Restaurants|
The food took 6 hours
to arrive to 1940 W
State St Boise!!!
Review Stars: 1.0:\n
Figure 1: Illustration of our problem and methodology. We propose to generate synthetic text with a formal privacy
guarantee: we fine-tune a generative language model with DP and then leverage it for synthetic text generation
using control codes. Privacy loss of the overall procedure can be controlled by the data generation stage as, by the
robustness to post-processing property of DP, the downstream task stage does not incur any additional privacy loss.
ing (Bourtoule et al.,2021;Sekhari et al.,2021).
In this work, we initiate a systematic empirical
study of the problem and show that DP language
model (LM) fine-tuning can be an effective solu-
tion to synthetic text generation with privacy. In
particular, we show that simply fine-tuning progres-
sively larger autoregressively pre-trained language
models on (private) data leads to models that gener-
ate increasingly useful synthetic text. For instance,
we fine-tune a GPT-2 Large model (Radford et al.,
2019) on a review dataset with DP at
ϵ= 4
and then
use it to generate synthetic text to build downstream
classifiers. The classification models achieve com-
parable performance (only 2-4% in accuracy away)
to the classifiers trained on the original dataset.
Furthermore, we demonstrate that generating a
small amount of synthetic data with DP is sufficient
to create classification models that are on par with
those trained directly on the entire original dataset
with DP. One of the advantages of the synthetic
data approach is that the privacy loss is fixed, and
an unlimited number of downstream models can
be built without incurring additional leakage. In
contrast, training additional downstream models on
the original data with DP accumulates privacy loss.
Distributional similarity evaluation additionally
confirms that the synthetic text distribution resem-
bles the original data distribution. We also uncover
a novel phenomenon in DP-trained LMs that is of
independent interest. Specifically, we observe a
length truncation effect in text generation with DP-
trained models, resulting in completions that are
generally shorter than their non-DP counterparts
and instances in the original dataset.
We further extensively study learning dynam-
ics with DP by injecting specially-crafted ca-
naries (Carlini et al.,2019) in the training data.
This allows for (i) stress-testing the extent to which
DP fine-tuning limits the leakage of private infor-
mation and (ii) understanding the conditions under
which a subject of interest would appear in syn-
thetic generations.
Finally, we conclude our studies on an industrial-
level private customer feedback dataset to show the
feasibility of our approach in real-world scenarios.
2 Background
2.1 Differential Privacy
Definition 2.1 (Differential Privacy (DP) (Dwork
et al.,2006)).A randomized algorithm
M:D →
S
is
(ϵ, δ)
-differentially private if for any two neigh-
boring datasets
D, D∈ D
that differ exactly in a
single data sample, and for all sets S⊆ S:
P[M(D)S]eϵP[M(D)S] + δ.
This definition provides a rigorous privacy guar-
antee by theoretically bounding the effect of a sin-
gle data sample in the dataset. For a differentially
private algorithm, the output distribution is statis-
tically similar whether any individual data sample
appears in the input dataset or not. The privacy pa-
rameter
ϵ
quantifies the maximum allowable impact
of a single individual’s data on the outcome.
δ
spec-
ifies the maximum probability that the privacy guar-
antee may fail. An algorithm can typically be made
(ϵ, δ)
-DP by bounding the contribution of a single
data sample and adding controlled noise from a
predetermined distribution (e.g., Gaussian) (Dwork
and Roth,2014). Setting
ϵ
and
δ
in practice often
requires careful consideration of the specific use
case and the acceptable trade-off between privacy
and utility. We discuss our choice of
ϵ
and
δ
in
Section 4.1.
An appealing property of DP crucial to this work
is robustness to post-processing. This property en-
sures that if the algorithm
M
satisfies
(ϵ, δ)
-DP,
then so does
FM
for any deterministic or ran-
domized function
F
(which is independent of
M
).
Namely, one can perform arbitrary post-processing
without incurring additional privacy loss.
2.2 DP Stochastic Gradient Descent
Deep learning models can be trained with DP via
a modification of the stochastic gradient descent
(SGD) algorithm (Song et al.,2013;Bassily et al.,
2014;Abadi et al.,2016). The modified algorithm
clips per-sample gradients to bound the contribu-
tion of individual examples. Noise from a Gaus-
sian distribution is sampled and added to the sum
of the clipped gradients in a batch to obfuscate the
gradient update. The resulting algorithm, called
Differentially Private Stochastic Gradient Descent
(DP-SGD), can be shown to be DP for some
(ϵ, δ)
for each update of the model. Privacy parameters
at the end of training can be computed via privacy
composition algorithms (Abadi et al.,2016;Gopi
et al.,2021a). In the next section, we will utilize
DP-SGD to train a language model with privacy
for synthetic text generation.
3 Method
In this section, we formally state the problem and
present our method (see Figure 1for an illustration)
that produces a synthetic version of private text data
with differential privacy.
3.1 Problem Statement
Let
D
be a database representing the collection of
token sequences from a fixed dictionary
V
. We
define a (randomized) mapping
M:D → D
such
that for a given dataset
D∈ D
, the goal is to gen-
erate a synthetic version M(D) = ˜
Dwith privacy
constraints and utility desiderata.
Regarding privacy constraints, we require that
M
be
(ϵ, δ)
-DP with domain
D
. This requirement
provides strong protection for the participants in
the input dataset as this participation will be statis-
tically indistinguishable to a certain degree through
any adversary accessing the model or synthetic ver-
sion of the dataset in the output.
For the case of utility, ideally, the synthetic ver-
sion
˜
D
should be able to replace
D
in providing
a training resource for models on relevant down-
stream applications. In other words, on target
downstream tasks, models trained on the synthetic
dataset
˜
D
are expected to have performance sim-
ilar to the models trained on the original dataset
D
. More generally, distributional properties of the
dataset
D
should be captured as much as possible
in the synthetic version
˜
D
without violating the
aforementioned privacy requirement. These will be
extensively explored in Section 4.
3.2 Synthetic Text Generation with DP
Conventionally, to generate synthetic text, an auto-
regressive language model (e.g. GPT-2 (Radford
et al.,2019)) is trained on the original dataset and
subsequently sampled using a sampling mecha-
nism (e.g., beam search, top-
k
sampling (Fan et al.,
2018), nucleus sampling (Holtzman et al.,2020),
etc.) to produce synthetic sequences.
To make this operation differentially private, we
adopt DP-SGD to fine-tune a pre-trained generative
LM. The post-processing property of DP ensures
that once the LM has been fine-tuned with DP, sam-
pling from the model incurs no extra privacy loss.
It would be desirable to synthesize examples
with labels. We achieve this by building a condi-
tional generator introduced in (Keskar et al.,2019)
to provide more explicit control over text gener-
ation. By using so-called control codes (Keskar
et al.,2019), the probability distribution of a text
sequence
x= (x1, x2, . . . , xn)
is conditioned on a
control code cand decomposed as:
P(x|c) =
n
Y
i=1
P(xi|x1, x2, . . . , xi1, c).
A neural network
pθ(·)
is then trained to model
each conditional distribution. The model can later
be used to generate new samples conditioned
on a control code
c
by sequentially sampling
pθ(x1|c), pθ(x2|˜x1, c), . . . , pθ(xm|˜x1, . . . ˜xm1, c)
.
The advantage of this approach is that it provides
flexibility in the text generation of the model by
allowing the conditional control codes to specify
a particular style, domain, sentiment, or category.
For example, feedback data collected from users
on a set of products may contain product types
and review scores associated with each data
sample. Control codes can be constructed as
cp,r =
"Product type:
p
| Review score:
r
"for
different product type (
p
) and review score (
r
)
pairs. In our method, we utilize control codes
to prepend each sample with its corresponding
categories as a simple preprocessing step. During
the text generation, this allows us to use the control
codes to generate as many samples as the original
categorical distribution is preserved.
We point out that the categorical distribution in
the original dataset may also be a piece of private
information itself. However, its estimation could
easily be privatized (Dwork and Roth,2014) and
for simplicity, we ignore the low-cost privacy loss
of this step and use the exact categorical distribu-
tion of the original dataset in this paper.
4 Analyses on a Public Review Dataset
In this section, we extensively analyze our method
with experiments on a public benchmark dataset:
Yelp Open Dataset,
2
which has been widely
adopted for language modeling and text classifica-
tion tasks. We then apply our method to an internal
private customer feedback dataset in Section 5.
4.1 Experimental Setup
Dataset. The Yelp dataset contains review text
data on businesses that can be studied for academic
purposes. We select two attributes for the condi-
tional generation as well as the downstream task
applications: review stars (1-5) and business cate-
gory. We sample 10 frequent business categories
and remove the reviews that do not have ratings
(Details can be found in Appendix A.1). This re-
sults in a dataset that has 1.9M reviews for training,
5000 for validation, and 5000 for testing.
Implementation Details. We utilize the pub-
lic repository (Inan et al.,2022), which is based
on Huggingface (Wolf et al.,2019) and Opa-
cus (Yousefpour et al.,2021), for fine-tuning lan-
guage models with DP. Specifically, we fine-tune
three language models: GPT2 (Radford et al.,
2019), GPT2-Medium, and GPT2-Large, for syn-
thetic text generation. Additionally, we fine-tune
the RoBERTa-base model (Liu et al.,2019) for
downstream text classification tasks.
Control codes are constructed based on attributes
such as “Business Type: Bar | Review Stars: 5.0”
2https://www.yelp.com/dataset
Data Type Data Generator ϵRating Category
Original - - 0.7334 0.7752
Synthetic
GPT2 0.6892 0.7584
4 0.6656 0.7478
GPT2-Medium 0.6878 0.7550
4 0.6756 0.7486
GPT2-Large 0.7090 0.7576
4 0.6936 0.7568
Table 1: Synthetic text generation with DP yields mod-
els that exhibit comparable accuracy in downstream
tasks (review rating and business category classifica-
tion) when compared to models trained on the synthetic
text generated without privacy protection.
and are prepended to each sample. Hyperparame-
ters are specified in Appendix A. For both synthetic
text generation and classification, we set the max-
imum sequence length to 128, unless otherwise
specified. During training, we evaluate the models
on the dev dataset and select the checkpoint that
achieves the best validation performance for the
final evaluation on the test set.
We set the privacy parameter
ϵ
to 4, which is
supported by prior work (Yu et al.,2021a;Li et al.,
2022b;Yu et al.,2022;De et al.,2022;Mehta et al.,
2022) and real-world applications. For instance,
the release of US population data uses
ϵ= 13.64
(Bureau,2020), and the development of a next-
word prediction model uses
ϵ= 6.92
(Google,
2022). Our
ϵ= 4
is smaller and provides stronger
privacy protection. As recommended by (Hsu et al.,
2014;De et al.,2022),
δ
should be smaller than the
inverse of the dataset size
N
, and we set
δ= 1/(N·
log N)
. The additive noise scale is calculated using
the numerical composition algorithm (Gopi et al.,
2021b), given the batch size and epochs for each
setting mentioned in Appendix Afor DP training.
To generate synthetic text samples, we employ
top-
k
sampling (Fan et al.,2018) and nucleus sam-
pling (top-
p
) (Holtzman et al.,2020), with
k= 50
and
p= 0.9
. To produce synthetic datasets that
preserve categorical distributions (e.g., business
category), we generate 100K samples from the fine-
tuned models using the appropriate control codes.
4.2 Downstream Tasks on Synthetic Data
One way to evaluate the quality of the synthetic
dataset is by examining the performance of down-
stream task models trained on it. We fine-tune
RoBERTa-base models for classifying review rat-
ings and business categories using the synthetic
摘要:

SyntheticTextGenerationwithDifferentialPrivacy:ASimpleandPracticalRecipeXiangYue1,∗,HuseyinA.Inan2,XuechenLi3,GirishKumar5,JuliaMcAnallen4,HodaShajari4,HuanSun1,DavidLevitan4,andRobertSim21TheOhioStateUniversity,2MicrosoftResearch,3StanfordUniversity,4Microsoft,5UCDavis{yue.149,sun.397}@osu.edulxuec...

收起<<
Synthetic Text Generation with Differential Privacy A Simple and Practical Recipe Xiang Yue1 Huseyin A. Inan2 Xuechen Li3.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.01MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注