Synthetic Text Generation with Differential Privacy A Simple and Practical Recipe Xiang Yue1 Huseyin A. Inan2 Xuechen Li3

2025-05-02 1 0 1.01MB 20 页 10玖币

侵权投诉

Synthetic Text Generation with Differential Privacy:

A Simple and Practical Recipe

Xiang Yue1,∗

, Huseyin A. Inan2, Xuechen Li3,

Girish Kumar5, Julia McAnallen4, Hoda Shajari4, Huan Sun1, David Levitan4, and Robert Sim2

1The Ohio State University, 2Microsoft Research, 3Stanford University, 4Microsoft, 5UC Davis

{yue.149,sun.397}@osu.edu

lxuechen@cs.stanford.edu gkum@ucdavis.edu

{Huseyin.Inan,Julia.McAnallen,hodashajari,David.Levitan,rsim}@microsoft.com

Abstract

Privacy concerns have attracted increasing at-

tention in data-driven products due to the ten-

dency of machine learning models to memorize

sensitive training data. Generating synthetic

versions of such data with a formal privacy

guarantee, such as differential privacy (DP),

provides a promising path to mitigating these

privacy concerns, but previous approaches in

this direction have typically failed to produce

synthetic data of high quality. In this work, we

show that a simple and practical recipe in the

text domain is effective: simply ﬁne-tuning a

pre-trained generative language model with DP

enables the model to generate useful synthetic

text with strong privacy protection. Through ex-

tensive empirical analyses on both benchmark

and private customer data, we demonstrate that

our method produces synthetic text that is com-

petitive in terms of utility with its non-private

counterpart, meanwhile providing strong pro-

tection against potential privacy leakages.1

1 Introduction

The issue of privacy has gained increasing attention

in natural language processing (NLP). Privacy at-

tacks against common NLP pipelines have demon-

strated that models trained without formal privacy

guarantees can reveal membership information and

enable training data reconstruction (Shokri et al.,

2017;Carlini et al.,2021). Privacy concerns mani-

fested through tightening legislation (e.g., GDPR

(Art. 29 WP,2014)) and growing discussions on

policy and ethics call for improved approaches for

privacy-preserving machine learning.

∗

Most of the work was done when Xiang, Xuechen, and

Girish interned at Microsoft (Research).

Our code is available at

https://github.com/

microsoft/dp-transformers

Among different approaches for learning with

private data, learning with differential privacy

(DP) (Dwork et al.,2006) has become the gold

standard as its formal guarantee enables reasoning

about the privacy loss in a principled manner and

makes the approach resilient to strong privacy at-

tacks (Carlini et al.,2019). Recent developments

have substantially improved the computational efﬁ-

ciency and privacy-utility trade-off of DP machine

learning (Subramani et al.,2021;Li et al.,2022b;

Yu et al.,2022;De et al.,2022;Bu et al.,2022;

Li et al.,2022a;Mehta et al.,2022,inter alia),

demonstrating gains for learning models that per-

form speciﬁc downstream tasks.

In contrast to the above works, we study syn-

thetic text generation by building generative text

models with DP training algorithms (Figure 1).

The goal of this approach is to learn a generative

model that faithfully captures distributional proper-

ties of the training data (and the underlying distribu-

tion), as opposed to learning task-oriented models

with speciﬁc functions. Compared to directly learn-

ing models for target tasks, this paradigm has sev-

eral advantages: (1) DP-trained generative models

can be used to draw synthetic data for learning an

expanding set of task models without incurring any

additional privacy loss (due to the post-processing

property of DP); (2) Dataset debugging is made

easy as synthetic text generated from DP-trained

models can be shared more freely, and inspecting

its samples poses less of a privacy concern com-

pared to examining the original private data (Au-

genstein et al.,2020); (3) Synthetic data gener-

ated from DP-trained models can be retained for

a longer time under certain existing policies (e.g.,

right to be forgotten) thanks to the fact that DP im-

plies some degree of approximate machine unlearn-

arXiv:2210.14348v3 [cs.CL] 18 Jul 2023

ML/NLP

Tasks

Statistical

Analysis

Feedback

Analysis

Private customer data

(e.g., review and feedback)

Data Owner Downstream Applications

Synthetic Text Generation with DP-trained LM

Data prepended with a control code

Business Type:

Restaurants|

Review Stars:

5.0:\n

Generative

Language Model

The food came

right on time

and it was

delicious!!!

Synthetic texts are generated by prompting the language model

with the control codes without incurring any additional privacy loss

DP-SGD

Generative

Language Model

Train a LM with Differential Privacy

Generate

Private Data Provision

Synthetic Data Release

Business Type:

Restaurants|

The food took 6 hours

to arrive to 1940 W

State St Boise!!!

Review Stars: 1.0:\n

Figure 1: Illustration of our problem and methodology. We propose to generate synthetic text with a formal privacy

guarantee: we ﬁne-tune a generative language model with DP and then leverage it for synthetic text generation

using control codes. Privacy loss of the overall procedure can be controlled by the data generation stage as, by the

robustness to post-processing property of DP, the downstream task stage does not incur any additional privacy loss.

ing (Bourtoule et al.,2021;Sekhari et al.,2021).

In this work, we initiate a systematic empirical

study of the problem and show that DP language

model (LM) ﬁne-tuning can be an effective solu-

tion to synthetic text generation with privacy. In

particular, we show that simply ﬁne-tuning progres-

sively larger autoregressively pre-trained language

models on (private) data leads to models that gener-

ate increasingly useful synthetic text. For instance,

we ﬁne-tune a GPT-2 Large model (Radford et al.,

2019) on a review dataset with DP at

ϵ= 4

and then

use it to generate synthetic text to build downstream

classiﬁers. The classiﬁcation models achieve com-

parable performance (only 2-4% in accuracy away)

to the classiﬁers trained on the original dataset.

Furthermore, we demonstrate that generating a

small amount of synthetic data with DP is sufﬁcient

to create classiﬁcation models that are on par with

those trained directly on the entire original dataset

with DP. One of the advantages of the synthetic

data approach is that the privacy loss is ﬁxed, and

an unlimited number of downstream models can

be built without incurring additional leakage. In

contrast, training additional downstream models on

the original data with DP accumulates privacy loss.

Distributional similarity evaluation additionally

conﬁrms that the synthetic text distribution resem-

bles the original data distribution. We also uncover

a novel phenomenon in DP-trained LMs that is of

independent interest. Speciﬁcally, we observe a

length truncation effect in text generation with DP-

trained models, resulting in completions that are

generally shorter than their non-DP counterparts

and instances in the original dataset.

We further extensively study learning dynam-

ics with DP by injecting specially-crafted ca-

naries (Carlini et al.,2019) in the training data.

This allows for (i) stress-testing the extent to which

DP ﬁne-tuning limits the leakage of private infor-

mation and (ii) understanding the conditions under

which a subject of interest would appear in syn-

thetic generations.

Finally, we conclude our studies on an industrial-

level private customer feedback dataset to show the

feasibility of our approach in real-world scenarios.

2 Background

2.1 Differential Privacy

Deﬁnition 2.1 (Differential Privacy (DP) (Dwork

et al.,2006)).A randomized algorithm

M:D →

(ϵ, δ)

-differentially private if for any two neigh-

boring datasets

D, D′∈ D

that differ exactly in a

single data sample, and for all sets S⊆ S:

P[M(D)∈S]≤eϵP[M(D′)∈S] + δ.

This deﬁnition provides a rigorous privacy guar-

antee by theoretically bounding the effect of a sin-

gle data sample in the dataset. For a differentially

private algorithm, the output distribution is statis-

tically similar whether any individual data sample

appears in the input dataset or not. The privacy pa-

rameter

quantiﬁes the maximum allowable impact

of a single individual’s data on the outcome.

spec-

iﬁes the maximum probability that the privacy guar-

antee may fail. An algorithm can typically be made

(ϵ, δ)

-DP by bounding the contribution of a single

data sample and adding controlled noise from a

predetermined distribution (e.g., Gaussian) (Dwork

and Roth,2014). Setting

and

in practice often

requires careful consideration of the speciﬁc use

case and the acceptable trade-off between privacy

and utility. We discuss our choice of

and

Section 4.1.

An appealing property of DP crucial to this work

is robustness to post-processing. This property en-

sures that if the algorithm

satisﬁes

(ϵ, δ)

-DP,

then so does

F◦M

for any deterministic or ran-

domized function

(which is independent of

Namely, one can perform arbitrary post-processing

without incurring additional privacy loss.

2.2 DP Stochastic Gradient Descent

Deep learning models can be trained with DP via

a modiﬁcation of the stochastic gradient descent

(SGD) algorithm (Song et al.,2013;Bassily et al.,

2014;Abadi et al.,2016). The modiﬁed algorithm

clips per-sample gradients to bound the contribu-

tion of individual examples. Noise from a Gaus-

sian distribution is sampled and added to the sum

of the clipped gradients in a batch to obfuscate the

gradient update. The resulting algorithm, called

Differentially Private Stochastic Gradient Descent

(DP-SGD), can be shown to be DP for some

(ϵ, δ)

for each update of the model. Privacy parameters

at the end of training can be computed via privacy

composition algorithms (Abadi et al.,2016;Gopi

et al.,2021a). In the next section, we will utilize

DP-SGD to train a language model with privacy

for synthetic text generation.

3 Method

In this section, we formally state the problem and

present our method (see Figure 1for an illustration)

that produces a synthetic version of private text data

with differential privacy.

3.1 Problem Statement

Let

be a database representing the collection of

token sequences from a ﬁxed dictionary

. We

deﬁne a (randomized) mapping

M:D → D

such

that for a given dataset

D∈ D

, the goal is to gen-

erate a synthetic version M(D) = ˜

Dwith privacy

constraints and utility desiderata.

Regarding privacy constraints, we require that

(ϵ, δ)

-DP with domain

. This requirement

provides strong protection for the participants in

the input dataset as this participation will be statis-

tically indistinguishable to a certain degree through

any adversary accessing the model or synthetic ver-

sion of the dataset in the output.

For the case of utility, ideally, the synthetic ver-

sion

should be able to replace

in providing

a training resource for models on relevant down-

stream applications. In other words, on target

downstream tasks, models trained on the synthetic

dataset

are expected to have performance sim-

ilar to the models trained on the original dataset

. More generally, distributional properties of the

dataset

should be captured as much as possible

in the synthetic version

without violating the

aforementioned privacy requirement. These will be

extensively explored in Section 4.

3.2 Synthetic Text Generation with DP

Conventionally, to generate synthetic text, an auto-

regressive language model (e.g. GPT-2 (Radford

et al.,2019)) is trained on the original dataset and

subsequently sampled using a sampling mecha-

nism (e.g., beam search, top-

sampling (Fan et al.,

2018), nucleus sampling (Holtzman et al.,2020),

etc.) to produce synthetic sequences.

To make this operation differentially private, we

adopt DP-SGD to ﬁne-tune a pre-trained generative

LM. The post-processing property of DP ensures

that once the LM has been ﬁne-tuned with DP, sam-

pling from the model incurs no extra privacy loss.

It would be desirable to synthesize examples

with labels. We achieve this by building a condi-

tional generator introduced in (Keskar et al.,2019)

to provide more explicit control over text gener-

ation. By using so-called control codes (Keskar

et al.,2019), the probability distribution of a text

sequence

x= (x1, x2, . . . , xn)

is conditioned on a

control code cand decomposed as:

P(x|c) =

i=1

P(xi|x1, x2, . . . , xi−1, c).

A neural network

pθ(·)

is then trained to model

each conditional distribution. The model can later

be used to generate new samples conditioned

on a control code

by sequentially sampling

pθ(x1|c), pθ(x2|˜x1, c), . . . , pθ(xm|˜x1, . . . ˜xm−1, c)

The advantage of this approach is that it provides

ﬂexibility in the text generation of the model by

allowing the conditional control codes to specify

a particular style, domain, sentiment, or category.

For example, feedback data collected from users

on a set of products may contain product types

and review scores associated with each data

sample. Control codes can be constructed as

cp,r =

"Product type:

| Review score:

"for

different product type (

) and review score (

)

pairs. In our method, we utilize control codes

to prepend each sample with its corresponding

categories as a simple preprocessing step. During

the text generation, this allows us to use the control

codes to generate as many samples as the original

categorical distribution is preserved.

We point out that the categorical distribution in

the original dataset may also be a piece of private

information itself. However, its estimation could

easily be privatized (Dwork and Roth,2014) and

for simplicity, we ignore the low-cost privacy loss

of this step and use the exact categorical distribu-

tion of the original dataset in this paper.

4 Analyses on a Public Review Dataset

In this section, we extensively analyze our method

with experiments on a public benchmark dataset:

Yelp Open Dataset,

which has been widely

adopted for language modeling and text classiﬁca-

tion tasks. We then apply our method to an internal

private customer feedback dataset in Section 5.

4.1 Experimental Setup

Dataset. The Yelp dataset contains review text

data on businesses that can be studied for academic

purposes. We select two attributes for the condi-

tional generation as well as the downstream task

applications: review stars (1-5) and business cate-

gory. We sample 10 frequent business categories

and remove the reviews that do not have ratings

(Details can be found in Appendix A.1). This re-

sults in a dataset that has 1.9M reviews for training,

5000 for validation, and 5000 for testing.

Implementation Details. We utilize the pub-

lic repository (Inan et al.,2022), which is based

on Huggingface (Wolf et al.,2019) and Opa-

cus (Yousefpour et al.,2021), for ﬁne-tuning lan-

guage models with DP. Speciﬁcally, we ﬁne-tune

three language models: GPT2 (Radford et al.,

2019), GPT2-Medium, and GPT2-Large, for syn-

thetic text generation. Additionally, we ﬁne-tune

the RoBERTa-base model (Liu et al.,2019) for

downstream text classiﬁcation tasks.

Control codes are constructed based on attributes

such as “Business Type: Bar | Review Stars: 5.0”

2https://www.yelp.com/dataset

Data Type Data Generator ϵRating Category

Original - - 0.7334 0.7752

Synthetic

GPT2 ∞0.6892 0.7584

4 0.6656 0.7478

GPT2-Medium ∞0.6878 0.7550

4 0.6756 0.7486

GPT2-Large ∞0.7090 0.7576

4 0.6936 0.7568

Table 1: Synthetic text generation with DP yields mod-

els that exhibit comparable accuracy in downstream

tasks (review rating and business category classiﬁca-

tion) when compared to models trained on the synthetic

text generated without privacy protection.

and are prepended to each sample. Hyperparame-

ters are speciﬁed in Appendix A. For both synthetic

text generation and classiﬁcation, we set the max-

imum sequence length to 128, unless otherwise

speciﬁed. During training, we evaluate the models

on the dev dataset and select the checkpoint that

achieves the best validation performance for the

ﬁnal evaluation on the test set.

We set the privacy parameter

to 4, which is

supported by prior work (Yu et al.,2021a;Li et al.,

2022b;Yu et al.,2022;De et al.,2022;Mehta et al.,

2022) and real-world applications. For instance,

the release of US population data uses

ϵ= 13.64

(Bureau,2020), and the development of a next-

word prediction model uses

ϵ= 6.92

(Google,

2022). Our

ϵ= 4

is smaller and provides stronger

privacy protection. As recommended by (Hsu et al.,

2014;De et al.,2022),

should be smaller than the

inverse of the dataset size

, and we set

δ= 1/(N·

log N)

. The additive noise scale is calculated using

the numerical composition algorithm (Gopi et al.,

2021b), given the batch size and epochs for each

setting mentioned in Appendix Afor DP training.

To generate synthetic text samples, we employ

top-

sampling (Fan et al.,2018) and nucleus sam-

pling (top-

) (Holtzman et al.,2020), with

k= 50

and

p= 0.9

. To produce synthetic datasets that

preserve categorical distributions (e.g., business

category), we generate 100K samples from the ﬁne-

tuned models using the appropriate control codes.

4.2 Downstream Tasks on Synthetic Data

One way to evaluate the quality of the synthetic

dataset is by examining the performance of down-

stream task models trained on it. We ﬁne-tune

RoBERTa-base models for classifying review rat-

ings and business categories using the synthetic

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SyntheticTextGenerationwithDifferentialPrivacy:ASimpleandPracticalRecipeXiangYue1,∗,HuseyinA.Inan2,XuechenLi3,GirishKumar5,JuliaMcAnallen4,HodaShajari4,HuanSun1,DavidLevitan4,andRobertSim21TheOhioStateUniversity,2MicrosoftResearch,3StanfordUniversity,4Microsoft,5UCDavis{yue.149,sun.397}@osu.edulxuec...

展开>> 收起<<

Synthetic Text Generation with Differential Privacy A Simple and Practical Recipe Xiang Yue1 Huseyin A. Inan2 Xuechen Li3.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Synthetic Text Generation with Differential Privacy A Simple and Practical Recipe Xiang Yue1 Huseyin A. Inan2 Xuechen Li3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: