
on a set of products may contain product types
and review scores associated with each data
sample. Control codes can be constructed as
cp,r =
"Product type:
p
| Review score:
r
"for
different product type (
p
) and review score (
r
)
pairs. In our method, we utilize control codes
to prepend each sample with its corresponding
categories as a simple preprocessing step. During
the text generation, this allows us to use the control
codes to generate as many samples as the original
categorical distribution is preserved.
We point out that the categorical distribution in
the original dataset may also be a piece of private
information itself. However, its estimation could
easily be privatized (Dwork and Roth,2014) and
for simplicity, we ignore the low-cost privacy loss
of this step and use the exact categorical distribu-
tion of the original dataset in this paper.
4 Analyses on a Public Review Dataset
In this section, we extensively analyze our method
with experiments on a public benchmark dataset:
Yelp Open Dataset,
2
which has been widely
adopted for language modeling and text classifica-
tion tasks. We then apply our method to an internal
private customer feedback dataset in Section 5.
4.1 Experimental Setup
Dataset. The Yelp dataset contains review text
data on businesses that can be studied for academic
purposes. We select two attributes for the condi-
tional generation as well as the downstream task
applications: review stars (1-5) and business cate-
gory. We sample 10 frequent business categories
and remove the reviews that do not have ratings
(Details can be found in Appendix A.1). This re-
sults in a dataset that has 1.9M reviews for training,
5000 for validation, and 5000 for testing.
Implementation Details. We utilize the pub-
lic repository (Inan et al.,2022), which is based
on Huggingface (Wolf et al.,2019) and Opa-
cus (Yousefpour et al.,2021), for fine-tuning lan-
guage models with DP. Specifically, we fine-tune
three language models: GPT2 (Radford et al.,
2019), GPT2-Medium, and GPT2-Large, for syn-
thetic text generation. Additionally, we fine-tune
the RoBERTa-base model (Liu et al.,2019) for
downstream text classification tasks.
Control codes are constructed based on attributes
such as “Business Type: Bar | Review Stars: 5.0”
2https://www.yelp.com/dataset
Data Type Data Generator ϵRating Category
Original - - 0.7334 0.7752
Synthetic
GPT2 ∞0.6892 0.7584
4 0.6656 0.7478
GPT2-Medium ∞0.6878 0.7550
4 0.6756 0.7486
GPT2-Large ∞0.7090 0.7576
4 0.6936 0.7568
Table 1: Synthetic text generation with DP yields mod-
els that exhibit comparable accuracy in downstream
tasks (review rating and business category classifica-
tion) when compared to models trained on the synthetic
text generated without privacy protection.
and are prepended to each sample. Hyperparame-
ters are specified in Appendix A. For both synthetic
text generation and classification, we set the max-
imum sequence length to 128, unless otherwise
specified. During training, we evaluate the models
on the dev dataset and select the checkpoint that
achieves the best validation performance for the
final evaluation on the test set.
We set the privacy parameter
ϵ
to 4, which is
supported by prior work (Yu et al.,2021a;Li et al.,
2022b;Yu et al.,2022;De et al.,2022;Mehta et al.,
2022) and real-world applications. For instance,
the release of US population data uses
ϵ= 13.64
(Bureau,2020), and the development of a next-
word prediction model uses
ϵ= 6.92
(Google,
2022). Our
ϵ= 4
is smaller and provides stronger
privacy protection. As recommended by (Hsu et al.,
2014;De et al.,2022),
δ
should be smaller than the
inverse of the dataset size
N
, and we set
δ= 1/(N·
log N)
. The additive noise scale is calculated using
the numerical composition algorithm (Gopi et al.,
2021b), given the batch size and epochs for each
setting mentioned in Appendix Afor DP training.
To generate synthetic text samples, we employ
top-
k
sampling (Fan et al.,2018) and nucleus sam-
pling (top-
p
) (Holtzman et al.,2020), with
k= 50
and
p= 0.9
. To produce synthetic datasets that
preserve categorical distributions (e.g., business
category), we generate 100K samples from the fine-
tuned models using the appropriate control codes.
4.2 Downstream Tasks on Synthetic Data
One way to evaluate the quality of the synthetic
dataset is by examining the performance of down-
stream task models trained on it. We fine-tune
RoBERTa-base models for classifying review rat-
ings and business categories using the synthetic