
To summarize, our main contributions are fourfold:
• We compare knowledge unlearning with two approaches from literature known to mitigate
privacy risks: a data preprocessing approach and a Differential Privacy (DP) Decoding
approach. We show that our approach results in little to no performance degradation of
general capabilities (sometimes resulting in improvement) while providing strong privacy
protections in situations individuals practice their RTBF whereas the data preprocessing ap-
proach provides weaker privacy protection while being orders of magnitude computation-
ally demanding and the DP Decoding approach results in severe degradation of modeling
performance.
• We perform additional experiments to determine which factors contribute to the difficulty
of knowledge unlearning and find that (1) trying to forget many samples at once results in
substantial LM performance degradation which can be mitigated by sequentially forgetting
chunks of data and that (2) the domain of the target data (Code, License, Wikipedia, etc.)
plays a critical role in determining how hard they are to forget.
• We provide a novel metric and a general guideline for quantifying the privacy risks for LMs
and determine when they should be considered to have “forgotten” a given target sequence.
•Knowledge unlearning surprisingly seems to make LMs stronger where the extreme cases
bring +8.0% (37.6% →45.6%), +10.1% (57.4% →67.5%), and +7.9% (62.2% →70.1%)
improvements on Lambada for GPT-NEO 125M, 1.3B, and 2.7B, respectively.
2 RELATED WORK
2.1 PRIVACY METHODS FOR LANGUAGE MODELS
Prior work that tries to mitigate privacy risks for LMs can be divided mainly into data pre/post-
processing methods and differential privacy methods.
(Data) Pre/Post-Processing Data preprocessing aims to sanitize the training data; it aims to get
rid of all data that might violate any kind of privacy from the training data prior to training. These
methods mostly utilize measures such as parsers and classification models that try to identify and
predict patterns that constitute private information. This is effective at identifying well-formatted
private information such as social security numbers or special forms of medical notes (Aura et al.,
2006; Dernoncourt et al., 2017; Lison et al., 2021; Kandpal et al., 2022). However, as pointed out by
Brown et al. (2022), considering that private information is mostly context-dependent and sometimes
in a non-specific format, data preprocessing methods cannot fully claim that they provide privacy
guarantees, especially guarantees that match each individual’s standards. Methods that attempt to
utilize post-processing methods such as applying censorship to the LM outputs still face the same
limitations.
In this work, we compare our proposed method with a data preprocessing approach proposed by
Kandpal et al. (2022) which shows that deduplicating the training corpora before pretraining helps
pretrain LMs that show stronger robustness against extraction attacks than an LM pretrained under
the same circumstances without deduplicating the pretraining corpora. However, we highlight that
this approach, which may still be effective at mitigating the overall privacy risks, is not the most
suitable approach when considering a realistic scenario of individuals requesting the removal of
their information from the implicit parameters of the LMs.
Differential Privacy Differential Privacy (DP) aims to guarantee that the effect of an individual
input on the output of a specific function is bounded (Dwork, 2008; Dwork et al., 2006). In the
context of deep neural networks, DP, which needs to be applied during the training phase, aims
to construct models that can provide general guarantees that the individual information within the
training data cannot be inferred (Abadi et al., 2016). While DP has shown to be surprisingly ef-
fective at fine-tuning LMs (Li et al., 2022; Yu et al., 2022), pretraining LMs with DP still suffers
from substantial performance gap, expensive computation, and slow convergence (Anil et al., 2021).
Furthermore, as pointed out by Brown et al. (2022), DP can only provide limited guarantees for LMs
because DP requires a unified definition for privacy boundaries, which is inherently impossible for
natural language data. Most importantly, in a realistic scenario where individuals may practice their
3