KNOWLEDGE UNLEARNING FOR MITIGATING PRIVACY RISKS IN LANGUAGE MODELS Joel Jang1Dongkeun Yoon3Sohee Yang1Sungmin Cha4

2025-05-03 0 0 912.54KB 21 页 10玖币
侵权投诉
KNOWLEDGE UNLEARNING FOR MITIGATING
PRIVACY RISKS IN LANGUAGE MODELS
Joel Jang1Dongkeun Yoon 3Sohee Yang1Sungmin Cha4
Moontae Lee2,5 Lajanugen Logeswaran2Minjoon Seo1
1KAIST 2LG AI Research 3Konkuk University 4Seoul National University
5University of Illinois Chicago
{joeljang,sohee.yang,minjoon}@kaist.ac.kr, ramses2687@konkuk.ac.kr
sungmin.cha@snu.ac.kr, {moontae.lee,llajan}@lgresearch.ai
ABSTRACT
Pretrained Language Models (LMs) memorize a vast amount of knowledge during
initial pretraining, including information that may violate the privacy of personal
lives and identities. Previous work addressing privacy issues for language models
has mostly focused on data preprocessing and differential privacy methods, both
requiring re-training the underlying LM. We propose knowledge unlearning as an
alternative method to reduce privacy risks for LMs post hoc. We show that simply
performing gradient ascent on target token sequences is effective at forgetting
them with little to no degradation of general language modeling performances
for larger LMs; it sometimes even substantially improves the underlying LM with
just a few iterations. We also find that sequential unlearning is better than trying
to unlearn all the data at once and that unlearning is highly dependent on which
kind of data (domain) is forgotten. By showing comparisons with a previous data
preprocessing method and a decoding method known to mitigate privacy risks for
LMs, we show that unlearning can give a stronger empirical privacy guarantee in
scenarios where the data vulnerable to extraction attacks are known a priori while
being much more efficient and robust. We release the code and dataset needed to
replicate our results at https://github.com/joeljang/knowledge-unlearning.
1 INTRODUCTION
Recent work has shown that an adversary can extract training data from Pretrained Language Mod-
els (LMs) including Personally Identifiable Information (PII) such as names, phone numbers, and
email addresses, and other information such as licensed code, private clinical notes, and 128-bit
UUIDs (Carlini et al., 2021; Lee et al., 2022; Huang et al., 2022; Lehman et al., 2021). In 2021, an
AI chatbot Iruda became the first AI system to be sued for violating the Personal Information Protec-
tion Act after generating the exact home addresses and bank account numbers of actual individuals
unintentionally (Park, 2021). Heikkil¨
a (2022) has also shown that GPT-3 (Brown et al., 2020), one
of the most well-known LM currently in commercial use, offered detailed private information about
the Editor-in-Chief of MIT Technology Review including his family members, work address, and
phone number. Considering findings that show extracting training data gets easier as LMs scale to
larger sizes (Carlini et al., 2022a) and that it is common practice for practitioners to release billion
parameters pretrained LMs for public use (Gao et al., 2020; Black et al., 2021; Zhang et al., 2022),
it has become important to provide privacy guarantees for large LMs.
Practitioners are required to delete personal information from the LMs by individuals’ request be-
cause each individual has the “Right To Be Forgotten (RTBF)” (Mantelero, 2013; Graves et al.,
2021) and can limit the direct and indirect commercial use of their personal information (Villaronga
et al., 2018). Previous methods addressing privacy risks for language models attempt to remove all
private information from the training data (data preprocessing) (Aura et al., 2006; Dernoncourt et al.,
2017; Lison et al., 2021; Kandpal et al., 2022) or attempt to design algorithms that ensure differen-
tial privacy (DP) (Dwork, 2008; Dwork et al., 2006; Abadi et al., 2016; Anil et al., 2021; Li et al.,
work done during internship at LG AI Research.
1
arXiv:2210.01504v2 [cs.CL] 19 Dec 2022
Name: Bob
Age: 27
Marital Status: Single
SSN: 123 - 4567 - 8910
Details: Got divorced by ex-wife named
Alice and is currently undergoing custody
battles.
Net Worth: $5,000,000
Sensitive Personal Information
Data
Preprocessing
Differential
Privacy
Knowledge
Unlearning
LM
Find and Remove
Re-train LM after sanitization
(~900 A100 GPU days)
LM
Re-train LM with DP Algorithm
(~1800 A100 GPU days)
LM
Perform a few token updates
(~0.001 A100 GPU days)
I practice my Right To
Be Forgotten (RTBF)!
Pretraining
Corpora
Pretraining
Corpora
Token
Sequences
Our Proposed Approach
Bob
Figure 1: Comparison of previous approaches and knowledge unlearning when an individual prac-
tices his/her Right-To-Be-Forgotten (RTBF).
2022; Yu et al., 2022). Both approaches require retraining the underlying LM every time individuals
want to practice their RTBF, which makes them inadequate for large LMs that are extremely costly
to retrain. Furthermore, as pointed out by Brown et al. (2022), data preprocessing methods assume
private information to be easily identifiable, specified, and removed and DP algorithms can only
guarantee protection for information that has clear privacy borders, which makes them inadequate
in the real-world scenarios where the standard of privacy might differ by each individual.
To this end, we propose knowledge unlearning (Figure 1) as an efficient solution that can be applied
with just a few parameter updates instead of pretraining the underlying LM again. We perform ex-
periments on GPT-Neo LMs (125M, 1.3B, 2.7B) (Black et al., 2021) and show that simply changing
the gradient descent to the opposite direction during language modeling (which can also be seen as
maximizing instead of minimizing the loss function) is effective at protecting target sequences from
extraction attacks with little to no performance degradation on the initial LM capabilities measured
via 9 common NLP classification benchmarks (Hellaswag (Zellers et al., 2019), Lambada (Paperno
et al., 2016), Winogrande (Sakaguchi et al., 2021), COPA (Gordon et al., 2012), ARC-Easy (Clark
et al., 2018), ARC-Challenge (Clark et al., 2018), Piqa (Bisk et al., 2020), MathQA (Amini et al.,
2019), and PubmedQA (Jin et al., 2019)) and 4 dialogue tasks (Wizard of Wikipedia (Dinan et al.,
2019), Empathetic Dialogues (Rashkin et al., 2019), Blended Skill Talk (Smith et al., 2020), and
Wizard of Internet (Komeili et al., 2022)). For some cases, knowledge unlearning unexpectedly
shows significant improvements in LM performance for some of the benchmarks.
We compare our approach with data deduplication method (Kandpal et al., 2022) and differential
privacy decoding method (Majmudar et al., 2022) which are both known to mitigate privacy risks
and show the effectiveness of knowledge unlearning by providing strong privacy protection while
being much more efficient and robust. We also provide a general guideline that can be used to
quantify the memorization and extraction likelihood of target token sequences and suggest when we
can empirically consider them to have been “forgotten”. Specifically, we introduce a novel metric
that measures the extraction likelihood by varying the prefix length of the target token sequence and
quantifying how much of the suffix is actually extracted from the LM.
Surprisingly, for knowledge unlearning, we find that it is easier to forget a chunk of instances se-
quentially rather than trying to forget them all at once. We provide further analysis and show that
the difficulty of knowledge unlearning depends heavily on the target data being forgotten, especially
the domain of the target data. We also provide empirical examples of performing extraction attacks
and how exactly knowledge unlearning provides privacy protection for the LM.
2
To summarize, our main contributions are fourfold:
We compare knowledge unlearning with two approaches from literature known to mitigate
privacy risks: a data preprocessing approach and a Differential Privacy (DP) Decoding
approach. We show that our approach results in little to no performance degradation of
general capabilities (sometimes resulting in improvement) while providing strong privacy
protections in situations individuals practice their RTBF whereas the data preprocessing ap-
proach provides weaker privacy protection while being orders of magnitude computation-
ally demanding and the DP Decoding approach results in severe degradation of modeling
performance.
We perform additional experiments to determine which factors contribute to the difficulty
of knowledge unlearning and find that (1) trying to forget many samples at once results in
substantial LM performance degradation which can be mitigated by sequentially forgetting
chunks of data and that (2) the domain of the target data (Code, License, Wikipedia, etc.)
plays a critical role in determining how hard they are to forget.
We provide a novel metric and a general guideline for quantifying the privacy risks for LMs
and determine when they should be considered to have “forgotten” a given target sequence.
Knowledge unlearning surprisingly seems to make LMs stronger where the extreme cases
bring +8.0% (37.6% 45.6%), +10.1% (57.4% 67.5%), and +7.9% (62.2% 70.1%)
improvements on Lambada for GPT-NEO 125M, 1.3B, and 2.7B, respectively.
2 RELATED WORK
2.1 PRIVACY METHODS FOR LANGUAGE MODELS
Prior work that tries to mitigate privacy risks for LMs can be divided mainly into data pre/post-
processing methods and differential privacy methods.
(Data) Pre/Post-Processing Data preprocessing aims to sanitize the training data; it aims to get
rid of all data that might violate any kind of privacy from the training data prior to training. These
methods mostly utilize measures such as parsers and classification models that try to identify and
predict patterns that constitute private information. This is effective at identifying well-formatted
private information such as social security numbers or special forms of medical notes (Aura et al.,
2006; Dernoncourt et al., 2017; Lison et al., 2021; Kandpal et al., 2022). However, as pointed out by
Brown et al. (2022), considering that private information is mostly context-dependent and sometimes
in a non-specific format, data preprocessing methods cannot fully claim that they provide privacy
guarantees, especially guarantees that match each individual’s standards. Methods that attempt to
utilize post-processing methods such as applying censorship to the LM outputs still face the same
limitations.
In this work, we compare our proposed method with a data preprocessing approach proposed by
Kandpal et al. (2022) which shows that deduplicating the training corpora before pretraining helps
pretrain LMs that show stronger robustness against extraction attacks than an LM pretrained under
the same circumstances without deduplicating the pretraining corpora. However, we highlight that
this approach, which may still be effective at mitigating the overall privacy risks, is not the most
suitable approach when considering a realistic scenario of individuals requesting the removal of
their information from the implicit parameters of the LMs.
Differential Privacy Differential Privacy (DP) aims to guarantee that the effect of an individual
input on the output of a specific function is bounded (Dwork, 2008; Dwork et al., 2006). In the
context of deep neural networks, DP, which needs to be applied during the training phase, aims
to construct models that can provide general guarantees that the individual information within the
training data cannot be inferred (Abadi et al., 2016). While DP has shown to be surprisingly ef-
fective at fine-tuning LMs (Li et al., 2022; Yu et al., 2022), pretraining LMs with DP still suffers
from substantial performance gap, expensive computation, and slow convergence (Anil et al., 2021).
Furthermore, as pointed out by Brown et al. (2022), DP can only provide limited guarantees for LMs
because DP requires a unified definition for privacy boundaries, which is inherently impossible for
natural language data. Most importantly, in a realistic scenario where individuals may practice their
3
Right-To-Be-Forgotten (RTBF) dynamically after model deployment, it is nontrivial to apply ex-
isting descent-based DP algorithms such as DP-SGD to only protection against targeted extraction
attacks.
2.2 MACHINE UNLEARNING
Machine unlearning has received attention as an alternative approach to overcome data privacy issues
in machine learning (Cao & Yang, 2015; Ginart et al., 2019; Bourtoule et al., 2021; Graves et al.,
2021). Several studies attempt to explore machine unlearning for deep neural networks (Golatkar
et al., 2020; Mehta et al., 2022). However, they mostly focus on proposing algorithms for image
classification models where they aim to forget a whole class; that is, achieve random performance
for specific image classes such as “cats” or “ships”. We are the first, to the best of our knowledge,
to explore unlearning a specific sequence of tokens for LMs which is a quite different set-up from
traditional image classification models (tens of image classes vs. a sequence of tokens that can
each be classified into VR50,000). In this work, we coin this approach as knowledge unlearning
since we are more focused on forgetting specific knowledge represented by sequences of tokens.
Zhou et al. (2022) focus on how forgetting can be leveraged to improve the performance of the un-
derlying model. They propose “forget-and-relearn” that unifies existing iterative training algorithms
by selectively removing undesirable information and re-learning good features, helping boost per-
formance for the task of image classification and multi-agent emergence communication. The un-
derlying assumption is that it is often easier to define and stop unwanted behavior than to teach good
behavior. We also show this phenomenon in Section 4 where we unintentionally find unlearning just
a few sequences of tokens sometimes boosts general LM capabilities.
2.3 MEMORIZATION IN LANGUAGE MODELS
Previous work that explores to which extent LMs have memorized their training data approach the
phenomenon with two different viewpoints. Some work view memorization of LMs simply as a
threat to individual privacy (Carlini et al., 2021; 2022a; Jagielski et al., 2022) and utilize metrics
that quantify how much the LMs are susceptible to adversarial attacks. These metrics are mostly
dependent on the specific types of attacks such as the membership inference attack (Shokri et al.,
2017) and measure the privacy risks of LMs by quantifying the success rate of these attacks. In our
work, we instead focus on more targeted extraction attacks.
Another line of work simply quantifies how much knowledge is accumulated and forgotten during
pretraining by extracting relational knowledge about the world (Petroni et al., 2019; Lazaridou et al.,
2021; Jang et al., 2022b;a). This line of work does not view memorization as a negative trait, but as
a positive one that can be leveraged to extract world knowledge from its implicit parameters and per-
form knowledge-intensive tasks such as question answering or training knowledgeable conversation
agents.
Our work is highly related to Jagielski et al. (2022)’s work where they also assert that forgetting
can be a relaxed version of differential privacy. However, there are two main differences between
our work and theirs. First, they only analyze forgetting as a passive form of mitigating privacy,
asserting that data seen early in large-scale training obtain privacy benefits, whereas we suggest a
more active form of forgetting. Second, they only show analysis results with image classification
and audio generation models while we specifically focus on large LMs.
3 KNOWLEDGE UNLEARNING FOR LANGUAGE MODELS
3.1 METHODOLOGY
We propose simply negating the original training objective of minimizing the negative log-likelihood
of the token sequences as our main method of knowledge unlearning in LMs. Specifically, given
a sequence of tokens x= (x1, ..., xT), our unlearning training objective is simply maximizing the
following loss function:
LUL(fθ,x) =
T
X
t=1
log(pθ(xt|x<t)) (1)
4
where x<t denotes the token sequence x= (x1, ..., xt1)and pθ(xt|x<t)denotes the conditional
probability of predicting the next token to be xtwhen given x<t to an LM fwith parameters θ.
3.2 QUANTIFYING PRIVACY RISKS OF LANGUAGE MODELS
In this subsection, we introduce two metrics we use to quantify the privacy risks given a specific
token sequence and how we empirically define the token sequence to be forgotten. In this work, we
do not utilize metrics such as membership inference attack recall (Shokri et al., 2017) since we are
not interested in quantifying the general privacy risks of LMs, but instead the privacy risks on the
specific target token sequences.
Extraction Likelihood (EL) We first introduce a new metric, EL. Given a sequence of tokens
x= (x1, ..., xT)and an LM fwith pre-trained parameters θ, we define EL to be as follows:
ELn(x) = PTn
t=1 OVERLAPn(fθ(x<t), xt)
Tn(2)
OVERLAPn(a,b) = Pcn-grams(a){cn-grams(b)}
|n-grams(a)|(3)
where n-grams() denotes the list of n-grams in the given token sequence and fθ(x<t)denotes the
output token sequences from the LM fθwhen given x<t as input that can have max lengths |xt|
but may be shorter when the EOS (end-of-sequence) token is generated beforehand.
The process of varying the prefix length |x<t|can be seen as varying the strength of adversarial
attacks. This is based on the assumption that the more prior information is provided about the
target token sequence, the easier the LM will be able to extract it. Overall, EL can be seen as
estimating the general extraction likelihood since we are measuring the average success rate of
varying extraction attacks quantified via getting the n-gram overlap of generated and target token
sequences. While previous metrics quantifying the privacy risks of LMs are dependent on specific
adversarial attacks, this characteristic of EL allows it to quantify the general likelihood of extraction
without any dependency on specific extraction attacks.
We regard nto be a hyper-parameter that can be varied depending on the stringency of privacy
standards. The higher nis set, the stricter we set the standard for a successful extraction attack.
Memorization Accuracy (MA) We define Memorization Accuracy (MA) as follows:
MA(x) = PT1
t=1 {argmax(pθ(·|x<t)) = xt}
T1(4)
MA quantifies how much fθhas memorized the given token sequences and was proposed by Tiru-
mala et al. (2022) to analyze the training dynamics of large LMs.
Empirical Definition of Forgetting By utilizing both ELnand MA, we empirically define a spe-
cific token sequence xto be forgotten and is no longer susceptible to extraction attacks when the
following conditions are met:
ELn(x)1
|D0|X
x0D0
ELn(x0)and MA(x)1
|D0|X
x0D0
MA(x0)(5)
where D0represents a validation corpora not seen during training. In other words, we define xto be
forgotten when the ELn(x) and MA(x) reach a value that is lower than the average ELnand MA
on token sequences that were not seen during training.
5
摘要:

KNOWLEDGEUNLEARNINGFORMITIGATINGPRIVACYRISKSINLANGUAGEMODELSJoelJang1DongkeunYoon3SoheeYang1SungminCha4MoontaeLee2,5LajanugenLogeswaran2MinjoonSeo11KAIST2LGAIResearch3KonkukUniversity4SeoulNationalUniversity5UniversityofIllinoisChicagofjoeljang,sohee.yang,minjoong@kaist.ac.kr,ramses2687@konkuk.ac.k...

展开>> 收起<<
KNOWLEDGE UNLEARNING FOR MITIGATING PRIVACY RISKS IN LANGUAGE MODELS Joel Jang1Dongkeun Yoon3Sohee Yang1Sungmin Cha4.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:912.54KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注