KNOWLEDGE UNLEARNING FOR MITIGATING PRIVACY RISKS IN LANGUAGE MODELS Joel Jang1Dongkeun Yoon3Sohee Yang1Sungmin Cha4

2025-05-03 0 0 912.54KB 21 页 10玖币

侵权投诉

KNOWLEDGE UNLEARNING FOR MITIGATING

PRIVACY RISKS IN LANGUAGE MODELS

Joel Jang1∗Dongkeun Yoon 3Sohee Yang1Sungmin Cha4

Moontae Lee2,5 Lajanugen Logeswaran2Minjoon Seo1

1KAIST 2LG AI Research 3Konkuk University 4Seoul National University

5University of Illinois Chicago

{joeljang,sohee.yang,minjoon}@kaist.ac.kr, ramses2687@konkuk.ac.kr

sungmin.cha@snu.ac.kr, {moontae.lee,llajan}@lgresearch.ai

ABSTRACT

Pretrained Language Models (LMs) memorize a vast amount of knowledge during

initial pretraining, including information that may violate the privacy of personal

lives and identities. Previous work addressing privacy issues for language models

has mostly focused on data preprocessing and differential privacy methods, both

requiring re-training the underlying LM. We propose knowledge unlearning as an

alternative method to reduce privacy risks for LMs post hoc. We show that simply

performing gradient ascent on target token sequences is effective at forgetting

them with little to no degradation of general language modeling performances

for larger LMs; it sometimes even substantially improves the underlying LM with

just a few iterations. We also ﬁnd that sequential unlearning is better than trying

to unlearn all the data at once and that unlearning is highly dependent on which

kind of data (domain) is forgotten. By showing comparisons with a previous data

preprocessing method and a decoding method known to mitigate privacy risks for

LMs, we show that unlearning can give a stronger empirical privacy guarantee in

scenarios where the data vulnerable to extraction attacks are known a priori while

being much more efﬁcient and robust. We release the code and dataset needed to

replicate our results at https://github.com/joeljang/knowledge-unlearning.

1 INTRODUCTION

Recent work has shown that an adversary can extract training data from Pretrained Language Mod-

els (LMs) including Personally Identiﬁable Information (PII) such as names, phone numbers, and

email addresses, and other information such as licensed code, private clinical notes, and 128-bit

UUIDs (Carlini et al., 2021; Lee et al., 2022; Huang et al., 2022; Lehman et al., 2021). In 2021, an

AI chatbot Iruda became the ﬁrst AI system to be sued for violating the Personal Information Protec-

tion Act after generating the exact home addresses and bank account numbers of actual individuals

unintentionally (Park, 2021). Heikkil¨

a (2022) has also shown that GPT-3 (Brown et al., 2020), one

of the most well-known LM currently in commercial use, offered detailed private information about

the Editor-in-Chief of MIT Technology Review including his family members, work address, and

phone number. Considering ﬁndings that show extracting training data gets easier as LMs scale to

larger sizes (Carlini et al., 2022a) and that it is common practice for practitioners to release billion

parameters pretrained LMs for public use (Gao et al., 2020; Black et al., 2021; Zhang et al., 2022),

it has become important to provide privacy guarantees for large LMs.

Practitioners are required to delete personal information from the LMs by individuals’ request be-

cause each individual has the “Right To Be Forgotten (RTBF)” (Mantelero, 2013; Graves et al.,

2021) and can limit the direct and indirect commercial use of their personal information (Villaronga

et al., 2018). Previous methods addressing privacy risks for language models attempt to remove all

private information from the training data (data preprocessing) (Aura et al., 2006; Dernoncourt et al.,

2017; Lison et al., 2021; Kandpal et al., 2022) or attempt to design algorithms that ensure differen-

tial privacy (DP) (Dwork, 2008; Dwork et al., 2006; Abadi et al., 2016; Anil et al., 2021; Li et al.,

∗work done during internship at LG AI Research.

arXiv:2210.01504v2 [cs.CL] 19 Dec 2022

Name: Bob

Age: 27

Marital Status: Single

SSN: 123 - 4567 - 8910

Details: Got divorced by ex-wife named

Alice and is currently undergoing custody

battles.

Net Worth: $5,000,000

Sensitive Personal Information

Data

Preprocessing

Differential

Privacy

Knowledge

Unlearning

Find and Remove

Re-train LM after sanitization

(~900 A100 GPU days)

Re-train LM with DP Algorithm

(~1800 A100 GPU days)

Perform a few token updates

(~0.001 A100 GPU days)

I practice my Right To

Be Forgotten (RTBF)!

Pretraining

Corpora

Pretraining

Corpora

Token

Sequences

Our Proposed Approach

Bob

Figure 1: Comparison of previous approaches and knowledge unlearning when an individual prac-

tices his/her Right-To-Be-Forgotten (RTBF).

2022; Yu et al., 2022). Both approaches require retraining the underlying LM every time individuals

want to practice their RTBF, which makes them inadequate for large LMs that are extremely costly

to retrain. Furthermore, as pointed out by Brown et al. (2022), data preprocessing methods assume

private information to be easily identiﬁable, speciﬁed, and removed and DP algorithms can only

guarantee protection for information that has clear privacy borders, which makes them inadequate

in the real-world scenarios where the standard of privacy might differ by each individual.

To this end, we propose knowledge unlearning (Figure 1) as an efﬁcient solution that can be applied

with just a few parameter updates instead of pretraining the underlying LM again. We perform ex-

periments on GPT-Neo LMs (125M, 1.3B, 2.7B) (Black et al., 2021) and show that simply changing

the gradient descent to the opposite direction during language modeling (which can also be seen as

maximizing instead of minimizing the loss function) is effective at protecting target sequences from

extraction attacks with little to no performance degradation on the initial LM capabilities measured

via 9 common NLP classiﬁcation benchmarks (Hellaswag (Zellers et al., 2019), Lambada (Paperno

et al., 2016), Winogrande (Sakaguchi et al., 2021), COPA (Gordon et al., 2012), ARC-Easy (Clark

et al., 2018), ARC-Challenge (Clark et al., 2018), Piqa (Bisk et al., 2020), MathQA (Amini et al.,

2019), and PubmedQA (Jin et al., 2019)) and 4 dialogue tasks (Wizard of Wikipedia (Dinan et al.,

2019), Empathetic Dialogues (Rashkin et al., 2019), Blended Skill Talk (Smith et al., 2020), and

Wizard of Internet (Komeili et al., 2022)). For some cases, knowledge unlearning unexpectedly

shows signiﬁcant improvements in LM performance for some of the benchmarks.

We compare our approach with data deduplication method (Kandpal et al., 2022) and differential

privacy decoding method (Majmudar et al., 2022) which are both known to mitigate privacy risks

and show the effectiveness of knowledge unlearning by providing strong privacy protection while

being much more efﬁcient and robust. We also provide a general guideline that can be used to

quantify the memorization and extraction likelihood of target token sequences and suggest when we

can empirically consider them to have been “forgotten”. Speciﬁcally, we introduce a novel metric

that measures the extraction likelihood by varying the preﬁx length of the target token sequence and

quantifying how much of the sufﬁx is actually extracted from the LM.

Surprisingly, for knowledge unlearning, we ﬁnd that it is easier to forget a chunk of instances se-

quentially rather than trying to forget them all at once. We provide further analysis and show that

the difﬁculty of knowledge unlearning depends heavily on the target data being forgotten, especially

the domain of the target data. We also provide empirical examples of performing extraction attacks

and how exactly knowledge unlearning provides privacy protection for the LM.

To summarize, our main contributions are fourfold:

• We compare knowledge unlearning with two approaches from literature known to mitigate

privacy risks: a data preprocessing approach and a Differential Privacy (DP) Decoding

approach. We show that our approach results in little to no performance degradation of

general capabilities (sometimes resulting in improvement) while providing strong privacy

protections in situations individuals practice their RTBF whereas the data preprocessing ap-

proach provides weaker privacy protection while being orders of magnitude computation-

ally demanding and the DP Decoding approach results in severe degradation of modeling

performance.

• We perform additional experiments to determine which factors contribute to the difﬁculty

of knowledge unlearning and ﬁnd that (1) trying to forget many samples at once results in

substantial LM performance degradation which can be mitigated by sequentially forgetting

chunks of data and that (2) the domain of the target data (Code, License, Wikipedia, etc.)

plays a critical role in determining how hard they are to forget.

• We provide a novel metric and a general guideline for quantifying the privacy risks for LMs

and determine when they should be considered to have “forgotten” a given target sequence.

•Knowledge unlearning surprisingly seems to make LMs stronger where the extreme cases

bring +8.0% (37.6% →45.6%), +10.1% (57.4% →67.5%), and +7.9% (62.2% →70.1%)

improvements on Lambada for GPT-NEO 125M, 1.3B, and 2.7B, respectively.

2 RELATED WORK

2.1 PRIVACY METHODS FOR LANGUAGE MODELS

Prior work that tries to mitigate privacy risks for LMs can be divided mainly into data pre/post-

processing methods and differential privacy methods.

(Data) Pre/Post-Processing Data preprocessing aims to sanitize the training data; it aims to get

rid of all data that might violate any kind of privacy from the training data prior to training. These

methods mostly utilize measures such as parsers and classiﬁcation models that try to identify and

predict patterns that constitute private information. This is effective at identifying well-formatted

private information such as social security numbers or special forms of medical notes (Aura et al.,

2006; Dernoncourt et al., 2017; Lison et al., 2021; Kandpal et al., 2022). However, as pointed out by

Brown et al. (2022), considering that private information is mostly context-dependent and sometimes

in a non-speciﬁc format, data preprocessing methods cannot fully claim that they provide privacy

guarantees, especially guarantees that match each individual’s standards. Methods that attempt to

utilize post-processing methods such as applying censorship to the LM outputs still face the same

limitations.

In this work, we compare our proposed method with a data preprocessing approach proposed by

Kandpal et al. (2022) which shows that deduplicating the training corpora before pretraining helps

pretrain LMs that show stronger robustness against extraction attacks than an LM pretrained under

the same circumstances without deduplicating the pretraining corpora. However, we highlight that

this approach, which may still be effective at mitigating the overall privacy risks, is not the most

suitable approach when considering a realistic scenario of individuals requesting the removal of

their information from the implicit parameters of the LMs.

Differential Privacy Differential Privacy (DP) aims to guarantee that the effect of an individual

input on the output of a speciﬁc function is bounded (Dwork, 2008; Dwork et al., 2006). In the

context of deep neural networks, DP, which needs to be applied during the training phase, aims

to construct models that can provide general guarantees that the individual information within the

training data cannot be inferred (Abadi et al., 2016). While DP has shown to be surprisingly ef-

fective at ﬁne-tuning LMs (Li et al., 2022; Yu et al., 2022), pretraining LMs with DP still suffers

from substantial performance gap, expensive computation, and slow convergence (Anil et al., 2021).

Furthermore, as pointed out by Brown et al. (2022), DP can only provide limited guarantees for LMs

because DP requires a uniﬁed deﬁnition for privacy boundaries, which is inherently impossible for

natural language data. Most importantly, in a realistic scenario where individuals may practice their

Right-To-Be-Forgotten (RTBF) dynamically after model deployment, it is nontrivial to apply ex-

isting descent-based DP algorithms such as DP-SGD to only protection against targeted extraction

attacks.

2.2 MACHINE UNLEARNING

Machine unlearning has received attention as an alternative approach to overcome data privacy issues

in machine learning (Cao & Yang, 2015; Ginart et al., 2019; Bourtoule et al., 2021; Graves et al.,

2021). Several studies attempt to explore machine unlearning for deep neural networks (Golatkar

et al., 2020; Mehta et al., 2022). However, they mostly focus on proposing algorithms for image

classiﬁcation models where they aim to forget a whole class; that is, achieve random performance

for speciﬁc image classes such as “cats” or “ships”. We are the ﬁrst, to the best of our knowledge,

to explore unlearning a speciﬁc sequence of tokens for LMs which is a quite different set-up from

traditional image classiﬁcation models (∼tens of image classes vs. a sequence of tokens that can

each be classiﬁed into V∈R∼50,000). In this work, we coin this approach as knowledge unlearning

since we are more focused on forgetting speciﬁc knowledge represented by sequences of tokens.

Zhou et al. (2022) focus on how forgetting can be leveraged to improve the performance of the un-

derlying model. They propose “forget-and-relearn” that uniﬁes existing iterative training algorithms

by selectively removing undesirable information and re-learning good features, helping boost per-

formance for the task of image classiﬁcation and multi-agent emergence communication. The un-

derlying assumption is that it is often easier to deﬁne and stop unwanted behavior than to teach good

behavior. We also show this phenomenon in Section 4 where we unintentionally ﬁnd unlearning just

a few sequences of tokens sometimes boosts general LM capabilities.

2.3 MEMORIZATION IN LANGUAGE MODELS

Previous work that explores to which extent LMs have memorized their training data approach the

phenomenon with two different viewpoints. Some work view memorization of LMs simply as a

threat to individual privacy (Carlini et al., 2021; 2022a; Jagielski et al., 2022) and utilize metrics

that quantify how much the LMs are susceptible to adversarial attacks. These metrics are mostly

dependent on the speciﬁc types of attacks such as the membership inference attack (Shokri et al.,

2017) and measure the privacy risks of LMs by quantifying the success rate of these attacks. In our

work, we instead focus on more targeted extraction attacks.

Another line of work simply quantiﬁes how much knowledge is accumulated and forgotten during

pretraining by extracting relational knowledge about the world (Petroni et al., 2019; Lazaridou et al.,

2021; Jang et al., 2022b;a). This line of work does not view memorization as a negative trait, but as

a positive one that can be leveraged to extract world knowledge from its implicit parameters and per-

form knowledge-intensive tasks such as question answering or training knowledgeable conversation

agents.

Our work is highly related to Jagielski et al. (2022)’s work where they also assert that forgetting

can be a relaxed version of differential privacy. However, there are two main differences between

our work and theirs. First, they only analyze forgetting as a passive form of mitigating privacy,

asserting that data seen early in large-scale training obtain privacy beneﬁts, whereas we suggest a

more active form of forgetting. Second, they only show analysis results with image classiﬁcation

and audio generation models while we speciﬁcally focus on large LMs.

3 KNOWLEDGE UNLEARNING FOR LANGUAGE MODELS

3.1 METHODOLOGY

We propose simply negating the original training objective of minimizing the negative log-likelihood

of the token sequences as our main method of knowledge unlearning in LMs. Speciﬁcally, given

a sequence of tokens x= (x1, ..., xT), our unlearning training objective is simply maximizing the

following loss function:

LUL(fθ,x) = −

t=1

log(pθ(xt|x<t)) (1)

where x<t denotes the token sequence x= (x1, ..., xt−1)and pθ(xt|x<t)denotes the conditional

probability of predicting the next token to be xtwhen given x<t to an LM fwith parameters θ.

3.2 QUANTIFYING PRIVACY RISKS OF LANGUAGE MODELS

In this subsection, we introduce two metrics we use to quantify the privacy risks given a speciﬁc

token sequence and how we empirically deﬁne the token sequence to be forgotten. In this work, we

do not utilize metrics such as membership inference attack recall (Shokri et al., 2017) since we are

not interested in quantifying the general privacy risks of LMs, but instead the privacy risks on the

speciﬁc target token sequences.

Extraction Likelihood (EL) We ﬁrst introduce a new metric, EL. Given a sequence of tokens

x= (x1, ..., xT)and an LM fwith pre-trained parameters θ, we deﬁne EL to be as follows:

ELn(x) = PT−n

t=1 OVERLAPn(fθ(x<t), x≥t)

T−n(2)

OVERLAPn(a,b) = Pc∈n-grams(a){c∈n-grams(b)}

|n-grams(a)|(3)

where n-grams() denotes the list of n-grams in the given token sequence and fθ(x<t)denotes the

output token sequences from the LM fθwhen given x<t as input that can have max lengths |x≥t|

but may be shorter when the EOS (end-of-sequence) token is generated beforehand.

The process of varying the preﬁx length |x<t|can be seen as varying the strength of adversarial

attacks. This is based on the assumption that the more prior information is provided about the

target token sequence, the easier the LM will be able to extract it. Overall, EL can be seen as

estimating the general extraction likelihood since we are measuring the average success rate of

varying extraction attacks quantiﬁed via getting the n-gram overlap of generated and target token

sequences. While previous metrics quantifying the privacy risks of LMs are dependent on speciﬁc

adversarial attacks, this characteristic of EL allows it to quantify the general likelihood of extraction

without any dependency on speciﬁc extraction attacks.

We regard nto be a hyper-parameter that can be varied depending on the stringency of privacy

standards. The higher nis set, the stricter we set the standard for a successful extraction attack.

Memorization Accuracy (MA) We deﬁne Memorization Accuracy (MA) as follows:

MA(x) = PT−1

t=1 {argmax(pθ(·|x<t)) = xt}

T−1(4)

MA quantiﬁes how much fθhas memorized the given token sequences and was proposed by Tiru-

mala et al. (2022) to analyze the training dynamics of large LMs.

Empirical Deﬁnition of Forgetting By utilizing both ELnand MA, we empirically deﬁne a spe-

ciﬁc token sequence xto be forgotten and is no longer susceptible to extraction attacks when the

following conditions are met:

ELn(x)≤1

|D0|X

x0∈D0

ELn(x0)and MA(x)≤1

|D0|X

x0∈D0

MA(x0)(5)

where D0represents a validation corpora not seen during training. In other words, we deﬁne xto be

forgotten when the ELn(x) and MA(x) reach a value that is lower than the average ELnand MA

on token sequences that were not seen during training.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

KNOWLEDGEUNLEARNINGFORMITIGATINGPRIVACYRISKSINLANGUAGEMODELSJoelJang1DongkeunYoon3SoheeYang1SungminCha4MoontaeLee2,5LajanugenLogeswaran2MinjoonSeo11KAIST2LGAIResearch3KonkukUniversity4SeoulNationalUniversity5UniversityofIllinoisChicagofjoeljang,sohee.yang,minjoong@kaist.ac.kr,ramses2687@konkuk.ac.k...

展开>> 收起<<

KNOWLEDGE UNLEARNING FOR MITIGATING PRIVACY RISKS IN LANGUAGE MODELS Joel Jang1Dongkeun Yoon3Sohee Yang1Sungmin Cha4.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

KNOWLEDGE UNLEARNING FOR MITIGATING PRIVACY RISKS IN LANGUAGE MODELS Joel Jang1Dongkeun Yoon3Sohee Yang1Sungmin Cha4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: