Unsupervised Text Deidentiﬁcation John X. Morris Justin T. Chiu Ramin Zabih Alexander M. Rush Cornell University

2025-05-06 0 0 1.3MB 12 页 10玖币

侵权投诉

Unsupervised Text Deidentiﬁcation

John X. Morris Justin T. Chiu Ramin Zabih Alexander M. Rush

Cornell University

{jxm3}@cornell.edu

Abstract

Deidentiﬁcation seeks to anonymize textual

data prior to distribution. Automatic deidentiﬁ-

cation primarily uses supervised named entity

recognition from human-labeled data points.

We propose an unsupervised deidentiﬁcation

method that masks words that leak personally-

identifying information. The approach uti-

lizes a specially trained reidentiﬁcation model

to identify individuals from redacted personal

documents. Motivated by K-anonymity based

privacy, we generate redactions that ensure a

minimum reidentiﬁcation rank for the correct

proﬁle of the document. To evaluate this ap-

proach, we consider the task of deidentify-

ing Wikipedia Biographies, and evaluate using

an adversarial reidentiﬁcation metric. Com-

pared to a set of unsupervised baselines, our

approach deidentiﬁes documents more com-

pletely while removing fewer words. Quali-

tatively, we see that the approach eliminates

many identifying aspects that would fall out-

side of the common named entity based ap-

proach. 1

1 Introduction

In domains such as law, medicine, and government,

it can be difﬁcult to release textual data because

it contains sensitive personal information (John-

son et al.,2016;Jana and Biemann,2021;Pil

et al.,2022). Privacy laws and regulations vary

by domain and impact the requirements for dei-

dentiﬁcation. Most prior work on automatic dei-

dentiﬁcation (Neamatullah et al.,2008;Meystre

et al.,2010;S

anchez et al.,2014;Liu et al.,2017;

Norgeot et al.,2020;Sberbank and Emelyanov,

2021) deidentiﬁes data to the requirements of the

HIPAA Safe Harbor method (Centers for Medicare

& Medicaid Services,1996). Annotations for these

systems are based on a list of 18 identiﬁers like age,

Our code and deidentiﬁed datasets are available on

Github.

phone number, and zip code. These systems treat

deidentiﬁcation as a named entity recognition prob-

lem within this space. Upon the removal of these

pre-deﬁned entities, text is no longer considered

sensitive.

However, one of the 18 categories deﬁned by

HIPAA Safe Harbor includes “any unique identi-

fying number, characteristic, or code [that could

be used to reidentify an individual]”. Prior work

ignores this nebulous 18th category. One reason

the category is ill-deﬁned is due to the existence of

quasi-identiﬁers, pieces of personally identiﬁable

information (PII) that do not fall under any sin-

gle category and therefore are difﬁcult to identify

and label in the general case (Phillips and Knop-

pers,2016). Even data that has all of the categories

from Safe Harbor removed may still be reidentiﬁed

through quasi-identiﬁers (Angiuli et al.,2015). Su-

pervised approaches cannot naturally detect quasi-

identiﬁers, since these words are not inherently

labeled as PII (Uzuner et al.,2007).

In this work, we propose an unsupervised dei-

dentiﬁcation method that targets the more general

deﬁnition of PII. Instead of relying on speciﬁc rule

lists of named entities, we directly remove words

that could lead to reidentiﬁcation. Motivated by

the goal of K-anonymity (Lison et al.,2021), our

approach utilizes a learned probabilistic reidentiﬁ-

cation model to predict the true identity of a given

text. We perform combinatorial inference in this

model to ﬁnd a set of words that, when masked,

achieve K-anonymity. The system does not require

any annotations of speciﬁc PII, but instead learns

from a dataset of aligned descriptive text and pro-

ﬁle information. Using this information, we can

train an identiﬁcation process using a dense en-

coder model.

Experiments test the ability of the system to dei-

dentify documents from a large-scale database. We

use a dataset of Wikipedia Biographies aligned

arXiv:2210.11528v1 [cs.CL] 20 Oct 2022

Figure 1: Method overview. A document (x, top-left) paired with a proﬁle (ˆy, top-right) is given to the system. A

trained neural reidentiﬁcation model (p(y|x, z), blue circle) produces a distribution over all possible proﬁles based

on densely encoded representations. At each stage of inference, masks are added to the source document, changing

the relative rank of the reidentiﬁcation model. The method is run until k-anonymity of the reidentiﬁcation model

is achieved. Note that in this example, it is not necessary to remove all information, such as the month and day of

birth, since the player is already deidentiﬁed.

with info-boxes (Lebret et al.,2016). The system

is ﬁt on a subset of the data and then asked to

deidentify unseen individuals. Results show that

even when all words from the proﬁle are masked,

the system is able to reidentify

32%

of individuals.

When we use our system to deidentify documents,

it is able to fully anonymize them while retain-

ing over 50% of words. When we compare our

deidentiﬁcation method to a set of unsupervised

baselines, our method deidentiﬁes documents more

completely while removing fewer words. We qual-

itatively and quantitatively analyze the redactions

produced by our system, including examples of

successfully redacted quasi-identiﬁers.

2 Related Work

Automated deidentiﬁcation.

There is much

prior work on deidentifying text datasets, both

with rule-based systems (Neamatullah et al.,2008;

Meystre et al.,2010;S

anchez et al.,2014;Nor-

geot et al.,2020;Sberbank and Emelyanov,2021)

and deep learning methods (Liu et al.,2017;Yue

and Zhou,2020;Johnson et al.,2020). Each of

these methods is supervised, relies on datasets with

human-labeled PII, and focuses on removing some

subset of the 18 identifying categories from HIPAA

Safe Harbor. Other approaches include generat-

ing entire new fake datasets using Generative Ad-

versarial Networks (GANs) (Chin-Cheong et al.,

2019). Friedrich et al. (2019) train an LSTM on

an EMR-based NLP task using adversarial loss to

prevent the model from learning to reconstruct the

input. Finally, differential privacy is a technique

for ensuring provably private distributions (Dwork

et al.,2006). It has mostly been used for training

anonymized models on data containing PII, but re-

quires access to the un-anonymized datasets for

training (Li et al.,2021). Our deidentiﬁcation ap-

proach does not provide the formal guarantees of

differential privacy, but aims to provide a practi-

cal solution for anonymizing datasets in real-world

scenarios.

Deidentiﬁcation by reidentiﬁcation.

The

NeurIPS 2020 Hide-and-Seek Privacy Chal-

lenge benchmarked both deidentiﬁcation and

reidentiﬁcation techniques for clinical time series

data (Jordon et al.,2021). In computer vision,

researchers have proposed learning to mask faces

in images to preserve the privacy of individuals

using reidentiﬁcation (Hukkel

as et al.,2019;

Maximov et al.,2020;Gupta et al.,2021). In

NLP, some work has been done on evaluating the

reidentiﬁcation risk of deidentiﬁed text (Scaiano

et al.,2016). El Emam et al. (2009) proposes a

method for deidentiﬁcation of tabular datasets

based on the concept of K-anonymity. Gardner

and Xiong (2009) deidentify unstructured text by

performing named entity extraction and redacting

entities until k-anonymity. Mansour et al. (2021)

propose an algorithm for deidentiﬁcation of tabular

datasets by quantifying reidentiﬁcation risk using a

metric related to K-anonymity. In our work, we

train a reidentiﬁcation model in an adversarial

setting and use the model to deidentify documents

directly.

Learning in the presence of masks.

Various

works have shown how to improve NLP models by

masking some of the input during training. Chen

and Ji (2020) show that learning in the presence of

masks can improve classiﬁer interpretability and

accuracy. Li et al. (2016) train a model to search for

the minimum subset of words required that, when

removed, change the output of a classiﬁer. They

apply their method to neural network interpretabil-

ity, and use reinforcement learning. Liao et al.

(2020) pre-train a BERT-style language model to

do masked-word prediction by sampling a masking

ratio from

U(0,1)

and masking that many words.

While their method was originally proposed for text

generation, we apply the same masking approach

to train language models for redaction.

3 Motivating Experiment:

Quasi-Identiﬁers

In order to study the problem of deidentifying

personal information from documents, we set up

a model dataset utilizing personal proﬁles from

Wikipedia. We use the Wikibio dataset (Lebret

et al.,2016). Each entry in the dataset contains a

document, the introductory text of the Wikipedia

article, and a proﬁle, the infobox of key-value pairs

containing personal information. We train on the

training dataset of

582,659

documents and proﬁles.

During test time, we evaluate only test documents,

but consider all

728,321

proﬁles from the concate-

nation of the train, validation, and test sets. This

dataset represents a natural baseline by providing

a range of factual proﬁle information for a large

collection of individuals, making it challenging to

deidentify. In addition, it provides an openly avail-

able collection for comparing models.

DeID

None Named entity Lexical

ReID (0%) (24%) (28%)

IR ReID 74.9 4.3 0.0

NN ReID 99.6 79.7 31.9

Table 1: Percentage of documents reidentiﬁed (ReID)

for different deidentiﬁcation methods. Percentage of

words masked in parentheses.

Is it difﬁcult to deidentify individuals in this

dataset? Wikipedia presents no domain challenges,

and so ﬁnding entities is trivial. In addition many

of the terms in the documents overlap directly with

the terms in the proﬁle table. Simple techniques

should provide robust deidentiﬁcation.

We test this with two deidentiﬁcation techniques:

(1) Named entity removes all words in documents

that are tagged as named entities. (2) Lexical re-

moves all words in the document that also overlap

with the proﬁle. To reidentify, we use an informa-

tion retrieval model (BM25) and a dense neural

network approach (described in Section 5).

Table 1shows the results. While IR-based ReID

is able to reidentify most of the original documents,

without named entities or lexical matches, docu-

ments appear to be no longer reidentiﬁable. How-

ever, our model is able to reidentify 80% of doc-

uments, even with all entities removed. With all

lexical matches with the proﬁle removed (32% of

total words), NN ReID is still able to reidentify a

non-trivial number of documents.

This experiment indicates that even in the Wik-

iBio domain, there are a signiﬁcant number of

pseudo-identiﬁers that allow the system to iden-

tify documents even when almost all known match-

ing information is removed. In this work we study

methods for discovering and quantifying these iden-

tiﬁers.

4 Deidentiﬁcation by Inference

An overview of our data and system is shown in

Figure 1. Given a document

x1. . . xN

, we consider

the problem of uniquely identifying the correspond-

ing person

from a set of possible options

. The

system works in the presence of redactions deﬁned

by a latent binary mask

z1. . . zN

on each position,

where setting zn= 1 masks word xn.

We deﬁne a reidentiﬁcation model as a model of

p(y|x, z)

that gives a probability to each proﬁle in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedTextDeidenticationJohnX.MorrisJustinT.ChiuRaminZabihAlexanderM.RushCornellUniversityfjxm3g@cornell.eduAbstractDeidenticationseekstoanonymizetextualdatapriortodistribution.Automaticdeidenti-cationprimarilyusessupervisednamedentityrecognitionfromhuman-labeleddatapoints.Weproposeanunsupe...

展开>> 收起<<

Unsupervised Text Deidentiﬁcation John X. Morris Justin T. Chiu Ramin Zabih Alexander M. Rush Cornell University.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Text Deidentiﬁcation John X. Morris Justin T. Chiu Ramin Zabih Alexander M. Rush Cornell University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: