Unsupervised Text Deidentification John X. Morris Justin T. Chiu Ramin Zabih Alexander M. Rush Cornell University

2025-05-06 0 0 1.3MB 12 页 10玖币
侵权投诉
Unsupervised Text Deidentification
John X. Morris Justin T. Chiu Ramin Zabih Alexander M. Rush
Cornell University
{jxm3}@cornell.edu
Abstract
Deidentification seeks to anonymize textual
data prior to distribution. Automatic deidentifi-
cation primarily uses supervised named entity
recognition from human-labeled data points.
We propose an unsupervised deidentification
method that masks words that leak personally-
identifying information. The approach uti-
lizes a specially trained reidentification model
to identify individuals from redacted personal
documents. Motivated by K-anonymity based
privacy, we generate redactions that ensure a
minimum reidentification rank for the correct
profile of the document. To evaluate this ap-
proach, we consider the task of deidentify-
ing Wikipedia Biographies, and evaluate using
an adversarial reidentification metric. Com-
pared to a set of unsupervised baselines, our
approach deidentifies documents more com-
pletely while removing fewer words. Quali-
tatively, we see that the approach eliminates
many identifying aspects that would fall out-
side of the common named entity based ap-
proach. 1
1 Introduction
In domains such as law, medicine, and government,
it can be difficult to release textual data because
it contains sensitive personal information (John-
son et al.,2016;Jana and Biemann,2021;Pil
´
an
et al.,2022). Privacy laws and regulations vary
by domain and impact the requirements for dei-
dentification. Most prior work on automatic dei-
dentification (Neamatullah et al.,2008;Meystre
et al.,2010;S
´
anchez et al.,2014;Liu et al.,2017;
Norgeot et al.,2020;Sberbank and Emelyanov,
2021) deidentifies data to the requirements of the
HIPAA Safe Harbor method (Centers for Medicare
& Medicaid Services,1996). Annotations for these
systems are based on a list of 18 identifiers like age,
1
Our code and deidentified datasets are available on
Github.
phone number, and zip code. These systems treat
deidentification as a named entity recognition prob-
lem within this space. Upon the removal of these
pre-defined entities, text is no longer considered
sensitive.
However, one of the 18 categories defined by
HIPAA Safe Harbor includes “any unique identi-
fying number, characteristic, or code [that could
be used to reidentify an individual]”. Prior work
ignores this nebulous 18th category. One reason
the category is ill-defined is due to the existence of
quasi-identifiers, pieces of personally identifiable
information (PII) that do not fall under any sin-
gle category and therefore are difficult to identify
and label in the general case (Phillips and Knop-
pers,2016). Even data that has all of the categories
from Safe Harbor removed may still be reidentified
through quasi-identifiers (Angiuli et al.,2015). Su-
pervised approaches cannot naturally detect quasi-
identifiers, since these words are not inherently
labeled as PII (Uzuner et al.,2007).
In this work, we propose an unsupervised dei-
dentification method that targets the more general
definition of PII. Instead of relying on specific rule
lists of named entities, we directly remove words
that could lead to reidentification. Motivated by
the goal of K-anonymity (Lison et al.,2021), our
approach utilizes a learned probabilistic reidentifi-
cation model to predict the true identity of a given
text. We perform combinatorial inference in this
model to find a set of words that, when masked,
achieve K-anonymity. The system does not require
any annotations of specific PII, but instead learns
from a dataset of aligned descriptive text and pro-
file information. Using this information, we can
train an identification process using a dense en-
coder model.
Experiments test the ability of the system to dei-
dentify documents from a large-scale database. We
use a dataset of Wikipedia Biographies aligned
arXiv:2210.11528v1 [cs.CL] 20 Oct 2022
Figure 1: Method overview. A document (x, top-left) paired with a profile (ˆy, top-right) is given to the system. A
trained neural reidentification model (p(y|x, z), blue circle) produces a distribution over all possible profiles based
on densely encoded representations. At each stage of inference, masks are added to the source document, changing
the relative rank of the reidentification model. The method is run until k-anonymity of the reidentification model
is achieved. Note that in this example, it is not necessary to remove all information, such as the month and day of
birth, since the player is already deidentified.
with info-boxes (Lebret et al.,2016). The system
is fit on a subset of the data and then asked to
deidentify unseen individuals. Results show that
even when all words from the profile are masked,
the system is able to reidentify
32%
of individuals.
When we use our system to deidentify documents,
it is able to fully anonymize them while retain-
ing over 50% of words. When we compare our
deidentification method to a set of unsupervised
baselines, our method deidentifies documents more
completely while removing fewer words. We qual-
itatively and quantitatively analyze the redactions
produced by our system, including examples of
successfully redacted quasi-identifiers.
2 Related Work
Automated deidentification.
There is much
prior work on deidentifying text datasets, both
with rule-based systems (Neamatullah et al.,2008;
Meystre et al.,2010;S
´
anchez et al.,2014;Nor-
geot et al.,2020;Sberbank and Emelyanov,2021)
and deep learning methods (Liu et al.,2017;Yue
and Zhou,2020;Johnson et al.,2020). Each of
these methods is supervised, relies on datasets with
human-labeled PII, and focuses on removing some
subset of the 18 identifying categories from HIPAA
Safe Harbor. Other approaches include generat-
ing entire new fake datasets using Generative Ad-
versarial Networks (GANs) (Chin-Cheong et al.,
2019). Friedrich et al. (2019) train an LSTM on
an EMR-based NLP task using adversarial loss to
prevent the model from learning to reconstruct the
input. Finally, differential privacy is a technique
for ensuring provably private distributions (Dwork
et al.,2006). It has mostly been used for training
anonymized models on data containing PII, but re-
quires access to the un-anonymized datasets for
training (Li et al.,2021). Our deidentification ap-
proach does not provide the formal guarantees of
differential privacy, but aims to provide a practi-
cal solution for anonymizing datasets in real-world
scenarios.
Deidentification by reidentification.
The
NeurIPS 2020 Hide-and-Seek Privacy Chal-
lenge benchmarked both deidentification and
reidentification techniques for clinical time series
data (Jordon et al.,2021). In computer vision,
researchers have proposed learning to mask faces
in images to preserve the privacy of individuals
using reidentification (Hukkel
˚
as et al.,2019;
Maximov et al.,2020;Gupta et al.,2021). In
NLP, some work has been done on evaluating the
reidentification risk of deidentified text (Scaiano
et al.,2016). El Emam et al. (2009) proposes a
method for deidentification of tabular datasets
based on the concept of K-anonymity. Gardner
and Xiong (2009) deidentify unstructured text by
performing named entity extraction and redacting
entities until k-anonymity. Mansour et al. (2021)
propose an algorithm for deidentification of tabular
datasets by quantifying reidentification risk using a
metric related to K-anonymity. In our work, we
train a reidentification model in an adversarial
setting and use the model to deidentify documents
directly.
Learning in the presence of masks.
Various
works have shown how to improve NLP models by
masking some of the input during training. Chen
and Ji (2020) show that learning in the presence of
masks can improve classifier interpretability and
accuracy. Li et al. (2016) train a model to search for
the minimum subset of words required that, when
removed, change the output of a classifier. They
apply their method to neural network interpretabil-
ity, and use reinforcement learning. Liao et al.
(2020) pre-train a BERT-style language model to
do masked-word prediction by sampling a masking
ratio from
U(0,1)
and masking that many words.
While their method was originally proposed for text
generation, we apply the same masking approach
to train language models for redaction.
3 Motivating Experiment:
Quasi-Identifiers
In order to study the problem of deidentifying
personal information from documents, we set up
a model dataset utilizing personal profiles from
Wikipedia. We use the Wikibio dataset (Lebret
et al.,2016). Each entry in the dataset contains a
document, the introductory text of the Wikipedia
article, and a profile, the infobox of key-value pairs
containing personal information. We train on the
training dataset of
582,659
documents and profiles.
During test time, we evaluate only test documents,
but consider all
728,321
profiles from the concate-
nation of the train, validation, and test sets. This
dataset represents a natural baseline by providing
a range of factual profile information for a large
collection of individuals, making it challenging to
deidentify. In addition, it provides an openly avail-
able collection for comparing models.
DeID
None Named entity Lexical
ReID (0%) (24%) (28%)
IR ReID 74.9 4.3 0.0
NN ReID 99.6 79.7 31.9
Table 1: Percentage of documents reidentified (ReID)
for different deidentification methods. Percentage of
words masked in parentheses.
Is it difficult to deidentify individuals in this
dataset? Wikipedia presents no domain challenges,
and so finding entities is trivial. In addition many
of the terms in the documents overlap directly with
the terms in the profile table. Simple techniques
should provide robust deidentification.
We test this with two deidentification techniques:
(1) Named entity removes all words in documents
that are tagged as named entities. (2) Lexical re-
moves all words in the document that also overlap
with the profile. To reidentify, we use an informa-
tion retrieval model (BM25) and a dense neural
network approach (described in Section 5).
Table 1shows the results. While IR-based ReID
is able to reidentify most of the original documents,
without named entities or lexical matches, docu-
ments appear to be no longer reidentifiable. How-
ever, our model is able to reidentify 80% of doc-
uments, even with all entities removed. With all
lexical matches with the profile removed (32% of
total words), NN ReID is still able to reidentify a
non-trivial number of documents.
This experiment indicates that even in the Wik-
iBio domain, there are a significant number of
pseudo-identifiers that allow the system to iden-
tify documents even when almost all known match-
ing information is removed. In this work we study
methods for discovering and quantifying these iden-
tifiers.
4 Deidentification by Inference
An overview of our data and system is shown in
Figure 1. Given a document
x1. . . xN
, we consider
the problem of uniquely identifying the correspond-
ing person
y
from a set of possible options
Y
. The
system works in the presence of redactions defined
by a latent binary mask
z1. . . zN
on each position,
where setting zn= 1 masks word xn.
We define a reidentification model as a model of
p(y|x, z)
that gives a probability to each profile in
摘要:

UnsupervisedTextDeidenticationJohnX.MorrisJustinT.ChiuRaminZabihAlexanderM.RushCornellUniversityfjxm3g@cornell.eduAbstractDeidenticationseekstoanonymizetextualdatapriortodistribution.Automaticdeidenti-cationprimarilyusessupervisednamedentityrecognitionfromhuman-labeleddatapoints.Weproposeanunsupe...

展开>> 收起<<
Unsupervised Text Deidentification John X. Morris Justin T. Chiu Ramin Zabih Alexander M. Rush Cornell University.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.3MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注