
Maximov et al.,2020;Gupta et al.,2021). In
NLP, some work has been done on evaluating the
reidentification risk of deidentified text (Scaiano
et al.,2016). El Emam et al. (2009) proposes a
method for deidentification of tabular datasets
based on the concept of K-anonymity. Gardner
and Xiong (2009) deidentify unstructured text by
performing named entity extraction and redacting
entities until k-anonymity. Mansour et al. (2021)
propose an algorithm for deidentification of tabular
datasets by quantifying reidentification risk using a
metric related to K-anonymity. In our work, we
train a reidentification model in an adversarial
setting and use the model to deidentify documents
directly.
Learning in the presence of masks.
Various
works have shown how to improve NLP models by
masking some of the input during training. Chen
and Ji (2020) show that learning in the presence of
masks can improve classifier interpretability and
accuracy. Li et al. (2016) train a model to search for
the minimum subset of words required that, when
removed, change the output of a classifier. They
apply their method to neural network interpretabil-
ity, and use reinforcement learning. Liao et al.
(2020) pre-train a BERT-style language model to
do masked-word prediction by sampling a masking
ratio from
U(0,1)
and masking that many words.
While their method was originally proposed for text
generation, we apply the same masking approach
to train language models for redaction.
3 Motivating Experiment:
Quasi-Identifiers
In order to study the problem of deidentifying
personal information from documents, we set up
a model dataset utilizing personal profiles from
Wikipedia. We use the Wikibio dataset (Lebret
et al.,2016). Each entry in the dataset contains a
document, the introductory text of the Wikipedia
article, and a profile, the infobox of key-value pairs
containing personal information. We train on the
training dataset of
582,659
documents and profiles.
During test time, we evaluate only test documents,
but consider all
728,321
profiles from the concate-
nation of the train, validation, and test sets. This
dataset represents a natural baseline by providing
a range of factual profile information for a large
collection of individuals, making it challenging to
deidentify. In addition, it provides an openly avail-
able collection for comparing models.
DeID
None Named entity Lexical
ReID (0%) (24%) (28%)
IR ReID 74.9 4.3 0.0
NN ReID 99.6 79.7 31.9
Table 1: Percentage of documents reidentified (ReID)
for different deidentification methods. Percentage of
words masked in parentheses.
Is it difficult to deidentify individuals in this
dataset? Wikipedia presents no domain challenges,
and so finding entities is trivial. In addition many
of the terms in the documents overlap directly with
the terms in the profile table. Simple techniques
should provide robust deidentification.
We test this with two deidentification techniques:
(1) Named entity removes all words in documents
that are tagged as named entities. (2) Lexical re-
moves all words in the document that also overlap
with the profile. To reidentify, we use an informa-
tion retrieval model (BM25) and a dense neural
network approach (described in Section 5).
Table 1shows the results. While IR-based ReID
is able to reidentify most of the original documents,
without named entities or lexical matches, docu-
ments appear to be no longer reidentifiable. How-
ever, our model is able to reidentify 80% of doc-
uments, even with all entities removed. With all
lexical matches with the profile removed (32% of
total words), NN ReID is still able to reidentify a
non-trivial number of documents.
This experiment indicates that even in the Wik-
iBio domain, there are a significant number of
pseudo-identifiers that allow the system to iden-
tify documents even when almost all known match-
ing information is removed. In this work we study
methods for discovering and quantifying these iden-
tifiers.
4 Deidentification by Inference
An overview of our data and system is shown in
Figure 1. Given a document
x1. . . xN
, we consider
the problem of uniquely identifying the correspond-
ing person
y
from a set of possible options
Y
. The
system works in the presence of redactions defined
by a latent binary mask
z1. . . zN
on each position,
where setting zn= 1 masks word xn.
We define a reidentification model as a model of
p(y|x, z)
that gives a probability to each profile in