
2 Hao Cui, Rahmadi Trimananda, Scott Jordan, and Athina Markopoulou
We collect the following categories of personal information:
●Device information... such as IP address...
●Location. We use this information to provide features...
We use your personal information... to:
●Provide the Services...
●Authenticate your account...
We disclose the personal information... as follows:
●With our travel partners...
●With social networking services...
(we, collect, personal information)
(we, collect, device information)
(we, collect, IP address)
(we, collect, location)
(we, collect, this information)
(we, collect, personal information)
(travel partners, collect, personal information)
(social networking services, collect, personal information)
(we, collect, personal information)
(we, collect, personal information)
[provide features]
[provide the Services]
[authenticate your account]
[provide features]
[provide services]
[auth account]
personal
information
we
device
information
location
IP address
travel
partners
social networking
services
SUBSUME
COLLECT
Data type
Entity
Purpose
(a) (b) (c)
Fig. 1. Example of a privacy policy and analysis approaches. (a) The excerpt is from the policy of KAYAK [
36
]. It contains sections
and lists, regarding: what is collected (data type), how it is used (purpose), who receives the information (entity), and references across
sentences (e.g., “personal information” relates to other data types; “this information” refers to “location”). (b) Prior work extracts
elements found in each sentence, mainly data types and entities, as disconnected tuples. Purposes can also be extracted to extend
the tuple [
13
,
63
]. (c) PoliGraph is a knowledge graph that encodes data types, entities, and purposes; and two types of relations
between them (collection and subsumption), possibly specified across dierent sentences. A
COLLECT
edge represents that a data type
is collected by an entity, while edge aributes represent the purposes of that collection.
SUBSUME
edges represent the subsumption
relations between generic and specific terms.
with privacy laws, and law enforcement agencies who want to audit organizations’ data collection practices and hold
them accountable. Unfortunately, privacy policies are typically lengthy and complicated, making it hard not only for
the average user to understand, but also for experts to analyze in depth and at scale [34].
NLP Analysis and Limitations. To address this challenge, as well as to facilitate expert analysis [
67
] and crowdsourced
annotation [
68
], the research community has recently applied natural language processing (NLP) to automate the
analysis of privacy policies. State-of-the-art examples include the following: PolicyLint [
3
] extracts data types and
entities that collect them, and analyzes potential contradictions within a privacy policy; PoliCheck [
4
] builds on
PolicyLint and further compares the privacy policy statements with the data collection practices observed in the
network trac; Polisis [
26
] and PurPliance [
13
] extract data collection purposes; and OVRseen [
63
] leverages PoliCheck
and Polisis to associate data types, entities, and purposes. Despite promising results, this body of work also has certain
limitations.
First, existing privacy policy analyzers extract statements (about what is collected, i.e., data type; who collects it,
i.e., entity; and for what purpose) as disconnected labels [
26
] or tuples [
3
,
13
], ignoring the links between information
disclosed across sentences, paragraphs or sections. However, today’s privacy policies typically have a structure that
discloses data types being collected, third-party sharing and usage purposes in separate sections
1
, as shown in the
example in Figure 1(a). Polisis [
26
] uses separate text classiers to label data types, third-party entities and purposes
disclosed in each paragraph. Without connecting these labels, it is unclear which data type is collected by which entity,
and what purpose applies. PolicyLint [
3
] and PurPliance [
13
] adopt tuple representations that put together entities,
data types and purposes disclosed in each sentence, as shown in Figure 1(b). However, the tuples still miss context from
other sentences. For example, it cannot be inferred from the tuples that the purpose “provide features” applies to the
collection of “location”; or that the usage purposes and third-party entities in later sections are related to the specic
types of “personal information” (e.g., “device information”) listed in the rst section.
Second, because of this incomplete context, prior work needs to map and relate the semantics of the terms across
dierent sentences by introducing ontologies that encode subsumption relations between data types or entities. So far,
1
We read through 200 privacy policies in our test set (see Section 4.2). Among them, 135 discuss denitions and practices concerning the same data types
in dierent sections, requiring to put the information together to get the full context about collection, use and sharing of these data types. In particular,
104 divide content into sections addressing collection, use, and sharing of “personal information”, resembling the structure shown in Figure 1(a).
Manuscript submitted to ACM