PoliGraph Automated Privacy Policy Analysis using Knowledge Graphs

2025-04-24 0 0 1.78MB 46 页 10玖币
侵权投诉
PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs
HAO CUI,University of California, Irvine, USA
RAHMADI TRIMANANDA,University of California, Irvine, USA
SCOTT JORDAN,University of California, Irvine, USA
ATHINA MARKOPOULOU,University of California, Irvine, USA
Privacy policies disclose how an organization collects and handles personal information. Recent work has made progress in leveraging
natural language processing (NLP) to automate privacy policy analysis and extract data collection statements from dierent sentences,
considered in isolation from each other. In this paper, we view and analyze, for the rst time, the entire text of a privacy policy in an
integrated way. In terms of methodology: (1) we dene PoliGraph, a type of knowledge graph that captures statements in a privacy
policy as relations between dierent parts of the text; and (2) we revisit the notion of ontologies, previously dened in heuristic ways,
to capture subsumption relations between terms. We make a clear distinction between local and global ontologies to capture the
context of individual privacy policies, application domains, and privacy laws. We develop PoliGrapher, an NLP tool to automatically
extract PoliGraph from the text using linguistic analysis. Using a public dataset for evaluation, we show that PoliGrapher identies
40% more collection statements than prior state-of-the-art, with 97% precision. In terms of applications, PoliGraph enables automated
analysis of a corpus of privacy policies and allows us to: (1) reveal common patterns in the texts across dierent privacy policies, and
(2) assess the correctness of the terms as dened within a privacy policy. We also apply PoliGraph to: (3) detect contradictions in a
privacy policy, where we show false alarms by prior work, and (4) analyze the consistency of privacy policies and network trac,
where we identify signicantly more clear disclosures than prior work. Finally, leveraging the capabilities of the emerging large
language models (LLMs), we also present PoliGrapher-LM, a tool that uses LLM prompting instead of NLP linguistic analysis, to
extract PoliGraph from the privacy policy text, and we show that it further improves coverage.
CCS Concepts: Security and privacy Human and societal aspects of security and privacy.
Additional Key Words and Phrases: Privacy, Privacy Policies, Linguistic Analysis, Large Language Models (LLMs).
ACM Reference Format:
Hao Cui, Rahmadi Trimananda, Scott Jordan, and Athina Markopoulou. 2018. PoliGraph: Automated Privacy Policy Analysis using
Knowledge Graphs. 1, 1 (March 2018), 46 pages. https://doi.org/XXXXXXX.XXXXXXX
1 Introduction
Privacy Policies. Privacy laws, such as the General Data Protection Regulation (GDPR) [
57
], the California Consumer
Privacy Act (CCPA) [
62
], and other data protection laws, require organizations to disclose the personal information
they collect, as well as how and why they use and share it. Privacy policies are the primary legally-binding way for
organizations to disclose their data collection practices to the users of their products. They receive much attention from
many stakeholders, such as users who want to exercise their rights, developers who want their systems to be compliant
Authors’ Contact Information: Hao Cui, cuih7@uci.edu, University of California, Irvine, Irvine, California, USA; Rahmadi Trimananda, rtrimana@uci.edu,
University of California, Irvine, Irvine, California, USA; Scott Jordan, sjordan@uci.edu, University of California, Irvine, Irvine, California, USA; Athina
Markopoulou, cuih7@uci.edu, University of California, Irvine, Irvine, California, USA.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM 1
arXiv:2210.06746v3 [cs.CR] 6 Mar 2025
2 Hao Cui, Rahmadi Trimananda, Scott Jordan, and Athina Markopoulou
We collect the following categories of personal information:
Device information... such as IP address...
Location. We use this information to provide features...
We use your personal information... to:
Provide the Services...
Authenticate your account...
We disclose the personal information... as follows:
With our travel partners...
With social networking services...
(we, collect, personal information)
(we, collect, device information)
(we, collect, IP address)
(we, collect, location)
(we, collect, this information)
(we, collect, personal information)
(travel partners, collect, personal information)
(social networking services, collect, personal information)
(we, collect, personal information)
(we, collect, personal information)
[provide features]
[provide the Services]
[authenticate your account]
[provide features]
[provide services]
[auth account]
personal
information
we
device
information
location
IP address
travel
partners
social networking
services
SUBSUME
COLLECT
Data type
Entity
Purpose
(a) (b) (c)
Fig. 1. Example of a privacy policy and analysis approaches. (a) The excerpt is from the policy of KAYAK [
36
]. It contains sections
and lists, regarding: what is collected (data type), how it is used (purpose), who receives the information (entity), and references across
sentences (e.g., “personal information” relates to other data types; “this information” refers to “location”). (b) Prior work extracts
elements found in each sentence, mainly data types and entities, as disconnected tuples. Purposes can also be extracted to extend
the tuple [
13
,
63
]. (c) PoliGraph is a knowledge graph that encodes data types, entities, and purposes; and two types of relations
between them (collection and subsumption), possibly specified across dierent sentences. A
COLLECT
edge represents that a data type
is collected by an entity, while edge aributes represent the purposes of that collection.
SUBSUME
edges represent the subsumption
relations between generic and specific terms.
with privacy laws, and law enforcement agencies who want to audit organizations’ data collection practices and hold
them accountable. Unfortunately, privacy policies are typically lengthy and complicated, making it hard not only for
the average user to understand, but also for experts to analyze in depth and at scale [34].
NLP Analysis and Limitations. To address this challenge, as well as to facilitate expert analysis [
67
] and crowdsourced
annotation [
68
], the research community has recently applied natural language processing (NLP) to automate the
analysis of privacy policies. State-of-the-art examples include the following: PolicyLint [
3
] extracts data types and
entities that collect them, and analyzes potential contradictions within a privacy policy; PoliCheck [
4
] builds on
PolicyLint and further compares the privacy policy statements with the data collection practices observed in the
network trac; Polisis [
26
] and PurPliance [
13
] extract data collection purposes; and OVRseen [
63
] leverages PoliCheck
and Polisis to associate data types, entities, and purposes. Despite promising results, this body of work also has certain
limitations.
First, existing privacy policy analyzers extract statements (about what is collected, i.e., data type; who collects it,
i.e., entity; and for what purpose) as disconnected labels [
26
] or tuples [
3
,
13
], ignoring the links between information
disclosed across sentences, paragraphs or sections. However, today’s privacy policies typically have a structure that
discloses data types being collected, third-party sharing and usage purposes in separate sections
1
, as shown in the
example in Figure 1(a). Polisis [
26
] uses separate text classiers to label data types, third-party entities and purposes
disclosed in each paragraph. Without connecting these labels, it is unclear which data type is collected by which entity,
and what purpose applies. PolicyLint [
3
] and PurPliance [
13
] adopt tuple representations that put together entities,
data types and purposes disclosed in each sentence, as shown in Figure 1(b). However, the tuples still miss context from
other sentences. For example, it cannot be inferred from the tuples that the purpose “provide features” applies to the
collection of “location”; or that the usage purposes and third-party entities in later sections are related to the specic
types of “personal information” (e.g., “device information”) listed in the rst section.
Second, because of this incomplete context, prior work needs to map and relate the semantics of the terms across
dierent sentences by introducing ontologies that encode subsumption relations between data types or entities. So far,
1
We read through 200 privacy policies in our test set (see Section 4.2). Among them, 135 discuss denitions and practices concerning the same data types
in dierent sections, requiring to put the information together to get the full context about collection, use and sharing of these data types. In particular,
104 divide content into sections addressing collection, use, and sharing of “personal information”, resembling the structure shown in Figure 1(a).
Manuscript submitted to ACM
PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs 3
these ontologies have been built in a manual or semi-automated fashion by domain experts, who dene lists of terms
commonly found in privacy policy text and other sources (e.g., network trac), and subsumption relations between
them (e.g., the term “device information” subsumes “IP address”). The resulting ontologies are not universal: they do not
necessarily agree with all privacy policies and need to be adapted to dierent application domains, e.g., mobile [
3
,
4
,
13
],
smart speakers [
32
,
38
], and VR [
63
]. As a result, they often generate ambiguous or wrong results that require further
validation by experts. Manandhar et al. [
42
] recently reported that state-of-the-art analyzers [
3
,
4
,
26
] incorrectly reason
about more than half of the privacy policies they analyzed.
The PoliGraph Framework. Our key observation is that a policy
2
should be treated in its entirety, leveraging terms
in dierent sentences that are related. To that end, we make the following methodological contributions.
First, we propose to extract and encode statements in a policy (i.e., what data types are collected, with what entities
they are shared, and for what purposes) into a knowledge graph [
37
,
45
], which we refer to as PoliGraph; Figure 1(c)
shows an example
3
. Nodes represent data types or entities. Edges represent relations between nodes, e.g., an entity may
collect a particular data type, and a more generic data type may subsume a more specic data type. An edge representing
data collection may have an attribute indicating the purposes. The graph in Figure 1(c) naturally links the extracted
information by merging the same data types and entities and establishing edges between them. It allows inferences such
as “IP address” being collected for the purpose “provide services”, and “location” being collected by “travel partners”.
Second, for policies that are not well written, the extracted PoliGraph may be missing subsumption relations
between terms that are not fully dened in the policies. To supplement the missing relations, we use ontologies, as
in prior work [
3
,
4
,
13
]; however, we redene and use them as follows. First, we consider the subsumption relations
extracted from each individual policy as the local ontology denied by it. Next, we also dene additional subsumption
relations that encode external knowledge, beyond what is stated in the text of an individual policy; we refer to these as
global ontologies. They can be dened by domain experts, using information from multiple policies, or from privacy
laws; for example, in Section 3.2, we dene a data ontology based on the CCPA [62].
PoliGrapher: Generating PoliGraph using Linguistic Analysis. We present PoliGrapher, a methodology and imple-
mentation that applies NLP linguistic analysis to automatically extract and build a PoliGraph from the policy text.
To that end, we address several challenges, including coreference resolution, list parsing, phrase normalization, and
purpose phrase classication, to extract and link more information than prior work. We evaluate PoliGrapher on a
public dataset from PoliCheck [
4
], consisting of over 6K policies from over 13K mobile apps. Our manual validation
shows that PoliGraph improves the recall of collection statements from 27% to 66%, compared to prior work [
3
], with
over 97% precision. The improvement is enabled by both the improved NLP techniques and the knowledge graph
representation, which can analyze statements spanning multiple sentences and sections in the policy document.
Applications. PoliGraph enables two new types of automated analyses, which were not previously possible. First,
PoliGraph is used to summarize policies in our dataset and reveal common patterns across them. This is possible
because PoliGraph, by representing each policy as a whole, allows inferences about more collection statements. We nd
that 64% of policies disclose the collection of software identiers and, in particular, cookies. Advertisers and analytics
providers are major entities that collect such data. This is further reinforced by the nding that more than half of
the policies disclose data usage for non-core purposes, namely for advertising and analytics. We also nd that the
use of generic terms for data types (e.g., “personal information”), often without more precise denitions, reduces the
2In the rest of the paper, we refer to a privacy policy simply as “policy”.
Manuscript submitted to ACM
4 Hao Cui, Rahmadi Trimananda, Scott Jordan, and Athina Markopoulou
transparency and leaves the specic data types being collected unknown. Second, dierent policies may have dierent
denitions of the same terms. By clearly separating local ontologies from global ones, PoliGraph allows us to assess the
correctness of the term denitions. For example, we nd that many policies declare the collected data as “non-personal
information”, which contradicts common knowledge and our CCPA-based global data ontology (see Sections 3.2 and 5.2).
We also nd that non-standard terms are widely used, with varied denitions across policies.
We also apply PoliGraph to revisit two known applications of policy analysis. First, to identify contradictions within
a policy, we extend PoliGraph to analyze negative statements and take into account additional contexts that are crucial
for interpreting contradictions, such as (1) ne-grained actions (e.g., “sell” for prot vs. “sharing”), and (2) data subjects
(e.g., children vs. general users). We show that the majority of contradictions found by prior work are false alarms due to
language nuances and missing contexts (e.g., data subjects). Second, we apply PoliGraph to analyze data ow-to-policy
consistency. As a result of the improved recall of our approach, we show that prior work [
4
] has underestimated the
number of policies that clearly disclose some sensitive data ows.
PoliGrapher-LM: Generating PoliGraph using LLMs. The recent developments in Large Langaueg Models (LLMs)
have greatly advanced natural language processing. To take advantage of and evaluate the capabilities of LLMs for
privacy policy analysis, we further develop PoliGrapher-LM, an alternative implementation of PoliGrapher that
extracts PoliGraph by prompting an LLM. We address LLMs’ limitations, particularly hallucination and coverage
errors, by programmatically constraining the output and promting the LLM to reect on its output. Our evaluation
shows that PoliGrapher-LM extracts PoliGraphs with high precision and further improves the recall of collection
statements to 83%, a signicant improvement from the linguistic analysis in PoliGrapher. However, the high cost of
LLMs is a major barrier to deploy PoliGrapher-LM at scale.
Overview. The rest of the paper is structured as follows. Section 2discusses related work. Section 3denes the
proposed PoliGraph framework and the ontologies used with it. Section 4describes the implementation of PoliGrapher
that uses NLP lintuistic analysis to build PoliGraph from the text of a policy, and its evaluation. Section 5presents
applications of PoliGraph to policy analysis. Section 6presents the implementation of PoliGrapher-LM that uses
LLMs to build PoliGraph, and its evaluation. Finally, Section 7concludes the paper and discusses future directions.
The appendices, uploaded as supplemental materials, provide additional implementation details and evaluation results.
2 Related Work
Formalizing Policies. A body of related work focuses on standardizing or formalizing policies. W3C P3P standard [
69
]
proposed an XML schema to describe policies. The Contextual Integrity (CI) [
47
] framework expresses policies as
information ows with parameters including the senders, recipients and subjects of information, data types, and
transmission principles that describe the contexts of data collection. None of them replaces text-format policies, but
they give insights into dening policies and serve as analysis frameworks. PoliGraph builds on the CI framework by
extracting entities, data types, and part of the transmission principle (i.e., purposes) from the policy text.
Policy Analysis. Another body of work analyzes policy text. OPP-115 [
67
] is a policy dataset with manual annotations
for ne-grained data practices labeled by experts. Shvartzshnaider et al. [
61
], with the help of crowdsourced workers,
analyze CI information ows extracted from policies to identify writing issues, such as incomplete context and vagueness.
This manual approach is dicult to scale up for hundreds or thousands of policies due to the signicant human eorts.
Manuscript submitted to ACM
PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs 5
Automated Policy Analysis. The progress in NLP has made it possible to automate the analysis of unstructured text,
such as policy text. Privee [
71
] uses binary text classiers to answer whether a policy species certain privacy practices,
such as data collection, encryption and ad tracking. Polisis [
26
], trained on the OPP-115 dataset, uses 10 multi-label text
classiers to identify data practices, such as the category of data types being discussed and purposes. Classier-based
methods use pre-dened labels which cannot capture the ner-grained semantics in the text. PolicyLint [
3
] rst uses NLP
linguistic analysis to extract data types and entities in collection statements. PurPliance [
13
], built on top of PolicyLint,
further extracts purposes. Conceptually, both works focus on analyzing one sentence at a time, and extracting a tuple
entity, collect, data type
, as well as purpose in PurPliance, albeit in a separate, nested tuple
data type, for / not_for,
entity, purpose
. Unlike PoliGraph, these works view extracted tuples individually and do not infer data practices
disclosed across multiple sentences.
Knowledge Graphs. Graphs are routinely used to integrate knowledge bases as relationships between terms [
45
].
Google has used a knowledge graph built from crawled data to show suggestions in search results [
24
]. OpenIE [
5
]
and T2KG [
37
] use NLP to build knowledge graphs from a large corpus of unstructured text. In PoliGraph, we use
knowledge graphs, for the rst time, to represent policies.
The PoliGraph framework rst appeared in Cui et al
. [16]
. Compared to that, this journal submission provides
additional results and new materials. In particular, the design and evaluation of PoliGrapher-LM in Section 6is new to
this paper, and was motivated by the breakthroughs of large language models (LLMs) that happened after the acceptance
of the original paper to the USENIX Security Symposium.
3 The PoliGraph Framework
In this section, we introduce PoliGraph, our proposed representation of the entire text of a policy as a knowledge
graph. We also revisit the related notion of ontologies, and we propose a new denition and use it with PoliGraph.
3.1 Defining PoliGraph
We dene PoliGraph as a knowledge graph that captures statements in a policy considered as a whole. Throughout
this section, we will use Figure 1as our running example to illustrate the terminology and denitions.
Privacy laws, such as the GDPR [
57
] and the CCPA [
62
], require that organizations disclose their practices regarding
data collection, sharing and use in their policies. To capture these three aspects of disclosures in the policy, we represent
the corresponding three kinds of terms in PoliGraph: what data types are collected, with what entities they are shared,
and for what purposes they are used.
Data type: This kind of terms refers to the type of data being collected. In Figure 1(a), “location” is a specic collected
data type. Generic terms can be used as well, e.g., “personal information” and “device information”.
Entity: This kind of terms refers to the organization that receives the collected data. It can be the rst party if it is
the developer of the product (e.g., website, mobile app, etc.) that writes the policy, namely “we” in Figure 1(a); or,
otherwise, a third party such as “travel partners” in Figure 1(a).
Purpose: Policies may also specify purposes.
4
In Figure 1(a), purposes include “provide services”, “authenticate your
account”, and “provide features”.
4
In this paper, we refer to purposes of processing of personal data as specied in the GDPR, namely the purposes of collection, use, and sharing. US laws
often distinguish among the three, e.g., the CCPA appears to require a policy to separately disclose the purposes of collection / use and the purposes of
sharing personal information.
Manuscript submitted to ACM
摘要:

PoliGraph:AutomatedPrivacyPolicyAnalysisusingKnowledgeGraphsHAOCUI,UniversityofCalifornia,Irvine,USARAHMADITRIMANANDA,UniversityofCalifornia,Irvine,USASCOTTJORDAN,UniversityofCalifornia,Irvine,USAATHINAMARKOPOULOU,UniversityofCalifornia,Irvine,USAPrivacypoliciesdisclosehowanorganizationcollectsandha...

展开>> 收起<<
PoliGraph Automated Privacy Policy Analysis using Knowledge Graphs.pdf

共46页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:46 页 大小:1.78MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 46
客服
关注