PoliGraph Automated Privacy Policy Analysis using Knowledge Graphs

2025-04-24 0 0 1.78MB 46 页 10玖币

侵权投诉

PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs

HAO CUI,University of California, Irvine, USA

RAHMADI TRIMANANDA,University of California, Irvine, USA

SCOTT JORDAN,University of California, Irvine, USA

ATHINA MARKOPOULOU,University of California, Irvine, USA

Privacy policies disclose how an organization collects and handles personal information. Recent work has made progress in leveraging

natural language processing (NLP) to automate privacy policy analysis and extract data collection statements from dierent sentences,

considered in isolation from each other. In this paper, we view and analyze, for the rst time, the entire text of a privacy policy in an

integrated way. In terms of methodology: (1) we dene PoliGraph, a type of knowledge graph that captures statements in a privacy

policy as relations between dierent parts of the text; and (2) we revisit the notion of ontologies, previously dened in heuristic ways,

to capture subsumption relations between terms. We make a clear distinction between local and global ontologies to capture the

context of individual privacy policies, application domains, and privacy laws. We develop PoliGrapher, an NLP tool to automatically

extract PoliGraph from the text using linguistic analysis. Using a public dataset for evaluation, we show that PoliGrapher identies

40% more collection statements than prior state-of-the-art, with 97% precision. In terms of applications, PoliGraph enables automated

analysis of a corpus of privacy policies and allows us to: (1) reveal common patterns in the texts across dierent privacy policies, and

(2) assess the correctness of the terms as dened within a privacy policy. We also apply PoliGraph to: (3) detect contradictions in a

privacy policy, where we show false alarms by prior work, and (4) analyze the consistency of privacy policies and network trac,

where we identify signicantly more clear disclosures than prior work. Finally, leveraging the capabilities of the emerging large

language models (LLMs), we also present PoliGrapher-LM, a tool that uses LLM prompting instead of NLP linguistic analysis, to

extract PoliGraph from the privacy policy text, and we show that it further improves coverage.

CCS Concepts: •Security and privacy →Human and societal aspects of security and privacy.

Additional Key Words and Phrases: Privacy, Privacy Policies, Linguistic Analysis, Large Language Models (LLMs).

ACM Reference Format:

Hao Cui, Rahmadi Trimananda, Scott Jordan, and Athina Markopoulou. 2018. PoliGraph: Automated Privacy Policy Analysis using

Knowledge Graphs. 1, 1 (March 2018), 46 pages. https://doi.org/XXXXXXX.XXXXXXX

1 Introduction

Privacy Policies. Privacy laws, such as the General Data Protection Regulation (GDPR) [

], the California Consumer

Privacy Act (CCPA) [

], and other data protection laws, require organizations to disclose the personal information

they collect, as well as how and why they use and share it. Privacy policies are the primary legally-binding way for

organizations to disclose their data collection practices to the users of their products. They receive much attention from

many stakeholders, such as users who want to exercise their rights, developers who want their systems to be compliant

Authors’ Contact Information: Hao Cui, cuih7@uci.edu, University of California, Irvine, Irvine, California, USA; Rahmadi Trimananda, rtrimana@uci.edu,

University of California, Irvine, Irvine, California, USA; Scott Jordan, sjordan@uci.edu, University of California, Irvine, Irvine, California, USA; Athina

Markopoulou, cuih7@uci.edu, University of California, Irvine, Irvine, California, USA.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not

made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components

of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on

servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.

Manuscript submitted to ACM 1

arXiv:2210.06746v3 [cs.CR] 6 Mar 2025

2 Hao Cui, Rahmadi Trimananda, Scott Jordan, and Athina Markopoulou

We collect the following categories of personal information:

●Device information... such as IP address...

●Location. We use this information to provide features...

We use your personal information... to:

●Provide the Services...

●Authenticate your account...

We disclose the personal information... as follows:

●With our travel partners...

●With social networking services...

(we, collect, personal information)

(we, collect, device information)

(we, collect, IP address)

(we, collect, location)

(we, collect, this information)

(we, collect, personal information)

(travel partners, collect, personal information)

(social networking services, collect, personal information)

(we, collect, personal information)

[provide features]

[provide the Services]

[authenticate your account]

[provide features]

[provide services]

[auth account]

personal

information

device

information

location

IP address

travel

partners

social networking

services

SUBSUME

COLLECT

Data type

Entity

Purpose

(a) (b) (c)

Fig. 1. Example of a privacy policy and analysis approaches. (a) The excerpt is from the policy of KAYAK [

]. It contains sections

and lists, regarding: what is collected (data type), how it is used (purpose), who receives the information (entity), and references across

sentences (e.g., “personal information” relates to other data types; “this information” refers to “location”). (b) Prior work extracts

elements found in each sentence, mainly data types and entities, as disconnected tuples. Purposes can also be extracted to extend

the tuple [

]. (c) PoliGraph is a knowledge graph that encodes data types, entities, and purposes; and two types of relations

between them (collection and subsumption), possibly specified across dierent sentences. A

COLLECT

edge represents that a data type

is collected by an entity, while edge aributes represent the purposes of that collection.

SUBSUME

edges represent the subsumption

relations between generic and specific terms.

with privacy laws, and law enforcement agencies who want to audit organizations’ data collection practices and hold

them accountable. Unfortunately, privacy policies are typically lengthy and complicated, making it hard not only for

the average user to understand, but also for experts to analyze in depth and at scale [34].

NLP Analysis and Limitations. To address this challenge, as well as to facilitate expert analysis [

] and crowdsourced

annotation [

], the research community has recently applied natural language processing (NLP) to automate the

analysis of privacy policies. State-of-the-art examples include the following: PolicyLint [

] extracts data types and

entities that collect them, and analyzes potential contradictions within a privacy policy; PoliCheck [

] builds on

PolicyLint and further compares the privacy policy statements with the data collection practices observed in the

network trac; Polisis [

] and PurPliance [

] extract data collection purposes; and OVRseen [

] leverages PoliCheck

and Polisis to associate data types, entities, and purposes. Despite promising results, this body of work also has certain

limitations.

First, existing privacy policy analyzers extract statements (about what is collected, i.e., data type; who collects it,

i.e., entity; and for what purpose) as disconnected labels [

] or tuples [

], ignoring the links between information

disclosed across sentences, paragraphs or sections. However, today’s privacy policies typically have a structure that

discloses data types being collected, third-party sharing and usage purposes in separate sections

, as shown in the

example in Figure 1(a). Polisis [

] uses separate text classiers to label data types, third-party entities and purposes

disclosed in each paragraph. Without connecting these labels, it is unclear which data type is collected by which entity,

and what purpose applies. PolicyLint [

] and PurPliance [

] adopt tuple representations that put together entities,

data types and purposes disclosed in each sentence, as shown in Figure 1(b). However, the tuples still miss context from

other sentences. For example, it cannot be inferred from the tuples that the purpose “provide features” applies to the

collection of “location”; or that the usage purposes and third-party entities in later sections are related to the specic

types of “personal information” (e.g., “device information”) listed in the rst section.

Second, because of this incomplete context, prior work needs to map and relate the semantics of the terms across

dierent sentences by introducing ontologies that encode subsumption relations between data types or entities. So far,

We read through 200 privacy policies in our test set (see Section 4.2). Among them, 135 discuss denitions and practices concerning the same data types

in dierent sections, requiring to put the information together to get the full context about collection, use and sharing of these data types. In particular,

104 divide content into sections addressing collection, use, and sharing of “personal information”, resembling the structure shown in Figure 1(a).

Manuscript submitted to ACM

PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs 3

these ontologies have been built in a manual or semi-automated fashion by domain experts, who dene lists of terms

commonly found in privacy policy text and other sources (e.g., network trac), and subsumption relations between

them (e.g., the term “device information” subsumes “IP address”). The resulting ontologies are not universal: they do not

necessarily agree with all privacy policies and need to be adapted to dierent application domains, e.g., mobile [

smart speakers [

], and VR [

]. As a result, they often generate ambiguous or wrong results that require further

validation by experts. Manandhar et al. [

] recently reported that state-of-the-art analyzers [

] incorrectly reason

about more than half of the privacy policies they analyzed.

The PoliGraph Framework. Our key observation is that a policy

should be treated in its entirety, leveraging terms

in dierent sentences that are related. To that end, we make the following methodological contributions.

First, we propose to extract and encode statements in a policy (i.e., what data types are collected, with what entities

they are shared, and for what purposes) into a knowledge graph [

], which we refer to as PoliGraph; Figure 1(c)

shows an example

. Nodes represent data types or entities. Edges represent relations between nodes, e.g., an entity may

collect a particular data type, and a more generic data type may subsume a more specic data type. An edge representing

data collection may have an attribute indicating the purposes. The graph in Figure 1(c) naturally links the extracted

information by merging the same data types and entities and establishing edges between them. It allows inferences such

as “IP address” being collected for the purpose “provide services”, and “location” being collected by “travel partners”.

Second, for policies that are not well written, the extracted PoliGraph may be missing subsumption relations

between terms that are not fully dened in the policies. To supplement the missing relations, we use ontologies, as

in prior work [

]; however, we redene and use them as follows. First, we consider the subsumption relations

extracted from each individual policy as the local ontology denied by it. Next, we also dene additional subsumption

relations that encode external knowledge, beyond what is stated in the text of an individual policy; we refer to these as

global ontologies. They can be dened by domain experts, using information from multiple policies, or from privacy

laws; for example, in Section 3.2, we dene a data ontology based on the CCPA [62].

PoliGrapher: Generating PoliGraph using Linguistic Analysis. We present PoliGrapher, a methodology and imple-

mentation that applies NLP linguistic analysis to automatically extract and build a PoliGraph from the policy text.

To that end, we address several challenges, including coreference resolution, list parsing, phrase normalization, and

purpose phrase classication, to extract and link more information than prior work. We evaluate PoliGrapher on a

public dataset from PoliCheck [

], consisting of over 6K policies from over 13K mobile apps. Our manual validation

shows that PoliGraph improves the recall of collection statements from 27% to 66%, compared to prior work [

], with

over 97% precision. The improvement is enabled by both the improved NLP techniques and the knowledge graph

representation, which can analyze statements spanning multiple sentences and sections in the policy document.

Applications. PoliGraph enables two new types of automated analyses, which were not previously possible. First,

PoliGraph is used to summarize policies in our dataset and reveal common patterns across them. This is possible

because PoliGraph, by representing each policy as a whole, allows inferences about more collection statements. We nd

that 64% of policies disclose the collection of software identiers and, in particular, cookies. Advertisers and analytics

providers are major entities that collect such data. This is further reinforced by the nding that more than half of

the policies disclose data usage for non-core purposes, namely for advertising and analytics. We also nd that the

use of generic terms for data types (e.g., “personal information”), often without more precise denitions, reduces the

2In the rest of the paper, we refer to a privacy policy simply as “policy”.

Manuscript submitted to ACM

4 Hao Cui, Rahmadi Trimananda, Scott Jordan, and Athina Markopoulou

transparency and leaves the specic data types being collected unknown. Second, dierent policies may have dierent

denitions of the same terms. By clearly separating local ontologies from global ones, PoliGraph allows us to assess the

correctness of the term denitions. For example, we nd that many policies declare the collected data as “non-personal

information”, which contradicts common knowledge and our CCPA-based global data ontology (see Sections 3.2 and 5.2).

We also nd that non-standard terms are widely used, with varied denitions across policies.

We also apply PoliGraph to revisit two known applications of policy analysis. First, to identify contradictions within

a policy, we extend PoliGraph to analyze negative statements and take into account additional contexts that are crucial

for interpreting contradictions, such as (1) ne-grained actions (e.g., “sell” for prot vs. “sharing”), and (2) data subjects

(e.g., children vs. general users). We show that the majority of contradictions found by prior work are false alarms due to

language nuances and missing contexts (e.g., data subjects). Second, we apply PoliGraph to analyze data ow-to-policy

consistency. As a result of the improved recall of our approach, we show that prior work [

] has underestimated the

number of policies that clearly disclose some sensitive data ows.

PoliGrapher-LM: Generating PoliGraph using LLMs. The recent developments in Large Langaueg Models (LLMs)

have greatly advanced natural language processing. To take advantage of and evaluate the capabilities of LLMs for

extracts PoliGraph by prompting an LLM. We address LLMs’ limitations, particularly hallucination and coverage

errors, by programmatically constraining the output and promting the LLM to reect on its output. Our evaluation

shows that PoliGrapher-LM extracts PoliGraphs with high precision and further improves the recall of collection

statements to 83%, a signicant improvement from the linguistic analysis in PoliGrapher. However, the high cost of

LLMs is a major barrier to deploy PoliGrapher-LM at scale.

Overview. The rest of the paper is structured as follows. Section 2discusses related work. Section 3denes the

proposed PoliGraph framework and the ontologies used with it. Section 4describes the implementation of PoliGrapher

that uses NLP lintuistic analysis to build PoliGraph from the text of a policy, and its evaluation. Section 5presents

applications of PoliGraph to policy analysis. Section 6presents the implementation of PoliGrapher-LM that uses

LLMs to build PoliGraph, and its evaluation. Finally, Section 7concludes the paper and discusses future directions.

The appendices, uploaded as supplemental materials, provide additional implementation details and evaluation results.

2 Related Work

Formalizing Policies. A body of related work focuses on standardizing or formalizing policies. W3C P3P standard [

]

proposed an XML schema to describe policies. The Contextual Integrity (CI) [

] framework expresses policies as

information ows with parameters including the senders, recipients and subjects of information, data types, and

transmission principles that describe the contexts of data collection. None of them replaces text-format policies, but

they give insights into dening policies and serve as analysis frameworks. PoliGraph builds on the CI framework by

extracting entities, data types, and part of the transmission principle (i.e., purposes) from the policy text.

Policy Analysis. Another body of work analyzes policy text. OPP-115 [

] is a policy dataset with manual annotations

for ne-grained data practices labeled by experts. Shvartzshnaider et al. [

], with the help of crowdsourced workers,

analyze CI information ows extracted from policies to identify writing issues, such as incomplete context and vagueness.

This manual approach is dicult to scale up for hundreds or thousands of policies due to the signicant human eorts.

Manuscript submitted to ACM

PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs 5

Automated Policy Analysis. The progress in NLP has made it possible to automate the analysis of unstructured text,

such as policy text. Privee [

] uses binary text classiers to answer whether a policy species certain privacy practices,

such as data collection, encryption and ad tracking. Polisis [

], trained on the OPP-115 dataset, uses 10 multi-label text

classiers to identify data practices, such as the category of data types being discussed and purposes. Classier-based

methods use pre-dened labels which cannot capture the ner-grained semantics in the text. PolicyLint [

] rst uses NLP

linguistic analysis to extract data types and entities in collection statements. PurPliance [

], built on top of PolicyLint,

further extracts purposes. Conceptually, both works focus on analyzing one sentence at a time, and extracting a tuple

⟨

entity, collect, data type

⟩

, as well as purpose in PurPliance, albeit in a separate, nested tuple

⟨

data type, for / not_for,

⟨

entity, purpose

⟩⟩

. Unlike PoliGraph, these works view extracted tuples individually and do not infer data practices

disclosed across multiple sentences.

Knowledge Graphs. Graphs are routinely used to integrate knowledge bases as relationships between terms [

Google has used a knowledge graph built from crawled data to show suggestions in search results [

]. OpenIE [

]

and T2KG [

] use NLP to build knowledge graphs from a large corpus of unstructured text. In PoliGraph, we use

knowledge graphs, for the rst time, to represent policies.

The PoliGraph framework rst appeared in Cui et al

. [16]

. Compared to that, this journal submission provides

additional results and new materials. In particular, the design and evaluation of PoliGrapher-LM in Section 6is new to

this paper, and was motivated by the breakthroughs of large language models (LLMs) that happened after the acceptance

of the original paper to the USENIX Security Symposium.

3 The PoliGraph Framework

In this section, we introduce PoliGraph, our proposed representation of the entire text of a policy as a knowledge

graph. We also revisit the related notion of ontologies, and we propose a new denition and use it with PoliGraph.

3.1 Defining PoliGraph

We dene PoliGraph as a knowledge graph that captures statements in a policy considered as a whole. Throughout

this section, we will use Figure 1as our running example to illustrate the terminology and denitions.

Privacy laws, such as the GDPR [

] and the CCPA [

], require that organizations disclose their practices regarding

data collection, sharing and use in their policies. To capture these three aspects of disclosures in the policy, we represent

the corresponding three kinds of terms in PoliGraph: what data types are collected, with what entities they are shared,

and for what purposes they are used.

•

Data type: This kind of terms refers to the type of data being collected. In Figure 1(a), “location” is a specic collected

data type. Generic terms can be used as well, e.g., “personal information” and “device information”.

•

Entity: This kind of terms refers to the organization that receives the collected data. It can be the rst party if it is

the developer of the product (e.g., website, mobile app, etc.) that writes the policy, namely “we” in Figure 1(a); or,

otherwise, a third party such as “travel partners” in Figure 1(a).

•

Purpose: Policies may also specify purposes.

In Figure 1(a), purposes include “provide services”, “authenticate your

account”, and “provide features”.

In this paper, we refer to purposes of processing of personal data as specied in the GDPR, namely the purposes of collection, use, and sharing. US laws

often distinguish among the three, e.g., the CCPA appears to require a policy to separately disclose the purposes of collection / use and the purposes of

sharing personal information.

Manuscript submitted to ACM

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PoliGraph:AutomatedPrivacyPolicyAnalysisusingKnowledgeGraphsHAOCUI,UniversityofCalifornia,Irvine,USARAHMADITRIMANANDA,UniversityofCalifornia,Irvine,USASCOTTJORDAN,UniversityofCalifornia,Irvine,USAATHINAMARKOPOULOU,UniversityofCalifornia,Irvine,USAPrivacypoliciesdisclosehowanorganizationcollectsandha...

展开>> 收起<<

PoliGraph Automated Privacy Policy Analysis using Knowledge Graphs.pdf

共46页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PoliGraph Automated Privacy Policy Analysis using Knowledge Graphs

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: