TAP Transparent and Privacy-Preserving Data Services

2025-05-02 0 0 998.55KB 17 页 10玖币

侵权投诉

TAP: Transparent and Privacy-Preserving Data Services

Daniël Reijsbergen,†Aung Maw,†Zheng Yang,‡Tien Tuan Anh Dinh,†and Jianying Zhou†

†Singapore University of Technology and Design, Singapore, Singapore

‡Southwest University, Chongqing, China

ABSTRACT

Users today expect more security from services that handle their

data. In addition to traditional data privacy and integrity require-

ments, they expect transparency, i.e., that the service’s processing

of the data is veriable by users and trusted auditors. Our goal is

to build a multi-user system that provides data privacy, integrity,

and transparency for a large number of operations, while achieving

practical performance.

To this end, we rst identify the limitations of existing approaches

that use authenticated data structures. We nd that they fall into

two categories: 1) those that hide each user’s data from other users,

but have a limited range of veriable operations (e.g., CONIKS,

Merkle

, and Proofs of Liabilities), and 2) those that support a wide

range of veriable operations, but make all data publicly visible

(e.g., IntegriDB and FalconDB). We then present TAP to address

the above limitations. The key component of TAP is a novel tree

data structure that supports ecient result verication, and relies

on independent audits that use zero-knowledge range proofs to

show that the tree is constructed correctly without revealing user

data. TAP supports a broad range of veriable operations, including

quantiles and sample standard deviations. We conduct a comprehen-

sive evaluation of TAP, and compare it against two state-of-the-art

baselines, namely IntegriDB and Merkle

, showing that the system

is practical at scale.

1 INTRODUCTION

Many of today’s applications collect large amounts of user data

and make decisions that have a direct impact on the user. One

example is a utility company that collects power usage data from

users and charges them dierent rates based on peak/o-peak pe-

riods per region. Another example is a road pricing system that

determines real-time trac conditions based on the number of cars

on the road, and charges motorists appropriate congestion fees.

The third example is an advertising service that monitors clicks

and pays the publisher for displaying the advertisement based on

the click-through rate. One desirable property of the applications

above is transparency [

], which allows users to verify that

the computation done on their data has been executed correctly.

This property is stronger than simply ensuring data integrity, as it

protects users from malicious or compromised service providers

who ignore data or tamper with the computation.

One approach toward transparent data services is for the provider

to make the raw data available to users. For example, the provider

can use a public bulletin board to store the data. Users can then

perform computations directly on the data, but they have no data

privacy. This is unacceptable in applications such as dynamic road

pricing, as it would allow any user to track the movements of other

users using the raw data. Alternatively, the service provider could

< < < < < < < < < <

0 1

00 01 10 11

chronological

preﬁx tree

sorted sum tree

for preﬁx 00 sorted sum tree

for preﬁx 11

Figure 1: TAP’s authenticated data structure.

store the data and return only query results. For example, an ad-

vertising service may have an API that returns the total number

of clicks on an advertisement over a given period. However, this

approach cannot guarantee transparency, which is troublesome

when the provider has a nancial incentive to falsify query results.

Our goal is to build a data service system with the following

properties. First, it supports a wide range of queries that compute

aggregates over the data of many independent users. Second, it

protects data privacy from adversarial users. Third, it protects data

integrity and transparency from an adversarial provider. Finally, it

has a reasonable performance overhead, and scales well when the

number of users increases.

One important primitive for realizing our goal is the Authen-

ticated Data Structure (ADS) [

], which allows users to detect

incorrect results returned by the provider. However, all ADS pro-

posals in the literature fall into two broad categories, which both

fall short in meeting our requirements. The rst category consists

of approaches that guarantee privacy, but support a limited set of

queries. This category includes transparency logs such as certicate

transparency [

] and its variants such as revocation transparency

[

], extended certicate transparency [

], Trillian [

], CONIKS

[

], and Merkle

[

]. Transparency logs store data in the form

of key-value tuples and allow for public auditing. However, trans-

parency logs only support data insertion, removal, and look-up

operations, and thus fail to meet our rst requirement. Also in-

cluded in the rst category are Proofs of Liabilities (PoLs) [

which allow users to prove that certain sums of user values are

non-negative without revealing the underlying values. The second

category consists of SQL-based authenticated databases such as

IntegriDB [

] and FalconDB [

]. These databases allow users

to verify the results of a wide range of SQL queries – i.e., sum,

count, average, min, and max. However, they are either designed

for a single user [

], thus failing our rst requirement, or make all

insert queries public [

], thus violating the second requirement.

Other approaches exist that only support transparent range queries

[12, 27, 29] and hence fail the rst requirement.

In this work, we present a Transparent and Privacy-Preserving

ADS named TAP, a data service system that meets our four design

arXiv:2210.11702v1 [cs.CR] 21 Oct 2022

Daniël Reijsbergen,†Aung Maw,†Zheng Yang,‡Tien Tuan Anh Dinh,†and Jianying Zhou†

goals. It addresses the limitations of existing systems with a novel

tree data structure, depicted in Figure 1, that combines the features

of existing approaches that are best suited to our context. The data

structure consists of a chronological prex tree (as in Merkle

and each leaf in the prex tree is the root of a Merkle sum tree

(as in PoLs) that is sorted (as in IntegriDB). TAP adopts the same

system model as CONIKS and Certicate Transparency, in which

individual users monitor the inclusion of their data, and there exist

some auditors that verify the relevant properties – e.g., ordered

and append-only – of the data structure. The prex trees enable

ecient monitoring and auditing, similar to the data structures of

CONIKS and Merkle

. Meanwhile, the sorted sum trees enable fast

verication for a wide range of operations, including sum, count,

average, min, max, and sample standard deviation queries. TAP also

supports the quantile query [

] that allows for the computation

of ne-grained statistics, e.g., the median or the 5th percentile on

sliding windows.

TAP is designed to store data from multiple users, thus meet-

ing the rst requirement. It protects data privacy – the second

requirement – by storing cryptographic commitments instead of

raw data, and by publishing zero-knowledge proofs. TAP’s Merkle

tree structure ensures data integrity and allows users to verify the

correctness of a broad range of queries – the third requirement

– by generating Merkle proofs and zero-knowledge range proofs

for the commitments’ underlying values. In our setting – i.e., a

single data table in which aggregates are computed over sliding

windows – TAP has better performance than previous systems,

because it maintains one single tree, as opposed to the many trees

in IntegriDB, and the tree is smaller than the tree in Merkle

. The

computation and bandwidth overheads of TAP are linear in the size

of the sliding windows, but logarithmic in the size of the entire tree.

To reason about the security of TAP, we require the following:

that each user adds one data entry per time slot, that the set of

users (but not their data) at each time slot is known to a super-

auditor (e.g., a regulator or watchdog), and that the fraction of

adversarial users is bounded. We present a detailed analysis of the

properties of TAP in this setting, and nd an explicit tradeo be-

tween transparency and privacy. We focus on guaranteeing perfect

transparency at the cost of revealing query results, which cannot

be trivially linked to user identities. In Appendix C, we discuss a

method that guarantees

(𝜖, 𝛿)

-dierential privacy [

]. Our nal

contribution is a full, publicly accessible implementation of TAP,

and we conduct a broad range of experiments to evaluate its perfor-

mance. We compare TAP against two relevant baselines – Merkle

and IntegriDB. The results show that the system has reasonable

overhead, and that it outperforms the baselines in many cases.

Contributions. We make the following contributions:

(1)

We present a survey of existing ADS approaches, and discuss

their limitations in today’s emerging applications.

(2)

We present TAP, a multi-user data service whose ADS com-

bines elements from CONIKS, Merkle

, IntegriDB, and PoLs

to protect data privacy and integrity, while providing trans-

parency to a wide range of operations.

(3)

We formally analyze TAP and prove that it only reveals the

results of queries.

(4)

We present a full implementation of TAP and evaluate its

performance. We compare it against two baselines, namely

IntegriDB and Merkle

, and show that TAP outperforms the

baselines in many cases.

Outline

. The remainder of this work is organized as follows.

Section 2 describes the system model, use case examples, threat

model, and our design goals. Section 3 discusses related systems

built on top of ADSs. Section 4 presents TAP. Section 5 provides

security and performance analysis of TAP, and discusses its current

limitations. Section 6 contains the detailed performance evaluation

and comparison against two state-of-the-art baselines. We discuss

the practical aspects of TAP in Section 7, and Section 8 concludes

the paper.

2 MODEL & REQUIREMENTS

In this section, we rst discuss the general system and data models.

Next, we present three use cases for transparent data services and

discuss how they t into our models. Finally, we present the threat

model and system requirements.

2.1 Model Entities

Our system model consists of the following types of entities:

Users

. Users send data to the server and issue queries on the

aggregate data through a client. Each user monitors the data struc-

ture by verifying that her data is properly stored by the server

and veries that query results are computed correctly. In practice,

monitoring can be automated, e.g., the app on the user’s mobile

phone that shows bills or payments can verify the displayed values

by querying the server’s ADS.

Server

. The server stores the data provided by the users in a

database, and maintains an ADS on top of the data. It computes

responses to user queries, and generates proofs for the responses

using the ADS.

Auditors

. Auditors validate the server’s ADS. In particular, they

verify that it is append-only, i.e., data is never modied or deleted,

and that certain data has been sorted correctly. We will discuss

these checks in more detail in later sections.

Bulletin board

. The server periodically publishes the digest of

its ADS to an immutable bulletin board, e.g., a public blockchain.

Users and auditors download the latest digests during monitoring,

auditing, and query verication. The only goal of the bulletin board

is to prevent equivocation – i.e., the server presenting dierent

versions of the ADS to dierent entities – and can therefore be

replaced by a gossip protocol that would serve the same purpose.

Figure 2a displays a naïve design which assumes that the server

is fully trusted, hence allowing an unscrupulous operator to return

incorrect query results. Figure 2b displays our system model, in

which the server is untrusted.

2.2 Data Model

We consider a simple relational data model. The schema consists of

the following attributes.

Time

. Time is modeled as a sequence of epochs, i.e., time slots

such as hours or days, and is represented as an integer.

Value

. This attribute contains the privacy-sensitive data. For

simplicity, we assume that they are non-negative integers.

TAP: Transparent and Privacy-Preserving Data Services

server

user 1 · · ·

user 2

queries

responses

queries

(a)

auditor

server

bulletin

board

user 1 · · ·

user 2

queries responses,

proofs

queries

digests digests

audits

digests

(b)

Figure 2: Left: system model with a trusted server. Right:

TAP’s system model with an untrusted server.

ID. This represents the user ID associated with the value.

Type

. These are string attributes that capture metadata about

users.

We denote the number of Type attributes by

𝑚

, and the total

number of attributes in our schema by

𝑚′=𝑚+3

. We assume

that there is one data entry per ID per epoch. This assumption

prevents a single adversarial user from arbitrarily skewing query

results, as we will discuss in Section 5. However, a single entity

may control multiple IDs, e.g., a single person owning multiple cars

for congestion pricing.

2.3 Use Case Examples

In the following, we present three illustrative use cases for a trans-

parent data service. In each case, the users receive bills or rewards

that are dependent on the recorded data, including (potentially) the

data of others. In all cases, the Time attribute is the smallest time

interval in which the billing or reward rate remains the same. In the

following, we discuss the key features of each use case, including

the ID, Value and Type attributes, the typical scale of the system,

and types of queries.

Smart Grids.

Our rst example is a smart grid that enables

peak/o-peak pricing, i.e., customers are billed at a higher rate for

power usage when system-level demand is high. In this setting,

each ID corresponds to a smart meter’s serial number, and the

Value to the customer’s power usage during the epoch. Possible

Type attributes include the customer’s geographic location, and

the customer’s type (industrial, commercial, or residential). The

service is maintained by the power retailer. To illustrate the size

of a typical smart grid, we consider SP Group, which is the largest

power retailer in Singapore with 1.6 million customers in 2022 [

For peak/o-peak pricing, we record power usage once per hour.

For the Type attributes, we consider the 28 postal code regions of

Singapore, and the 3 customer types mentioned above.

In this setting, the main design goal for a transparent data service

is to allow customers to verify their electricity bills. Each customer’s

bill for a given period

𝑇

can be expressed as the sum, over each

epoch

𝑡

𝑇

, of the electricity price in

𝑡

multiplied by the customer’s

power usage in

𝑡

. The rst advantage of our data service is that

it allows customers to verify their usage and bills. However, a

major advantage of TAP is that it also enables more advanced

pricing methods, such as making the price dependent on the system-

level power usage [30]. The data service allows the users to verify

claims about the system-level through sum queries. Finally, the data

service allows for the computation of aggregate statistics, e.g., 1)

the average and standard deviation of power consumption of all

users within the same region, 2) the maximum and minimum usage

in a region during a given period, and 3) the top 5% consumption

across all residential users.

Congestion Pricing.

Our second example is a system in which

vehicles are charged when they cross designated road sections

that are heavily congested during peak periods. To detect vehicles,

gantries with cameras are placed alongside the designated road

sections. Each ID corresponds to a pair of vehicle and gantry IDs,

and the Value to the number of times the vehicle crossed the gantry

during the epoch. Possible Type attributes include the gantry ID and

the type of vehicle, e.g., car, truck, or motorcycle. As an illustrative

example, we consider the Electronic Road Pricing (ERP) system in

Singapore. As of 2022, the ERP system consists of 77 gantries of

which around 20 are located at the entries and exits of the highly

congested Central Business District (CBD) [2].

As in smart grids, the data service allows users to verify their

bills even for advanced pricing schemes. For example, the price can

be made dependent on the total number of registered vehicles that

have crossed a gantry in an epoch, which would act as a proxy for

the real congestion in the system. An even more advanced pricing

scheme would make the price of entering the CBD dependent on

the sum of vehicle entries into the CBD minus the sum of vehicle

exits.

Digital Advertising.

Our nal example is a system that rewards

websites who display digital advertisements [

]. In particular, a web-

site owner receives a reward whenever an advertisement displayed

on the website is clicked.

In this setting, ID corresponds to a pair

of website and ad IDs, and the Value to the number of times the

advertisement was clicked on via the website (i.e., the click-through

rate). Possible Type attributes include the category and size of the

website, or characteristics of the advertisement. The advertisement

platform operates the server, while the website owners and adver-

tisers monitor their data entries. The biggest online advertisement

platform is Google Ads with millions of registered websites and

advertisers. However, there are also digital ad platforms such as

sixads [5] that serve around 100 000 websites.

The data service allows both the advertisers and the websites to

monitor the click-through data provided by the advertising platform.

This makes it easier to detect misbehavior, e.g., the platform under-

reporting (or overreporting) the number of clicks to the website

owner (or advertiser) for prot. Finally, it enables more advanced

reward schemes, e.g., in which the total reward paid by any single

advertiser is limited.

2.4 Threat Model

We consider two types of threats. The rst consists of honest-but-

curious users who follow the protocol but try to learn the privacy-

sensitive data of other users. The second is an adversarial server who

tries to falsify query results by tampering with the data and/or query

execution. This assumption captures unscrupulous server owners,

insider threats, and external attacks via software vulnerabilities. We

do not consider threats to privacy that stem from collusion between

the server and users, because the server is always free to send raw

This is a simplied version of online advertising, as advertisers and publishers are

often interested in more detail than just click-through rates.

Daniël Reijsbergen,†Aung Maw,†Zheng Yang,‡Tien Tuan Anh Dinh,†and Jianying Zhou†

data to adversarial users out-of-protocol. However, a limited set

of adversarial users may collude with the server to falsify query

results – this includes fake users created by the server. We discuss

how users disseminate information about server misbehavior in

Section 7.

We assume that all communication between the server and hon-

est users is done via secure channels. The auditors are only trusted

to validate the server’s structural properties of the server’s ADS, but

not individual data entries. In fact, the trust assumption for auditors

is the same as for users, i.e., any user with large computation and

network resources can also act as an auditor. In practice, we also

assume that one “super” auditor (e.g., a regulator) has the ability to

verify user identities, to prevent service providers from creating an

unlimited number of fake users. Finally, we assume that the public

bulletin board – or, alternatively, the gossip protocol for users – is

trusted and able to detect equivocation.

2.5 Requirements

Our goal is to design a system that meets the following require-

ments. Due to space constraints, we only present informal deni-

tions of the security requirements, and leave the formal denitions

to Section 5 and the appendix.

(R1) Rich operations for multiple users

. The system sup-

ports a wide range of operations (or queries) on the aggregate data

generated by multiple independent users.

(R2) Data privacy

. A user can only learn a limited number of

other users’ values by performing queries.

(R3a) Data integrity

. The server cannot change the data with-

out being detected.

(R3b) Transparency

. For each supported query, the server

cannot convince the user to accept incorrect results computed from

incorrect, incomplete, or articial data.

(R4) Eciency

. The computation, storage, and network costs

at the server and the user’s client are small. Query overheads grow

sublinearly with the number of users and epochs.

3 EXISTING SOLUTIONS

In this section, we discuss existing approaches that partially meet

our requirements. We divide them into three categories: trans-

parency logs, PoLs, and SQL-based authenticated databases. Visual

representations of various system models, and several ADS designs

whose features are incorporated in TAP, can be found in Figure 4.

We conclude the section by discussing which requirements are met

by these approaches.

3.1 Transparency Logs

Transparency logs are append-only data structures whose integrity

is protected by a Merkle tree. They provide ecient cryptographic

proofs that show, e.g., that an entry is included in the log, or that

one log is a prex of another log.

Certicate Transparency (CT)

. CT [

] addresses the prob-

lem of compromised certicate authorities by publishing the cer-

ticates on a transparency log. The system model of CT consists of

some certicate authorities who issue certicates and insert them

into the log, and users who search for specic certicates in the

log. CT relies on a monitor to ensure the consistency of the log.

The core data structure is a Merkle tree in which the certicates

are hashed and stored at the leaves, and the server signs the root of

the tree. CT has been extended to support ecient verication of

certicate revocations [

]. It has also been generalized into an

abstraction called a veriable log which is implemented as Google’s

Trillian [16].

CONIKS

. CONIKS [

] extends CT to support transparent

name-to-key bindings. It allows for ecient proofs of non-inclusion

so that users can easily check for unauthorized name-to-key bind-

ings in their namespaces. CONIKS’ system model, depicted in Fig-

ure 3a for a system with

𝑑

users, is similar to that of CT, but users

are more active in monitoring their key bindings. CONIKS uses

prex trees, depicted in Figure 4a, for ecient non-inclusion proofs.

It hides bindings by storing only their commitments at the leaves.

It also hides the total number of users by adding dummy nodes.

Merkle2

. Merkle

[

] extends CONIKS through a data struc-

ture that enables ecient auditing. The data structure combines a

chronological Merkle tree with prex trees. The leaves of the chrono-

logical tree only extend to the right (append-only). Each internal

node protects a set of leaves in the chronological tree, and stores

the root of a prex tree that has the same set of leaves.

3.2 Proofs of Liabilities (PoLs)

PoLs [

] are designed to prove solvency – i.e., assets being greater

than liabilities – in a setting where users are fully anonymous and

their individual assets and liabilities are privacy-sensitive. The main

data structure in PoLs is a Merkle sum tree, as depicted in Figure 4b,

that stores additively homomorphic commitments to the values of

assets and liabilities in the leaves. Intermediate nodes store the

sums of the commitments in their children. To show that the sum

of assets and liabilities in the leaves is non-negative, PoLs use zero-

knowledge range proofs. PoLs were generalized in [

] to use cases

beyond proving solvency.

3.3 SQL-Based Authenticated Databases

IntegriDB

. The system model of IntegriDB [

] is designed for out-

sourced databases. In particular, the user uploads data and metadata

to an untrusted server, depicted in Figure 3b. The server executes

user queries and generates proofs based on the metadata to show

that the results are correct. IntegriDB supports data insertions, join

queries on multiple tables, multidimensional range queries, and

sum, count, average, min, and max queries. IntegriDB creates a

sorted tree for each column pair, resulting in

2(𝑚2−𝑚)

trees for a

table with

𝑚

columns, as depicted in Figure 4c. In each internal node

of the tree, IntegriDB stores a polynomial over the values in the

leaves of the internal node’s subtree. The polynomials enable proofs

that sum or range queries have been performed over the correct set

of leaves. Meanwhile, the sorted nature of the trees allows users

to verify min and max queries. Another SQL-based ADS, vSQL

[

], supports generic SQL queries and has similar performance as

IntegriDB.

FalconDB

. FalconDB [

] combines IntegriDB with blockchains.

In FalconDB’s system model, depicted in Figure 3c, a smart contract

maintains the ADS and ensures that it is globally consistent. Queries

are performed directly by the servers, which run IntegriDB, without

going through consensus (except for insertions and removals). Users

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TAP:TransparentandPrivacy-PreservingDataServicesDaniëlReijsbergen,†AungMaw,†ZhengYang,‡TienTuanAnhDinh,†andJianyingZhou††SingaporeUniversityofTechnologyandDesign,Singapore,Singapore‡SouthwestUniversity,Chongqing,ChinaABSTRACTUserstodayexpectmoresecurityfromservicesthathandletheirdata.Inadditiontotra...

展开>> 收起<<

TAP Transparent and Privacy-Preserving Data Services.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TAP Transparent and Privacy-Preserving Data Services

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: