TAP Transparent and Privacy-Preserving Data Services

2025-05-02 0 0 998.55KB 17 页 10玖币
侵权投诉
TAP: Transparent and Privacy-Preserving Data Services
Daniël Reijsbergen,Aung Maw,Zheng Yang,Tien Tuan Anh Dinh,and Jianying Zhou
Singapore University of Technology and Design, Singapore, Singapore
Southwest University, Chongqing, China
ABSTRACT
Users today expect more security from services that handle their
data. In addition to traditional data privacy and integrity require-
ments, they expect transparency, i.e., that the service’s processing
of the data is veriable by users and trusted auditors. Our goal is
to build a multi-user system that provides data privacy, integrity,
and transparency for a large number of operations, while achieving
practical performance.
To this end, we rst identify the limitations of existing approaches
that use authenticated data structures. We nd that they fall into
two categories: 1) those that hide each user’s data from other users,
but have a limited range of veriable operations (e.g., CONIKS,
Merkle
2
, and Proofs of Liabilities), and 2) those that support a wide
range of veriable operations, but make all data publicly visible
(e.g., IntegriDB and FalconDB). We then present TAP to address
the above limitations. The key component of TAP is a novel tree
data structure that supports ecient result verication, and relies
on independent audits that use zero-knowledge range proofs to
show that the tree is constructed correctly without revealing user
data. TAP supports a broad range of veriable operations, including
quantiles and sample standard deviations. We conduct a comprehen-
sive evaluation of TAP, and compare it against two state-of-the-art
baselines, namely IntegriDB and Merkle
2
, showing that the system
is practical at scale.
1 INTRODUCTION
Many of today’s applications collect large amounts of user data
and make decisions that have a direct impact on the user. One
example is a utility company that collects power usage data from
users and charges them dierent rates based on peak/o-peak pe-
riods per region. Another example is a road pricing system that
determines real-time trac conditions based on the number of cars
on the road, and charges motorists appropriate congestion fees.
The third example is an advertising service that monitors clicks
and pays the publisher for displaying the advertisement based on
the click-through rate. One desirable property of the applications
above is transparency [
12
,
27
], which allows users to verify that
the computation done on their data has been executed correctly.
This property is stronger than simply ensuring data integrity, as it
protects users from malicious or compromised service providers
who ignore data or tamper with the computation.
One approach toward transparent data services is for the provider
to make the raw data available to users. For example, the provider
can use a public bulletin board to store the data. Users can then
perform computations directly on the data, but they have no data
privacy. This is unacceptable in applications such as dynamic road
pricing, as it would allow any user to track the movements of other
users using the raw data. Alternatively, the service provider could
< < < < < < < < < <
0 1
00 01 10 11
chronological
prefix tree
sorted sum tree
for prefix 00 sorted sum tree
for prefix 11
Figure 1: TAP’s authenticated data structure.
store the data and return only query results. For example, an ad-
vertising service may have an API that returns the total number
of clicks on an advertisement over a given period. However, this
approach cannot guarantee transparency, which is troublesome
when the provider has a nancial incentive to falsify query results.
Our goal is to build a data service system with the following
properties. First, it supports a wide range of queries that compute
aggregates over the data of many independent users. Second, it
protects data privacy from adversarial users. Third, it protects data
integrity and transparency from an adversarial provider. Finally, it
has a reasonable performance overhead, and scales well when the
number of users increases.
One important primitive for realizing our goal is the Authen-
ticated Data Structure (ADS) [
33
], which allows users to detect
incorrect results returned by the provider. However, all ADS pro-
posals in the literature fall into two broad categories, which both
fall short in meeting our requirements. The rst category consists
of approaches that guarantee privacy, but support a limited set of
queries. This category includes transparency logs such as certicate
transparency [
20
] and its variants such as revocation transparency
[
21
], extended certicate transparency [
31
], Trillian [
16
], CONIKS
[
25
], and Merkle
2
[
18
]. Transparency logs store data in the form
of key-value tuples and allow for public auditing. However, trans-
parency logs only support data insertion, removal, and look-up
operations, and thus fail to meet our rst requirement. Also in-
cluded in the rst category are Proofs of Liabilities (PoLs) [
13
,
19
],
which allow users to prove that certain sums of user values are
non-negative without revealing the underlying values. The second
category consists of SQL-based authenticated databases such as
IntegriDB [
35
] and FalconDB [
28
]. These databases allow users
to verify the results of a wide range of SQL queries – i.e., sum,
count, average, min, and max. However, they are either designed
for a single user [
35
], thus failing our rst requirement, or make all
insert queries public [
28
], thus violating the second requirement.
Other approaches exist that only support transparent range queries
[12, 27, 29] and hence fail the rst requirement.
In this work, we present a Transparent and Privacy-Preserving
ADS named TAP, a data service system that meets our four design
arXiv:2210.11702v1 [cs.CR] 21 Oct 2022
Daniël Reijsbergen,Aung Maw,Zheng Yang,Tien Tuan Anh Dinh,and Jianying Zhou
goals. It addresses the limitations of existing systems with a novel
tree data structure, depicted in Figure 1, that combines the features
of existing approaches that are best suited to our context. The data
structure consists of a chronological prex tree (as in Merkle
2
),
and each leaf in the prex tree is the root of a Merkle sum tree
(as in PoLs) that is sorted (as in IntegriDB). TAP adopts the same
system model as CONIKS and Certicate Transparency, in which
individual users monitor the inclusion of their data, and there exist
some auditors that verify the relevant properties – e.g., ordered
and append-only – of the data structure. The prex trees enable
ecient monitoring and auditing, similar to the data structures of
CONIKS and Merkle
2
. Meanwhile, the sorted sum trees enable fast
verication for a wide range of operations, including sum, count,
average, min, max, and sample standard deviation queries. TAP also
supports the quantile query [
22
] that allows for the computation
of ne-grained statistics, e.g., the median or the 5th percentile on
sliding windows.
TAP is designed to store data from multiple users, thus meet-
ing the rst requirement. It protects data privacy – the second
requirement – by storing cryptographic commitments instead of
raw data, and by publishing zero-knowledge proofs. TAP’s Merkle
tree structure ensures data integrity and allows users to verify the
correctness of a broad range of queries – the third requirement
– by generating Merkle proofs and zero-knowledge range proofs
for the commitments’ underlying values. In our setting – i.e., a
single data table in which aggregates are computed over sliding
windows – TAP has better performance than previous systems,
because it maintains one single tree, as opposed to the many trees
in IntegriDB, and the tree is smaller than the tree in Merkle
2
. The
computation and bandwidth overheads of TAP are linear in the size
of the sliding windows, but logarithmic in the size of the entire tree.
To reason about the security of TAP, we require the following:
that each user adds one data entry per time slot, that the set of
users (but not their data) at each time slot is known to a super-
auditor (e.g., a regulator or watchdog), and that the fraction of
adversarial users is bounded. We present a detailed analysis of the
properties of TAP in this setting, and nd an explicit tradeo be-
tween transparency and privacy. We focus on guaranteeing perfect
transparency at the cost of revealing query results, which cannot
be trivially linked to user identities. In Appendix C, we discuss a
method that guarantees
(𝜖, 𝛿)
-dierential privacy [
14
]. Our nal
contribution is a full, publicly accessible implementation of TAP,
and we conduct a broad range of experiments to evaluate its perfor-
mance. We compare TAP against two relevant baselines – Merkle
2
and IntegriDB. The results show that the system has reasonable
overhead, and that it outperforms the baselines in many cases.
Contributions. We make the following contributions:
(1)
We present a survey of existing ADS approaches, and discuss
their limitations in today’s emerging applications.
(2)
We present TAP, a multi-user data service whose ADS com-
bines elements from CONIKS, Merkle
2
, IntegriDB, and PoLs
to protect data privacy and integrity, while providing trans-
parency to a wide range of operations.
(3)
We formally analyze TAP and prove that it only reveals the
results of queries.
(4)
We present a full implementation of TAP and evaluate its
performance. We compare it against two baselines, namely
IntegriDB and Merkle
2
, and show that TAP outperforms the
baselines in many cases.
Outline
. The remainder of this work is organized as follows.
Section 2 describes the system model, use case examples, threat
model, and our design goals. Section 3 discusses related systems
built on top of ADSs. Section 4 presents TAP. Section 5 provides
security and performance analysis of TAP, and discusses its current
limitations. Section 6 contains the detailed performance evaluation
and comparison against two state-of-the-art baselines. We discuss
the practical aspects of TAP in Section 7, and Section 8 concludes
the paper.
2 MODEL & REQUIREMENTS
In this section, we rst discuss the general system and data models.
Next, we present three use cases for transparent data services and
discuss how they t into our models. Finally, we present the threat
model and system requirements.
2.1 Model Entities
Our system model consists of the following types of entities:
Users
. Users send data to the server and issue queries on the
aggregate data through a client. Each user monitors the data struc-
ture by verifying that her data is properly stored by the server
and veries that query results are computed correctly. In practice,
monitoring can be automated, e.g., the app on the user’s mobile
phone that shows bills or payments can verify the displayed values
by querying the server’s ADS.
Server
. The server stores the data provided by the users in a
database, and maintains an ADS on top of the data. It computes
responses to user queries, and generates proofs for the responses
using the ADS.
Auditors
. Auditors validate the server’s ADS. In particular, they
verify that it is append-only, i.e., data is never modied or deleted,
and that certain data has been sorted correctly. We will discuss
these checks in more detail in later sections.
Bulletin board
. The server periodically publishes the digest of
its ADS to an immutable bulletin board, e.g., a public blockchain.
Users and auditors download the latest digests during monitoring,
auditing, and query verication. The only goal of the bulletin board
is to prevent equivocation – i.e., the server presenting dierent
versions of the ADS to dierent entities – and can therefore be
replaced by a gossip protocol that would serve the same purpose.
Figure 2a displays a naïve design which assumes that the server
is fully trusted, hence allowing an unscrupulous operator to return
incorrect query results. Figure 2b displays our system model, in
which the server is untrusted.
2.2 Data Model
We consider a simple relational data model. The schema consists of
the following attributes.
Time
. Time is modeled as a sequence of epochs, i.e., time slots
such as hours or days, and is represented as an integer.
Value
. This attribute contains the privacy-sensitive data. For
simplicity, we assume that they are non-negative integers.
TAP: Transparent and Privacy-Preserving Data Services
server
user 1 · · ·
user 2
queries
responses
queries
(a)
auditor
server
bulletin
board
user 1 · · ·
user 2
queries responses,
proofs
queries
digests digests
audits
digests
(b)
Figure 2: Left: system model with a trusted server. Right:
TAP’s system model with an untrusted server.
ID. This represents the user ID associated with the value.
Type
. These are string attributes that capture metadata about
users.
We denote the number of Type attributes by
𝑚
, and the total
number of attributes in our schema by
𝑚=𝑚+3
. We assume
that there is one data entry per ID per epoch. This assumption
prevents a single adversarial user from arbitrarily skewing query
results, as we will discuss in Section 5. However, a single entity
may control multiple IDs, e.g., a single person owning multiple cars
for congestion pricing.
2.3 Use Case Examples
In the following, we present three illustrative use cases for a trans-
parent data service. In each case, the users receive bills or rewards
that are dependent on the recorded data, including (potentially) the
data of others. In all cases, the Time attribute is the smallest time
interval in which the billing or reward rate remains the same. In the
following, we discuss the key features of each use case, including
the ID, Value and Type attributes, the typical scale of the system,
and types of queries.
Smart Grids.
Our rst example is a smart grid that enables
peak/o-peak pricing, i.e., customers are billed at a higher rate for
power usage when system-level demand is high. In this setting,
each ID corresponds to a smart meter’s serial number, and the
Value to the customer’s power usage during the epoch. Possible
Type attributes include the customer’s geographic location, and
the customer’s type (industrial, commercial, or residential). The
service is maintained by the power retailer. To illustrate the size
of a typical smart grid, we consider SP Group, which is the largest
power retailer in Singapore with 1.6 million customers in 2022 [
6
].
For peak/o-peak pricing, we record power usage once per hour.
For the Type attributes, we consider the 28 postal code regions of
Singapore, and the 3 customer types mentioned above.
In this setting, the main design goal for a transparent data service
is to allow customers to verify their electricity bills. Each customer’s
bill for a given period
𝑇
can be expressed as the sum, over each
epoch
𝑡
in
𝑇
, of the electricity price in
𝑡
multiplied by the customer’s
power usage in
𝑡
. The rst advantage of our data service is that
it allows customers to verify their usage and bills. However, a
major advantage of TAP is that it also enables more advanced
pricing methods, such as making the price dependent on the system-
level power usage [30]. The data service allows the users to verify
claims about the system-level through sum queries. Finally, the data
service allows for the computation of aggregate statistics, e.g., 1)
the average and standard deviation of power consumption of all
users within the same region, 2) the maximum and minimum usage
in a region during a given period, and 3) the top 5% consumption
across all residential users.
Congestion Pricing.
Our second example is a system in which
vehicles are charged when they cross designated road sections
that are heavily congested during peak periods. To detect vehicles,
gantries with cameras are placed alongside the designated road
sections. Each ID corresponds to a pair of vehicle and gantry IDs,
and the Value to the number of times the vehicle crossed the gantry
during the epoch. Possible Type attributes include the gantry ID and
the type of vehicle, e.g., car, truck, or motorcycle. As an illustrative
example, we consider the Electronic Road Pricing (ERP) system in
Singapore. As of 2022, the ERP system consists of 77 gantries of
which around 20 are located at the entries and exits of the highly
congested Central Business District (CBD) [2].
As in smart grids, the data service allows users to verify their
bills even for advanced pricing schemes. For example, the price can
be made dependent on the total number of registered vehicles that
have crossed a gantry in an epoch, which would act as a proxy for
the real congestion in the system. An even more advanced pricing
scheme would make the price of entering the CBD dependent on
the sum of vehicle entries into the CBD minus the sum of vehicle
exits.
Digital Advertising.
Our nal example is a system that rewards
websites who display digital advertisements [
8
]. In particular, a web-
site owner receives a reward whenever an advertisement displayed
on the website is clicked.
1
In this setting, ID corresponds to a pair
of website and ad IDs, and the Value to the number of times the
advertisement was clicked on via the website (i.e., the click-through
rate). Possible Type attributes include the category and size of the
website, or characteristics of the advertisement. The advertisement
platform operates the server, while the website owners and adver-
tisers monitor their data entries. The biggest online advertisement
platform is Google Ads with millions of registered websites and
advertisers. However, there are also digital ad platforms such as
sixads [5] that serve around 100 000 websites.
The data service allows both the advertisers and the websites to
monitor the click-through data provided by the advertising platform.
This makes it easier to detect misbehavior, e.g., the platform under-
reporting (or overreporting) the number of clicks to the website
owner (or advertiser) for prot. Finally, it enables more advanced
reward schemes, e.g., in which the total reward paid by any single
advertiser is limited.
2.4 Threat Model
We consider two types of threats. The rst consists of honest-but-
curious users who follow the protocol but try to learn the privacy-
sensitive data of other users. The second is an adversarial server who
tries to falsify query results by tampering with the data and/or query
execution. This assumption captures unscrupulous server owners,
insider threats, and external attacks via software vulnerabilities. We
do not consider threats to privacy that stem from collusion between
the server and users, because the server is always free to send raw
1
This is a simplied version of online advertising, as advertisers and publishers are
often interested in more detail than just click-through rates.
Daniël Reijsbergen,Aung Maw,Zheng Yang,Tien Tuan Anh Dinh,and Jianying Zhou
data to adversarial users out-of-protocol. However, a limited set
of adversarial users may collude with the server to falsify query
results – this includes fake users created by the server. We discuss
how users disseminate information about server misbehavior in
Section 7.
We assume that all communication between the server and hon-
est users is done via secure channels. The auditors are only trusted
to validate the server’s structural properties of the server’s ADS, but
not individual data entries. In fact, the trust assumption for auditors
is the same as for users, i.e., any user with large computation and
network resources can also act as an auditor. In practice, we also
assume that one “super” auditor (e.g., a regulator) has the ability to
verify user identities, to prevent service providers from creating an
unlimited number of fake users. Finally, we assume that the public
bulletin board – or, alternatively, the gossip protocol for users – is
trusted and able to detect equivocation.
2.5 Requirements
Our goal is to design a system that meets the following require-
ments. Due to space constraints, we only present informal deni-
tions of the security requirements, and leave the formal denitions
to Section 5 and the appendix.
(R1) Rich operations for multiple users
. The system sup-
ports a wide range of operations (or queries) on the aggregate data
generated by multiple independent users.
(R2) Data privacy
. A user can only learn a limited number of
other users’ values by performing queries.
(R3a) Data integrity
. The server cannot change the data with-
out being detected.
(R3b) Transparency
. For each supported query, the server
cannot convince the user to accept incorrect results computed from
incorrect, incomplete, or articial data.
(R4) Eciency
. The computation, storage, and network costs
at the server and the user’s client are small. Query overheads grow
sublinearly with the number of users and epochs.
3 EXISTING SOLUTIONS
In this section, we discuss existing approaches that partially meet
our requirements. We divide them into three categories: trans-
parency logs, PoLs, and SQL-based authenticated databases. Visual
representations of various system models, and several ADS designs
whose features are incorporated in TAP, can be found in Figure 4.
We conclude the section by discussing which requirements are met
by these approaches.
3.1 Transparency Logs
Transparency logs are append-only data structures whose integrity
is protected by a Merkle tree. They provide ecient cryptographic
proofs that show, e.g., that an entry is included in the log, or that
one log is a prex of another log.
Certicate Transparency (CT)
. CT [
20
] addresses the prob-
lem of compromised certicate authorities by publishing the cer-
ticates on a transparency log. The system model of CT consists of
some certicate authorities who issue certicates and insert them
into the log, and users who search for specic certicates in the
log. CT relies on a monitor to ensure the consistency of the log.
The core data structure is a Merkle tree in which the certicates
are hashed and stored at the leaves, and the server signs the root of
the tree. CT has been extended to support ecient verication of
certicate revocations [
21
,
31
]. It has also been generalized into an
abstraction called a veriable log which is implemented as Google’s
Trillian [16].
CONIKS
. CONIKS [
25
] extends CT to support transparent
name-to-key bindings. It allows for ecient proofs of non-inclusion
so that users can easily check for unauthorized name-to-key bind-
ings in their namespaces. CONIKS’ system model, depicted in Fig-
ure 3a for a system with
𝑑
users, is similar to that of CT, but users
are more active in monitoring their key bindings. CONIKS uses
prex trees, depicted in Figure 4a, for ecient non-inclusion proofs.
It hides bindings by storing only their commitments at the leaves.
It also hides the total number of users by adding dummy nodes.
Merkle2
. Merkle
2
[
18
] extends CONIKS through a data struc-
ture that enables ecient auditing. The data structure combines a
chronological Merkle tree with prex trees. The leaves of the chrono-
logical tree only extend to the right (append-only). Each internal
node protects a set of leaves in the chronological tree, and stores
the root of a prex tree that has the same set of leaves.
3.2 Proofs of Liabilities (PoLs)
PoLs [
13
] are designed to prove solvency – i.e., assets being greater
than liabilities – in a setting where users are fully anonymous and
their individual assets and liabilities are privacy-sensitive. The main
data structure in PoLs is a Merkle sum tree, as depicted in Figure 4b,
that stores additively homomorphic commitments to the values of
assets and liabilities in the leaves. Intermediate nodes store the
sums of the commitments in their children. To show that the sum
of assets and liabilities in the leaves is non-negative, PoLs use zero-
knowledge range proofs. PoLs were generalized in [
19
] to use cases
beyond proving solvency.
3.3 SQL-Based Authenticated Databases
IntegriDB
. The system model of IntegriDB [
35
] is designed for out-
sourced databases. In particular, the user uploads data and metadata
to an untrusted server, depicted in Figure 3b. The server executes
user queries and generates proofs based on the metadata to show
that the results are correct. IntegriDB supports data insertions, join
queries on multiple tables, multidimensional range queries, and
sum, count, average, min, and max queries. IntegriDB creates a
sorted tree for each column pair, resulting in
1
2(𝑚2𝑚)
trees for a
table with
𝑚
columns, as depicted in Figure 4c. In each internal node
of the tree, IntegriDB stores a polynomial over the values in the
leaves of the internal node’s subtree. The polynomials enable proofs
that sum or range queries have been performed over the correct set
of leaves. Meanwhile, the sorted nature of the trees allows users
to verify min and max queries. Another SQL-based ADS, vSQL
[
34
], supports generic SQL queries and has similar performance as
IntegriDB.
FalconDB
. FalconDB [
28
] combines IntegriDB with blockchains.
In FalconDB’s system model, depicted in Figure 3c, a smart contract
maintains the ADS and ensures that it is globally consistent. Queries
are performed directly by the servers, which run IntegriDB, without
going through consensus (except for insertions and removals). Users
摘要:

TAP:TransparentandPrivacy-PreservingDataServicesDaniëlReijsbergen,†AungMaw,†ZhengYang,‡TienTuanAnhDinh,†andJianyingZhou††SingaporeUniversityofTechnologyandDesign,Singapore,Singapore‡SouthwestUniversity,Chongqing,ChinaABSTRACTUserstodayexpectmoresecurityfromservicesthathandletheirdata.Inadditiontotra...

展开>> 收起<<
TAP Transparent and Privacy-Preserving Data Services.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:998.55KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注