Daniël Reijsbergen,†Aung Maw,†Zheng Yang,‡Tien Tuan Anh Dinh,†and Jianying Zhou†
goals. It addresses the limitations of existing systems with a novel
tree data structure, depicted in Figure 1, that combines the features
of existing approaches that are best suited to our context. The data
structure consists of a chronological prex tree (as in Merkle
2
),
and each leaf in the prex tree is the root of a Merkle sum tree
(as in PoLs) that is sorted (as in IntegriDB). TAP adopts the same
system model as CONIKS and Certicate Transparency, in which
individual users monitor the inclusion of their data, and there exist
some auditors that verify the relevant properties – e.g., ordered
and append-only – of the data structure. The prex trees enable
ecient monitoring and auditing, similar to the data structures of
CONIKS and Merkle
2
. Meanwhile, the sorted sum trees enable fast
verication for a wide range of operations, including sum, count,
average, min, max, and sample standard deviation queries. TAP also
supports the quantile query [
22
] that allows for the computation
of ne-grained statistics, e.g., the median or the 5th percentile on
sliding windows.
TAP is designed to store data from multiple users, thus meet-
ing the rst requirement. It protects data privacy – the second
requirement – by storing cryptographic commitments instead of
raw data, and by publishing zero-knowledge proofs. TAP’s Merkle
tree structure ensures data integrity and allows users to verify the
correctness of a broad range of queries – the third requirement
– by generating Merkle proofs and zero-knowledge range proofs
for the commitments’ underlying values. In our setting – i.e., a
single data table in which aggregates are computed over sliding
windows – TAP has better performance than previous systems,
because it maintains one single tree, as opposed to the many trees
in IntegriDB, and the tree is smaller than the tree in Merkle
2
. The
computation and bandwidth overheads of TAP are linear in the size
of the sliding windows, but logarithmic in the size of the entire tree.
To reason about the security of TAP, we require the following:
that each user adds one data entry per time slot, that the set of
users (but not their data) at each time slot is known to a super-
auditor (e.g., a regulator or watchdog), and that the fraction of
adversarial users is bounded. We present a detailed analysis of the
properties of TAP in this setting, and nd an explicit tradeo be-
tween transparency and privacy. We focus on guaranteeing perfect
transparency at the cost of revealing query results, which cannot
be trivially linked to user identities. In Appendix C, we discuss a
method that guarantees
(𝜖, 𝛿)
-dierential privacy [
14
]. Our nal
contribution is a full, publicly accessible implementation of TAP,
and we conduct a broad range of experiments to evaluate its perfor-
mance. We compare TAP against two relevant baselines – Merkle
2
and IntegriDB. The results show that the system has reasonable
overhead, and that it outperforms the baselines in many cases.
Contributions. We make the following contributions:
(1)
We present a survey of existing ADS approaches, and discuss
their limitations in today’s emerging applications.
(2)
We present TAP, a multi-user data service whose ADS com-
bines elements from CONIKS, Merkle
2
, IntegriDB, and PoLs
to protect data privacy and integrity, while providing trans-
parency to a wide range of operations.
(3)
We formally analyze TAP and prove that it only reveals the
results of queries.
(4)
We present a full implementation of TAP and evaluate its
performance. We compare it against two baselines, namely
IntegriDB and Merkle
2
, and show that TAP outperforms the
baselines in many cases.
Outline
. The remainder of this work is organized as follows.
Section 2 describes the system model, use case examples, threat
model, and our design goals. Section 3 discusses related systems
built on top of ADSs. Section 4 presents TAP. Section 5 provides
security and performance analysis of TAP, and discusses its current
limitations. Section 6 contains the detailed performance evaluation
and comparison against two state-of-the-art baselines. We discuss
the practical aspects of TAP in Section 7, and Section 8 concludes
the paper.
2 MODEL & REQUIREMENTS
In this section, we rst discuss the general system and data models.
Next, we present three use cases for transparent data services and
discuss how they t into our models. Finally, we present the threat
model and system requirements.
2.1 Model Entities
Our system model consists of the following types of entities:
Users
. Users send data to the server and issue queries on the
aggregate data through a client. Each user monitors the data struc-
ture by verifying that her data is properly stored by the server
and veries that query results are computed correctly. In practice,
monitoring can be automated, e.g., the app on the user’s mobile
phone that shows bills or payments can verify the displayed values
by querying the server’s ADS.
Server
. The server stores the data provided by the users in a
database, and maintains an ADS on top of the data. It computes
responses to user queries, and generates proofs for the responses
using the ADS.
Auditors
. Auditors validate the server’s ADS. In particular, they
verify that it is append-only, i.e., data is never modied or deleted,
and that certain data has been sorted correctly. We will discuss
these checks in more detail in later sections.
Bulletin board
. The server periodically publishes the digest of
its ADS to an immutable bulletin board, e.g., a public blockchain.
Users and auditors download the latest digests during monitoring,
auditing, and query verication. The only goal of the bulletin board
is to prevent equivocation – i.e., the server presenting dierent
versions of the ADS to dierent entities – and can therefore be
replaced by a gossip protocol that would serve the same purpose.
Figure 2a displays a naïve design which assumes that the server
is fully trusted, hence allowing an unscrupulous operator to return
incorrect query results. Figure 2b displays our system model, in
which the server is untrusted.
2.2 Data Model
We consider a simple relational data model. The schema consists of
the following attributes.
Time
. Time is modeled as a sequence of epochs, i.e., time slots
such as hours or days, and is represented as an integer.
Value
. This attribute contains the privacy-sensitive data. For
simplicity, we assume that they are non-negative integers.