This concept has been extended to handle arbitrary join graph
structures [
68
]; identify good message passing orders [
77
]; and
support complex aggregations over semi-ring structures such as
factorized learning [
21
,
78
], where the aggregation function trains
a ML model. These properties make factorized execution promising
for wide-table analytics.
However, common wide-table analytics starts with an initial
(pivot) query, and then applies deltas to the query structure (delta
queries) —these may modify selection or grouping clauses, update
or remove tables, or join new tables. Further, these use cases de-
mand interactive response times [
29
,
81
]. For instance, users may
incrementally slice, dice and drill down along dimensions [
34
]; ap-
ply predicate-based deletion interventions on the input tables to
understand their eects [
73
,
88
]; or join with new tables as part of
ML augmentation [
17
] or data enrichment [
24
]. Although factorized
execution reduces individual query latencies, it does not leverage
materialization during query processing to exploit work-sharing
across them (either from the pivot query, or between pre-computed
data structures).
Our experiments show that Wide-table Delta
Analytics can be >105×slower than need be.
Our core question is:
what intermediates should be materi-
alized for Wide-table Delta Analytics?
First, which messages
can be shared between the pivot query and delta queries? Delta
queries only dier from the pivot query in a subset of selection,
projection, join or aggregation clauses, and most messages from the
pivot query could be re-used. To this end, we introduce an ecient
method to check the equivalence of intermediate messages.
Second, given the pivot query, which of its messages are sucient
to support re-use for Wide-table Delta Analytics? The key challenge
is that simply caching the messages emitted when executing the
pivot query is insucient because message re-usability is sensitive
to the message-passing order. This would force future delta queries
to either use the same ordering or forgo message re-use. To address
this, we observe that query execution passes messages across all
edges in the join graph along a single direction, and show that
sending and materializing messages in reverse for the pivot query
are sucient to support arbitrary orderings in future delta queries.
We relate this to calibration in probabilistic graphical models [
79
],
which similarly shares computation between interactive queries
over posterior distributions.
Third, we bring these ideas together in the
Calibrated Junction
HyperTree
(
CJT
). Given a pivot query, the novel data structure
manages message materialization and re-use. Building the data
structure only takes up to twice the time as executing the pivot
query, but supports re-use for arbitrary message passing orders.
Finally, we apply
CJT
to three important classes of wide-table
applications.
CJT
accelerates
Data Cube
construction by re-using
messages from low-dimensional cuboids to answer higher dimen-
sional OLAP queries, and avoiding the exponential cost of con-
structing high dimensional cuboids directly.
CJT
enables interactive
Data Augmentation for ML
by reducing the time to add a new
relation (features) to the join graph and update an ML model by
> 10
2×
compared to prior approaches.
CJT
can directly use Factor-
ized IVM [
7
,
67
] to accelerate
Data Explanation and Streaming
applications, and we further reduce IVM maintenance overheads
by >92×by lazily maintaining messages.
To summarize, our contributions are as follows:
•
Conceptually, we expand the connection between factorized
queries and PGM by drawing on the idea of calibration.
•
Practically, we design the novel
CJT
data structure, which uses
calibration to enable work-sharing for Wide-table Delta Analytics.
The cost of materializing the data structure is within a constant
factor of the pivot query execution, but accelerates future queries
by multiple orders of magnitude.
•
We apply
CJT
to data cube, data augmentation for ML, streaming
and explanation applications, and describe additional application-
specic optimizations.
•
To illustrate the algorithmic benets of our ideas, we implement
and evaluate three versions of
CJT
: a custom single threaded
query engine and middleware compilers to SQL and Pandas data
frame operations. Our custom engine out-performs the state-
of-the-art LMFAO factorized engine by
∼
30
×
on OLAP queries;
compared to factorized execution algorithms, our SQL experi-
ments on AWS Redshift reduce execution by up to 10
3×
on TPC-H
queries, while Pandas experiments accelerate data augmentation
for ML by >100×.
2 BACKGROUND
This section provides a brief overview of annotated relations, early
marginalization and variable elimination to accelerate join-aggregation
queries, and the junction hypertree join representation. Our goal in
this paper is to keep the content accessible. To this end, we avoid
technical concepts (e.g., hypergraphs) that are needed for deriv-
ing bounds but not needed for developing intuition, and limit the
discussion to COUNT queries. However, our work generalizes to
any commutative semi-ring aggregation query [
35
], and the full
technical details can be found in the technical report [39].
Data Model.
Let uppercase symbol
A
be an attribute,
dom
(
A
) is its
domain, and lowercase symbol
a∈dom
(
A
) be a valid attribute value.
By default, we assume categorical attributes. Numerical attributes
are usually part of the semi-ring annotation discussed below. How-
ever, we can easily support numerical attributes by introducing a
domain with innite size. Given relation R, its schema
SR
is a set of
attributes, and its domain
dom
(
R
) =
×A∈Sdom
(
A
) is the Cartesian
product of its attribute domains. An attribute is incident of R if
A∈SR. Given tuple t, let t[A] be its value of attribute A.
Annotated Relations.
Since relational algebra (rst-order logic)
does not support aggregation, it has been extended with the use of
commutative structures to support aggregation. The main idea is
that tuples are annotated with values from a semi-ring, and when
relational operators (e.g., join, project, group-by) concatenate or
combine tuples, they also multiply or add the tuple annotations,
such that the nal annotations correspond to the desired aggrega-
tion results.
A commutative semi-ring (
D
, +,
×
, 0, 1) denes a set
D
, binary
operators + and
×
closed over
D
where both are commutative,
and the zero 0 and unit 1 elements. For simplicity, the text will
be based on COUNT queries and the natural numbers semi-ring
(
N
, +,
×
, 0, 1), which operates as in grade school math. However,
our work extends to arbitrary commutative semi-ring structures
that support aggregation queries containing common statistical
2