procedure [
76
]. To provide such analysis capabilities, a provenance
extractor should be able to aggregate provenance information.
Disconnect with consumer applications (L4):
Finally, customers
also highlighted that existing extractors fail to provide mechanisms
to tailor the resulting provenance information to the needs of the
consumer application. For instance, they treat each query as equally
important. However, queries issued by system administrators or
backup processes are unlikely to be of interest to business users.
For such applications, the resulting provenance graphs are often
considered noisy and redundant. Similarly, existing extractors lack
a rich query runtime metadata model to provide necessary context
for upstream applications [
53
,
61
] (e.g., who executed a query, from
which application, or what was the CPU and IO costs).
To this end, we introduce ONEPROVENANCE, a novel dynamic
provenance extraction system that addresses the limitations of ex-
isting extractors. In particular, our system uses a novel extraction
design that collects dynamic provenance from low-volume query
logs, without relying on plans (
L1
). As such, ONEPROVENANCE
avoids excessive overheads on database execution and provenance
extraction to the extent that ONEPROVENANCE extracts provenance
from even sizeable transactional workloads—extending coverage
beyond the traditional focus on analytical and ETL workloads.
ONEPROVENANCE tackles complex scenarios by identifying
query dependencies through efficient query log analysis (
L2
) and
providing drill-down/rollup capabilities over these query dependen-
cies (
L3
). More specifically, query dependencies are encoded in a
novel tree data structure, that we refer to as QQTree. Provenance is
then aggregated based on parent-child relationships of QQTrees.
To better accommodate application-specific requirements (
L4
),
we introduced filtering techniques to eliminate redundant provenance
information in various points of the provenance extraction work-
flow. Moreover, ONEPROVENANCE employs an extensible query
runtime metadata model that allows capturing application-specific
metadata in the provenance graph. Importantly, our data model is in
compliance with open standards (APACHE ATLAS [
11
]) for better
interoperability with existing metadata management systems.
To summarize, our key contributions in this paper include:
•
A data model encoding dynamic provenance, metadata, and query
dependencies—while complying with open standards (Section 3).
•
An efficient and extensible dynamic provenance extractor that
overcomes several limitations of existing extractors (Section 4).
•
Filtering techniques to optimize the extraction process depending
on application requirements (Section 5).
•The integration of ONEPROVENANCE with Purview (Section 6).
•
A thorough experimental analysis across workflow types (from
transactional to analytical ones) highlighting the performance of
ONEPROVENANCE (end-to-end, in individual components, and in
comparison with state-of-the-art extraction techniques), and the
benefits from our proposed optimizations (Section 7).
2 RUNNING EXAMPLE
To better highlight key points in our discussion, we use the fol-
lowing running example throughout the rest of the paper:
Consider the T-SQL script in Figure 1. The stored procedure Sync-
NewSales (lines 18-28) is populating the table
StagedSales
(delet-
ing its previous contents, if it already exists) using an external CSV
1CREATE PROCEDURE CleanAndAppendSalesHistory
2@trackingSystemVersion int
3AS
4BEGIN
5IF @trackingSystemVersion = 1
6BEGIN
7INSERT SalesHistory
8SELECT c.CustomerId, c.Region,
9r.Rate *c.Amount AS Amount
10 FROM StagedSales c JOIN
11 ConversionRate r ON c.Region = r.Region
12 END
13 ELSE
14 BEGIN
15 INSERT SalesHistory SELECT *FROM StagedSales
16 END
17 END
18 CREATE PROCEDURE SyncNewSales
19 @trackingSystemVersion int
20 AS
21 BEGIN
22 IF EXISTS(SELECT *FROM INFORMATION_SCHEMA.TABLES
23 WHERE TABLE_NAME='StagedSales')
24 DELETE FROM TABLE StagedSales;
25 BULK INSERT StagedSales FROM 'newSales.csv';
26 EXECUTE CleanAndAppendSalesHistory
27 @trackingSystemVersion;
28 END
29 EXECUTE SyncNewSales 2;
Figure 1: Workflow of our running example.
file. Then, it calls the stored procedure CleanAndAppendSalesHis-
tory (lines 1-17) that is responsible for appending new data to the
SalesHistory
table. Note that the query that populates
SalesHis-
tory
varies depending on the value of the parameter
@trackingSys-
temVersion
. Hence, provenance information and runtime metadata
for these two stored procedures may change across their runs.
Using this example, we will show how we can extract dynamic
provenance, identify dependencies (e.g., across executions of a
stored procedure or queries that are part of a stored procedure),
attach metadata to nodes of the provenance graph (e.g., who and
from what server and application executed the stored procedure or
how much CPU time and I/O was taken during execution of a stored
procedure), and aggregate provenance across queries.
3 DATA MODEL AND PROBLEM
STATEMENT
Our overall goal is to extract semantically-rich, coarse-grained prove-
nance information from executed queries. In this section, we discuss
our provenance model and associated problem statement.
3.1 Provenance Model
At a high-level, we model coarse-grained provenance as a hyper-
graph that captures the relationships between datasets (e.g., tables or
columns) and processes (e.g., queries). We now define the building
blocks for modeling such a graph (i.e., our provenance model).
Since we aim to comply with open standards (recall Section 1),
our provenance model is built on top on the, heavily extensible,
APACHE ATLAS type system. More specifically, APACHE ATLAS
introduces the generic entity types
Process
and
Dataset
. (Both
types are derived from the generic
Asset
type of APACHE ATLAS,
as shown in Figure 2.) Metadata on processes and datasets can
be introduced as attributes (e.g., a dataset may have a name, size,
and id) or through relationships (e.g., a client connection invokes
a process). APACHE ATLAS supports inheritance and containment
relationships that are relevant to our work as we discuss below.
Provenance is encoded as special relationships between
Process
and
Dataset
entities denoting the input/output datasets of a process.
(Finally, note that APACHE ATLAS is only one target type system for
2