OneProvenance Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs Technical Report-1em

2025-05-02 0 0 1.16MB 14 页 10玖币
侵权投诉
ONEPROVENANCE: Efficient Extraction of Dynamic
Coarse-Grained Provenance From Database Query Event
Logs [Technical Report]
Fotis Psallidas, Ashvin Agrawal, Chandru Sugunan1, Khaled Ibrahim, Konstantinos Karanasos2,
Jesús Camacho-Rodríguez, Avrilia Floratou, Carlo Curino, Raghu Ramakrishnan
Microsoft, Snowflake1, Meta2
first.last@microsoft.com, chandru.sugunan@snowflake.com1, kkaranasos@meta.com2
ABSTRACT
Provenance encodes information that connects datasets, their gener-
ation workflows, and associated metadata (e.g., who or when exe-
cuted a query). As such, it is instrumental for a wide range of critical
governance applications (e.g., observability and auditing). Unfortu-
nately, in the context of database systems, extracting coarse-grained
provenance is a long-standing problem due to the complexity and
sheer volume of database workflows. Provenance extraction from
query event logs has been recently proposed as favorable because,
in principle, can result in meaningful provenance graphs for prove-
nance applications. Current approaches, however, (a) add substantial
overhead to the database and provenance extraction workflows and
(b) extract provenance that is noisy, omits query execution dependen-
cies, and is not rich enough for upstream applications. To address
these problems, we introduce ONEPROVENANCE: an efficient prove-
nance extraction system from query event logs. ONEPROVENANCE
addresses the unique challenges of log-based extraction by (a) iden-
tifying query execution dependencies through efficient log analysis,
(b) extracting provenance through novel event transformations that
account for query dependencies, and (c) introducing effective filter-
ing optimizations. Our thorough experimental analysis shows that
ONEPROVENANCE can improve extraction by up to ~18X compared
to state-of-the-art baselines; our optimizations reduce the extraction
noise and optimize performance even further. ONEPROVENANCE
is deployed at scale by Microsoft Purview and actively supports
customer provenance extraction needs (https://bit.ly/3N2JVGF).
1 INTRODUCTION
Data governance platforms aim to enable organizations to govern
(e.g., catalog, overview, secure, analyze, and audit) their data es-
tates. In this direction, Microsoft’s governance platform, namely,
Purview [
72
], makes a range of governance functionalities readily
accessible to customers. (Other such platforms include Collibra [
24
],
Alation [
6
], IBM Infosphere [
58
], or Informatica [
57
]—further high-
lighting the importance of data governance.) According to several
recent business [
41
,
46
] and academic [
1
] reports, and in-line with
our customer feedback, central to data governance is the ability to
capture and use provenance (i.e., a graph encoding connections be-
tween inputs and outputs across workflows) and associated metadata
(e.g., who or when executed a workflow) from across data systems.
1,2 Work done while Chandru and Konstantinos were at Microsoft.
Unsurprisingly, since databases play a critical role in data manage-
ment, capturing provenance from database systems has been one of
the most requested feature from Microsoft Purview customers.
Extracting provenance information from database systems is chal-
lenging, however, due to the complexity and size of database work-
flows. Provenance extractors, such as the ones supported by the
major governance platforms above, can be classified into static and
dynamic ones. Static provenance extractors access information from
database catalogs (e.g., tables, views, and stored procedures), and
use static analysis to extract provenance information from them. The
main advantage of these extractors is that they can be easily deployed.
Unfortunately, however, such extractors can lead to incomplete or
incorrect provenance graphs due their inability to monitor the execu-
tion of queries (e.g., branches, triggers, or dynamic SQL). Hence,
more recently, dynamic provenance extractors have been introduced
to address these limitations. Dynamic provenance extractors operate
by listening on events generated by database systems as a side effect
of query execution, and extracting provenance from these events.
Dynamic provenance had also been one of the most highly re-
quested features in Microsoft Purview from customers spanning a
wide spectrum of industries (e.g., finance, retail, healthcare, and pub-
lic services). Designing an efficient and robust dynamic provenance
extractor is not straightforward. In particular, based on customer in-
terviews, we find that existing solutions have four main limitations:
Design overheads (L1)
: They extract provenance based on logical
or physical plans carried over events generated during query execu-
tion (e.g., SAC [
80
], Spline [
74
], or OpenLineage for Spark [
36
]). As
we show in our experiments, this design adds significant overheads
(up to ~18
×
) to both query execution and provenance extraction. In
real-world scenarios, as highlighted in our customer interviews, these
overheads can be prohibitive—both performance- and cost-wise.
Limited support for complex scenarios (L2):
They focus on each
query in isolation, thus failing to capture important dependencies
among queries (e.g., queries executed by a stored procedure). Such
dependencies are ubiquitous and can be complex in practice [
48
].
Hence, resulting provenance graphs are largely incomplete (e.g., no
provenance of stored procedures or context on which query triggered
another query) and, as such, hard to explore and reason upon.
Absence of drill-down/rollup capabilities (L3):
Because existing
extractors fail to capture dependencies among queries, customers
also pointed out that such extractors do not allow reasoning at various
levels of an application. For instance, a user might want to first
explore provenance at the stored procedure execution level and then,
if needed, drill-down to the provenance of the queries of this stored
arXiv:2210.14047v2 [cs.DB] 3 Mar 2023
procedure [
76
]. To provide such analysis capabilities, a provenance
extractor should be able to aggregate provenance information.
Disconnect with consumer applications (L4):
Finally, customers
also highlighted that existing extractors fail to provide mechanisms
to tailor the resulting provenance information to the needs of the
consumer application. For instance, they treat each query as equally
important. However, queries issued by system administrators or
backup processes are unlikely to be of interest to business users.
For such applications, the resulting provenance graphs are often
considered noisy and redundant. Similarly, existing extractors lack
a rich query runtime metadata model to provide necessary context
for upstream applications [
53
,
61
] (e.g., who executed a query, from
which application, or what was the CPU and IO costs).
To this end, we introduce ONEPROVENANCE, a novel dynamic
provenance extraction system that addresses the limitations of ex-
isting extractors. In particular, our system uses a novel extraction
design that collects dynamic provenance from low-volume query
logs, without relying on plans (
L1
). As such, ONEPROVENANCE
avoids excessive overheads on database execution and provenance
extraction to the extent that ONEPROVENANCE extracts provenance
from even sizeable transactional workloads—extending coverage
beyond the traditional focus on analytical and ETL workloads.
ONEPROVENANCE tackles complex scenarios by identifying
query dependencies through efficient query log analysis (
L2
) and
providing drill-down/rollup capabilities over these query dependen-
cies (
L3
). More specifically, query dependencies are encoded in a
novel tree data structure, that we refer to as QQTree. Provenance is
then aggregated based on parent-child relationships of QQTrees.
To better accommodate application-specific requirements (
L4
),
we introduced filtering techniques to eliminate redundant provenance
information in various points of the provenance extraction work-
flow. Moreover, ONEPROVENANCE employs an extensible query
runtime metadata model that allows capturing application-specific
metadata in the provenance graph. Importantly, our data model is in
compliance with open standards (APACHE ATLAS [
11
]) for better
interoperability with existing metadata management systems.
To summarize, our key contributions in this paper include:
A data model encoding dynamic provenance, metadata, and query
dependencies—while complying with open standards (Section 3).
An efficient and extensible dynamic provenance extractor that
overcomes several limitations of existing extractors (Section 4).
Filtering techniques to optimize the extraction process depending
on application requirements (Section 5).
The integration of ONEPROVENANCE with Purview (Section 6).
A thorough experimental analysis across workflow types (from
transactional to analytical ones) highlighting the performance of
ONEPROVENANCE (end-to-end, in individual components, and in
comparison with state-of-the-art extraction techniques), and the
benefits from our proposed optimizations (Section 7).
2 RUNNING EXAMPLE
To better highlight key points in our discussion, we use the fol-
lowing running example throughout the rest of the paper:
Consider the T-SQL script in Figure 1. The stored procedure Sync-
NewSales (lines 18-28) is populating the table
StagedSales
(delet-
ing its previous contents, if it already exists) using an external CSV
1CREATE PROCEDURE CleanAndAppendSalesHistory
2@trackingSystemVersion int
3AS
4BEGIN
5IF @trackingSystemVersion = 1
6BEGIN
7INSERT SalesHistory
8SELECT c.CustomerId, c.Region,
9r.Rate *c.Amount AS Amount
10 FROM StagedSales c JOIN
11 ConversionRate r ON c.Region = r.Region
12 END
13 ELSE
14 BEGIN
15 INSERT SalesHistory SELECT *FROM StagedSales
16 END
17 END
18 CREATE PROCEDURE SyncNewSales
19 @trackingSystemVersion int
20 AS
21 BEGIN
22 IF EXISTS(SELECT *FROM INFORMATION_SCHEMA.TABLES
23 WHERE TABLE_NAME='StagedSales')
24 DELETE FROM TABLE StagedSales;
25 BULK INSERT StagedSales FROM 'newSales.csv';
26 EXECUTE CleanAndAppendSalesHistory
27 @trackingSystemVersion;
28 END
29 EXECUTE SyncNewSales 2;
Figure 1: Workflow of our running example.
file. Then, it calls the stored procedure CleanAndAppendSalesHis-
tory (lines 1-17) that is responsible for appending new data to the
SalesHistory
table. Note that the query that populates
SalesHis-
tory
varies depending on the value of the parameter
@trackingSys-
temVersion
. Hence, provenance information and runtime metadata
for these two stored procedures may change across their runs.
Using this example, we will show how we can extract dynamic
provenance, identify dependencies (e.g., across executions of a
stored procedure or queries that are part of a stored procedure),
attach metadata to nodes of the provenance graph (e.g., who and
from what server and application executed the stored procedure or
how much CPU time and I/O was taken during execution of a stored
procedure), and aggregate provenance across queries.
3 DATA MODEL AND PROBLEM
STATEMENT
Our overall goal is to extract semantically-rich, coarse-grained prove-
nance information from executed queries. In this section, we discuss
our provenance model and associated problem statement.
3.1 Provenance Model
At a high-level, we model coarse-grained provenance as a hyper-
graph that captures the relationships between datasets (e.g., tables or
columns) and processes (e.g., queries). We now define the building
blocks for modeling such a graph (i.e., our provenance model).
Since we aim to comply with open standards (recall Section 1),
our provenance model is built on top on the, heavily extensible,
APACHE ATLAS type system. More specifically, APACHE ATLAS
introduces the generic entity types
Process
and
Dataset
. (Both
types are derived from the generic
Asset
type of APACHE ATLAS,
as shown in Figure 2.) Metadata on processes and datasets can
be introduced as attributes (e.g., a dataset may have a name, size,
and id) or through relationships (e.g., a client connection invokes
a process). APACHE ATLAS supports inheritance and containment
relationships that are relevant to our work as we discuss below.
Provenance is encoded as special relationships between
Process
and
Dataset
entities denoting the input/output datasets of a process.
(Finally, note that APACHE ATLAS is only one target type system for
2
Figure 2: ONEPROVENANCE Data Model. Blue lines with arrows cor-
respond to inheritance relationships, orange lines with diamonds cor-
respond to containment relationships, and purple lines with circles
highlight provenance relationships. (We omit attributes per type for
brevity.)
provenance information. Our model can also be compiled to other
standards including OpenLineage [36] or W3C PROV-DM [64]).
With the background of APACHE ATLAS in place we can now
dive into our model which is depicted in Figure 2.
Datasets.
In line with prior work on coarse-grained provenance [
74
,
80
], datasets in our model include relations (such as tables, views,
external tables, and query outputs) and their associated columns.
Processes.
In our model, processes can be either queries or query
runs. The former is traditionally the target of static provenance
extractors, and the latter the target of dynamic ones. Queries and their
runs are then sub-typed through inheritance relationships. As shown
in Figure 2, in our current model, we encode ad-hoc statements,
batches (encoded as a series of statements), queries that are part of
stored procedures, and stored procedures. For every such static query
type, we introduce its dynamic type by subtyping on query run.
In addition, we use attributes to encode metadata for processes. In
particular, for queries we track the query text, while for query runs
we track the user that executed the query; CPU time; duration; rows
inserted, updated, deleted, and returned. Finally, a client connection
is attached to each query run through a containment relationship to
encode from what application and server the query run was invoked.
Query dependencies.
To support complex scenarios and enable
drill-down and rollup capabilities (addressing L2-3 from Section 1),
we introduce two types of dependencies: (1) query runs spawned
by parent query runs and (2) runs of a query. The former allows
us to encode what query runs have been triggered by other queries
(e.g., queries executed as part of executing a stored procedure) and
vice versa (e.g., what stored procedures a query has been part of).
The latter allows us to track the runs of a particular query (e.g., the
runs of a stored procedure). Both types of dependencies are encoded
through containment relationships in our model (see also Figure 2).
Provenance.
Following our conceptual model, we model prove-
nance as a hypergraph
P
connecting inputs with outputs across a
workflow. Logically, each edge
ip
o
maps input
i
to output
o
,
derived from
i
, through process
p
. As we discussed above, in our
case, such processes are query runs and their static counterparts,
while input and output datasets are relations and columns.
As discussed in Section 1, our main contribution to address
L3
is to aggregate provenance through the query dependencies of our
model. We define the provenance of a query run as the set union
of the provenance of the query runs that were executed as a result
of executing that query (i.e., for query run
Qr
we define its output
O
(
Qr
) as
O
(
Qr
) =
ÐQrQ
rO
(
Q
r
), where
QrQ
r
denotes the set of
query runs
Q
r
triggered by
Qr
; similarly for inputs). Furthermore, we
define the provenance of a query as the set union of the provenance
of its query runs (i.e., for query
Qs
we define its output
O
(
Qs
) as
O
(
Qs
) =
ÐQsQrO
(
Qr
), where
QsQr
denotes the set of query
runs
Qr
for the query
Qs
; similarly for inputs). Note that provenance
is defined recursively for both definitions. The base case is the
provenance of individual statements (e.g., individual DML or DDL
statements). Similar to prior work, what constitutes an input or
output of each individual statement is based on the SQL semantics
of the statement type (e.g., a
CREATE TABLE X
statement has
X
as
output); we discuss more on this in Section 4.4.
Note that by using the set union operator to aggregate provenance
information, we end up deduplicating multiple instances of an input
or output dataset into a single dataset instance (e.g., in our example,
if we execute SyncNewSales multiple times, each resulting in exe-
cuting a query
INSERT SalesHistory...
, our set union semantics
will lead to having
SalesHistory
as output of SyncNewSales only
once). This design enables providing an overview of provenance in-
formation with reduced noise (e.g., provenance of a stored procedure
shows only single instances of inputs and outputs). At the same time,
we still allow drilling down to capture all the details with respect to
the alternative instances (e.g., on the provenance of the queries of
the stored procedure). We discuss more on this in Section 4.5.
Finally, note that although input/output relationships in APACHE
ATLAS can encode what are the input/output datasets of a process,
they cannot encode which input contributes to which output. For this,
we require a ternary relationship (
i
,
p
,
o
) encoding that input
i
con-
tributes to output
o
through process
p
. Since APACHE ATLAS only
supports binary relationships, a common workaround is to introduce
this ternary relationship as an attribute on the process (serialized as
a dictionary). Importantly, input/output relationships still need to be
extracted despite being redundant because other APACHE ATLAS
features (e.g., provenance visualization or label propagation) rely
on them. While not ideal, we opt for this workaround in favor of
complying with the APACHE ATLAS open standard.
So far we have discussed about our provenance data model that
essentially forms the output of our extraction system. Next, we
discuss on query logs that are input to our system.
3.2 Query Log
We assume that a database creates a query log as a side effect of
query execution. We model the query log as a totally ordered set of
events—with ordering based on the time each event was spawned.
Next, we discuss on the semantics we require from query logs. To
simplify our discussion, Figure 3 shows a query log created by Azure
SQL DB for our running example in Figure 1.
Event types.
Each event is associated with a query run, and can
have one of the following two types: 1) started and 2) completed, in-
dicating the start or completion of the corresponding query run,
respectively. Furthermore, we assume that each event is drawn
from a collection of event types that encode whether the query
is an individual or batch statement. For instance, in our example
in Figure 3, the entries in the log from Azure SQL DB contain
3
摘要:

ONEPROVENANCE:EfficientExtractionofDynamicCoarse-GrainedProvenanceFromDatabaseQueryEventLogs[TechnicalReport]FotisPsallidas,AshvinAgrawal,ChandruSugunan1,KhaledIbrahim,KonstantinosKaranasos2,JesúsCamacho-Rodríguez,AvriliaFloratou,CarloCurino,RaghuRamakrishnanMicrosoft,Snowflake1,Meta2first.last@micr...

展开>> 收起<<
OneProvenance Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs Technical Report-1em.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.16MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注