OneProvenance Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs Technical Report-1em

2025-05-02 0 0 1.16MB 14 页 10玖币

侵权投诉

ONEPROVENANCE: Efﬁcient Extraction of Dynamic

Coarse-Grained Provenance From Database Query Event

Logs [Technical Report]

Fotis Psallidas, Ashvin Agrawal, Chandru Sugunan1, Khaled Ibrahim, Konstantinos Karanasos2,

Jesús Camacho-Rodríguez, Avrilia Floratou, Carlo Curino, Raghu Ramakrishnan

Microsoft, Snowﬂake1, Meta2

ﬁrst.last@microsoft.com, chandru.sugunan@snowﬂake.com1, kkaranasos@meta.com2

ABSTRACT

Provenance encodes information that connects datasets, their gener-

ation workﬂows, and associated metadata (e.g., who or when exe-

cuted a query). As such, it is instrumental for a wide range of critical

governance applications (e.g., observability and auditing). Unfortu-

nately, in the context of database systems, extracting coarse-grained

provenance is a long-standing problem due to the complexity and

sheer volume of database workﬂows. Provenance extraction from

query event logs has been recently proposed as favorable because,

in principle, can result in meaningful provenance graphs for prove-

nance applications. Current approaches, however, (a) add substantial

overhead to the database and provenance extraction workﬂows and

(b) extract provenance that is noisy, omits query execution dependen-

cies, and is not rich enough for upstream applications. To address

these problems, we introduce ONEPROVENANCE: an efﬁcient prove-

nance extraction system from query event logs. ONEPROVENANCE

addresses the unique challenges of log-based extraction by (a) iden-

tifying query execution dependencies through efﬁcient log analysis,

(b) extracting provenance through novel event transformations that

account for query dependencies, and (c) introducing effective ﬁlter-

ing optimizations. Our thorough experimental analysis shows that

ONEPROVENANCE can improve extraction by up to ~18X compared

to state-of-the-art baselines; our optimizations reduce the extraction

noise and optimize performance even further. ONEPROVENANCE

is deployed at scale by Microsoft Purview and actively supports

customer provenance extraction needs (https://bit.ly/3N2JVGF).

1 INTRODUCTION

Data governance platforms aim to enable organizations to govern

(e.g., catalog, overview, secure, analyze, and audit) their data es-

tates. In this direction, Microsoft’s governance platform, namely,

Purview [

], makes a range of governance functionalities readily

accessible to customers. (Other such platforms include Collibra [

Alation [

], IBM Infosphere [

], or Informatica [

]—further high-

lighting the importance of data governance.) According to several

recent business [

] and academic [

] reports, and in-line with

our customer feedback, central to data governance is the ability to

capture and use provenance (i.e., a graph encoding connections be-

tween inputs and outputs across workﬂows) and associated metadata

(e.g., who or when executed a workﬂow) from across data systems.

1,2 Work done while Chandru and Konstantinos were at Microsoft.

Unsurprisingly, since databases play a critical role in data manage-

ment, capturing provenance from database systems has been one of

the most requested feature from Microsoft Purview customers.

Extracting provenance information from database systems is chal-

lenging, however, due to the complexity and size of database work-

ﬂows. Provenance extractors, such as the ones supported by the

major governance platforms above, can be classiﬁed into static and

dynamic ones. Static provenance extractors access information from

database catalogs (e.g., tables, views, and stored procedures), and

use static analysis to extract provenance information from them. The

main advantage of these extractors is that they can be easily deployed.

Unfortunately, however, such extractors can lead to incomplete or

incorrect provenance graphs due their inability to monitor the execu-

tion of queries (e.g., branches, triggers, or dynamic SQL). Hence,

more recently, dynamic provenance extractors have been introduced

to address these limitations. Dynamic provenance extractors operate

by listening on events generated by database systems as a side effect

of query execution, and extracting provenance from these events.

Dynamic provenance had also been one of the most highly re-

quested features in Microsoft Purview from customers spanning a

wide spectrum of industries (e.g., ﬁnance, retail, healthcare, and pub-

lic services). Designing an efﬁcient and robust dynamic provenance

extractor is not straightforward. In particular, based on customer in-

terviews, we ﬁnd that existing solutions have four main limitations:

Design overheads (L1)

: They extract provenance based on logical

or physical plans carried over events generated during query execu-

tion (e.g., SAC [

], Spline [

], or OpenLineage for Spark [

]). As

we show in our experiments, this design adds signiﬁcant overheads

(up to ~18

) to both query execution and provenance extraction. In

real-world scenarios, as highlighted in our customer interviews, these

overheads can be prohibitive—both performance- and cost-wise.

Limited support for complex scenarios (L2):

They focus on each

query in isolation, thus failing to capture important dependencies

among queries (e.g., queries executed by a stored procedure). Such

dependencies are ubiquitous and can be complex in practice [

Hence, resulting provenance graphs are largely incomplete (e.g., no

provenance of stored procedures or context on which query triggered

another query) and, as such, hard to explore and reason upon.

Absence of drill-down/rollup capabilities (L3):

Because existing

extractors fail to capture dependencies among queries, customers

also pointed out that such extractors do not allow reasoning at various

levels of an application. For instance, a user might want to ﬁrst

explore provenance at the stored procedure execution level and then,

if needed, drill-down to the provenance of the queries of this stored

arXiv:2210.14047v2 [cs.DB] 3 Mar 2023

procedure [

]. To provide such analysis capabilities, a provenance

extractor should be able to aggregate provenance information.

Disconnect with consumer applications (L4):

Finally, customers

also highlighted that existing extractors fail to provide mechanisms

to tailor the resulting provenance information to the needs of the

consumer application. For instance, they treat each query as equally

important. However, queries issued by system administrators or

backup processes are unlikely to be of interest to business users.

For such applications, the resulting provenance graphs are often

considered noisy and redundant. Similarly, existing extractors lack

a rich query runtime metadata model to provide necessary context

for upstream applications [

] (e.g., who executed a query, from

which application, or what was the CPU and IO costs).

To this end, we introduce ONEPROVENANCE, a novel dynamic

provenance extraction system that addresses the limitations of ex-

isting extractors. In particular, our system uses a novel extraction

design that collects dynamic provenance from low-volume query

logs, without relying on plans (

). As such, ONEPROVENANCE

avoids excessive overheads on database execution and provenance

extraction to the extent that ONEPROVENANCE extracts provenance

from even sizeable transactional workloads—extending coverage

beyond the traditional focus on analytical and ETL workloads.

ONEPROVENANCE tackles complex scenarios by identifying

query dependencies through efﬁcient query log analysis (

) and

providing drill-down/rollup capabilities over these query dependen-

cies (

). More speciﬁcally, query dependencies are encoded in a

novel tree data structure, that we refer to as QQTree. Provenance is

then aggregated based on parent-child relationships of QQTrees.

To better accommodate application-speciﬁc requirements (

we introduced ﬁltering techniques to eliminate redundant provenance

information in various points of the provenance extraction work-

ﬂow. Moreover, ONEPROVENANCE employs an extensible query

runtime metadata model that allows capturing application-speciﬁc

metadata in the provenance graph. Importantly, our data model is in

compliance with open standards (APACHE ATLAS [

]) for better

interoperability with existing metadata management systems.

To summarize, our key contributions in this paper include:

•

A data model encoding dynamic provenance, metadata, and query

dependencies—while complying with open standards (Section 3).

•

An efﬁcient and extensible dynamic provenance extractor that

overcomes several limitations of existing extractors (Section 4).

•

Filtering techniques to optimize the extraction process depending

on application requirements (Section 5).

•The integration of ONEPROVENANCE with Purview (Section 6).

•

A thorough experimental analysis across workﬂow types (from

transactional to analytical ones) highlighting the performance of

ONEPROVENANCE (end-to-end, in individual components, and in

comparison with state-of-the-art extraction techniques), and the

beneﬁts from our proposed optimizations (Section 7).

2 RUNNING EXAMPLE

To better highlight key points in our discussion, we use the fol-

lowing running example throughout the rest of the paper:

Consider the T-SQL script in Figure 1. The stored procedure Sync-

NewSales (lines 18-28) is populating the table

StagedSales

(delet-

ing its previous contents, if it already exists) using an external CSV

1CREATE PROCEDURE CleanAndAppendSalesHistory

2@trackingSystemVersion int

3AS

4BEGIN

5IF @trackingSystemVersion = 1

6BEGIN

7INSERT SalesHistory

8SELECT c.CustomerId, c.Region,

9r.Rate *c.Amount AS Amount

10 FROM StagedSales c JOIN

11 ConversionRate r ON c.Region = r.Region

12 END

13 ELSE

14 BEGIN

15 INSERT SalesHistory SELECT *FROM StagedSales

16 END

17 END

18 CREATE PROCEDURE SyncNewSales

19 @trackingSystemVersion int

20 AS

21 BEGIN

22 IF EXISTS(SELECT *FROM INFORMATION_SCHEMA.TABLES

23 WHERE TABLE_NAME='StagedSales')

24 DELETE FROM TABLE StagedSales;

25 BULK INSERT StagedSales FROM 'newSales.csv';

26 EXECUTE CleanAndAppendSalesHistory

27 @trackingSystemVersion;

28 END

29 EXECUTE SyncNewSales 2;

Figure 1: Workﬂow of our running example.

ﬁle. Then, it calls the stored procedure CleanAndAppendSalesHis-

tory (lines 1-17) that is responsible for appending new data to the

SalesHistory

table. Note that the query that populates

SalesHis-

tory

varies depending on the value of the parameter

@trackingSys-

temVersion

. Hence, provenance information and runtime metadata

for these two stored procedures may change across their runs.

Using this example, we will show how we can extract dynamic

provenance, identify dependencies (e.g., across executions of a

stored procedure or queries that are part of a stored procedure),

attach metadata to nodes of the provenance graph (e.g., who and

from what server and application executed the stored procedure or

how much CPU time and I/O was taken during execution of a stored

procedure), and aggregate provenance across queries.

3 DATA MODEL AND PROBLEM

STATEMENT

Our overall goal is to extract semantically-rich, coarse-grained prove-

nance information from executed queries. In this section, we discuss

our provenance model and associated problem statement.

3.1 Provenance Model

At a high-level, we model coarse-grained provenance as a hyper-

graph that captures the relationships between datasets (e.g., tables or

columns) and processes (e.g., queries). We now deﬁne the building

blocks for modeling such a graph (i.e., our provenance model).

Since we aim to comply with open standards (recall Section 1),

our provenance model is built on top on the, heavily extensible,

APACHE ATLAS type system. More speciﬁcally, APACHE ATLAS

introduces the generic entity types

Process

and

Dataset

. (Both

types are derived from the generic

Asset

type of APACHE ATLAS,

as shown in Figure 2.) Metadata on processes and datasets can

be introduced as attributes (e.g., a dataset may have a name, size,

and id) or through relationships (e.g., a client connection invokes

a process). APACHE ATLAS supports inheritance and containment

relationships that are relevant to our work as we discuss below.

Provenance is encoded as special relationships between

Process

and

Dataset

entities denoting the input/output datasets of a process.

(Finally, note that APACHE ATLAS is only one target type system for

Figure 2: ONEPROVENANCE Data Model. Blue lines with arrows cor-

respond to inheritance relationships, orange lines with diamonds cor-

respond to containment relationships, and purple lines with circles

highlight provenance relationships. (We omit attributes per type for

brevity.)

provenance information. Our model can also be compiled to other

standards including OpenLineage [36] or W3C PROV-DM [64]).

With the background of APACHE ATLAS in place we can now

dive into our model which is depicted in Figure 2.

Datasets.

In line with prior work on coarse-grained provenance [

], datasets in our model include relations (such as tables, views,

external tables, and query outputs) and their associated columns.

Processes.

In our model, processes can be either queries or query

runs. The former is traditionally the target of static provenance

extractors, and the latter the target of dynamic ones. Queries and their

runs are then sub-typed through inheritance relationships. As shown

in Figure 2, in our current model, we encode ad-hoc statements,

batches (encoded as a series of statements), queries that are part of

stored procedures, and stored procedures. For every such static query

type, we introduce its dynamic type by subtyping on query run.

In addition, we use attributes to encode metadata for processes. In

particular, for queries we track the query text, while for query runs

we track the user that executed the query; CPU time; duration; rows

inserted, updated, deleted, and returned. Finally, a client connection

is attached to each query run through a containment relationship to

encode from what application and server the query run was invoked.

Query dependencies.

To support complex scenarios and enable

drill-down and rollup capabilities (addressing L2-3 from Section 1),

we introduce two types of dependencies: (1) query runs spawned

by parent query runs and (2) runs of a query. The former allows

us to encode what query runs have been triggered by other queries

(e.g., queries executed as part of executing a stored procedure) and

vice versa (e.g., what stored procedures a query has been part of).

The latter allows us to track the runs of a particular query (e.g., the

runs of a stored procedure). Both types of dependencies are encoded

through containment relationships in our model (see also Figure 2).

Provenance.

Following our conceptual model, we model prove-

nance as a hypergraph

connecting inputs with outputs across a

workﬂow. Logically, each edge

←→ o

maps input

to output

derived from

, through process

. As we discussed above, in our

case, such processes are query runs and their static counterparts,

while input and output datasets are relations and columns.

As discussed in Section 1, our main contribution to address

is to aggregate provenance through the query dependencies of our

model. We deﬁne the provenance of a query run as the set union

of the provenance of the query runs that were executed as a result

of executing that query (i.e., for query run

we deﬁne its output

(

) as

(

) =

ÐQr←−Q′

(

Q′

), where

Qr←− Q′

denotes the set of

query runs

Q′

triggered by

; similarly for inputs). Furthermore, we

deﬁne the provenance of a query as the set union of the provenance

of its query runs (i.e., for query

we deﬁne its output

(

) as

(

) =

ÐQs←−QrO

(

), where

Qs←− Qr

denotes the set of query

runs

for the query

; similarly for inputs). Note that provenance

is deﬁned recursively for both deﬁnitions. The base case is the

provenance of individual statements (e.g., individual DML or DDL

statements). Similar to prior work, what constitutes an input or

output of each individual statement is based on the SQL semantics

of the statement type (e.g., a

CREATE TABLE X

statement has

output); we discuss more on this in Section 4.4.

Note that by using the set union operator to aggregate provenance

information, we end up deduplicating multiple instances of an input

or output dataset into a single dataset instance (e.g., in our example,

if we execute SyncNewSales multiple times, each resulting in exe-

cuting a query

INSERT SalesHistory...

, our set union semantics

will lead to having

SalesHistory

as output of SyncNewSales only

once). This design enables providing an overview of provenance in-

formation with reduced noise (e.g., provenance of a stored procedure

shows only single instances of inputs and outputs). At the same time,

we still allow drilling down to capture all the details with respect to

the alternative instances (e.g., on the provenance of the queries of

the stored procedure). We discuss more on this in Section 4.5.

Finally, note that although input/output relationships in APACHE

ATLAS can encode what are the input/output datasets of a process,

they cannot encode which input contributes to which output. For this,

we require a ternary relationship (

) encoding that input

con-

tributes to output

through process

. Since APACHE ATLAS only

supports binary relationships, a common workaround is to introduce

this ternary relationship as an attribute on the process (serialized as

a dictionary). Importantly, input/output relationships still need to be

extracted despite being redundant because other APACHE ATLAS

features (e.g., provenance visualization or label propagation) rely

on them. While not ideal, we opt for this workaround in favor of

complying with the APACHE ATLAS open standard.

So far we have discussed about our provenance data model that

essentially forms the output of our extraction system. Next, we

discuss on query logs that are input to our system.

3.2 Query Log

We assume that a database creates a query log as a side effect of

query execution. We model the query log as a totally ordered set of

events—with ordering based on the time each event was spawned.

Next, we discuss on the semantics we require from query logs. To

simplify our discussion, Figure 3 shows a query log created by Azure

SQL DB for our running example in Figure 1.

Event types.

Each event is associated with a query run, and can

have one of the following two types: 1) started and 2) completed, in-

dicating the start or completion of the corresponding query run,

respectively. Furthermore, we assume that each event is drawn

from a collection of event types that encode whether the query

is an individual or batch statement. For instance, in our example

in Figure 3, the entries in the log from Azure SQL DB contain

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ONEPROVENANCE:EfficientExtractionofDynamicCoarse-GrainedProvenanceFromDatabaseQueryEventLogs[TechnicalReport]FotisPsallidas,AshvinAgrawal,ChandruSugunan1,KhaledIbrahim,KonstantinosKaranasos2,JesúsCamacho-Rodríguez,AvriliaFloratou,CarloCurino,RaghuRamakrishnanMicrosoft,Snowflake1,Meta2first.last@micr...

展开>> 收起<<

OneProvenance Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs Technical Report-1em.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OneProvenance Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs Technical Report-1em

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: