TraVaS Dierentially Private Trace Variant Selection for Process Mining Majid Raei

2025-05-06 0 0 509.35KB 12 页 10玖币

侵权投诉

TraVaS: Diﬀerentially Private Trace Variant

Selection for Process Mining?

Majid Raﬁei Q, Frederik Wangelik , and Wil M.P. van der Aalst

Chair of Process and Data Science, RWTH Aachen University, Aachen, Germany

Abstract. In the area of industrial process mining, privacy-preserving

event data publication is becoming increasingly relevant. Consequently,

the trade-oﬀ between high data utility and quantiﬁable privacy poses

new challenges. State-of-the-art research mainly focuses on diﬀerentially

private trace variant construction based on preﬁx expansion methods.

However, these algorithms face several practical limitations such as high

computational complexity, introducing fake variants, removing frequent

variants, and a bounded variant length. In this paper, we introduce a

new approach for direct diﬀerentially private trace variant release which

uses anonymized partition selection strategies to overcome the afore-

mentioned restraints. Experimental results on real-life event data show

that our algorithm outperforms state-of-the-art methods in terms of both

plain data utility and result utility preservation.

Keywords: Process Mining ·Diﬀerential Privacy ·Event Data

1 Introduction

In recent years, process mining and event data analysis have been successfully

deployed in many industries. The main objectives are to learn process models

from event logs for further behavioral inference (so-called process discovery), to

extend existing models using event logs (so-called model enhancement), or to

assess the alignment between a process model and an event log (so-called con-

formance checking) [2]. However, often the underlying event data are bound to

personal identiﬁers or other private information. A prominent example is the pro-

cess management of hospitals where the cases are patients being treated by staﬀ.

Without means of privacy protection, any adversary is able to extract sensitive

information about individuals and their properties. Thus, privacy regulations,

such as GDPR [1], typically restrict data storage and access which motivates the

development of privacy preservation techniques.

The majority of state-of-the-art privacy preservation techniques are built on

Diﬀerential Privacy (DP), which oﬀers a noise-based privacy deﬁnition. This is

due to its important features, such as providing mathematical privacy guaran-

tees and security against predicate-singling-out attacks [3]. The goal of techniques

based on DP is to hide the participation of an individual in the released output

?Funded under the Excellence Strategy of the Federal Government and the L¨ander. We also thank

the Alexander von Humboldt Stiftung for supporting our research.

arXiv:2210.14951v1 [cs.CR] 20 Oct 2022

2 Majid Raﬁei et al.

Table 1: A simple event log from the healthcare context including trace variants and their frequencies.

Trace Variant Frequency

hregister, visit, blood-test, releasei10

hregister, blood-test, visit, releasei8

hregister, visit, releasei20

hregister, visit, blood-test, blood-test, releasei5

by injecting noise. The amount of noise is mainly determined by the privacy

parameters, and δ, and the sensitivity of the underlying data. State-of-the-art

research targeting (, δ)-DP methods in process mining focuses on releasing raw

privatized activity sequences performed for cases, i.e., trace variants. Table 1

shows a sample of such event data in the healthcare context, where each trace

variant belongs to a case, i.e., a patient, and one case cannot have more than one

trace variant. This format describes the control-ﬂow of event logs that is basis

for the main process mining activities. The trace variant of a case is considered

sensitive information because it contains the complete sequence of activities per-

formed for the case that can be exploited to conclude private information, e.g.,

patient diseases in the healthcare context.

To achieve diﬀerential privacy for trace variants, the state-of-the-art ap-

proach [12] inserts noise drawn from a Laplacian distribution into the variant

distribution obtained from an event log. This approach has several drawbacks

including: (1) introducing fake variants, (2) removing frequent true variants, and

(3) limited length for generated trace variants. A recent work called SaCoFa [9],

attempts to mitigate drawbacks (1) and (2) by gaining knowledge regarding

the underlying process semantics from original event data. However, the privacy

quantiﬁcation of all extra queries to gain knowledge regarding the underlying

semantics is not discussed. Moreover, the third drawback still remains since

this work, similar to [12], employs a preﬁx-based approach. The preﬁx-based ap-

proaches need to generate all possible unique variants based on a set of activities

to provide diﬀerential privacy for the original distribution of variants. Since the

set of possible trace variants that can be generated given a unique set of activi-

ties is inﬁnite, the preﬁx-based techniques need to bound the length of generated

sequences. Also, to limit the search space these approaches typically include a

pruning parameter to exclude less frequent preﬁxes.

We introduce an (, δ)-DP approach for releasing the distribution of trace

variants that focuses on the aforementioned drawbacks. In contrast to the preﬁx-

based approaches, the underlying algorithm is based on (, δ)-DP for partition

selection that allows for a direct publication of arbitrarily long sequences [4]. Em-

ploying diﬀerentially private partition selection techniques, the actual frequencies

of all trace variants can directly be queried without guessing (generating) trace

variants. Internally, random noise drawn from a speciﬁc geometric distribution

is injected into the corresponding frequencies, and all variants whose privatized

frequencies fall beyond a threshold are removed. Hence, no fake trace variants are

introduced, and only some infrequent variants may disappear from the output.

Moreover, no tedious ﬁne-tuning has to be conducted and no computationally

expensive search needs to be included. In Section 5, we introduce diﬀerent met-

rics to evaluate the data and result utility preservation of our approach. We

TraVaS: Diﬀerentially Private Trace Variant Selection for Process Mining 3

also run our experiments for the state-of-the-art preﬁx-based methods and show

superior data and result utilities compared to these methods.

The remainder of this paper is structured as follows. In Section 2, we provide a

summary of related work. Preliminaries and notations are provided in Section 3.

Section 4 introduces the theoretical background of diﬀerentially private partition

selection, and describes our TraVaS algorithm. In Section 5, the experimental

results based on real-life event logs are shown. Section 6 concludes the paper.

2 Related Work

The research area of privacy and conﬁdentiality in process mining is recently

growing in importance. Several techniques have been proposed to address the pri-

vacy and conﬁdentiality issues. In this paper, our focus is on the so-called noise-

based techniques that are based on the notion of diﬀerential privacy. In [12],

the authors apply an (, δ)-DP mechanism to event logs to privatize directly-

follows relations and trace variants. The underlying principle uses a combina-

tion of an (, δ)-DP noise generator and an iterative query engine that allows an

anonymized publication of trace variants with an upper bound for their length.

SaCoFa [9] is the most recent extension of the aforementioned (, δ)-DP mecha-

nism that attempts to optimize the query structures with the help of underlying

semantics. Another extension of [12] is the PRIPEL approach, where more event

attributes can be secured using the so-called sequence enrichment [8].

Whereas most of the aforementioned ideas target raw event logs, in [7], the

focus is on directly-follows graphs. During the edge generation, connections are

randomized using (, δ)-DP mechanisms to balance utility preservation and pri-

vacy risks. As the main benchmark model for our work, we choose the technique

by Mannhardt et al. [12] since it focuses on trace variants and is the basis of most

of the other techniques. Moreover, its privacy guarantees are directly proven by

(, δ)-DP mechanisms, i.e., no extra privacy analysis is required. Nevertheless,

we also compare our results with SaCoFa as the most recent extension of the

benchmark to demonstrate the superior performance of our approach.

3 Preliminaries

In this section, we introduce the necessary mathematical concepts and deﬁnitions

utilized throughout the remainder of the paper. Let Abe a set. B(A) is the

set of all multisets over A. A multiset Acan be represented as a set of tuples

{(a, A(a))|a∈A}where A(a) is the frequency of a∈A. Given Aand Bas two

multisets, A]Bis the sum over multisets, e.g., [a2, b3]][b2, c2] = [a2, b5, c2]. We

deﬁne a ﬁnite sequence over Aof length nas σ=ha1, a2, . . . , aniwhere σ(i)=ai∈A

for all i∈{1,2, . . . , n}. The set of all ﬁnite sequences over Ais denoted with A∗.

3.1 Event Data

The data used by process mining techniques are typically collections of unique

events that are recorded per activity execution and characterized by their at-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TraVaS:DierentiallyPrivateTraceVariantSelectionforProcessMining?MajidRaeiQ,FrederikWangelik,andWilM.P.vanderAalstChairofProcessandDataScience,RWTHAachenUniversity,Aachen,GermanyAbstract.Intheareaofindustrialprocessmining,privacy-preservingeventdatapublicationisbecomingincreasinglyrelevant.Conseque...

展开>> 收起<<

TraVaS Dierentially Private Trace Variant Selection for Process Mining Majid Raei.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TraVaS Dierentially Private Trace Variant Selection for Process Mining Majid Raei

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: