TraVaS Dierentially Private Trace Variant Selection for Process Mining Majid Raei

2025-05-06 0 0 509.35KB 12 页 10玖币
侵权投诉
TraVaS: Differentially Private Trace Variant
Selection for Process Mining?
Majid Rafiei Q, Frederik Wangelik , and Wil M.P. van der Aalst
Chair of Process and Data Science, RWTH Aachen University, Aachen, Germany
Abstract. In the area of industrial process mining, privacy-preserving
event data publication is becoming increasingly relevant. Consequently,
the trade-off between high data utility and quantifiable privacy poses
new challenges. State-of-the-art research mainly focuses on differentially
private trace variant construction based on prefix expansion methods.
However, these algorithms face several practical limitations such as high
computational complexity, introducing fake variants, removing frequent
variants, and a bounded variant length. In this paper, we introduce a
new approach for direct differentially private trace variant release which
uses anonymized partition selection strategies to overcome the afore-
mentioned restraints. Experimental results on real-life event data show
that our algorithm outperforms state-of-the-art methods in terms of both
plain data utility and result utility preservation.
Keywords: Process Mining ·Differential Privacy ·Event Data
1 Introduction
In recent years, process mining and event data analysis have been successfully
deployed in many industries. The main objectives are to learn process models
from event logs for further behavioral inference (so-called process discovery), to
extend existing models using event logs (so-called model enhancement), or to
assess the alignment between a process model and an event log (so-called con-
formance checking) [2]. However, often the underlying event data are bound to
personal identifiers or other private information. A prominent example is the pro-
cess management of hospitals where the cases are patients being treated by staff.
Without means of privacy protection, any adversary is able to extract sensitive
information about individuals and their properties. Thus, privacy regulations,
such as GDPR [1], typically restrict data storage and access which motivates the
development of privacy preservation techniques.
The majority of state-of-the-art privacy preservation techniques are built on
Differential Privacy (DP), which offers a noise-based privacy definition. This is
due to its important features, such as providing mathematical privacy guaran-
tees and security against predicate-singling-out attacks [3]. The goal of techniques
based on DP is to hide the participation of an individual in the released output
?Funded under the Excellence Strategy of the Federal Government and the L¨ander. We also thank
the Alexander von Humboldt Stiftung for supporting our research.
arXiv:2210.14951v1 [cs.CR] 20 Oct 2022
2 Majid Rafiei et al.
Table 1: A simple event log from the healthcare context including trace variants and their frequencies.
Trace Variant Frequency
hregister, visit, blood-test, releasei10
hregister, blood-test, visit, releasei8
hregister, visit, releasei20
hregister, visit, blood-test, blood-test, releasei5
by injecting noise. The amount of noise is mainly determined by the privacy
parameters, and δ, and the sensitivity of the underlying data. State-of-the-art
research targeting (, δ)-DP methods in process mining focuses on releasing raw
privatized activity sequences performed for cases, i.e., trace variants. Table 1
shows a sample of such event data in the healthcare context, where each trace
variant belongs to a case, i.e., a patient, and one case cannot have more than one
trace variant. This format describes the control-flow of event logs that is basis
for the main process mining activities. The trace variant of a case is considered
sensitive information because it contains the complete sequence of activities per-
formed for the case that can be exploited to conclude private information, e.g.,
patient diseases in the healthcare context.
To achieve differential privacy for trace variants, the state-of-the-art ap-
proach [12] inserts noise drawn from a Laplacian distribution into the variant
distribution obtained from an event log. This approach has several drawbacks
including: (1) introducing fake variants, (2) removing frequent true variants, and
(3) limited length for generated trace variants. A recent work called SaCoFa [9],
attempts to mitigate drawbacks (1) and (2) by gaining knowledge regarding
the underlying process semantics from original event data. However, the privacy
quantification of all extra queries to gain knowledge regarding the underlying
semantics is not discussed. Moreover, the third drawback still remains since
this work, similar to [12], employs a prefix-based approach. The prefix-based ap-
proaches need to generate all possible unique variants based on a set of activities
to provide differential privacy for the original distribution of variants. Since the
set of possible trace variants that can be generated given a unique set of activi-
ties is infinite, the prefix-based techniques need to bound the length of generated
sequences. Also, to limit the search space these approaches typically include a
pruning parameter to exclude less frequent prefixes.
We introduce an (, δ)-DP approach for releasing the distribution of trace
variants that focuses on the aforementioned drawbacks. In contrast to the prefix-
based approaches, the underlying algorithm is based on (, δ)-DP for partition
selection that allows for a direct publication of arbitrarily long sequences [4]. Em-
ploying differentially private partition selection techniques, the actual frequencies
of all trace variants can directly be queried without guessing (generating) trace
variants. Internally, random noise drawn from a specific geometric distribution
is injected into the corresponding frequencies, and all variants whose privatized
frequencies fall beyond a threshold are removed. Hence, no fake trace variants are
introduced, and only some infrequent variants may disappear from the output.
Moreover, no tedious fine-tuning has to be conducted and no computationally
expensive search needs to be included. In Section 5, we introduce different met-
rics to evaluate the data and result utility preservation of our approach. We
TraVaS: Differentially Private Trace Variant Selection for Process Mining 3
also run our experiments for the state-of-the-art prefix-based methods and show
superior data and result utilities compared to these methods.
The remainder of this paper is structured as follows. In Section 2, we provide a
summary of related work. Preliminaries and notations are provided in Section 3.
Section 4 introduces the theoretical background of differentially private partition
selection, and describes our TraVaS algorithm. In Section 5, the experimental
results based on real-life event logs are shown. Section 6 concludes the paper.
2 Related Work
The research area of privacy and confidentiality in process mining is recently
growing in importance. Several techniques have been proposed to address the pri-
vacy and confidentiality issues. In this paper, our focus is on the so-called noise-
based techniques that are based on the notion of differential privacy. In [12],
the authors apply an (, δ)-DP mechanism to event logs to privatize directly-
follows relations and trace variants. The underlying principle uses a combina-
tion of an (, δ)-DP noise generator and an iterative query engine that allows an
anonymized publication of trace variants with an upper bound for their length.
SaCoFa [9] is the most recent extension of the aforementioned (, δ)-DP mecha-
nism that attempts to optimize the query structures with the help of underlying
semantics. Another extension of [12] is the PRIPEL approach, where more event
attributes can be secured using the so-called sequence enrichment [8].
Whereas most of the aforementioned ideas target raw event logs, in [7], the
focus is on directly-follows graphs. During the edge generation, connections are
randomized using (, δ)-DP mechanisms to balance utility preservation and pri-
vacy risks. As the main benchmark model for our work, we choose the technique
by Mannhardt et al. [12] since it focuses on trace variants and is the basis of most
of the other techniques. Moreover, its privacy guarantees are directly proven by
(, δ)-DP mechanisms, i.e., no extra privacy analysis is required. Nevertheless,
we also compare our results with SaCoFa as the most recent extension of the
benchmark to demonstrate the superior performance of our approach.
3 Preliminaries
In this section, we introduce the necessary mathematical concepts and definitions
utilized throughout the remainder of the paper. Let Abe a set. B(A) is the
set of all multisets over A. A multiset Acan be represented as a set of tuples
{(a, A(a))|aA}where A(a) is the frequency of aA. Given Aand Bas two
multisets, A]Bis the sum over multisets, e.g., [a2, b3]][b2, c2] = [a2, b5, c2]. We
define a finite sequence over Aof length nas σ=ha1, a2, . . . , aniwhere σ(i)=aiA
for all i∈{1,2, . . . , n}. The set of all finite sequences over Ais denoted with A.
3.1 Event Data
The data used by process mining techniques are typically collections of unique
events that are recorded per activity execution and characterized by their at-
摘要:

TraVaS:Di erentiallyPrivateTraceVariantSelectionforProcessMining?MajidRa eiQ,FrederikWangelik,andWilM.P.vanderAalstChairofProcessandDataScience,RWTHAachenUniversity,Aachen,GermanyAbstract.Intheareaofindustrialprocessmining,privacy-preservingeventdatapublicationisbecomingincreasinglyrelevant.Conseque...

展开>> 收起<<
TraVaS Dierentially Private Trace Variant Selection for Process Mining Majid Raei.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:509.35KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注