
2 Majid Rafiei et al.
Table 1: A simple event log from the healthcare context including trace variants and their frequencies.
Trace Variant Frequency
hregister, visit, blood-test, releasei10
hregister, blood-test, visit, releasei8
hregister, visit, releasei20
hregister, visit, blood-test, blood-test, releasei5
by injecting noise. The amount of noise is mainly determined by the privacy
parameters, and δ, and the sensitivity of the underlying data. State-of-the-art
research targeting (, δ)-DP methods in process mining focuses on releasing raw
privatized activity sequences performed for cases, i.e., trace variants. Table 1
shows a sample of such event data in the healthcare context, where each trace
variant belongs to a case, i.e., a patient, and one case cannot have more than one
trace variant. This format describes the control-flow of event logs that is basis
for the main process mining activities. The trace variant of a case is considered
sensitive information because it contains the complete sequence of activities per-
formed for the case that can be exploited to conclude private information, e.g.,
patient diseases in the healthcare context.
To achieve differential privacy for trace variants, the state-of-the-art ap-
proach [12] inserts noise drawn from a Laplacian distribution into the variant
distribution obtained from an event log. This approach has several drawbacks
including: (1) introducing fake variants, (2) removing frequent true variants, and
(3) limited length for generated trace variants. A recent work called SaCoFa [9],
attempts to mitigate drawbacks (1) and (2) by gaining knowledge regarding
the underlying process semantics from original event data. However, the privacy
quantification of all extra queries to gain knowledge regarding the underlying
semantics is not discussed. Moreover, the third drawback still remains since
this work, similar to [12], employs a prefix-based approach. The prefix-based ap-
proaches need to generate all possible unique variants based on a set of activities
to provide differential privacy for the original distribution of variants. Since the
set of possible trace variants that can be generated given a unique set of activi-
ties is infinite, the prefix-based techniques need to bound the length of generated
sequences. Also, to limit the search space these approaches typically include a
pruning parameter to exclude less frequent prefixes.
We introduce an (, δ)-DP approach for releasing the distribution of trace
variants that focuses on the aforementioned drawbacks. In contrast to the prefix-
based approaches, the underlying algorithm is based on (, δ)-DP for partition
selection that allows for a direct publication of arbitrarily long sequences [4]. Em-
ploying differentially private partition selection techniques, the actual frequencies
of all trace variants can directly be queried without guessing (generating) trace
variants. Internally, random noise drawn from a specific geometric distribution
is injected into the corresponding frequencies, and all variants whose privatized
frequencies fall beyond a threshold are removed. Hence, no fake trace variants are
introduced, and only some infrequent variants may disappear from the output.
Moreover, no tedious fine-tuning has to be conducted and no computationally
expensive search needs to be included. In Section 5, we introduce different met-
rics to evaluate the data and result utility preservation of our approach. We