idea is to reach a middle ground between
DPT
and
AdaTrace
.
To this end, we employ only the first and second-order Markov
chain model, for three reasons. First, previous studies have
shown that the second-order Markov chain model can achieve
promising accuracy for next step prediction [22,32,41]. Sec-
ond, Markov chain model of higher order is too space and
time-consuming. A
k
-order Markov chain model with
m
states
needs
mk+1
transition probability values. When
m
is 300, and
k
is 3, this requires 30GB of storage. Third, to satisfy DP,
building more models will introduce more noise. Although
more information is extracted, the amount of noise also grows.
As a result, after some specific order of the Markov chain
model, the overall information quality will decrease. In prac-
tice, we observe the threshold is between the second-order
and third-order.
To generate high-quality trajectories, we use both the first-
order model and the second-order model in our random-walk-
based generation process. Every time to predict a state as
the next step, we select between the first-order model and
the second-order model. The selection principle contains two
aspects of considerations: the effect of noise on different
models and the confidence ability of models.
The standard Markov chain model does not have the infor-
mation about where the trajectory starts and ends. We add
virtual start and end in every trajectory to record this informa-
tion. However, due to some actions to bound the sensitivity,
the recorded information is biased. To reduce the bias, we
propose an optimization-based method to estimate the trip
distribution. The optimization problem is built based on the
observation that a trajectory only contributes one to the count
related to the virtual start and end. The experimental results
show that our estimation method is effective, especially when
the privacy budget is small.
We conduct empirical experiments on both synthetic and
real-world trajectory datasets to compare
PrivTrace
with the
state-of-the-art.
PrivTrace
consistently outperforms the ex-
isting methods for a variety of metrics. We further conduct
comprehensive ablation studies to illustrate the effectiveness
of the three main components of
PrivTrace
. One limitation
of our approach is that the accuracy of the generated long
trajectories is worse than that of the short trajectories. This
is due to the fact that long trajectories are more complicated
and more challenging to generate correctly.
To summarize, the main contributions of this paper are
two-fold:
•
We propose a new trajectory data synthesis method
PrivTrace
. Its key insight is to exploit both the first-order
and second-order Markov chain models.
PrivTrace
is built
with a new method to choose between the first-order and
second-order transition information for next-step prediction,
and a new method to estimate the trip distribution from the
first-order Markov chain model without consuming extra
privacy budget.
•
We conduct extensive experiments on both synthetic and
real-world trajectory datasets to validate the effectiveness of
PrivTrace. Our code is open-sourced at https://github.
com/DpTrace/PrivTrace.
Roadmap.
In Section 2, we give the definition of the prob-
lem and present background knowledge of DP. Then we show
the framework of trajectory data generation Section 3. Follow-
ing this framework, we provide the design of our method in
Section 4. The experimental results are presented in Section 5
with discussions in Section 6.
Finally, we discuss related work in Section 7 and provide
concluding remarks in Section 8.
2 Preliminaries
2.1 Problem Definition
In this paper, we consider a dataset consisting of a set of tra-
jectories. Each trajectory is composed of a sequence of points.
We are interested in the following question: Given a sensitive
trajectory dataset
Do
, how to generate a synthetic trajectory
dataset
Ds
that shares similar properties with
Do
while satisfy-
ing DP. Generating the synthetic trajectory dataset facilitates
down-stream data analysis tasks without modifications.
Following prior work [25], we use four statistical metrics to
measure the similarity between
Ds
and
Do
: length distribution,
diameter distribution, trajectory density, and transition pattern.
The trajectory length is the total distance of a trajectory, which
can be used to study the commute distance of people. The
trajectory diameter measures the largest Euclidean distance
between any two points in a trajectory, which gives infor-
mation about the range of individual’s activities. The length
distribution and diameter distribution capture the frequency
of different trajectory lengths and trajectory diameters in a
trajectory dataset, respectively. We use Jensen-Shannon diver-
gence (
JSD
) [34] to measure the similarity of distributions
between
Do
and
Ds
. A smaller
JSD
means the generated
dataset Dsis more similar to the original Do.
The trajectory density calculates the number of trajectories
that pass through a given area on the map, which can be a
good indicator for urban planning, such as estimating the flow
of traffic in a specific area. The transition pattern captures the
frequency of transiting from one place to another, which can
help solve problems like next location prediction. We use the
average relative error (
ARE
) of different randomly generated
queries of trajectory density and transition pattern to measure
their similarity between
Do
and
Ds
. A smaller
ARE
implies
Dsis more similar to Do.
2.2 Markov Chain Model
To generate a synthetic dataset, a commonly used method is
the Markov chain model. A Markov chain model is a stochas-
tic model describing a sequence of possible events in which
the probability of each event depends only on the state at-
tained in the previous events. It is frequently used to analyze
2