PrivTrace Differentially Private Trajectory Synthesis by Adaptive Markov Models Haiming Wang1Zhikun Zhang2Tianhao Wang3Shibo He1 Michael Backes2Jiming Chen1Yang Zhang2

2025-05-02 0 0 1021.33KB 22 页 10玖币
侵权投诉
PrivTrace: Differentially Private Trajectory Synthesis by Adaptive Markov Models
Haiming Wang1Zhikun Zhang2Tianhao Wang3Shibo He1
Michael Backes2Jiming Chen1Yang Zhang2
1Zhejiang University
2CISPA Helmholtz Center for Information Security
3University of Virginia
Abstract
Publishing trajectory data (individual’s movement informa-
tion) is very useful, but it also raises privacy concerns. To
handle the privacy concern, in this paper, we apply differen-
tial privacy, the standard technique for data privacy, together
with Markov chain model, to generate synthetic trajectories.
We notice that existing studies all use Markov chain model
and thus propose a framework to analyze the usage of the
Markov chain model in this problem. Based on the analysis,
we come up with an effective algorithm
PrivTrace
that uses
the first-order and second-order Markov model adaptively.
We evaluate
PrivTrace
and existing methods on synthetic
and real-world datasets to demonstrate the superiority of our
method.
1 Introduction
Trajectory data analysis plays an important role in tasks re-
lated to the social benefit such as urban planning and intelli-
gent transportation [4,46]. The trajectory data used usually
comes from users who carry mobile devices to record their
locations. However, the sensitive nature of the trajectory data
also gives rise to privacy risks when being shared. Recent re-
search demonstrated effective privacy attacks despite aggrega-
tion or anonymization, such as trajectory reconstruction [25],
de-anonymization [11], and membership inference [12,39].
Without a sufficient privacy protection technique, trajectory
data analysis is not possible since users will refuse to share
their data.
A promising technique to overcome the privacy concern is
differential privacy (DP) [17,18,47,55], which has become
the golden standard in the privacy community. Intuitively, on
a given dataset, DP defines a random transformation process
(algorithm) so that the output of the transformation contains
some “noise” that can “cover” the existence of any possible
record in the dataset. DP can be used by both companies and
government agencies. With DP protection, hopefully, more
Zhikun Zhang is the corresponding author.
users will be willing to share their data. From the government
side, agencies can also share more data for the public good.
Most of the previous studies on differentially private tra-
jectory data analysis focus on designing tailored algorithms
for specific tasks, such as community discovering [52], par-
ticipatory sensing [33], and recommendation [49]. Our paper
focuses on a more general approach, where we publish a syn-
thetic trajectory dataset that shares similar properties with the
original one while satisfying DP. While publishing synthetic
datasets would not work better than directly publishing aggre-
gate statistics [42], this paradigm can easily enable any (even
unseen) down-stream data analysis tasks without modifying
existing algorithms and has been adopted in, e.g., the 2020
US Census data publication [10] and a dataset for population
flow analysis [29]. There are several recent studies focusing
on generating data satisfying DP, and they are typically com-
posed of two steps: model learning and generation [36,53,56].
In the scenario of trajectory data, because the information is
typically continuous, existing works [25,28] also need to first
discretize the data upfront.
Existing Work.
The first work to generate synthetic trajecto-
ries with DP is
DPT
by He et al. [28]. It uses a set of uniform
grids with different granularities to discretize the space and es-
tablishes multiple differentially private prefix trees [13] (each
with a different grid) to model transitions; then generates data
by random walk on the prefix trees. More recently, Gursoy et
al. [25] proposed
AdaTrace
. It discretizes the geographical
space into two kinds of grids using different granularities.
Then a first-order Markov chain model and other three impor-
tant features are learned based on the discretization results.
Then the data is generated by random walk on the first-order
Markov chain model with help of extracted features. The
drawback of
AdaTrace
is that information contained in the
first-order Markov chain model is not enough to generate data
of high quality.
Our Contributions.
Existing approaches either get only
the first-order Markov chain model, which fails to obtain
enough transition information, or get a high-order Markov
chain model, which introduces excessive noise due to DP. Our
1
arXiv:2210.00581v2 [cs.CR] 5 Oct 2022
idea is to reach a middle ground between
DPT
and
AdaTrace
.
To this end, we employ only the first and second-order Markov
chain model, for three reasons. First, previous studies have
shown that the second-order Markov chain model can achieve
promising accuracy for next step prediction [22,32,41]. Sec-
ond, Markov chain model of higher order is too space and
time-consuming. A
k
-order Markov chain model with
m
states
needs
mk+1
transition probability values. When
m
is 300, and
k
is 3, this requires 30GB of storage. Third, to satisfy DP,
building more models will introduce more noise. Although
more information is extracted, the amount of noise also grows.
As a result, after some specific order of the Markov chain
model, the overall information quality will decrease. In prac-
tice, we observe the threshold is between the second-order
and third-order.
To generate high-quality trajectories, we use both the first-
order model and the second-order model in our random-walk-
based generation process. Every time to predict a state as
the next step, we select between the first-order model and
the second-order model. The selection principle contains two
aspects of considerations: the effect of noise on different
models and the confidence ability of models.
The standard Markov chain model does not have the infor-
mation about where the trajectory starts and ends. We add
virtual start and end in every trajectory to record this informa-
tion. However, due to some actions to bound the sensitivity,
the recorded information is biased. To reduce the bias, we
propose an optimization-based method to estimate the trip
distribution. The optimization problem is built based on the
observation that a trajectory only contributes one to the count
related to the virtual start and end. The experimental results
show that our estimation method is effective, especially when
the privacy budget is small.
We conduct empirical experiments on both synthetic and
real-world trajectory datasets to compare
PrivTrace
with the
state-of-the-art.
PrivTrace
consistently outperforms the ex-
isting methods for a variety of metrics. We further conduct
comprehensive ablation studies to illustrate the effectiveness
of the three main components of
PrivTrace
. One limitation
of our approach is that the accuracy of the generated long
trajectories is worse than that of the short trajectories. This
is due to the fact that long trajectories are more complicated
and more challenging to generate correctly.
To summarize, the main contributions of this paper are
two-fold:
We propose a new trajectory data synthesis method
PrivTrace
. Its key insight is to exploit both the first-order
and second-order Markov chain models.
PrivTrace
is built
with a new method to choose between the first-order and
second-order transition information for next-step prediction,
and a new method to estimate the trip distribution from the
first-order Markov chain model without consuming extra
privacy budget.
We conduct extensive experiments on both synthetic and
real-world trajectory datasets to validate the effectiveness of
PrivTrace. Our code is open-sourced at https://github.
com/DpTrace/PrivTrace.
Roadmap.
In Section 2, we give the definition of the prob-
lem and present background knowledge of DP. Then we show
the framework of trajectory data generation Section 3. Follow-
ing this framework, we provide the design of our method in
Section 4. The experimental results are presented in Section 5
with discussions in Section 6.
Finally, we discuss related work in Section 7 and provide
concluding remarks in Section 8.
2 Preliminaries
2.1 Problem Definition
In this paper, we consider a dataset consisting of a set of tra-
jectories. Each trajectory is composed of a sequence of points.
We are interested in the following question: Given a sensitive
trajectory dataset
Do
, how to generate a synthetic trajectory
dataset
Ds
that shares similar properties with
Do
while satisfy-
ing DP. Generating the synthetic trajectory dataset facilitates
down-stream data analysis tasks without modifications.
Following prior work [25], we use four statistical metrics to
measure the similarity between
Ds
and
Do
: length distribution,
diameter distribution, trajectory density, and transition pattern.
The trajectory length is the total distance of a trajectory, which
can be used to study the commute distance of people. The
trajectory diameter measures the largest Euclidean distance
between any two points in a trajectory, which gives infor-
mation about the range of individual’s activities. The length
distribution and diameter distribution capture the frequency
of different trajectory lengths and trajectory diameters in a
trajectory dataset, respectively. We use Jensen-Shannon diver-
gence (
JSD
) [34] to measure the similarity of distributions
between
Do
and
Ds
. A smaller
JSD
means the generated
dataset Dsis more similar to the original Do.
The trajectory density calculates the number of trajectories
that pass through a given area on the map, which can be a
good indicator for urban planning, such as estimating the flow
of traffic in a specific area. The transition pattern captures the
frequency of transiting from one place to another, which can
help solve problems like next location prediction. We use the
average relative error (
ARE
) of different randomly generated
queries of trajectory density and transition pattern to measure
their similarity between
Do
and
Ds
. A smaller
ARE
implies
Dsis more similar to Do.
2.2 Markov Chain Model
To generate a synthetic dataset, a commonly used method is
the Markov chain model. A Markov chain model is a stochas-
tic model describing a sequence of possible events in which
the probability of each event depends only on the state at-
tained in the previous events. It is frequently used to analyze
2
sequential data such as location trajectories [5,15,22,24,35]
and natural language [19,31,38,44]. Formally, the Markov
chain model is defined as follows:
Definition 1
(Markov Chain Model)
.
Given a finite set
of discrete states
Σ={σ1,σ2,σ3,...,σk}
, a sequence
T=
(λ1,λ2,λ3,...λs)
is said to follow a
k
th-order Markov process
if k is1,λΣ
Pr [λi+1=σj|λi...λ1] = Pr [λi+1=σj|λi...λik]
where
Pr [λi+1|λi...λ1]
is called the transition probability
from λi...λ1to λi+1.
Given a dataset
D
, if
r
is a subsequence of any
TD
, the
empirical transition probability Pr [σ|r]is defined as
Pr [σ|r] =
TD
NT(rσ)
TD
xΣ
NT(rx)=ND(rσ)
xΣ
ND(rx)(1)
Here
rx
is a sequence where
r
is followed by
x
.
NT(rx)
is
the total number of occurrences of
rx
in
T
, and we denote
TD
NT(rx)as ND(rx).
The
k
th-order Markov chain model is the Markov chain
model that can provide
Pr [σ|r]
for any state
σ
and any sub-
sequence
r
with length
k
. By calculating all
ND(rx)
for all
possible length-
k
subsequence
r
and all
xΣ
, we can learn a
kth-order Markov chain model from the dataset.
2.3 Differential Privacy
Definition 2
(
ε
-Differential Privacy)
.
An algorithm
A
satis-
fies
ε
-differential privacy (
ε
-DP), where
ε>0
, if and only if
for any two neighboring datasets Dand D0, we have
ORange(A):Pr [A(D)O]eεPr A(D0)O,
where
Range(A)
denotes the set of all possible outputs of the
algorithm A.
In this paper, we consider two datasets
D
and
D0
to be
neighbors, denoted as
D'D0
if and only if
D=D0+S
or
D0=D+S
, where
D+S
denotes the datasets resulted from
adding one trajectory Sto the dataset D.
Laplacian Mechanism.
The Laplace mechanism computes
a function
f
on input dataset
D
while satisfying
ε
-DP, by
adding to
f(D)
a random noise. The magnitude of the noise
depends on GS f, the global L1sensitivity of f, defined as,
GS f=max
D'D0||f(D)f(D0)||1
When
f
outputs a single element, such a mechanism
A
is
given below:
Af(D) = f(D) + LGS f
ε
where
L(β)
denotes a random variable sampled from
the Laplace distribution with scale parameter
β
such that
Pr [L(β) = x] = 1
2βe−|x|/β.
When
f
outputs a vector,
A
adds independent samples of
L(GS f/ε)
to each element of the vector. The variance of each
such sample is 2GS2
f/ε2.
2.4 Composition Properties of DP
The following composition properties are commonly used
for building complex differentially private algorithms from
simpler subroutines.
Sequential Composition.
Combining multiple subroutines
that satisfy DP for
ε1,··· ,εk
results in a mechanism that sat-
isfies ε-DP for ε=iεi.
Post-processing.
Given an
ε
-DP algorithm
A
, releasing
g(A(D))
for any
g
still satisfies
ε
-DP. That is, post-processing
an output of a differentially private algorithm does not incur
any additional privacy concerns.
3 Existing Solutions
Before diving into the descriptions of existing methods that
use Markov chain model to generate synthetic trajectories,
we first present a general recipe that we observe in all these
solutions.
3.1 A Framework for Markov-based Trajec-
tory Synthesis
To generate synthetic trajectories using the Markov chain
model, there are three major components: Geographical space
discretization, model learning, and trajectory generation. We
integrate all the components in a general framework:
Discretization.
The purpose of discretization is to create
discrete states for the Markov chain model. We discretize
the continuous geographical space into one or more grids
where each grid partitions the geographical space. After
discretization, each area in the grid is regarded as a state in
the Markov chain model.
Model Learning.
Given a set of geographical states and a
trajectory dataset, we need to learn some models that can
capture the transition pattern of the trajectory dataset. Here,
the model can be a Markov chain model and other informa-
tion (e.g., trip distribution) extracted from the dataset. To
train the Markov chain model, we calculate all NDneeded
in Equation 1 and then add noise to achieve DP.
Generation.
After the model is learned, we can generate
the synthetic trajectory (e.g., by random walk on the model).
The trajectory generation component is a post-processing
step in the context of DP, thus does not have an additional
privacy concern.
3
Table 1: Summary of existing methods on different steps.
Method
Step Discretization Model Learning Generation
AdaTrace [25] Adaptive partition First-order Markov chain model Markov sampling + auxiliary info
DPT [28] Reference system Multiple Prefix Trees Markov sampling
PrivTrace Adaptive partition First-/second-order Markov chain model Random walk
3.2 Existing Methods
Table 1 summarizes these three phases of existing work. In
what follows, we review these steps for previous studies.
AdaTrace.
This method extracts the first-order Markov chain
model, trip distribution, and length features from the sensitive
dataset with DP, and then generates private trajectories accord-
ing to these features. In discretization,
AdaTrace
proposes to
discretize the geographical space into grids twice with two
different granularities. It first uniformly discretizes the geo-
graphical space into a coarse-grained grid. For the cells in the
first grid with a large number of trajectories passing through,
AdaTrace further discretizes them into finer-grained grids.
In model learning,
AdaTrace
learns a first-order Markov
chain model using the first grid. To better capture the inherent
patterns of the dataset,
AdaTrace
also extracts the length dis-
tribution and trip distribution (count of pairs of starting and
ending point in the trajectory). Noise is added to achieve DP.
In generation,
AdaTrace
first samples a starting state and
an ending state from the trip distribution, and samples a trajec-
tory length
|T|
from the length distribution. Then,
AdaTrace
generates the trajectory from the starting state, and repeatedly
generates the next state by random walking on Markov chain
model. After
|T|1
steps, the trajectory directly jumps to the
ending state. Finally, for each generated state,
AdaTrace
uni-
formly samples a location point from the corresponding cell.
If the cell is further partitioned, the location point is sampled
according to the density of the second-layer grid.
DPT.
In order to capture more precise information,
DPT
uses multiple uniform grids with different granularities. When
discretizing trajectories,
DPT
will choose the most coarse-
grained grid for which there exists a transition for two consec-
utive points of the trajectory (i.e., using a more coarse-grained
grid, the two locations in the trajectory will be in the same
cell). A trajectory will be transformed into multiple segments
of transition, where transitions in the same segment are on
the same grid.
Now for each grid,
DPT
establishes a differentially private
prefix tree [13] to model the transitions using the transition
segments. A trajectory will be generated following the path
on the prefix tree. Note that there are also transitions to allow
a state in one prefix tree to move to another prefix tree with
different granularity.
4 Our Proposal
Both
AdaTrace
and
DPT
adopt Markov chain model to cap-
ture the transition pattern of the trajectory dataset. But they
work in two extremes:
AdaTrace
mostly uses the first-order
Markov chain model to reduce the amount of noise. The draw-
back is that the first-order Markov chain model only captures
a limited amount of information, and thus the result is not
accurate. On the other hand,
DPT
uses high-order Markov
chain model, but this leads to excessive noise being added,
and also makes the result inaccurate.
Intuitively, there is a trade-off between the amount of in-
formation we can extract and the amount of noise we have
to add due to the constraint of DP. Our insight is to work in
the middle ground between the two extremes of
AdaTrace
and
DPT
, where we use the Markov chain model in a way
that is neither too coarse-grained nor too fine-grained. In par-
ticular, we employ the first-order and second-order Markov
chain model and select useful information in the two mod-
els for generation. To this end, we propose
PrivTrace
, which
achieves a good trade-off between accuracy and noise.
4.1 Method Overview
PrivTrace
follows the general framework for generating syn-
thetic trajectory data described in Section 3, which consists
of three major parts: geographical space discretization, model
learning, and trajectory generation.
Step 1: Discretization.
To better capture the transition in-
formation of the trajectories, we rely on the observation that
the places with more trajectories passing through should be
discretized in a finer-grained manner. To this end, we use a
density-sensitive two-layer discretization scheme. The core
idea is to first discretize the map into a coarser-grained uni-
form grid (or first-layer grid), and the cells in the first-layer
grid with many trajectories passing through are further dis-
cretized into finer-grained (second-layer) grid. The details of
this step are in Section 4.2.
Step 2: Model Learning.
Our model learning step contains
two components: the Markov chain models learning and trip
distribution estimation.
Markov Chain Models Learning. We first estimate the
first-order and second-order models in a differentially pri-
vate manner. The details of this component are discussed in
Section 4.3.
Trip Distribution Estimation. We estimate the trip distribu-
4
Two-layer Grid Discretization
Discrete Trajectories
First-order
Markov Model
Second-order
Markov Model
Trip Distribution
Estimation Model Selection
Model Learning
Trajectory Generation
A
CB
A
AB CB
A
AB B
Markov Chain Models
100 120 80 0.2
0.3
0.5 0.2 0.3
Figure 1: Method overview.
PrivTrace
is composed of three parts: discretization, model learning, and trajectory generation.
The discretization step first partitions the space into a coarse-grained uniform grid, and then the density of the cells with DP to
determine which cells need to be expanded. For model learning, we learn both the first-order and second-order Markov chain
models with differential privacy, and obtain the trip distribution from the two models. The trajectory generation process is a
random-walk-based algorithm to generate a synthetic trajectory. We propose a method to select from the two models during the
generation of synthetic trajectories.
tion to obtain the frequency of the starting and ending points
of the trajectories. The Markov chain model already contains
this, but it is biased due to a normalization step when training
the Markov chain model. To this end, we propose a method
to obtain an accurate estimation of the distribution estimation.
The details of this component are referred to Section 4.4.
Step 3: Generation.
We use a random-walk-based method
to generate trajectories, which starts at a random state and pre-
dict the next state by using either first-order or second-order
Markov chain model. Concretely, we first sample a pair of
starting and ending states from the estimated trip distribution
and random walk with the two Markov chain models. The
details of this step are referred to Section 4.5.
Note that we use two Markov chain models to predict the
next step in the synthetic trajectory. In the prediction process,
the first-order and second-order Markov chain models have
different prediction abilities. To take advantage of both two
models, we propose two criteria to choose between them,
which will be described in Section 4.6.
4.2 Geographical Space Discretization
We borrow the idea of Qardaji et al. [40] to discretize the geo-
graphical space. In particular, we first divide the geographical
space into
K×K
cells (we call them the first-layer cells) of
equal size. For each cell, we further calculate the number of
trajectories that passes through this cell, and if the number is
large, we further partition it.
To estimate the occurrences of trajectories for all first-layer
cells, we use the Laplacian mechanism with parameter
ε1
.
Here one challenge is that the sensitivity is unbounded, since
a trajectory can have an arbitrary number of occurrences in
cells.
This introduces infinite DP noise. To address this issue, we
normalize the trajectory (the normalization method introduced
in [25]) to bound the sensitivity (proved in Appendix B.1).
For example, in Figure 1, we first discretize the space into
four cells
C1,C2,C3,
and
C4
. Consider a trajectory that occurs
in
C1
and
C2
, its total number of occurrences in all cells is
2. After we normalize the occurrence of the trajectory in
C1
and
C2
by the total occurrence, the trajectory contribute to the
trajectory occurrence in
C1
and
C2
by
1/2
and
1/2
instead of
1 and 1.
If a cell has many trajectories passing over it, we expand it
using a more fine-grained cell so that we can understand that
area with more details. In Figure 1, step 1, the right bottom
cell is expanded into four cells.
Difference from AdaTrace.AdaTrace
[25] adopts a similar
two-layer discretization scheme as
PrivTrace
; however, the
use of two-layer grids is different. In
AdaTrace
, the cells
in the first layer are used as states for Markov chain model,
while the cells in the second layer are used for sampling in
the generation phase. On the other hand, in
PrivTrace
, the
cells from both the first-layer grid (if not divided again in the
second layer) and the second-layer grid are used as states. The
advantage of using cells from all layers as states is that it can
capture finer-grained transition information.
5
摘要:

PrivTrace:DifferentiallyPrivateTrajectorySynthesisbyAdaptiveMarkovModelsHaimingWang1ZhikunZhang2TianhaoWang3ShiboHe1MichaelBackes2JimingChen1YangZhang21ZhejiangUniversity2CISPAHelmholtzCenterforInformationSecurity3UniversityofVirginiaAbstractPublishingtrajectorydata(individual'smovementinforma-tion...

展开>> 收起<<
PrivTrace Differentially Private Trajectory Synthesis by Adaptive Markov Models Haiming Wang1Zhikun Zhang2Tianhao Wang3Shibo He1 Michael Backes2Jiming Chen1Yang Zhang2.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1021.33KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注