PrivTrace Differentially Private Trajectory Synthesis by Adaptive Markov Models Haiming Wang1Zhikun Zhang2Tianhao Wang3Shibo He1 Michael Backes2Jiming Chen1Yang Zhang2

2025-05-02 0 0 1021.33KB 22 页 10玖币

侵权投诉

PrivTrace: Differentially Private Trajectory Synthesis by Adaptive Markov Models

Haiming Wang1Zhikun Zhang2∗Tianhao Wang3Shibo He1

Michael Backes2Jiming Chen1Yang Zhang2

1Zhejiang University

2CISPA Helmholtz Center for Information Security

3University of Virginia

Abstract

Publishing trajectory data (individual’s movement informa-

tion) is very useful, but it also raises privacy concerns. To

handle the privacy concern, in this paper, we apply differen-

tial privacy, the standard technique for data privacy, together

with Markov chain model, to generate synthetic trajectories.

We notice that existing studies all use Markov chain model

and thus propose a framework to analyze the usage of the

Markov chain model in this problem. Based on the analysis,

we come up with an effective algorithm

PrivTrace

that uses

the ﬁrst-order and second-order Markov model adaptively.

We evaluate

PrivTrace

and existing methods on synthetic

and real-world datasets to demonstrate the superiority of our

method.

1 Introduction

Trajectory data analysis plays an important role in tasks re-

lated to the social beneﬁt such as urban planning and intelli-

gent transportation [4,46]. The trajectory data used usually

comes from users who carry mobile devices to record their

locations. However, the sensitive nature of the trajectory data

also gives rise to privacy risks when being shared. Recent re-

search demonstrated effective privacy attacks despite aggrega-

tion or anonymization, such as trajectory reconstruction [25],

de-anonymization [11], and membership inference [12,39].

Without a sufﬁcient privacy protection technique, trajectory

data analysis is not possible since users will refuse to share

their data.

A promising technique to overcome the privacy concern is

differential privacy (DP) [17,18,47,55], which has become

the golden standard in the privacy community. Intuitively, on

a given dataset, DP deﬁnes a random transformation process

(algorithm) so that the output of the transformation contains

some “noise” that can “cover” the existence of any possible

record in the dataset. DP can be used by both companies and

government agencies. With DP protection, hopefully, more

∗Zhikun Zhang is the corresponding author.

users will be willing to share their data. From the government

side, agencies can also share more data for the public good.

Most of the previous studies on differentially private tra-

jectory data analysis focus on designing tailored algorithms

for speciﬁc tasks, such as community discovering [52], par-

ticipatory sensing [33], and recommendation [49]. Our paper

focuses on a more general approach, where we publish a syn-

thetic trajectory dataset that shares similar properties with the

original one while satisfying DP. While publishing synthetic

datasets would not work better than directly publishing aggre-

gate statistics [42], this paradigm can easily enable any (even

unseen) down-stream data analysis tasks without modifying

existing algorithms and has been adopted in, e.g., the 2020

US Census data publication [10] and a dataset for population

ﬂow analysis [29]. There are several recent studies focusing

on generating data satisfying DP, and they are typically com-

posed of two steps: model learning and generation [36,53,56].

In the scenario of trajectory data, because the information is

typically continuous, existing works [25,28] also need to ﬁrst

discretize the data upfront.

Existing Work.

The ﬁrst work to generate synthetic trajecto-

ries with DP is

DPT

by He et al. [28]. It uses a set of uniform

grids with different granularities to discretize the space and es-

tablishes multiple differentially private preﬁx trees [13] (each

with a different grid) to model transitions; then generates data

by random walk on the preﬁx trees. More recently, Gursoy et

al. [25] proposed

AdaTrace

. It discretizes the geographical

space into two kinds of grids using different granularities.

Then a ﬁrst-order Markov chain model and other three impor-

tant features are learned based on the discretization results.

Then the data is generated by random walk on the ﬁrst-order

Markov chain model with help of extracted features. The

drawback of

AdaTrace

is that information contained in the

ﬁrst-order Markov chain model is not enough to generate data

of high quality.

Our Contributions.

Existing approaches either get only

the ﬁrst-order Markov chain model, which fails to obtain

enough transition information, or get a high-order Markov

chain model, which introduces excessive noise due to DP. Our

arXiv:2210.00581v2 [cs.CR] 5 Oct 2022

idea is to reach a middle ground between

DPT

and

AdaTrace

To this end, we employ only the ﬁrst and second-order Markov

chain model, for three reasons. First, previous studies have

shown that the second-order Markov chain model can achieve

promising accuracy for next step prediction [22,32,41]. Sec-

ond, Markov chain model of higher order is too space and

time-consuming. A

-order Markov chain model with

states

needs

mk+1

transition probability values. When

is 300, and

is 3, this requires 30GB of storage. Third, to satisfy DP,

building more models will introduce more noise. Although

more information is extracted, the amount of noise also grows.

As a result, after some speciﬁc order of the Markov chain

model, the overall information quality will decrease. In prac-

tice, we observe the threshold is between the second-order

and third-order.

To generate high-quality trajectories, we use both the ﬁrst-

order model and the second-order model in our random-walk-

based generation process. Every time to predict a state as

the next step, we select between the ﬁrst-order model and

the second-order model. The selection principle contains two

aspects of considerations: the effect of noise on different

models and the conﬁdence ability of models.

The standard Markov chain model does not have the infor-

mation about where the trajectory starts and ends. We add

virtual start and end in every trajectory to record this informa-

tion. However, due to some actions to bound the sensitivity,

the recorded information is biased. To reduce the bias, we

propose an optimization-based method to estimate the trip

distribution. The optimization problem is built based on the

observation that a trajectory only contributes one to the count

related to the virtual start and end. The experimental results

show that our estimation method is effective, especially when

the privacy budget is small.

We conduct empirical experiments on both synthetic and

real-world trajectory datasets to compare

PrivTrace

with the

state-of-the-art.

PrivTrace

consistently outperforms the ex-

isting methods for a variety of metrics. We further conduct

comprehensive ablation studies to illustrate the effectiveness

of the three main components of

PrivTrace

. One limitation

of our approach is that the accuracy of the generated long

trajectories is worse than that of the short trajectories. This

is due to the fact that long trajectories are more complicated

and more challenging to generate correctly.

To summarize, the main contributions of this paper are

two-fold:

•

We propose a new trajectory data synthesis method

PrivTrace

. Its key insight is to exploit both the ﬁrst-order

and second-order Markov chain models.

PrivTrace

is built

with a new method to choose between the ﬁrst-order and

second-order transition information for next-step prediction,

and a new method to estimate the trip distribution from the

ﬁrst-order Markov chain model without consuming extra

privacy budget.

•

We conduct extensive experiments on both synthetic and

real-world trajectory datasets to validate the effectiveness of

PrivTrace. Our code is open-sourced at https://github.

com/DpTrace/PrivTrace.

Roadmap.

In Section 2, we give the deﬁnition of the prob-

lem and present background knowledge of DP. Then we show

the framework of trajectory data generation Section 3. Follow-

ing this framework, we provide the design of our method in

Section 4. The experimental results are presented in Section 5

with discussions in Section 6.

Finally, we discuss related work in Section 7 and provide

concluding remarks in Section 8.

2 Preliminaries

2.1 Problem Deﬁnition

In this paper, we consider a dataset consisting of a set of tra-

jectories. Each trajectory is composed of a sequence of points.

We are interested in the following question: Given a sensitive

trajectory dataset

, how to generate a synthetic trajectory

dataset

that shares similar properties with

while satisfy-

ing DP. Generating the synthetic trajectory dataset facilitates

down-stream data analysis tasks without modiﬁcations.

Following prior work [25], we use four statistical metrics to

measure the similarity between

and

: length distribution,

diameter distribution, trajectory density, and transition pattern.

The trajectory length is the total distance of a trajectory, which

can be used to study the commute distance of people. The

trajectory diameter measures the largest Euclidean distance

between any two points in a trajectory, which gives infor-

mation about the range of individual’s activities. The length

distribution and diameter distribution capture the frequency

of different trajectory lengths and trajectory diameters in a

trajectory dataset, respectively. We use Jensen-Shannon diver-

gence (

JSD

) [34] to measure the similarity of distributions

between

and

. A smaller

JSD

means the generated

dataset Dsis more similar to the original Do.

The trajectory density calculates the number of trajectories

that pass through a given area on the map, which can be a

good indicator for urban planning, such as estimating the ﬂow

of trafﬁc in a speciﬁc area. The transition pattern captures the

frequency of transiting from one place to another, which can

help solve problems like next location prediction. We use the

average relative error (

ARE

) of different randomly generated

queries of trajectory density and transition pattern to measure

their similarity between

and

. A smaller

ARE

implies

Dsis more similar to Do.

2.2 Markov Chain Model

To generate a synthetic dataset, a commonly used method is

the Markov chain model. A Markov chain model is a stochas-

tic model describing a sequence of possible events in which

the probability of each event depends only on the state at-

tained in the previous events. It is frequently used to analyze

sequential data such as location trajectories [5,15,22,24,35]

and natural language [19,31,38,44]. Formally, the Markov

chain model is deﬁned as follows:

Deﬁnition 1

(Markov Chain Model)

Given a ﬁnite set

of discrete states

Σ={σ1,σ2,σ3,...,σk}

, a sequence

(λ1,λ2,λ3,...λs)

is said to follow a

th-order Markov process

if k ≤i≤s−1,∀λ∈Σ

Pr [λi+1=σj|λi...λ1] = Pr [λi+1=σj|λi...λi−k]

where

Pr [λi+1|λi...λ1]

is called the transition probability

from λi...λ1to λi+1.

Given a dataset

, if

is a subsequence of any

T∈D

, the

empirical transition probability Pr [σ|r]is deﬁned as

Pr [σ|r] =

∑

∀T∈D

NT(rσ)

∑

∀T∈D

∑

∀x∈Σ

NT(rx)=ND(rσ)

∑

∀x∈Σ

ND(rx)(1)

Here

is a sequence where

is followed by

NT(rx)

the total number of occurrences of

, and we denote

∑

∀T∈D

NT(rx)as ND(rx).

The

th-order Markov chain model is the Markov chain

model that can provide

Pr [σ|r]

for any state

and any sub-

sequence

with length

. By calculating all

ND(rx)

for all

possible length-

subsequence

and all

x∈Σ

, we can learn a

kth-order Markov chain model from the dataset.

2.3 Differential Privacy

Deﬁnition 2

(

-Differential Privacy)

An algorithm

satis-

ﬁes

-differential privacy (

-DP), where

ε>0

, if and only if

for any two neighboring datasets Dand D0, we have

∀O⊆Range(A):Pr [A(D)∈O]≤eεPr A(D0)∈O,

where

Range(A)

denotes the set of all possible outputs of the

algorithm A.

In this paper, we consider two datasets

and

to be

neighbors, denoted as

D'D0

if and only if

D=D0+S

D0=D+S

, where

D+S

denotes the datasets resulted from

adding one trajectory Sto the dataset D.

Laplacian Mechanism.

The Laplace mechanism computes

a function

on input dataset

while satisfying

-DP, by

adding to

f(D)

a random noise. The magnitude of the noise

depends on GS f, the global L1sensitivity of f, deﬁned as,

GS f=max

D'D0||f(D)−f(D0)||1

When

outputs a single element, such a mechanism

given below:

Af(D) = f(D) + LGS f

ε

where

L(β)

denotes a random variable sampled from

the Laplace distribution with scale parameter

such that

Pr [L(β) = x] = 1

2βe−|x|/β.

When

outputs a vector,

adds independent samples of

L(GS f/ε)

to each element of the vector. The variance of each

such sample is 2GS2

f/ε2.

2.4 Composition Properties of DP

The following composition properties are commonly used

for building complex differentially private algorithms from

simpler subroutines.

Sequential Composition.

Combining multiple subroutines

that satisfy DP for

ε1,··· ,εk

results in a mechanism that sat-

isﬁes ε-DP for ε=∑iεi.

Post-processing.

Given an

-DP algorithm

, releasing

g(A(D))

for any

still satisﬁes

-DP. That is, post-processing

an output of a differentially private algorithm does not incur

any additional privacy concerns.

3 Existing Solutions

Before diving into the descriptions of existing methods that

use Markov chain model to generate synthetic trajectories,

we ﬁrst present a general recipe that we observe in all these

solutions.

3.1 A Framework for Markov-based Trajec-

tory Synthesis

To generate synthetic trajectories using the Markov chain

model, there are three major components: Geographical space

discretization, model learning, and trajectory generation. We

integrate all the components in a general framework:

•Discretization.

The purpose of discretization is to create

discrete states for the Markov chain model. We discretize

the continuous geographical space into one or more grids

where each grid partitions the geographical space. After

discretization, each area in the grid is regarded as a state in

the Markov chain model.

•Model Learning.

Given a set of geographical states and a

trajectory dataset, we need to learn some models that can

capture the transition pattern of the trajectory dataset. Here,

the model can be a Markov chain model and other informa-

tion (e.g., trip distribution) extracted from the dataset. To

train the Markov chain model, we calculate all NDneeded

in Equation 1 and then add noise to achieve DP.

•Generation.

After the model is learned, we can generate

the synthetic trajectory (e.g., by random walk on the model).

The trajectory generation component is a post-processing

step in the context of DP, thus does not have an additional

privacy concern.

Table 1: Summary of existing methods on different steps.

Method

Step Discretization Model Learning Generation

AdaTrace [25] Adaptive partition First-order Markov chain model Markov sampling + auxiliary info

DPT [28] Reference system Multiple Preﬁx Trees Markov sampling

PrivTrace Adaptive partition First-/second-order Markov chain model Random walk

3.2 Existing Methods

Table 1 summarizes these three phases of existing work. In

what follows, we review these steps for previous studies.

AdaTrace.

This method extracts the ﬁrst-order Markov chain

model, trip distribution, and length features from the sensitive

dataset with DP, and then generates private trajectories accord-

ing to these features. In discretization,

AdaTrace

proposes to

discretize the geographical space into grids twice with two

different granularities. It ﬁrst uniformly discretizes the geo-

graphical space into a coarse-grained grid. For the cells in the

ﬁrst grid with a large number of trajectories passing through,

AdaTrace further discretizes them into ﬁner-grained grids.

In model learning,

AdaTrace

learns a ﬁrst-order Markov

chain model using the ﬁrst grid. To better capture the inherent

patterns of the dataset,

AdaTrace

also extracts the length dis-

tribution and trip distribution (count of pairs of starting and

ending point in the trajectory). Noise is added to achieve DP.

In generation,

AdaTrace

ﬁrst samples a starting state and

an ending state from the trip distribution, and samples a trajec-

tory length

|T|

from the length distribution. Then,

AdaTrace

generates the trajectory from the starting state, and repeatedly

generates the next state by random walking on Markov chain

model. After

|T|−1

steps, the trajectory directly jumps to the

ending state. Finally, for each generated state,

AdaTrace

uni-

formly samples a location point from the corresponding cell.

If the cell is further partitioned, the location point is sampled

according to the density of the second-layer grid.

DPT.

In order to capture more precise information,

DPT

uses multiple uniform grids with different granularities. When

discretizing trajectories,

DPT

will choose the most coarse-

grained grid for which there exists a transition for two consec-

utive points of the trajectory (i.e., using a more coarse-grained

grid, the two locations in the trajectory will be in the same

cell). A trajectory will be transformed into multiple segments

of transition, where transitions in the same segment are on

the same grid.

Now for each grid,

DPT

establishes a differentially private

preﬁx tree [13] to model the transitions using the transition

segments. A trajectory will be generated following the path

on the preﬁx tree. Note that there are also transitions to allow

a state in one preﬁx tree to move to another preﬁx tree with

different granularity.

4 Our Proposal

Both

AdaTrace

and

DPT

adopt Markov chain model to cap-

ture the transition pattern of the trajectory dataset. But they

work in two extremes:

AdaTrace

mostly uses the ﬁrst-order

Markov chain model to reduce the amount of noise. The draw-

back is that the ﬁrst-order Markov chain model only captures

a limited amount of information, and thus the result is not

accurate. On the other hand,

DPT

uses high-order Markov

chain model, but this leads to excessive noise being added,

and also makes the result inaccurate.

Intuitively, there is a trade-off between the amount of in-

formation we can extract and the amount of noise we have

to add due to the constraint of DP. Our insight is to work in

the middle ground between the two extremes of

AdaTrace

and

DPT

, where we use the Markov chain model in a way

that is neither too coarse-grained nor too ﬁne-grained. In par-

ticular, we employ the ﬁrst-order and second-order Markov

chain model and select useful information in the two mod-

els for generation. To this end, we propose

PrivTrace

, which

achieves a good trade-off between accuracy and noise.

4.1 Method Overview

PrivTrace

follows the general framework for generating syn-

thetic trajectory data described in Section 3, which consists

of three major parts: geographical space discretization, model

learning, and trajectory generation.

Step 1: Discretization.

To better capture the transition in-

formation of the trajectories, we rely on the observation that

the places with more trajectories passing through should be

discretized in a ﬁner-grained manner. To this end, we use a

density-sensitive two-layer discretization scheme. The core

idea is to ﬁrst discretize the map into a coarser-grained uni-

form grid (or ﬁrst-layer grid), and the cells in the ﬁrst-layer

grid with many trajectories passing through are further dis-

cretized into ﬁner-grained (second-layer) grid. The details of

this step are in Section 4.2.

Step 2: Model Learning.

Our model learning step contains

two components: the Markov chain models learning and trip

distribution estimation.

Markov Chain Models Learning. We ﬁrst estimate the

ﬁrst-order and second-order models in a differentially pri-

vate manner. The details of this component are discussed in

Section 4.3.

Trip Distribution Estimation. We estimate the trip distribu-

①Two-layer Grid Discretization

Discrete Trajectories

First-order

Markov Model

Second-order

Markov Model

Trip Distribution

Estimation Model Selection

②Model Learning

③Trajectory Generation

AB CB

AB B

Markov Chain Models

100 120 80 0.2

0.3

0.5 0.2 0.3

Figure 1: Method overview.

PrivTrace

is composed of three parts: discretization, model learning, and trajectory generation.

The discretization step ﬁrst partitions the space into a coarse-grained uniform grid, and then the density of the cells with DP to

determine which cells need to be expanded. For model learning, we learn both the ﬁrst-order and second-order Markov chain

models with differential privacy, and obtain the trip distribution from the two models. The trajectory generation process is a

random-walk-based algorithm to generate a synthetic trajectory. We propose a method to select from the two models during the

generation of synthetic trajectories.

tion to obtain the frequency of the starting and ending points

of the trajectories. The Markov chain model already contains

this, but it is biased due to a normalization step when training

the Markov chain model. To this end, we propose a method

to obtain an accurate estimation of the distribution estimation.

The details of this component are referred to Section 4.4.

Step 3: Generation.

We use a random-walk-based method

to generate trajectories, which starts at a random state and pre-

dict the next state by using either ﬁrst-order or second-order

Markov chain model. Concretely, we ﬁrst sample a pair of

starting and ending states from the estimated trip distribution

and random walk with the two Markov chain models. The

details of this step are referred to Section 4.5.

Note that we use two Markov chain models to predict the

next step in the synthetic trajectory. In the prediction process,

the ﬁrst-order and second-order Markov chain models have

different prediction abilities. To take advantage of both two

models, we propose two criteria to choose between them,

which will be described in Section 4.6.

4.2 Geographical Space Discretization

We borrow the idea of Qardaji et al. [40] to discretize the geo-

graphical space. In particular, we ﬁrst divide the geographical

space into

K×K

cells (we call them the ﬁrst-layer cells) of

equal size. For each cell, we further calculate the number of

trajectories that passes through this cell, and if the number is

large, we further partition it.

To estimate the occurrences of trajectories for all ﬁrst-layer

cells, we use the Laplacian mechanism with parameter

ε1

Here one challenge is that the sensitivity is unbounded, since

a trajectory can have an arbitrary number of occurrences in

cells.

This introduces inﬁnite DP noise. To address this issue, we

normalize the trajectory (the normalization method introduced

in [25]) to bound the sensitivity (proved in Appendix B.1).

For example, in Figure 1, we ﬁrst discretize the space into

four cells

C1,C2,C3,

and

. Consider a trajectory that occurs

and

, its total number of occurrences in all cells is

2. After we normalize the occurrence of the trajectory in

and

by the total occurrence, the trajectory contribute to the

trajectory occurrence in

and

1/2

and

1/2

instead of

1 and 1.

If a cell has many trajectories passing over it, we expand it

using a more ﬁne-grained cell so that we can understand that

area with more details. In Figure 1, step 1, the right bottom

cell is expanded into four cells.

Difference from AdaTrace.AdaTrace

[25] adopts a similar

two-layer discretization scheme as

PrivTrace

; however, the

use of two-layer grids is different. In

AdaTrace

, the cells

in the ﬁrst layer are used as states for Markov chain model,

while the cells in the second layer are used for sampling in

the generation phase. On the other hand, in

PrivTrace

, the

cells from both the ﬁrst-layer grid (if not divided again in the

second layer) and the second-layer grid are used as states. The

advantage of using cells from all layers as states is that it can

capture ﬁner-grained transition information.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PrivTrace:DifferentiallyPrivateTrajectorySynthesisbyAdaptiveMarkovModelsHaimingWang1ZhikunZhang2TianhaoWang3ShiboHe1MichaelBackes2JimingChen1YangZhang21ZhejiangUniversity2CISPAHelmholtzCenterforInformationSecurity3UniversityofVirginiaAbstractPublishingtrajectorydata(individual'smovementinforma-tion...

展开>> 收起<<

PrivTrace Differentially Private Trajectory Synthesis by Adaptive Markov Models Haiming Wang1Zhikun Zhang2Tianhao Wang3Shibo He1 Michael Backes2Jiming Chen1Yang Zhang2.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PrivTrace Differentially Private Trajectory Synthesis by Adaptive Markov Models Haiming Wang1Zhikun Zhang2Tianhao Wang3Shibo He1 Michael Backes2Jiming Chen1Yang Zhang2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: