Modeling Data Correlations in Private Data Mining with Markov Model - - PowerPoint PPT Presentation

modeling data correlations in private data mining with
SMART_READER_LITE
LIVE PREVIEW

Modeling Data Correlations in Private Data Mining with Markov Model - - PowerPoint PPT Presentation

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks Yang Cao Emory University 2017.11.15 Outline Data Mining with Di ff erential Privacy (DP) Scenario: Spatiotemporal Data Mining using DP Markov


slide-1
SLIDE 1

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks

Yang Cao Emory University 2017.11.15

slide-2
SLIDE 2

Outline

  • Data Mining with Differential Privacy (DP)
  • Scenario: Spatiotemporal Data Mining using DP
  • Markov Chain for temporal correlations
  • Gaussian Random Markov Field for user-user correlations
  • Summary and open problems
slide-3
SLIDE 3

Outline

  • Data Mining with Differential Privacy
  • Scenario: Spatiotemporal Data Mining using DP
  • Markov Chain for temporal correlations
  • Gaussian Random Markov Field for user-user correlations
  • Summary and open problems
slide-4
SLIDE 4

Data Mining

sensitive database*!

Company Institute Public Attacker

a t t a c k

slide-5
SLIDE 5

Adversary

attack

Institute

Privacy-Preserving Data Mining (PPDM)*

Sensitive data Or

noisy data

X

How? ε-Differential Privacy!

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

What is Differential Privacy

  • Privacy: the right to be forgotten.
  • DP: output of an algorithm should NOT be

significantly affected by individual’s data.

M( ) M( ) ≈ Q( )

D 1 1

Q( )

D’ 1

log Pr(M Q D

( )

( ) = r)

Pr(M Q ′ D

( )

( ) = r) ≤ ε

  • Formally, M satisfies ε-DP if…
  • e.g., Laplace mechanism: add Lap(1/ε) noise to Q(D)
  • Sequential Composition. e.g., run M twice → 2ε-DP

ε ⬆, privacy ⬇.

e.g. 2ε-DP means 
 more privacy loss than ε-DP.

slide-9
SLIDE 9

An open problem of DP on Correlation Data

  • When data are independent:

M( ) M( ) ≈ Q( )

D 1 1

Q( )

D’ 1

⇒ ε-DP

  • When data are correlated (e.g. u1 and u3 always same):

M( ) M( ) ≈ Q( )

D 1 1

Q( )

D’

/

⇒ ?-DP

  • It is still controversial [*][**] about the “guarantee” of DP

[*] Differential Privacy as a Causal Property, https://arxiv.org/abs/1710.05899 [**] https://github.com/frankmcsherry/blog/blob/master/posts/2016-08-29.md

slide-10
SLIDE 10

Quantifying DP on Correlated Data

  • A few recent papers [Cao17][Yang15][Song17] use

a Quantification approach to achieve ε-DP (protecting each user private data value)

sensitive data ε-DP data Laplace Mechanism Lap(1/ε)

Traditional approach (if attacker knows correlations, ε-DP may not hold):

sensitive data model data correlations attacker inference Laplace Mechanism Lap(1/ε’) ε-DP data

Quantification approach (protect against attackers with knowledge of correlation):

[Cao17]: Markov Chain [Yang15]: Gaussian Markov Random Field (GMRF) [Song17]: Bayesian Network

slide-11
SLIDE 11

Outline

  • Data Mining with Differential Privacy
  • Scenario: Spatiotemporal Data Mining using DP
  • Markov Chain for temporal correlations
  • Gaussian Random Markov Field for user-user correlations
  • Summary and open problems
slide-12
SLIDE 12

t= 1 2 3

… u1

loc3 loc1 loc1

u2

loc2 loc4 loc5

u3

loc2 loc4 loc5

u4

loc4 loc5 loc3

(a) Location Data (b) True Counts

t= 1 2 3

..

loc1

2 2

.. loc2

2

.. loc3

1 1

.. loc4

1 2

.. loc5

1 2

..

Count Query

D1 D2 D3 …

(c) Private Counts

t= 1 2 3 ..

loc1

1 3

.. loc2

3 1

.. loc3

1 1

.. loc4

2 1

.. loc5

1 3 3

..

Laplace Noise

r1 r2 r3 …

Lap(1/ε)

ε-DP ε-DP ε-DP

Spatiotemporal Data Mining with DP

Sensitive data Private data

slide-13
SLIDE 13

What types of data correlations ?

7:00 8:00 9:00 … u1

loc3 loc1 loc1

u2

loc2 loc1 loc1

u3

loc2 loc4 loc5

u4

loc4 loc5 loc3

(a) Location Data

(a) Road Network

loc4 loc5 loc3

(b) Social Ties

couple colleague

u2 u1 u3

temporal correlation


for single user

spatial correlation


for user-user

D1 D2 D3 …

slide-14
SLIDE 14

Outline

  • Data Mining with Differential Privacy
  • Scenario: Spatiotemporal Data Mining using DP
  • Markov Chain for temporal correlations
  • Gaussian Random Markov Field for user-user correlations
  • Summary and open problems
  • what is MC
  • how can (attacker) learn MC from data
  • how can (attacker) infer private data using MC
slide-15
SLIDE 15

What is Markov Chain

loc1 loc2 loc3 loc1 loc2 loc3 0.2 0.1 0.7 0.1 0.2 0.7 0.3 0.4 0.3

Transition Matrix

7:00 8:00 9:00 … u1

loc1 loc3 loc2

u2

loc2 loc2 loc2

u3

loc3 loc1 loc1

u4

loc1 loc2 loc2

Raw Trajectories

t t+1

  • A Markov chain is a stochastic process with the Markov property.
  • 1-order Markov property: the state at time t only depends on the state at time t-1
  • Time-homogeneous: the transition matrix is the same after each step

Pr(x_t|x_t-1)=Pr(x_t|x_t-1,…,x_1) ∀ t>0, Pr(x_t+1|x_t)=Pr(x_t+2|x_t+1)

slide-16
SLIDE 16

How can (attacker) learn MC

  • If attacker knows partial user trajectory, he can directly

learn transition matrix by Maximum Likelihood estimation

  • If attacker knows road network, he may learn MC using

google-like model [*]

[*] E. Crisostomi, S. Kirkland, and R. Shorten, “A Google-like model of road network dynamics and its application to regulation and control,” International Journal of Control, vol. 84, no. 3, pp. 633–651, Mar. 2011.

slide-17
SLIDE 17

Model Attacker Define TPL

Find structure of TPL

  • Model temporal correlations using Markov Chain

e.g., user i : loc1→ loc3→ loc2→ … loc1 loc2 loc3 loc1

0.2 0.3 0.5

loc2

0.1 0.1 0.8

loc3

0.6 0.2 0.2

(b) Transition Matrix

time t time t-1

Pr(li

t li t−1)

Forward Temporal Correlation

P

i B

P

i F

loc1 loc2 loc3 loc1

0.1 0.2 0.7

loc2

1

loc3

0.3 0.3 0.4

(a) Transition Matrix

time t-1 time t

Pr(li

t−1 li t )

Backward Temporal Correlation

How can (attacker) infer private data using MC

slide-18
SLIDE 18

Model Attacker Define TPL

Find structure of TPL

  • DP can protect against the attacker with knowledge of all tuples

except the one of victim

t= 1

u1

loc3

u2

loc2

u3

loc2

u4

loc4

?

li

D

} DK

Ai(DK ) Ai

T (DK,P i B,P i F)

(ii) (i) (iii) Ai

T(DK,P i B,∅)

Ai

T(DK,∅,P i F)

Ai

T(DK,P i B,P i F)

+ Temporal Correlation ?

How can (attacker) infer private data using MC

slide-19
SLIDE 19

Model Attacker Define TPL

Find structure of TPL

  • Recall the definition of DP:
  • Definition of TPL:

if , then satisfies ε-DP.

PL0(M) ≤ ε M

How can (attacker) infer private data using MC

slide-20
SLIDE 20

Model Attacker Define TPL

Find structure of TPL

  • If no temporal correlation…

Eqn(2)= log Pr(r1 | li

t,Dk t )

Pr(r1 | li

tʹ,Dk t )

+...+log Pr(rt | li

t,Dk t )

Pr(rt | li

tʹ,Dk t )

+...+log Pr(rT | li

t,Dk t )

Pr(rT | li

tʹ,Dk t )

  • Definition of TPL:

TPL = PL0

PL0

{ { {

How can (attacker) infer private data using MC

slide-21
SLIDE 21

Model Attacker Define TPL

Find structure of TPL

  • If with temporal correlation…

Eqn(2)= log Pr(r1 | li

t,Dk t )

Pr(r1 | li

tʹ,Dk t )

+...+log Pr(rt | li

t,Dk t )

Pr(rt | li

tʹ,Dk t )

+...+log Pr(rT | li

t,Dk t )

Pr(rT | li

tʹ,Dk t )

  • Definition of TPL:

TPL = ?

PL0

{

?

{

?

{

Hard to quantify Eqn(2)…

How can (attacker) infer private data using MC

slide-22
SLIDE 22

Model Attacker Define TPL

Find structure of TPL

(BPL) (FPL)

(iii)Ai

T(DK,P i B,P i F)

(ii) Ai

T(DK,∅,P i F)

(i) Ai

T(DK,P i B,∅)

(ii) (i)

r1 rt …. rt-1 rt+1 …. rT

How can (attacker) infer private data using MC

slide-23
SLIDE 23

Model Attacker Define TPL

Find structure of TPL

  • Analyze BPL

Backward privacy loss function. how to calculate it?

Eqn(6)=

Backward temporal correlations

BPL

How can (attacker) infer private data using MC

slide-24
SLIDE 24

Model Attacker Define TPL

Find structure of TPL

  • Analyze FPL

Forward privacy loss function. how to calculate it?

Forward temporal correlations

FPL

How can (attacker) infer private data using MC

slide-25
SLIDE 25

Calculating BPL & FPL

  • We convert the problem of BPL/FPL calculation to finding an
  • ptimal solution of a linear-fractional programming problem.
  • This problem can be solved by simplex algorithm in O(2n).
  • We designed a O(n2) algorithm for quantifying BPL/FPL.

Privacy Quantification Upper bound

slide-26
SLIDE 26

Calculating BPL & FPL

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

t=1 2 3 4 5 6 7 8 9 10

0.10 0.18 0.25 0.30 0.35 0.39 0.42 0.45 0.48 0.50

(i) Strong temporal corr. (ii) Moderate temporal corr. (iii) No temporal corr.

Privacy Loss Time

  • Example of BPL under different temporal corr.

Privacy Quantification Upper bound

slide-27
SLIDE 27

Calculating BPL & FPL

20 40 60 80 100 t 0.2 0.4 0.6 0.8 BPL

q = 0.8; d = 0.1; ε = 0.23

20 40 60 80 100 t 0.5 1.0 1.5 2.0 2.5 3.0 3.5 BPL

q = 0.8; d = 0; ε = 0.23

(d) (c) (b) (a)

Pi=( )

1 1

Pi=( )

0.8 0.2 1

Pi=( )

0.8 0.2 1

Pi=( )

0.8 0.1 0.2 0.9

B B B B

20 40 60 80 100 t 5 10 15 20 BPL

q = 1; d = 0; ε = 0.23

20 40 60 80 100 t 0.2 0.4 0.6 0.8 1.0 1.2 BPL

q = 0.8; d = 0; ε = 0.15

q=0.8, d=0.1, ε=0.23 q=1, d=0, ε=0.23 q=0.8, d=0, ε=0.23 q=0.8, d=0, ε=0.15

case 4 case 3 case 2 case 1

Privacy Quantification Upper bound

Refer to Theorem 5 in our paper

Privacy Loss

time

slide-28
SLIDE 28

Outline

  • Data Mining with Differential Privacy
  • Scenario: Spatiotemporal Data Mining using DP
  • Markov Chain for temporal correlations
  • Gaussian Random Markov Field for user-user correlations
  • Summary and open problems
  • what is GMRF
  • how can (attacker) learn GMRF from data
  • how can (attacker) infer private data using GMRF
slide-29
SLIDE 29

What is GMRF

  • Gaussian Markov Random Field is a probabilistic graphical model, in

which each node is Gaussian variable and the (undirected) edges indicate the dependencies between variables.

  • Data Correlation = (Joint) Distribution over Data
  • We choose Gaussian Markov random field
  • Rich representation
  • Easy to construct
  • Unified form-Gaussian
  • East to compute
slide-30
SLIDE 30

how can (attacker) learn GMRF

  • If attacker knows user-location frequency:
  • If attacker knows user-user social network, we can construct GMRF from weighted

undirected graph

A B C D

2 1 3 5

1

5 5 5 9 1 3 1 3 2 3 2 5

  • ÷

÷ ÷ ÷ ÷ ø ö ç ç ç ç ç è æ

  • =

S ÷ ø ö ç è æ S

  • µ
  • x

x x

1

2 1 exp ) | (

T i i x

p

Laplacian matrix

slide-31
SLIDE 31

Model Attacker Define SPL

How can (attacker) infer private data using MC

x1 x4 x2 x3 x5 x6 x7 x8 x9 x1 x4 x2 x3 x5 x6 x7 x8 x9

3 1 2 3 1 2 3 1 1 3 4

(a) R1(G1,1/3) (b) R’1(G’1,0.5)

1

G1 G’1

known unknown victim

  • each user can define their privacy level as R(G,δ).


G: subgraph of GMRF
 δ: percentage of the data is known by adversaries

slide-32
SLIDE 32

Pr(r | D)Pr(Du | loci,Dk )

Du

Marginalization

Laplace Mechanism Data Correlation (GMRF)

  • Bayesian Inference
  • “what is the probability of return r, if the victim is at loci?”

  • On GMRF
  • “what is the difference of the probability between “the victim is at

loci” and “the victim is at “loc’i”

  • Marginalizing Du


Pr(r | loci) = Pr(r | D)Pr(Du | loci,Dk)

Du

Du ~ unknown, try to infer Dk ~ known to adversary

Model Attacker Define SPL

How can (attacker) infer private data using MC

SPL = Pr(r | loci) Pr(r | loci′) = Pr(r | D)Pr(Du | loci,Dk )

Du

Pr(r | ′ D )Pr(Du | loc'i,Dk )

Du

slide-33
SLIDE 33

impact of Pi

  • s: parameter of Laplacian smoothing. 


The smaller s, the high prior knowledge about Pi

1 10 100

t=0 2 4 6 8 10 12 14

s=0.0. s=0.005. s=0.05. s=1.0.

1 10 100

t=5 20 35 50 65 80 95 110 125140

s=0.0. s=0.005. s=0.05. s=1.0. (b) TPL for ε=0.1 (b) TPL for ε=1

slide-34
SLIDE 34

impact of Gi,m

  • p: density of graph


SPL vs. p: PL on a sparser graphs tend to be higher.
 SPL vs. |Gi|: PL on larger |Gi| tends to be higher.

higher

10 20 30 40 m=40 30 20 10

G0.8_50 Brightkite_50 G0.2_50

7.5 15 22.5 30 |Gi|=20 30 40 50

G0.8 G0.5 G0.2

(a) SPL vs. p (b) SPL vs. |Gi| (m=10)

Average SPL

ε= ε=

slide-35
SLIDE 35

Outline

  • Data Mining with Differential Privacy
  • Scenario: Spatiotemporal Data Mining using DP
  • Markov Chain for temporal correlations
  • Gaussian Random Markov Field for user-user correlations
  • Summary and open problems
slide-36
SLIDE 36

Summary

  • When mining private data using DP

, we need to model data correlations.

  • In spatiotemporal data, naturally we model temporal

correlations using Markov Chain; model user-user correlations using GMRF as attacker’s knowledge.

  • Based on the inference/computation on MC and GMRF

, we can calibrate privacy parameter for DP .

sensitive data model data correlations attacker inference Laplace Mechanism Lap(1/ε’) ε-DP data

slide-37
SLIDE 37

Open Problems

  • How to model spatiotemporal correlations as a whole? Or

the appropriate combination between MC and GMRF?

  • Data correlations may degrade the utility of the miningd

private data (need to add more noisy for the same level of privacy): how can data miner takes advantage of data correlations to improve data utility?