Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks
Yang Cao Emory University 2017.11.15
Modeling Data Correlations in Private Data Mining with Markov Model - - PowerPoint PPT Presentation
Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks Yang Cao Emory University 2017.11.15 Outline Data Mining with Di ff erential Privacy (DP) Scenario: Spatiotemporal Data Mining using DP Markov
Yang Cao Emory University 2017.11.15
sensitive database*!
Company Institute Public Attacker
a t t a c k
Adversary
attack
Institute
Sensitive data Or
noisy data
How? ε-Differential Privacy!
significantly affected by individual’s data.
D 1 1
D’ 1
log Pr(M Q D
( )
( ) = r)
Pr(M Q ′ D
( )
( ) = r) ≤ ε
ε ⬆, privacy ⬇.
e.g. 2ε-DP means more privacy loss than ε-DP.
D 1 1
D’ 1
⇒ ε-DP
D 1 1
D’
/
⇒ ?-DP
[*] Differential Privacy as a Causal Property, https://arxiv.org/abs/1710.05899 [**] https://github.com/frankmcsherry/blog/blob/master/posts/2016-08-29.md
a Quantification approach to achieve ε-DP (protecting each user private data value)
sensitive data ε-DP data Laplace Mechanism Lap(1/ε)
Traditional approach (if attacker knows correlations, ε-DP may not hold):
sensitive data model data correlations attacker inference Laplace Mechanism Lap(1/ε’) ε-DP data
Quantification approach (protect against attackers with knowledge of correlation):
[Cao17]: Markov Chain [Yang15]: Gaussian Markov Random Field (GMRF) [Song17]: Bayesian Network
t= 1 2 3
… u1
loc3 loc1 loc1
…
u2
loc2 loc4 loc5
…
u3
loc2 loc4 loc5
…
u4
loc4 loc5 loc3
…
(a) Location Data (b) True Counts
t= 1 2 3
..
loc1
2 2
.. loc2
2
.. loc3
1 1
.. loc4
1 2
.. loc5
1 2
..
Count Query
D1 D2 D3 …
(c) Private Counts
t= 1 2 3 ..
loc1
1 3
.. loc2
3 1
.. loc3
1 1
.. loc4
2 1
.. loc5
1 3 3
..
Laplace Noise
r1 r2 r3 …
Lap(1/ε)
ε-DP ε-DP ε-DP
Sensitive data Private data
7:00 8:00 9:00 … u1
loc3 loc1 loc1
…
u2
loc2 loc1 loc1
…
u3
loc2 loc4 loc5
…
u4
loc4 loc5 loc3
…
(a) Location Data
(a) Road Network
loc4 loc5 loc3
(b) Social Ties
couple colleague
u2 u1 u3
temporal correlation
for single user
spatial correlation
for user-user
D1 D2 D3 …
loc1 loc2 loc3 loc1 loc2 loc3 0.2 0.1 0.7 0.1 0.2 0.7 0.3 0.4 0.3
Transition Matrix
7:00 8:00 9:00 … u1
loc1 loc3 loc2
…
u2
loc2 loc2 loc2
…
u3
loc3 loc1 loc1
…
u4
loc1 loc2 loc2
…
Raw Trajectories
t t+1
Pr(x_t|x_t-1)=Pr(x_t|x_t-1,…,x_1) ∀ t>0, Pr(x_t+1|x_t)=Pr(x_t+2|x_t+1)
learn transition matrix by Maximum Likelihood estimation
google-like model [*]
[*] E. Crisostomi, S. Kirkland, and R. Shorten, “A Google-like model of road network dynamics and its application to regulation and control,” International Journal of Control, vol. 84, no. 3, pp. 633–651, Mar. 2011.
Model Attacker Define TPL
Find structure of TPL
e.g., user i : loc1→ loc3→ loc2→ … loc1 loc2 loc3 loc1
0.2 0.3 0.5
loc2
0.1 0.1 0.8
loc3
0.6 0.2 0.2
(b) Transition Matrix
time t time t-1
Pr(li
t li t−1)
Forward Temporal Correlation
P
i B
P
i F
loc1 loc2 loc3 loc1
0.1 0.2 0.7
loc2
1
loc3
0.3 0.3 0.4
(a) Transition Matrix
time t-1 time t
Pr(li
t−1 li t )
Backward Temporal Correlation
Model Attacker Define TPL
Find structure of TPL
except the one of victim
t= 1
u1
loc3
u2
loc2
u3
loc2
u4
loc4
?
li
D
Ai(DK ) Ai
T (DK,P i B,P i F)
(ii) (i) (iii) Ai
T(DK,P i B,∅)
Ai
T(DK,∅,P i F)
Ai
T(DK,P i B,P i F)
+ Temporal Correlation ?
Model Attacker Define TPL
Find structure of TPL
if , then satisfies ε-DP.
PL0(M) ≤ ε M
Model Attacker Define TPL
Find structure of TPL
Eqn(2)= log Pr(r1 | li
t,Dk t )
Pr(r1 | li
tʹ,Dk t )
+...+log Pr(rt | li
t,Dk t )
Pr(rt | li
tʹ,Dk t )
+...+log Pr(rT | li
t,Dk t )
Pr(rT | li
tʹ,Dk t )
TPL = PL0
PL0
Model Attacker Define TPL
Find structure of TPL
Eqn(2)= log Pr(r1 | li
t,Dk t )
Pr(r1 | li
tʹ,Dk t )
+...+log Pr(rt | li
t,Dk t )
Pr(rt | li
tʹ,Dk t )
+...+log Pr(rT | li
t,Dk t )
Pr(rT | li
tʹ,Dk t )
TPL = ?
PL0
?
?
Hard to quantify Eqn(2)…
Model Attacker Define TPL
Find structure of TPL
(BPL) (FPL)
(iii)Ai
T(DK,P i B,P i F)
(ii) Ai
T(DK,∅,P i F)
(i) Ai
T(DK,P i B,∅)
⇒
(ii) (i)
r1 rt …. rt-1 rt+1 …. rT
Model Attacker Define TPL
Find structure of TPL
⇒
Backward privacy loss function. how to calculate it?
Eqn(6)=
Backward temporal correlations
BPL
Model Attacker Define TPL
Find structure of TPL
⇒
Forward privacy loss function. how to calculate it?
Forward temporal correlations
FPL
Privacy Quantification Upper bound
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
t=1 2 3 4 5 6 7 8 9 10
0.10 0.18 0.25 0.30 0.35 0.39 0.42 0.45 0.48 0.50
(i) Strong temporal corr. (ii) Moderate temporal corr. (iii) No temporal corr.
Privacy Loss Time
Privacy Quantification Upper bound
⇒
20 40 60 80 100 t 0.2 0.4 0.6 0.8 BPL
q = 0.8; d = 0.1; ε = 0.23
20 40 60 80 100 t 0.5 1.0 1.5 2.0 2.5 3.0 3.5 BPL
q = 0.8; d = 0; ε = 0.23
(d) (c) (b) (a)
Pi=( )
1 1
Pi=( )
0.8 0.2 1
Pi=( )
0.8 0.2 1
Pi=( )
0.8 0.1 0.2 0.9
B B B B
20 40 60 80 100 t 5 10 15 20 BPL
q = 1; d = 0; ε = 0.23
20 40 60 80 100 t 0.2 0.4 0.6 0.8 1.0 1.2 BPL
q = 0.8; d = 0; ε = 0.15
q=0.8, d=0.1, ε=0.23 q=1, d=0, ε=0.23 q=0.8, d=0, ε=0.23 q=0.8, d=0, ε=0.15
case 4 case 3 case 2 case 1
Privacy Quantification Upper bound
Refer to Theorem 5 in our paper
Privacy Loss
time
which each node is Gaussian variable and the (undirected) edges indicate the dependencies between variables.
undirected graph
A B C D
2 1 3 5
1
5 5 5 9 1 3 1 3 2 3 2 5
÷ ÷ ÷ ÷ ø ö ç ç ç ç ç è æ
S ÷ ø ö ç è æ S
x x
1
2 1 exp ) | (
T i i x
p
Laplacian matrix
Model Attacker Define SPL
x1 x4 x2 x3 x5 x6 x7 x8 x9 x1 x4 x2 x3 x5 x6 x7 x8 x9
3 1 2 3 1 2 3 1 1 3 4
(a) R1(G1,1/3) (b) R’1(G’1,0.5)
1
G1 G’1
known unknown victim
G: subgraph of GMRF δ: percentage of the data is known by adversaries
Pr(r | D)Pr(Du | loci,Dk )
Du
Marginalization
Laplace Mechanism Data Correlation (GMRF)
loci” and “the victim is at “loc’i”
Pr(r | loci) = Pr(r | D)Pr(Du | loci,Dk)
Du
∑
Du ~ unknown, try to infer Dk ~ known to adversary
Model Attacker Define SPL
SPL = Pr(r | loci) Pr(r | loci′) = Pr(r | D)Pr(Du | loci,Dk )
Du
∑
Pr(r | ′ D )Pr(Du | loc'i,Dk )
Du
∑
The smaller s, the high prior knowledge about Pi
1 10 100
t=0 2 4 6 8 10 12 14
s=0.0. s=0.005. s=0.05. s=1.0.
1 10 100
t=5 20 35 50 65 80 95 110 125140
s=0.0. s=0.005. s=0.05. s=1.0. (b) TPL for ε=0.1 (b) TPL for ε=1
SPL vs. p: PL on a sparser graphs tend to be higher. SPL vs. |Gi|: PL on larger |Gi| tends to be higher.
higher
10 20 30 40 m=40 30 20 10
G0.8_50 Brightkite_50 G0.2_50
7.5 15 22.5 30 |Gi|=20 30 40 50
G0.8 G0.5 G0.2
(a) SPL vs. p (b) SPL vs. |Gi| (m=10)
Average SPL
ε= ε=
, we need to model data correlations.
correlations using Markov Chain; model user-user correlations using GMRF as attacker’s knowledge.
, we can calibrate privacy parameter for DP .
sensitive data model data correlations attacker inference Laplace Mechanism Lap(1/ε’) ε-DP data
the appropriate combination between MC and GMRF?
private data (need to add more noisy for the same level of privacy): how can data miner takes advantage of data correlations to improve data utility?