1
Expectation-Maximization Tensor Factorization for Practical Location Privacy Attacks
Takao Murakami (AIST*, Japan)
*AIST: National Institute of Advanced Industrial Science & Technology
Expectation-Maximization Tensor Factorization for Practical Location - - PowerPoint PPT Presentation
Expectation-Maximization Tensor Factorization for Practical Location Privacy Attacks Takao Murakami (AIST*, Japan) *AIST: National Institute of Advanced Industrial Science & Technology 1 Outline [Shokri+,S&P11] [Gambs+,JCSS14]
1
Expectation-Maximization Tensor Factorization for Practical Location Privacy Attacks
Takao Murakami (AIST*, Japan)
*AIST: National Institute of Advanced Industrial Science & Technology
2 Markov Chain Model-based Attacks
Attacker can de-anonymize traces (or infer locations) with high accuracy
when the amount of training data is very large.
Outline
Mobility Trace Pseudonym 63427
De-anonymize
In reality, training data can be sparsely distributed over time…
Many users disclose a small number of locations not continuously but
“sporadically” via SNS (e.g. one or two check-ins per day/week/month).
region xi x2 x3 x1
xi xj
Transition Matrices
xi xj xi xj
Mobility Trace x2 x3 x1
e.g. 30 min, 1 hour
[Shokri+,S&P11] [Gambs+,JCSS14] [Mulder+,WPES08] [Xue+,ICDE13] etc.
x2 x2
? ? ? ? ? 0 1 0 0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Training Trace x1 x3
Train
missing location
3
Outline
Training Traces User
x2 x3
u1
x1
u2
x4
u3
x4
Transition Matrices
? ? ? ? ? 0 0 0 0 1 ? ? ? ? ?
P1
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 0 0 1
u4
x5 x3
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Worst case scenario for attackers (= reality?)…
No elements are observed in P2 & P3. Cannot de-anonymize u2 & u3.
Our Contributions
We show the answer is “yes”. We propose a training method that outperforms a random guess even
when no elements are observed in more than 70% of cases.
P2 P3 P4
ML
(ML: Maximum Likelihood Estimation)
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
4
Introduction
(Location Privacy, Related Work)
Contents
Experiments Our Proposal
(EMTF: Expectation-Maximization Tensor Factorization)
5 Location-based Services (LBS)
Many people are using LBS (e.g. map, route finding, check-in). “Spatial Big Data” can be provided to a third-party for analysis
(e.g. popular places), or made public to provide traffic information.
Location Privacy
Privacy Issues
Mobility trace can contain sensitive locations (e.g. homes, hospitals). Anonymized trace may be de-anonymized.
LBS provider
mobility trace
Spatial Big Data
Mobility Trace Pseudonym 63427
De-anonymize
x2 x3 x1 Mobility Trace x2 x3 x1
Markov chain model
6 Markov Chain Model for De-anonymization
Attacker = anyone who has anonymized traces (except for LBS provider). Attacker obtains training locations that are made public (e.g. via SNS). Attacker de-anonymizes traces using the trained transition matrices.
Related Work
x3 x2 x4 x5 x4 x4
Mobility Traces Anonymized Traces Nyms u1 Users uN
32091 29619
Training Traces
x1 x3 x2 User u1 x2 x4 x3
Transition Matrices
xi pij xj transition probability
region xi anonymize de-anonymize x3 x2 x4 x5 x4 x4 User uN Matrix P1 Matrix PN
[Shokri+,S&P11] [Gambs+,JCSS14] [Mulder+,WPES08] etc.
LBS provider Third-party (for analysis)
7 Sporadic Training Data (training data are sparsely distributed over time)
Many users disclose a small number of locations “sporadically” (via SNS). If we don’t estimate missing locations, we cannot train P2 and P3. we cannot de-anonymize traces of u2 and u3 using these matrices.
Related Work
Training Traces User
x2 x3
u1
x1
u2
x4
u3
x4
u4
x5 x3 x1 x1 x3 x3 x4 x3
Mobility Traces Anonymized Traces Nyms Users
32091 29619 anonymize x1 x1 x3 x3 x4 x3
u2 u3 We need to “somehow” estimate missing locations. ML Transition Matrices
? ? ? ? ? 0 0 0 0 1 ? ? ? ? ?
P1
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 0 0 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
P2 P3 P4
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 (ML: Maximum Likelihood Estimation)
8 Gibbs Sampling Method [Shokri+, S&P11]
Alternates between estimating Pn and estimating missing locations of un
independently of other users.
Related Work
Challenge
When there are few continuous locations in training traces... (1) Cannot accurately estimate Pn. (2) Cannot accurately estimate missing locations using Pn ((1)).
x2 x2
? ? ? ? ? 0 1 0 0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Training Trace x1
un Pn
Estimate matrix Estimate missing locations
We address this challenge by estimating Pn with the help of “other users” (instead of estimating Pn independently).
x2 x3 x1 x4 x5 x3
9
Our Proposal
(EMTF: Expectation-Maximization Tensor Factorization)
Contents
Experiments Introduction
(Location Privacy, Related Work)
10
EM (Expectation-Maximization)
0 0 0.50.5 0 0 0.40.6 0 0 0.30.60.1 0 0.20.50.20.1 0 0 0.30.50.2 0 0 1 0 0 0.20.40.4 0 0 0 0.60.4 0 0 0.20.8 0 0 0 0.30.50.2 0 0.30.60.1 0 0.20.50.3 0 0 0 0.80.2 0 0 0.50.5 0 0 0 0.20.20.6 0 0 1 0 0 0.20.50.3 0 0 0 0.60.4 0 0 0.20.8 0 0 0 0.20.20.6 0 0.10.40.5 0 0.20.40.30.1 0 0.50.40.1 0 0 0.20.70.1 0 0 0.10.20.7
Overview of EMTF
(1) Training Transition Matrices:
We estimate unobserved elements (“?”) with the help of “similar users”. We substitute average matrix over all users for completely unobserved matrices.
(2) Estimating Missing Locations:
We estimate missing locations (we can do this with the help of “similar users”).
0 0 0.50.5 0
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
0 0.40.6 0 0 0 0 1 0 0
? ? ? ? ? ? ? ? ? ?
0 0.60.4 0 0 0 0 0.20.8 0
average matrix similar users
0 0 1 0 0
? ? ? ? ? ? ? ? ? ?
0 0.60.4 0 0 0 0 0.20.8 0
similar users
TF (Tensor Factorization) [Murakami+, TIFS16]
x2 x3 x1 x3 x1 x3 x4 x3 x4 x5 x4 x3 x4 x1 x4 x3 x5 x2 x4 x2 x2 x5 x2 x3 x3 x5 x2 x2 x3 x4
estimated location user-specific location
0 0 0.50.5 0
? ? ? ? ?
0 0.80.2 0 0 0 0 0.20.2
? ? ? ? ?
0.6
We use the help of “similar users” (other users who have similar behavior):
Go back to (1) Each matrix captures unique feature of each user’s behavior since each trace is accurate & user-specific.
11
1
? ? ? ? ? ? ? ? ? ?
1 1 1
? ? ? ? ? ? ? ? ? ?
1 1
TF (Tensor Factorization)
Used for item recommendation. Factorizes tensor into low-rank matrices. Estimates unobserved element (“?”) with the help of “similar users”.
EM (Expectation-Maximization)
Trains parameter Θ from observed data x while estimating missing data z. Each EM cycle is guaranteed to increase the posterior probability Pr(Θ|x).
Details of EMTF
? ? ? ? ?
Parameter Θ
1 0 0.50.5 0
? ? ? ? ? ? ? ? ? ?
Transition matrices (= 3rd order tensor) x2 x3 x1 x4 x5 x3 x3 x4 x2 x4 x1 x4 Estimating missing data z (E-step) Training parameter Θ via TF (M-step) z = (x3, x4, x2, x4, x1, x4) x = (x2, x3, x1, x4, x3, x5)
Can find the most probable Θ and z with the help of “similar users”.
12
EMTF Algorithm
E-step: Estimate a distribution of missing location vector z:
Parameter Θ x2 x3 x1 x4 x5 x3 x3 x4 x2 x4 x1 x4 Estimating locations (E-step) Training via TF (M-step) z = (x3, x4, x2, x4, x1, x4) x = (x2, x3, x1, x4, x3, x5)
) , | Pr( : ) ( Θ x z z = Q
A
Tensor
M-step: Estimate parameter in TF given by
Θ ˆ
∑
≥ Θ
=
z
z x Θ z Θ ) , | Pr( log ) ( max arg ˆ Q Quadratic problem (w.r.t. one parameter) Max of log-posterior = Min of regularized square error ) || || || ˆ )(|| ( min arg
2 2 F F
Q Θ A A z
z
∑
+ − =
≥ Θ
λ Forward-Backward algorithm
z
) (z Q Time complexity is exponential in the number of missing locations.
13 Time Complexity of EMTF
Number of possible missing locations z is exponential in its length. E.g. #(regions) = 256, #(missing locations) = 8 possible z is 2568 = 264.
Approximation of EMTF
Two Approximation Methods:
[Method I] Viterbi: Approximates Q(z) by the most probable value z*. [Method II] FFBS: Approximates Q(z) by random samples z1, … , zS.
Training Trace z = (x224, x204, x140, x156, x186, x192 , x224 , x256)
x256 x188 x224 x204 x156 x140 x186 x192 x224 x256
Q(z) (distribution of z) Both methods reduce time complexity from exponential to linear. z z Q(z) z* Viterbi z Q(z) z1 FFBS (Forward Filtering Backward Sampling) z2 zS Approximate Q(z) in a more accurate manner
14
Experiments
Contents
Introduction
(Location Privacy, Related Work)
Our Proposal
(EMTF: Expectation-Maximization Tensor Factorization)
15 Gowalla Dataset
We used traces in New York & Philadelphia (16 x 16 regions). Training: 250 users x 1 traces x 10 locations (time interval: more than 30min). Testing: 250 users x 9 traces x 10 locations. We randomly deleted each training location with probability 80%. No elements in a matrix were observed in more than 70% of cases.
Experimental Set-up
(Here we explain only the most important part. Please see our paper for details) Transition Matrix x2 Training Trace x3
ML More than 70%
Extremely Sporadic Training Data (Worst Case Scenario for Attackers)
(ML: Maximum Likelihood Estimation)
16 De-anonymization Accuracy
We performed the Bayesian de-anonymization attack, which selects, for
each testing trace, K (<250) candidates whose probabilities are the highest.
ML & TF ≈ random guess
since they did not estimate missing locations.
GS < random guess
since it did not accurately estimate missing locations.
Experimental Results
0.34
Kth 1st Posterior Probability
0.06 0.01
250th Rank 25 50 75 100 50 100 150 250 200 # Candidates K ML (Maximum Likelihood) GS (Gibbs Sampling) TF (Tensor Factorization) EMTFViterbi EMTFFFBS random Accuracy[%]
EMTF outperformed random guess in sporadic training data scenario.
17 Summary of Results
Our training method (EMTF) significantly outperformed a random guess,
even when no elements were observed in more than 70% of cases.
Conclusion
Transition Matrix Future Work
Evaluation of state-of-the-art obfuscation (e.g. geo-indistinguishability
[Andres+, CCS13]) applied to sporadic training traces.
x2 Training Trace x3 Training Trace
noise noise
More than 70%
18
19 TF (Tensor Factorization)
Can automatically find a set of users who have “similar behavior”. Trains matrices so that each matrix is influenced by similar users.
Appendix: Similar Users in Gowalla Dataset
Visualization of “similar users” [Murakami+, TIFS16]
We visualized “similar users” in Gowalla based on the trained parameters. E.g. always stay in Manhattan, go to the western part of Manhattan.
All Users Users who had a large value in 1st parameter Users who had a large value in 2nd parameter.