Expectation-Maximization Tensor Factorization for Practical Location - - PowerPoint PPT Presentation

expectation maximization tensor factorization for
SMART_READER_LITE
LIVE PREVIEW

Expectation-Maximization Tensor Factorization for Practical Location - - PowerPoint PPT Presentation

Expectation-Maximization Tensor Factorization for Practical Location Privacy Attacks Takao Murakami (AIST*, Japan) *AIST: National Institute of Advanced Industrial Science & Technology 1 Outline [Shokri+,S&P11] [Gambs+,JCSS14]


slide-1
SLIDE 1

1

Expectation-Maximization Tensor Factorization for Practical Location Privacy Attacks

Takao Murakami (AIST*, Japan)

*AIST: National Institute of Advanced Industrial Science & Technology

slide-2
SLIDE 2

2  Markov Chain Model-based Attacks

 Attacker can de-anonymize traces (or infer locations) with high accuracy

when the amount of training data is very large.

Outline

Mobility Trace Pseudonym 63427

De-anonymize

 In reality, training data can be sparsely distributed over time…

 Many users disclose a small number of locations not continuously but

“sporadically” via SNS (e.g. one or two check-ins per day/week/month).

region xi x2 x3 x1

xi xj

Transition Matrices

xi xj xi xj

Mobility Trace x2 x3 x1

e.g. 30 min, 1 hour

[Shokri+,S&P11] [Gambs+,JCSS14] [Mulder+,WPES08] [Xue+,ICDE13] etc.

x2 x2

? ? ? ? ? 0 1 0 0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Training Trace x1 x3

Train

missing location

slide-3
SLIDE 3

3

Outline

Training Traces User

x2 x3

u1

x1

u2

x4

u3

x4

Transition Matrices

? ? ? ? ? 0 0 0 0 1 ? ? ? ? ?

P1

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 0 0 1

u4

x5 x3

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

 Worst case scenario for attackers (= reality?)…

 No elements are observed in P2 & P3.  Cannot de-anonymize u2 & u3.

 Our Contributions

 We show the answer is “yes”.  We propose a training method that outperforms a random guess even

when no elements are observed in more than 70% of cases.

P2 P3 P4

ML

  • Q. Is it possible to de-anonymize traces using such training data?

(ML: Maximum Likelihood Estimation)

? ?

?

x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5

slide-4
SLIDE 4

4

Introduction

(Location Privacy, Related Work)

Contents

Experiments Our Proposal

(EMTF: Expectation-Maximization Tensor Factorization)

slide-5
SLIDE 5

5  Location-based Services (LBS)

 Many people are using LBS (e.g. map, route finding, check-in).  “Spatial Big Data” can be provided to a third-party for analysis

(e.g. popular places), or made public to provide traffic information.

Location Privacy

 Privacy Issues

 Mobility trace can contain sensitive locations (e.g. homes, hospitals).  Anonymized trace may be de-anonymized.

LBS provider

mobility trace

Spatial Big Data

Mobility Trace Pseudonym 63427

De-anonymize

x2 x3 x1 Mobility Trace x2 x3 x1

Markov chain model

slide-6
SLIDE 6

6  Markov Chain Model for De-anonymization

 Attacker = anyone who has anonymized traces (except for LBS provider).  Attacker obtains training locations that are made public (e.g. via SNS).  Attacker de-anonymizes traces using the trained transition matrices.

Related Work

x3 x2 x4 x5 x4 x4

Mobility Traces Anonymized Traces Nyms u1 Users uN

32091 29619

Training Traces

x1 x3 x2 User u1 x2 x4 x3

Transition Matrices

xi pij xj transition probability

region xi anonymize de-anonymize x3 x2 x4 x5 x4 x4 User uN Matrix P1 Matrix PN

[Shokri+,S&P11] [Gambs+,JCSS14] [Mulder+,WPES08] etc.

LBS provider Third-party (for analysis)

slide-7
SLIDE 7

7  Sporadic Training Data (training data are sparsely distributed over time)

 Many users disclose a small number of locations “sporadically” (via SNS).  If we don’t estimate missing locations, we cannot train P2 and P3.   we cannot de-anonymize traces of u2 and u3 using these matrices.

Related Work

Training Traces User

x2 x3

u1

x1

u2

x4

u3

x4

u4

x5 x3 x1 x1 x3 x3 x4 x3

Mobility Traces Anonymized Traces Nyms Users

32091 29619 anonymize x1 x1 x3 x3 x4 x3

u2 u3 We need to “somehow” estimate missing locations. ML Transition Matrices

? ? ? ? ? 0 0 0 0 1 ? ? ? ? ?

P1

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 0 0 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

P2 P3 P4

? ?

x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 (ML: Maximum Likelihood Estimation)

slide-8
SLIDE 8

8  Gibbs Sampling Method [Shokri+, S&P11]

 Alternates between estimating Pn and estimating missing locations of un

independently of other users.

Related Work

 Challenge

 When there are few continuous locations in training traces...  (1) Cannot accurately estimate Pn.  (2) Cannot accurately estimate missing locations using Pn ((1)).

x2 x2

? ? ? ? ? 0 1 0 0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Training Trace x1

un Pn

Estimate matrix Estimate missing locations

We address this challenge by estimating Pn with the help of “other users” (instead of estimating Pn independently).

x2 x3 x1 x4 x5 x3

slide-9
SLIDE 9

9

Our Proposal

(EMTF: Expectation-Maximization Tensor Factorization)

Contents

Experiments Introduction

(Location Privacy, Related Work)

slide-10
SLIDE 10

10

EM (Expectation-Maximization)

0 0 0.50.5 0 0 0.40.6 0 0 0.30.60.1 0 0.20.50.20.1 0 0 0.30.50.2 0 0 1 0 0 0.20.40.4 0 0 0 0.60.4 0 0 0.20.8 0 0 0 0.30.50.2 0 0.30.60.1 0 0.20.50.3 0 0 0 0.80.2 0 0 0.50.5 0 0 0 0.20.20.6 0 0 1 0 0 0.20.50.3 0 0 0 0.60.4 0 0 0.20.8 0 0 0 0.20.20.6 0 0.10.40.5 0 0.20.40.30.1 0 0.50.40.1 0 0 0.20.70.1 0 0 0.10.20.7

Overview of EMTF

(1) Training Transition Matrices:

We estimate unobserved elements (“?”) with the help of “similar users”. We substitute average matrix over all users for completely unobserved matrices.

(2) Estimating Missing Locations:

We estimate missing locations (we can do this with the help of “similar users”).

0 0 0.50.5 0

? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

0 0.40.6 0 0 0 0 1 0 0

? ? ? ? ? ? ? ? ? ?

0 0.60.4 0 0 0 0 0.20.8 0

?

average matrix similar users

0 0 1 0 0

? ? ? ? ? ? ? ? ? ?

0 0.60.4 0 0 0 0 0.20.8 0

similar users

TF (Tensor Factorization) [Murakami+, TIFS16]

x2 x3 x1 x3 x1 x3 x4 x3 x4 x5 x4 x3 x4 x1 x4 x3 x5 x2 x4 x2 x2 x5 x2 x3 x3 x5 x2 x2 x3 x4

estimated location user-specific location

0 0 0.50.5 0

? ? ? ? ?

0 0.80.2 0 0 0 0 0.20.2

? ? ? ? ?

0.6

We use the help of “similar users” (other users who have similar behavior):

Go back to (1)  Each matrix captures unique feature of each user’s behavior since each trace is accurate & user-specific.

slide-11
SLIDE 11

11

1

? ? ? ? ? ? ? ? ? ?

1 1 1

? ? ? ? ? ? ? ? ? ?

1 1

 TF (Tensor Factorization)

 Used for item recommendation. Factorizes tensor into low-rank matrices.  Estimates unobserved element (“?”) with the help of “similar users”.

 EM (Expectation-Maximization)

 Trains parameter Θ from observed data x while estimating missing data z.  Each EM cycle is guaranteed to increase the posterior probability Pr(Θ|x).

Details of EMTF

? ? ? ? ?

Parameter Θ

1 0 0.50.5 0

? ? ? ? ? ? ? ? ? ?

Transition matrices (= 3rd order tensor) x2 x3 x1 x4 x5 x3 x3 x4 x2 x4 x1 x4 Estimating missing data z (E-step) Training parameter Θ via TF (M-step) z = (x3, x4, x2, x4, x1, x4) x = (x2, x3, x1, x4, x3, x5)

Can find the most probable Θ and z with the help of “similar users”.

slide-12
SLIDE 12

12

EMTF Algorithm

E-step: Estimate a distribution of missing location vector z:

Parameter Θ x2 x3 x1 x4 x5 x3 x3 x4 x2 x4 x1 x4 Estimating locations (E-step) Training via TF (M-step) z = (x3, x4, x2, x4, x1, x4) x = (x2, x3, x1, x4, x3, x5)

) , | Pr( : ) ( Θ x z z = Q

A

Tensor

M-step: Estimate parameter in TF given by

Θ ˆ

≥ Θ

=

z

z x Θ z Θ ) , | Pr( log ) ( max arg ˆ Q Quadratic problem (w.r.t. one parameter) Max of log-posterior = Min of regularized square error ) || || || ˆ )(|| ( min arg

2 2 F F

Q Θ A A z

z

+ − =

≥ Θ

λ Forward-Backward algorithm

z

) (z Q Time complexity is exponential in the number of missing locations.

slide-13
SLIDE 13

13  Time Complexity of EMTF

 Number of possible missing locations z is exponential in its length.  E.g. #(regions) = 256, #(missing locations) = 8  possible z is 2568 = 264.

Approximation of EMTF

 Two Approximation Methods:

 [Method I] Viterbi: Approximates Q(z) by the most probable value z*.  [Method II] FFBS: Approximates Q(z) by random samples z1, … , zS.

Training Trace z = (x224, x204, x140, x156, x186, x192 , x224 , x256)

x256 x188 x224 x204 x156 x140 x186 x192 x224 x256

Q(z) (distribution of z) Both methods reduce time complexity from exponential to linear. z z Q(z) z* Viterbi z Q(z) z1 FFBS (Forward Filtering Backward Sampling) z2 zS Approximate Q(z) in a more accurate manner

slide-14
SLIDE 14

14

Experiments

Contents

Introduction

(Location Privacy, Related Work)

Our Proposal

(EMTF: Expectation-Maximization Tensor Factorization)

slide-15
SLIDE 15

15  Gowalla Dataset

 We used traces in New York & Philadelphia (16 x 16 regions).  Training: 250 users x 1 traces x 10 locations (time interval: more than 30min).  Testing: 250 users x 9 traces x 10 locations.  We randomly deleted each training location with probability 80%.   No elements in a matrix were observed in more than 70% of cases.

Experimental Set-up

(Here we explain only the most important part. Please see our paper for details) Transition Matrix x2 Training Trace x3

?

ML More than 70%

  • f cases

Extremely Sporadic Training Data (Worst Case Scenario for Attackers)

(ML: Maximum Likelihood Estimation)

slide-16
SLIDE 16

16  De-anonymization Accuracy

 We performed the Bayesian de-anonymization attack, which selects, for

each testing trace, K (<250) candidates whose probabilities are the highest.

 ML & TF ≈ random guess

 since they did not estimate missing locations.

 GS < random guess

 since it did not accurately estimate missing locations.

Experimental Results

0.34

Kth 1st Posterior Probability

0.06 0.01

250th Rank 25 50 75 100 50 100 150 250 200 # Candidates K ML (Maximum Likelihood) GS (Gibbs Sampling) TF (Tensor Factorization) EMTFViterbi EMTFFFBS random Accuracy[%]

EMTF outperformed random guess in sporadic training data scenario.

slide-17
SLIDE 17

17  Summary of Results

 Our training method (EMTF) significantly outperformed a random guess,

even when no elements were observed in more than 70% of cases.

Conclusion

Transition Matrix  Future Work

 Evaluation of state-of-the-art obfuscation (e.g. geo-indistinguishability

[Andres+, CCS13]) applied to sporadic training traces.

x2 Training Trace x3 Training Trace

?

noise noise

More than 70%

  • f cases
slide-18
SLIDE 18

18

Thank you for listening.

slide-19
SLIDE 19

19  TF (Tensor Factorization)

 Can automatically find a set of users who have “similar behavior”.  Trains matrices so that each matrix is influenced by similar users.

Appendix: Similar Users in Gowalla Dataset

 Visualization of “similar users” [Murakami+, TIFS16]

 We visualized “similar users” in Gowalla based on the trained parameters.  E.g. always stay in Manhattan, go to the western part of Manhattan.

All Users Users who had a large value in 1st parameter Users who had a large value in 2nd parameter.