Privacy Preserving Data Mining Moheeb Rajab Agenda Overview and - - PowerPoint PPT Presentation

privacy preserving data mining
SMART_READER_LITE
LIVE PREVIEW

Privacy Preserving Data Mining Moheeb Rajab Agenda Overview and - - PowerPoint PPT Presentation

Privacy Preserving Data Mining Moheeb Rajab Agenda Overview and Terminology Motivation Active Research Areas Secure Multi-party Computation (SMC) Randomization approach Limitations Summary and Insights Overview


slide-1
SLIDE 1

Privacy Preserving Data Mining

Moheeb Rajab

slide-2
SLIDE 2

Agenda

 Overview and Terminology  Motivation  Active Research Areas

 Secure Multi-party Computation (SMC)  Randomization approach

 Limitations  Summary and Insights

slide-3
SLIDE 3

Overview

 What is Data Mining?

 Extracting implicit un-obvious patterns and relationships from a

warehoused of data sets.

 This information can be useful to increase the efficiency of the

  • rganization and aids future plans.

 Can be done at an organizational level.

 By Establishing a data Warehouse

 Can be done also at a global Scale.

slide-4
SLIDE 4

Data Mining System Architecture

10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Entity I Entity II Entity n Global Aggregato r

slide-5
SLIDE 5

Distributed Data Mining Architecture

 Lower scale Mining

10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
slide-6
SLIDE 6

Challenges

 Privacy Concerns  Proprietary information disclosure  Concerns about Association breaches  Misuse of mining  These Concerns provide the motivation

for privacy preserving data mining solutions

slide-7
SLIDE 7

Approaches to preserve privacy

 Restrict Access to data (Protect Individual

records)

 Protect both the data and its source:

 Secure Multi-party computation (SMC)  Input Data Randomization

 There is no such one solution that fits all

purposes

slide-8
SLIDE 8

SMC vs Randomization

Pinkas et al Overhead Accuracy Privacy Randomization Schemes SMC

slide-9
SLIDE 9

Secure Multi-party Computation

 Multiple parties sharing the burden of creating the data

aggregate.

 Final processing if needed can be delegated to any

party.

 Computation is considered secure if each party only

knows its input and the result of its computation.

slide-10
SLIDE 10

SMC

10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Each Party Knows its input and the result of the operation and nothing else

slide-11
SLIDE 11

Key Assumptions

 The ONLY information that can be leaked is the

information that we can get as an overall output from the computation (aggregation) process

 Users are not Malicious but can honestly curious

 All users are supposed to abide to the SMC protocol

 Otherwise, for the case of having malicious participants

is not easy to model! [Penkas et al, Argawal]

slide-12
SLIDE 12

“Tools for Privacy Preserving Distributed Data Mining” Clifton et al [SIGKDD]

 Secure Sum  Given a number of values belonging

to n entities

 We need to compute  Such that each entity ONLY knows its input and the

result of the computation (The aggregate sum of the data)

n

x x x ,........, ,

2 1

  • =

n i i

x

1

slide-13
SLIDE 13

Examples (Secure Sum)

 Problem:

 Colluding members

 Solution

 Divide values into shares and have each share permute a

disjoint path (no site has the same neighbor twice)

R = 15 10 20 15 45 50 Master R + 3 R + 10 R + 45 R + 90 R + 140 Sum = R+140 -R

slide-14
SLIDE 14

Split path solution

R1 = 15 10 20 15 45 50 R + 1 5 R + 5 R + 22.5 R + 45 R + 70 Sum = R1+ 70 – R1 + R2+ 70 –R2= 140 R2 = 12

slide-15
SLIDE 15

Secure Set Union

 Consider n sets

Compute, Such that each entity ONLY knows U and nothing else.

n

S S S ,........, ,

2 1

U U U

n

S S S S U ,........,

3 2 1

=

slide-16
SLIDE 16

Secure Union Set

 Using the properties of Commutative

Encryption

 For any permutation i, j the following holds

)...) ( (... )...) ( (...

1 1

M E E M E E

n j j n i i

K K K K

=

  • <

== )...)) ( (... )...) ( (... (

2 1

1 1

M E E M E E P

n j j n i i

K K K K

slide-17
SLIDE 17

Secure Set Union

Global Union Set U.

Each site:

 Encrypts its items  Creates an array M[n] and adds it to U

Upon receiving U an entity should encrypt all items in U that it did not encrypt before.

In the end: all entries are encrypted with all keys

Remove the duplicates:

 Identical plain text will result the same cipher text regardless of the order of the

use of encryption keys.

Decryption U:

 Done by all entities in any order.

n

K K K ,....., ,

2 1

slide-18
SLIDE 18

Secure Union Set

0 1 0 … . . .. . …

)...), ( (...

1

M E E

n i i

K K

1 2 3 ….. . .. .. . . . n

slide-19
SLIDE 19

A A C 1 3 2 U= {E1(A)}

 Problem:

Computation Overhead, number of

exchanged messages O(n*m)

U= {E2(E1(A)),E2(C)} U= {E3(E2(E1(A))),E3(E2(C)), E3(A)} U= {E3(E2(E1(A))),E1(E3(E2(C))), E1(E3(A))} U= {E3(E2(E1(A))),E1(E3(E2(C))), E2(E1(E3(A)))}

slide-20
SLIDE 20

Problems with SMC

 Scalability  High Overhead  Details of the trust model assumptions

Users are honest and follow the protocol

slide-21
SLIDE 21

Randomization Approach

 “Privacy Preserving Data Mining”, Argawal et. al [SIKDD]  Applied generally to provide estimates for data distributions rather

than single point estimates

 A user is allowed to alter the value provided to the aggregator  The alteration scheme should known to the aggregator  The aggregator Estimates the overall global distribution of input by

removing the randomization from the aggregate data

slide-22
SLIDE 22

Randomization Approach (ctnd.)

 Assumptions:

 Users are willing to divulge some form of their data  The aggregator is not malicious but may honestly

curious (they follow the protocol)

 Two main data perturbation schemes

 Value- class membership (Discretization)  Value distortion

slide-23
SLIDE 23

Randomization Methods

 Value Distortion Method  Given a value the client is allowed to report a distorted value

where is a random variable drawn from a known distribution

 Uniform Distribution:  Gaussian Distribution: i

x

) ( r xi +

r

[ ]

  • µ

+

  • =

, ,

  • µ

, =

slide-24
SLIDE 24

Quantifying the privacy of different randomization Schemes

6.8 x σ 3.92 x σ 1.34 x σ

Gaussian

0.999 x 2α 0.95 x 2α 0.5 x 2α

Uniform

0.999 x W 0.95 x W 0.5 x W

Discretization Distribution

99.9 % 95 % 50 %

Confidence (α)

W

− α + α Gaussian Distribution provides the best accuracy at higher confidence levels

slide-25
SLIDE 25

Problem Statement

10 20 30 40 50 60 70 80 90 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Entity I Entity II Entity m

Global Aggregator

k k

y x y x y x

1 1 12 12 11 11

,....., , + + +

k k

y x y x y x

2 2 22 22 21 21

,....., , + + +

mk mk m m m m

y x y x y x + + + ,....., ,

2 2 1 1

n n

y x y x y x + + + ,....., ,

2 2 1 1

) (a fY

Estimator

) (z f X

slide-26
SLIDE 26

Reconstruction of the Original Distribution

 Reconstruction problem can be viewed in in the general

framework of the “Inverse Problems”

 Inverse Problems: describing system internal structure

from indirect noisy data.

 Bayesian Estimation is an Effective tools for such

settings

slide-27
SLIDE 27

Formal problem statement

 Given one dimensional array of randomized data  Where ’s are iid random variables each with the same distribution

as the random variable X

 And ’s are realizations of a globally known random distribution

with CDF

 Purpose: Estimate

n n

y x y x y x + + + ,....., ,

2 2 1 1

i

x

i

y

Y

F

X

F

slide-28
SLIDE 28

Background: Bayesian Inference

 An Estimation method that involves collecting observational data

and use it a tool to adjust (either support of refute) a prior belief.

 The previous knowledge (hypothesis) has an established probability

called (prior probability)

 The adjusted hypothesis given the new observational data is called

(posterior probability)

slide-29
SLIDE 29

Bayesian Inference

 Let the prior probability, then Bayes’ rule states

that the posterior probability of given an

  • bservation (D) is given by:

 Bayes rule is a cyclic application of the general form of

the joint probability theorem:

) ( ) ( ) | ( ) | ( D P H P H D P D H P =

) ( ) | ( ) , ( D P D H P H D P =

) ( H P

) ( H

slide-30
SLIDE 30

Bayesian Inference ( Classical Example)

 Two Boxes:

 Box-I : 30 Red balls and 10 White Balls  Box-II: 20 Red balls and 20 White Balls

 A Person draws a Red Ball, what is the probability that

the Ball is from Box-I

 Prior Probability P(Box-I) = 0.5  From the data we know that:

 P(Red|Box-I) = 30/40 = 0.75  P(Red|Box-II) = 20/40 = 0.5

slide-31
SLIDE 31

Example (cntd.)

 Now, given the new observation (The Red Ball)

we want to know the posterior probability of Box-I (i.e P(Box-I | Red) )

) ( ) ( ) | ( ) | ( RED P I Box P I Box RED P RED I Box P

  • =
  • )

, ( ) , ( ) ( II Box RED P I Box RED P RED P

  • +
  • =

) ( ) | ( ) ( ) | ( ) ( II Box P II Box RED P I Box P I Box RED P RED P

  • +
  • =

5 . 5 . 75 . 5 . ) (

  • +
  • =

RED P

slide-32
SLIDE 32

Example (cntd)

 Computing the joint probability:  Substituting,  The posterior probability of Box-I is amplifies by the observation of

the Red Ball

) ( ) | ( ) ( ) | ( ) ( II Box P II Box RED P I Box P I Box RED P RED P

  • +
  • =

5 . 5 . 75 . 5 . ) (

  • +
  • =

RED P

6 . 5 . 5 . 75 . 5 . 5 . 75 . ) | ( =

  • +
  • =
  • RED

I Box P

slide-33
SLIDE 33

Back: Formal problem statement

 Given one dimensional array of randomized data  Where ‘s are iid random variables each with the same distribution

as the random variable X

 And ’s are realizations of a globally known random distribution

with CDF

 Purpose: Estimate

n n

y x y x y x + + + ,....., ,

2 2 1 1

i

x

i

y

Y

F

X

F

slide-34
SLIDE 34

Continuous probability distributions

) ( ) ( ) ( } { z F z CDF dk k f z r P

X z X

= = =

  • }

{ = = z r P

1 ) ( =

  • +
  • dk

k f X

slide-35
SLIDE 35

CDF and PDF

slide-36
SLIDE 36

Estimation of

 Bayes Rule:  Posterior Probability  Applying Bayes rule

) ( ) ( ) | ( ) | ( D P H P H D P D H P =

  • =
  • =

= +

  • a

z z X X

dz w Y X z f a F ) | ( ) (

1 1 1

  • +

+

= =

a Y X X Y X X

w f dz z f z X w f a F ) ( ) ( ) | ( ) (

1 1 1

1 1 1 1 1

X

F

slide-37
SLIDE 37

Estimation of

 We want to evaluate  Substituting:

  • +

+

= = dk k f k X w f w f

X Y X Y X

) ( ) | ( ) (

1 1 1 1 1

1 1 1

  • +

+

= = =

a X Y X X y X X

dk k f k X w f dz z f z X w f a F ) ( ) | ( ) ( ) | ( ) (

1 1 1 1 1 1

1 1 1 1

X

F

) (

1

1 1

w f

Y X +

slide-38
SLIDE 38

Estimation of

 Simplification (independence): X

F

  • =

dk k f z w f dz z f z w f a F

X i Y a X i Y X

) ( ) ( ) ( ) ( ) (

slide-39
SLIDE 39

Estimation of

 For all n observations:

  • =
  • =

n i X i Y a X i Y X

dk k f z w f dz z f z w f n a F

1

) ( ) ( ) ( ) ( 1 ) (

X

F

slide-40
SLIDE 40

Estimation of the PDF

 Is just the derivative of the CDF

X

f

X

f

  • =
  • =

n i X i Y X i Y X

dk k f z w f dz z f z w f n a f

1

) ( ) ( ) ( ) ( 1 ) (

slide-41
SLIDE 41

Algorithm

Uniform Distribution

While ( not Stopping Condition):

= :

X

f

:= j

  • =
  • +
  • +
  • =

n i j X i Y j X i Y j X

dz z f z w f a f a w f n a f

1 1

) ( ) ( ) ( ) ( 1 : ) ( 1 : + = j j

slide-42
SLIDE 42

Stopping Criteria

 The algorithm should terminate if:  For each round a goodness of fit test is performed.  Iteration is stopped when the difference between the

two estimates is too small (lower that a certain threshold)

) ( ) (

1

a f a f

j X j X

  • +

2

slide-43
SLIDE 43

Evaluation

Gaussian Randomizing Function µ=0, σ = 0.25

slide-44
SLIDE 44

Evaluation

Uniform Randomizing Function [-0.5,0.5]

slide-45
SLIDE 45

How is this different from Kalman Estimator?

 Both are estimation techniques  Kalman is stateless  In Kalman filter case we knew the distribution and

estimation is used to validate whether the trend of the data matches that distribution

 In Bayesian Inference the observation data is used to

adjust the prior hypothesis (probability distribution)

slide-46
SLIDE 46

Is the Problem Solved?

 Suppose a client randomizes Age records using a

uniform random variable [-50,50]

 If the aggregator receives value 120, with 100%

confidence it knows that actual age is

 Simply randomization does not guarantee absolute

privacy

70

slide-47
SLIDE 47

How to achieve better randomization scheme

 “Limiting Privacy Breaches in Privacy

Preserving Data Mining” Evfimievski et al

 Define an evaluation metric of how privacy

preserving a scheme is.

 Based on the developed metric, develop a

randomization scheme that abides to this metric

slide-48
SLIDE 48

How Privacy preserving is a scheme?

 Information Theoretic Approach:

 Computes the average information disclose in a randomized

attribute by computing the mutual information between the actual and the randomized attribute

 Privacy breach

 Defines a criteria that should be satisfied for a randomization

scheme to be privacy preserving

slide-49
SLIDE 49

What is a privacy breach?

 A privacy breach occurs when the

disclosure of a randomized value to the aggregator reveals that a certain property

  • f the “individual” input holds

with high probability

i

y

i

x

) (x Q

slide-50
SLIDE 50

Privacy Breach

 Back to Bayes’  Prior Probability

where is the property

 Posterior probability:

)) ( ( x Q P ) (x Q ) | ) ( (

i

y x Q P

slide-51
SLIDE 51

Amplification

 Is defined in terms of the transitive probability

where y is a fixed randomized output value

 Intuitive definition:

 if there are many

’s that can be mapped to y by the randomizing scheme then disclosing y have gives little information about

 We say we amplify the probability that

] [ y x P

  • i

x

] [ y x P

  • i

x

slide-52
SLIDE 52

Amplification factor

Let,

R a randomization operator a randomized value of x. Revealing R will not cause privacy breach if :

  • >
  • )

1 ( ) 1 (

2 1 1 2

p p p p

y

V y

slide-53
SLIDE 53

Summary

 No one solution can fit all.  Which area looks more promising?  Can we create robust randomization

schemes to a wide scale of applications and different distributions of data?

 How to deal with the case of Malicious

participants?