Privacy preserving data mining multiplicative perturbation - - PowerPoint PPT Presentation

privacy preserving data mining multiplicative
SMART_READER_LITE
LIVE PREVIEW

Privacy preserving data mining multiplicative perturbation - - PowerPoint PPT Presentation

Privacy preserving data mining multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity Outline Review and critique of randomization approaches (additive noise) Multiplicative data perturbations Rotation


slide-1
SLIDE 1

Privacy preserving data mining – multiplicative perturbation techniques

Li Xiong

CS573 Data Privacy and Anonymity

slide-2
SLIDE 2

Outline

 Review and critique of randomization approaches

(additive noise)

 Multiplicative data perturbations

 Rotation perturbation  Geometric Data Perturbation  Random projection

 Comparison

slide-3
SLIDE 3

slide 3

Additive noise (randomization)

x1 … xn

 Reveal entire database, but randomize entries

Database

x1+1 … xn+n Add random noise i to each database entry xi

For example, if distribution of noise has mean 0, user can compute average of xi

User

slide-4
SLIDE 4

Learning decision tree on randomized data

50 | 40K | ... 30 | 70K | ...

... ...

Randomizer Randomizer Reconstruct Distribution

  • f Age

Reconstruct Distribution

  • f Salary

Classification Algorithm Model 65 | 20K | ... 25 | 60K | ...

...

30 becomes 65 (30+35) Alice’s age Add random number to Age

slide-5
SLIDE 5

Summary on additive perturbations

 Benefits

 Easy to apply – applied separately to each data point (record)  Low cost  Can be used for both web model and corporate model

Web Apps data

user 1 User 2 User n Private info

x1 … xn x1+1 … xn+n

slide-6
SLIDE 6

Additive perturbations - privacy

 Need to publish noise distribution  The column distribution is disclosed

 Subject to data value attacks! On the Privacy Preserving Properties of Random Data Perturbation Techniques, Kargupta, 2003a

slide-7
SLIDE 7

The spectral filtering technique can be used to estimate the original data

slide-8
SLIDE 8
slide-9
SLIDE 9

The spectral filtering technique can perform poorly when there is an inherent random component in the original data

slide-10
SLIDE 10

Randomization – data utility

 Only preserves column distribution  Need to redesign/modify existing data mining

algorithms

 Limited data mining applications

 Decision tree and naïve bayes classifier

slide-11
SLIDE 11

Randomization approaches

Privacy guarantee Data Utility/ Model accuracy

?

Privacy guarantee Data utility/ Model accuracy

  • Difficult to balance the two factors
  • Low data utility
  • Subject to attacks
slide-12
SLIDE 12

More thoughts about perturbation

  • 1. Preserve Privacy

 Hide the original data

 not easy to estimate the

  • riginal values from the

perturbed data

 Protect from data

reconstruction techniques

 The attacker has prior

knowledge on the published data

  • 2. Preserve Data Utility for

Tasks

 Single-dimensional properties - column

distribution, etc.  Decision tree, Bayesian classifier

 Multi-dimensional properties - covariance

matrix, distance, etc  SVM classifier, knn classification, clustering

slide-13
SLIDE 13

Multiplicative perturbations

 Preserving multidimensional data properties  Geometric data perturbation (GDP) [Chen

’07]

 Rotation data perturbation  Translation data perturbation  Noise addition

 Random projection perturbation(RPP) [Liu

‘06]

Chen, K. and Liu, L. Towards attack-resilient geometric data perturbation. SDM, 2007 Liu, K., Kargupt, H., and Ryan, J. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. TKDE, 2006

slide-14
SLIDE 14

Rotation Perturbation

 G(X) = R*X  Key features

 preserves Euclidean distance and inner product of data

points

 preserves geometric shapes such as hyperplane and hyper

curved surfaces in the multidimensional space Rm*m - an orthonormal matrix (RTR = RRT = I) Xm*n - original data set with n m-dimensional data points G(X)m*n - rotated data set

Example:

ID 001 002 age 30 25 rent 1350 1000 tax 4230 3320 ID 001 002 age 1176 948 rent 3112 2392 tax

  • 2920
  • 2309

.83

  • .40

.40 .2 .86 .46 .53 .30

  • .79

= *

slide-15
SLIDE 15

Illustration of multiplicative data perturbation

Preserving distances while perturbing each individual dimensions

slide-16
SLIDE 16

Data properties

 A model is invariant to geometric perturbation if

distance plays an important role

 Class/cluster members and decision boundaries are correlated in

terms of distance, not the concrete locations

Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 Rotation and translation C l a s s 1 C l a s s 2 Slightly changed Classification boundary Distance perturbation (Noise addition)

2D Example:

slide-17
SLIDE 17

Applicable DM algorithms

 Models “invariant” to GDP

 all Euclidean distance based clustering algorithms  Classification algorithms

 K Nearest Neighbors  Kernel methods  Linear classifier  Support vector machines  Most regression models  And potentially more …

slide-18
SLIDE 18

When to Use Multiplicative Data Perturbation

Data Owner Service Provider/data user

G(X)=RX+T+D Mined models/patterns G(X) F(G(X), )

Good for the corporate model or dataset publishing. Major issue!! curious service providers/data users try to break G(X)

slide-19
SLIDE 19

Attacks!

 Three levels of knowledge

Know nothing  naïve estimation Know column distributions  Independent

Component Analysis

Know specific points (original points and their

images in perturbed data)  distance inference

slide-20
SLIDE 20

Attack 1: naïve estimation

 Estimate original points purely based on the

perturbed data

If using “random rotation” only

 Intensity of perturbation matters  Points around origin

Classification boundary

Class 1 Class 2

Classification boundary

Class 1 Class 2

Classification boundary

Class 1 Class 2 X Y

slide-21
SLIDE 21

Countering naïve estimation

 Maximize intensity

 Based on formal analysis of “rotation intensity”  Method to maximize intensity

 Fast_Opt algorithm in GDP

 “Random translation” T

 Hide origin  Increase difficulty of attacking!

 Need to estimate R first, in order to find out T

slide-22
SLIDE 22

Attack 2: ICA based attacks

 Independent Component Analysis (ICA)

 Try to separate R and X from Y= R*X

slide-23
SLIDE 23

Characteristics of ICA

  • 1. Ordering of dimensions is not preserved.
  • 2. Intensity (value range) is not preserved

Conditions of effective ICA-attack 1. Knowing column distribution 2. Knowing value range.

slide-24
SLIDE 24

Countering ICA attack

 Weakness of ICA attack

 Need certain amount of knowledge  Cannot effectively handle dependent columns

 In reality…

 Most datasets have correlated columns  We can find optimal rotation perturbation

 maximizing the difficulty of ICA attacks

slide-25
SLIDE 25

Original Perturbed Known point image

Attack 3: distance-inference attack

If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping…

slide-26
SLIDE 26

How is the Attack done …

 Knowing points and their images …

 find exact images of the known points

 Enumerate pairs by matched distances

… Less effective for large data …

 we assume pairs are successfully identified

 Estimation

  • 1. Cancel random translation T from pairs (x, x’)
  • 2. calculate R with pairs: Y=RX  R = Y*X-1
  • 3. calculate T with R and known pairs
slide-27
SLIDE 27

Countering distance-inference: Noise addition

 Noise brings enough variance in estimation of R and T  Can the noise be easily filtered?

 Need to know noise distribution,  Need to know distribution of RX + T,  Both distributions are not published, however.

Note: It is very different from the attacks to noise addition data perturbation [Kargupta03]

slide-28
SLIDE 28

Attackers with more knowledge?

 What if attackers know large amount of original

records?

 Able to accurately estimate covariance matrix, column

distribution, and column range, etc., of the original data

 Methods PCA,etc can be used  What do we do?

Stop releasing any kind of data anymore 

slide-29
SLIDE 29

Benefits of Geometric Data Perturbation

Privacy guarantee Data Utility/ Model accuracy decoupled

Applicable to many DM algorithms

  • Distance-based Clustering
  • Classification: linear, KNN, Kernel, SVM,…

Make optimization and balancing easier!

  • Almost fully preserving model accuracy - we optimize privacy only
slide-30
SLIDE 30

A randomized perturbation

  • ptimization algorithm

 Start with a random rotation  Goal: passing tests on simulated attacks  Not simply random – a hillclimbing method

  • 1. Iteratively determine R
  • Test on naïve estimation (Fast_opt)
  • Test on ICA (2nd level)

find a better rotation R

  • 2. Append a random translation component
  • 3. Append an appropriate noise component
slide-31
SLIDE 31

Privacy guarantee:GDP

 In terms of naïve estimation and ICA-based attacks  Use only the random rotation and translation components

(R*X+T)

Worst perturbation (no optimization) Optimized for Naïve estimation only Optimized perturbation for both attacks

slide-32
SLIDE 32

Privacy guarantee:GDP

 In terms of distance inference attacks

 Use all three components (R*X +T+D)  Noise D : Gaussian N(0, 2)  Assume pairs of (original, image) are identified by attackers

 no noise addition, privacy guarantee =0

Considerably high PG at small perturbation =0.1

slide-33
SLIDE 33

Data utility: GDP with noise addition

 Noise addition vs. model accuracy - noise:

N(0, 0.12)

Boolean data is more sensitive to distance perturbation

slide-34
SLIDE 34

Random Projection Perturbation

 Random projection

 projects a set of data points from high dimensional

space to a lower dimensional subspace

 F(X) = P*X

 X is m*n matrix: m columns and n rows  P is a k*m random matrix, k <= m

 Johnson-Lindenstrauss Lemma

There is a random projection F() with e is a small number <1, so that (1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y|| i.e. distance is approximately preserved.

slide-35
SLIDE 35

Data Utility: RPP

 Reduced # of dims vs. model accuracy

KNN classifiers SVMs

slide-36
SLIDE 36

Random projection vs. geometric perturbation

 Privacy preservation

 Subject to similar kinds of attacks  RPP is more resilient to distance-based

attacks

 Utility preservation(model accuracy)

 GDP:

 R and T exactly preserve distances,  The effect of D needs experimental evaluation

 RPP

 Approximately preserves distances  # of perturbed dimensions vs. utility

slide-37
SLIDE 37

Coming up

 Output perturbation  Cryptographic protocols

slide-38
SLIDE 38

Methodology of attack analysis

 An attack is an estimate of the original data

Original O(x1, x2,…, xn) vs. estimate P(x’1, x’2,…, x’n)

How similar are these two series?

One of the effective methods is to evaluate the variance/standard deviation of the difference [Rakesh00] Var (P–O) or std(P-O), P: estimated, O: original

slide-39
SLIDE 39

Two multi-column privacy metrics

qi : privacy guarantee for column i

qi = std(Pi–Oi), Oi normalized column values, Pi estimated column values

Min privacy guarantee: the weakest link of all columns  min { qi, i=1..d} Avg privacy guarantee: overall privacy guarantee  1/d qi