Privacy preserving data mining – multiplicative perturbation techniques
Li Xiong
CS573 Data Privacy and Anonymity
Privacy preserving data mining multiplicative perturbation - - PowerPoint PPT Presentation
Privacy preserving data mining multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity Outline Review and critique of randomization approaches (additive noise) Multiplicative data perturbations Rotation
CS573 Data Privacy and Anonymity
Rotation perturbation Geometric Data Perturbation Random projection
slide 3
x1 … xn
Database
x1+1 … xn+n Add random noise i to each database entry xi
For example, if distribution of noise has mean 0, user can compute average of xi
User
50 | 40K | ... 30 | 70K | ...
... ...
Randomizer Randomizer Reconstruct Distribution
Reconstruct Distribution
Classification Algorithm Model 65 | 20K | ... 25 | 60K | ...
...
30 becomes 65 (30+35) Alice’s age Add random number to Age
Easy to apply – applied separately to each data point (record) Low cost Can be used for both web model and corporate model
Web Apps data
user 1 User 2 User n Private info
x1 … xn x1+1 … xn+n
Subject to data value attacks! On the Privacy Preserving Properties of Random Data Perturbation Techniques, Kargupta, 2003a
The spectral filtering technique can be used to estimate the original data
The spectral filtering technique can perform poorly when there is an inherent random component in the original data
Decision tree and naïve bayes classifier
Privacy guarantee Data Utility/ Model accuracy
Privacy guarantee Data utility/ Model accuracy
Hide the original data
not easy to estimate the
perturbed data
Protect from data
reconstruction techniques
The attacker has prior
knowledge on the published data
Single-dimensional properties - column
distribution, etc. Decision tree, Bayesian classifier
Multi-dimensional properties - covariance
matrix, distance, etc SVM classifier, knn classification, clustering
Rotation data perturbation Translation data perturbation Noise addition
Chen, K. and Liu, L. Towards attack-resilient geometric data perturbation. SDM, 2007 Liu, K., Kargupt, H., and Ryan, J. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. TKDE, 2006
preserves Euclidean distance and inner product of data
points
preserves geometric shapes such as hyperplane and hyper
curved surfaces in the multidimensional space Rm*m - an orthonormal matrix (RTR = RRT = I) Xm*n - original data set with n m-dimensional data points G(X)m*n - rotated data set
Example:
ID 001 002 age 30 25 rent 1350 1000 tax 4230 3320 ID 001 002 age 1176 948 rent 3112 2392 tax
.83
.40 .2 .86 .46 .53 .30
= *
Preserving distances while perturbing each individual dimensions
Class/cluster members and decision boundaries are correlated in
terms of distance, not the concrete locations
Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 Rotation and translation C l a s s 1 C l a s s 2 Slightly changed Classification boundary Distance perturbation (Noise addition)
2D Example:
all Euclidean distance based clustering algorithms Classification algorithms
K Nearest Neighbors Kernel methods Linear classifier Support vector machines Most regression models And potentially more …
Data Owner Service Provider/data user
G(X)=RX+T+D Mined models/patterns G(X) F(G(X), )
Good for the corporate model or dataset publishing. Major issue!! curious service providers/data users try to break G(X)
Know nothing naïve estimation Know column distributions Independent
Know specific points (original points and their
Intensity of perturbation matters Points around origin
Classification boundary
Class 1 Class 2
Classification boundary
Class 1 Class 2
Classification boundary
Class 1 Class 2 X Y
Based on formal analysis of “rotation intensity” Method to maximize intensity
Fast_Opt algorithm in GDP
Hide origin Increase difficulty of attacking!
Need to estimate R first, in order to find out T
Try to separate R and X from Y= R*X
Conditions of effective ICA-attack 1. Knowing column distribution 2. Knowing value range.
Need certain amount of knowledge Cannot effectively handle dependent columns
Most datasets have correlated columns We can find optimal rotation perturbation
Original Perturbed Known point image
If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping…
find exact images of the known points
Enumerate pairs by matched distances
we assume pairs are successfully identified
Estimation
Need to know noise distribution, Need to know distribution of RX + T, Both distributions are not published, however.
Note: It is very different from the attacks to noise addition data perturbation [Kargupta03]
Able to accurately estimate covariance matrix, column
Methods PCA,etc can be used What do we do?
Privacy guarantee Data Utility/ Model accuracy decoupled
Applicable to many DM algorithms
Make optimization and balancing easier!
Start with a random rotation Goal: passing tests on simulated attacks Not simply random – a hillclimbing method
find a better rotation R
In terms of naïve estimation and ICA-based attacks Use only the random rotation and translation components
(R*X+T)
Worst perturbation (no optimization) Optimized for Naïve estimation only Optimized perturbation for both attacks
Use all three components (R*X +T+D) Noise D : Gaussian N(0, 2) Assume pairs of (original, image) are identified by attackers
no noise addition, privacy guarantee =0
Considerably high PG at small perturbation =0.1
Boolean data is more sensitive to distance perturbation
projects a set of data points from high dimensional
X is m*n matrix: m columns and n rows P is a k*m random matrix, k <= m
There is a random projection F() with e is a small number <1, so that (1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y|| i.e. distance is approximately preserved.
KNN classifiers SVMs
Subject to similar kinds of attacks RPP is more resilient to distance-based
GDP:
R and T exactly preserve distances, The effect of D needs experimental evaluation
RPP
Approximately preserves distances # of perturbed dimensions vs. utility
One of the effective methods is to evaluate the variance/standard deviation of the difference [Rakesh00] Var (P–O) or std(P-O), P: estimated, O: original
qi = std(Pi–Oi), Oi normalized column values, Pi estimated column values