When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) - - PowerPoint PPT Presentation

when does randomization fail to protect privacy
SMART_READER_LITE
LIVE PREVIEW

When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) - - PowerPoint PPT Presentation

When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) Du Department of EECS, Syracuse University 1 Random Perturbation Agrawal and Srikants SIGMOD paper. Y = X + R Original Data X Random Noise R Disguised Data Y + 2


slide-1
SLIDE 1

1

When Does Randomization Fail to Protect Privacy?

Wenliang (Kevin) Du Department of EECS, Syracuse University

slide-2
SLIDE 2

2

Random Perturbation

Agrawal and Srikant’s SIGMOD paper. Y = X + R

+ Original Data X Random Noise R Disguised Data Y

slide-3
SLIDE 3

3

Random Perturbation

Most of the security analysis methods based on randomization treat each attribute separately. Is that enough?

Does the relationship among data affect privacy?

slide-4
SLIDE 4

4

As we all know …

We can’t perturb the same number for several times. If we do that, we can estimate the original data:

Let t be the original data, Disguised data: t + R1, t + R2, …, t + Rm Let Z = [(t+R1)+ … + (t+Rm)] / m Mean: E(Z) = t Variance: Var(Z) = Var(R) / m

slide-5
SLIDE 5

5

This looks familiar …

This is the data set (x, x, x, x, x, x, x, x) Random Perturbation:

(x+r1, x+r2,……, x+rm)

We know this is NOT safe. Observation: the data set is highly correlated.

slide-6
SLIDE 6

6

Let’s Generalize!

Data set: (x1, x2, x3, ……, xm) If the correlation among data attributes are high, can we use that to improve

  • ur estimation (from the disguised

data)?

slide-7
SLIDE 7

7

Introduction

A heuristic approach toward privacy analysis Principal Component Analysis (PCA) PCA-based data reconstruction Experiment results Conclusion and future work

slide-8
SLIDE 8

8

Privacy Quantification: A Heuristic Approach

Our goal:

to find a best-effort algorithm that reconstructs the original data, based on the available information.

Definition

∑∑

= =

⋅ =

n i m j j i j i F

D D L m n P M

1 1 , * ,

) , ( 1

slide-9
SLIDE 9

9

How to use the correlation?

High Correlation Data Redundancy Data Redundancy Compression Our goal: Lossy compression:

We do want to lose information, but We don’t want to lose too much data, We do want to lose the added noise.

slide-10
SLIDE 10

10

PCA Introduction

The main use of PCA: reduce the dimensionality while retaining as much information as possible. 1st PC: containing the greatest amount

  • f variation.

2nd PC: containing the next largest amount of variation.

slide-11
SLIDE 11

11

Original Data

slide-12
SLIDE 12

12

After Dimension Reduction

slide-13
SLIDE 13

13

For the Original Data

They are correlated. If we remove 50% of the dimensions, the actual information loss might be less than 10%.

slide-14
SLIDE 14

14

For the Random Noises

They are not correlated. Their variance is evenly distributed to any direction. If we remove 50% of the dimensions, the actual noise loss should be 50%.

slide-15
SLIDE 15

15

Data Reconstruction

Applying PCA

Find Principle Components: C = Q ΛQT Set to be the first p columns of Q. Reconstruct the data:

Q

T T T T

Q Q Q Q Q Q ) ( Q Q Y X R X R X + = + = =

slide-16
SLIDE 16

16

Random Noise R

How does affect accuracy? Theorem: , ) ( ) ( m p R V a r Q Q R V a r

T

=

T

Q Q R

slide-17
SLIDE 17

17

How to Conduct PCA on Disguised Data?

Estimating Covariance Matrix

     ≠ = + = + + = j i fo r ), , ( j i fo r , ) , ( ) , ( ) , (

2 j i j i j j i i j i

X X C

  • v

X X C

  • v

R X R X C

  • v

Y Y C

  • v

σ

slide-18
SLIDE 18

18

Experiment 1: Increasing the Number of Attributes

Normal Distribution Uniform Distribution

slide-19
SLIDE 19

19

Experiment 2: Increasing the number of Principal Components

Normal Distribution Uniform Distribution

slide-20
SLIDE 20

20

Experiment 3: Increasing Standard Deviation of Noises

Normal Distribution Uniform Distribution

slide-21
SLIDE 21

21

Conclusions

Privacy analysis based on individual attributes is not sufficient. Correlation can disclose information. PCA can filter out some randomness from a highly correlated data set. When does randomization fail:

Answer: when the data correlation is high. Can it be cured?

slide-22
SLIDE 22

22

Future Work

How to improve the randomization to reduce the information disclosure?

Making random noises correlated?

How to combine the PCA with the univariate data reconstruction?