data anonymization that
play

Data Anonymization that Towards Optimal . . . Leads to the Most - PowerPoint PPT Presentation

Need to Preserve Privacy How to Preserve . . . In Statistical Data . . . Estimating Accuracy . . . Data Anonymization that Towards Optimal . . . Leads to the Most Accurate First Result: . . . We Need to Dismiss . . . Estimates of


  1. Need to Preserve Privacy How to Preserve . . . In Statistical Data . . . Estimating Accuracy . . . Data Anonymization that Towards Optimal . . . Leads to the Most Accurate First Result: . . . We Need to Dismiss . . . Estimates of Statistical How to Also Take into . . . Main Result: Optimal . . . Characteristics Home Page Gang Xiang 1 and Vladik Kreinovich 2 Title Page ◭◭ ◮◮ 1 Applied Biomathematics, 100 North Country Rd. Setauket, NY 11733, USA, gxiang@sigmaxi.net ◭ ◮ 2 Department of Computer Science Page 1 of 22 University of Texas at El Paso Go Back El Paso, TX 79968, USA, vladik@utep.edu Full Screen Close Quit

  2. Need to Preserve Privacy How to Preserve . . . 1. Need to Preserve Privacy In Statistical Data . . . • One of the main objectives of engineering is to help Estimating Accuracy . . . people: Towards Optimal . . . First Result: . . . – civil engineering designs houses in which we live We Need to Dismiss . . . and roads along which we travel, How to Also Take into . . . – electrical engineering designs appliances – and elec- Main Result: Optimal . . . tric networks that help use these appliances. Home Page • To better serve customers, it is important to know as Title Page much as possible about the potential customers. ◭◭ ◮◮ • Customers are reluctant to share information, since ◭ ◮ this information can be potentially used against them. Page 2 of 22 • For example, age can be used by companies to (unlaw- fully) discriminate against older job applicants. Go Back Full Screen • It is thus important to preserve privacy when storing customer data. Close Quit

  3. Need to Preserve Privacy How to Preserve . . . 2. How to Preserve Privacy: k -Anonymity and ℓ - In Statistical Data . . . Diversity Estimating Accuracy . . . • To maintain privacy, we divide the space of all possible Towards Optimal . . . combinations of values ( x 1 , . . . , x n ) into boxes. First Result: . . . We Need to Dismiss . . . • For each record, instead of storing the actual values x i , How to Also Take into . . . we only store the label of the box containing x . Main Result: Optimal . . . • To avoid further loss of privacy, it is important to make Home Page sure that location in a box does not identify a person. Title Page • This is usually achieved by requiring that for some ◭◭ ◮◮ fixed k , each box contains at least k records. ◭ ◮ • It is also not good if all records within a box have the Page 3 of 22 same value of an i -th quantity x i . Go Back • It is thus required that for some ℓ , in each box there are at least ℓ different values of each x i . Full Screen Close Quit

  4. Need to Preserve Privacy How to Preserve . . . 3. Statistical Data Processing In Statistical Data . . . • Our main objective is to predict the desired character- Estimating Accuracy . . . istic x i 0 . Towards Optimal . . . First Result: . . . • In most cases, the dependence is linear, so we must � m We Need to Dismiss . . . find c q s.t. x i 0 ≈ c 0 + c q · x i q . How to Also Take into . . . q =1 Main Result: Optimal . . . • Least Squares Approach leads to: Home Page m N � � Title Page c r · C i q i r = C i 0 i q ; c 0 = E i 0 − c q · E i q . ◭◭ ◮◮ r =1 q =1 ◭ ◮ • We also want to know which quantities are correlated, C ij Page 4 of 22 i.e., we want to estimate ρ ij = . σ i · σ j Go Back • In all these tasks, we need to estimate averages E i , Full Screen variances V i = σ 2 i , covariances C ij , and correlations ρ ij . Close Quit

  5. Need to Preserve Privacy How to Preserve . . . 4. Statistical Characteristics: Reminder In Statistical Data . . . • The means are usually estimated as follows: Estimating Accuracy . . . Towards Optimal . . . N N � � E i = 1 E j = 1 x ( p ) x ( p ) N · i , N · j . First Result: . . . We Need to Dismiss . . . p =1 p =1 How to Also Take into . . . • The covariance is usually estimated as: Main Result: Optimal . . . � � � � N � Home Page C ij = 1 x ( p ) x ( p ) N · − E i · − E j . i j Title Page p =1 ◭◭ ◮◮ • The variance is usually estimated as: ◭ ◮ � � 2 N � V i = 1 x ( p ) Page 5 of 22 N · − E i . i p =1 Go Back Full Screen Close Quit

  6. Need to Preserve Privacy How to Preserve . . . 5. In Statistical Data Processing, Privacy Leads In Statistical Data . . . to Uncertainty Estimating Accuracy . . . • To maintain privacy, we replace each numerical value Towards Optimal . . . x ( p ) with the corresponding interval. First Result: . . . i We Need to Dismiss . . . • Different values from these intervals lead to different How to Also Take into . . . values of the resulting statistical characteristics. Main Result: Optimal . . . • Hence, for each characteristic, we get a whole interval Home Page of possible values. Title Page • If this interval is too wide, the resulting range is useless: ◭◭ ◮◮ e.g., [ − 1 , 1] for correlation. ◭ ◮ • It is therefore desirable to select: Page 6 of 22 – among all possible subdivisions into boxes which Go Back preserve k -anonymity (and ℓ -diversity), Full Screen – the one which leads to the narrowest intervals for the desired statistical characteristic. Close Quit

  7. Need to Preserve Privacy How to Preserve . . . 6. Estimating Accuracy Caused by Privacy-Based In Statistical Data . . . Subdivision into Boxes: Case of k -Anonymity Estimating Accuracy . . . • To minimize uncertainty, we select the smallest boxes. Towards Optimal . . . First Result: . . . • Hence, each box B should have exactly k records. We Need to Dismiss . . . x i +∆ i ], instead of C ( x (1) 1 , . . . , x ( N ) • For intervals [ � x i − ∆ i , � n ), How to Also Take into . . . we get: Main Result: Optimal . . . x (1) 1 + ∆ x (1) n ) , where | ∆ x ( p ) x ( N ) + ∆ x ( N ) C ( � 1 , . . . , � i | ≤ ∆ i . Home Page n • When we have many records, boxes are small, so we Title Page can use a linear approximation: ◭◭ ◮◮ N n � � ∂C ◭ ◮ · ∆ x ( p ) C = � C + i . ∂x i Page 7 of 22 p =1 i =1 Go Back • The range of this linear expression is [ � C − ∆ , � C + ∆], � � � � � � � � � N � n = k · � � n ∂C ∂C Full Screen def � · ∆ ( p ) � � � � where ∆ = � · ∆ i . � � i ∂x i ∂x i p =1 i =1 B i =1 Close Quit

  8. Need to Preserve Privacy How to Preserve . . . 7. Expressions for the Corr. Partial Derivatives In Statistical Data . . . • The estimate for the accuracy ∆ is described in terms Estimating Accuracy . . . of partial derivatives ∂C Towards Optimal . . . of the stat. characteristic C . ∂x i First Result: . . . = 1 • For the mean E i , the derivative is equal to ∂E i We Need to Dismiss . . . N . ∂x i How to Also Take into . . . = 2 · ( x i − E i ) • For the variance V i , we have ∂V i Main Result: Optimal . . . . ∂x i N Home Page • Therefore, for σ i = √ V i , we get ∂σ i = x i − E x Title Page . ∂x i σ x ◭◭ ◮◮ • For the covariance C ij , we have ∂C ij = x j − E j . ◭ ◮ ∂x i N Page 8 of 22 • For the correlation ρ ij , we have: Go Back ( x j − E j ) − C ij · ( x i − E i ) σ 2 ∂ρ ij = 1 Full Screen i N · . σ i · σ j ∂x i Close Quit

  9. Need to Preserve Privacy How to Preserve . . . 8. Towards Optimal Subdivision into Boxes In Statistical Data . . . • The overall expression for ∆ is a sum of terms corre- Estimating Accuracy . . . sponding to different points. Towards Optimal . . . First Result: . . . • So, to minimize ∆, we must, for each point, minimize � � � � � n ∂C We Need to Dismiss . . . def � � the corresponding term a i · ∆ i , where a i = � . � ∂x i How to Also Take into . . . i =1 Main Result: Optimal . . . • The only constraint on the values ∆ i is that the corre- Home Page sponding box should contain exactly k different points. Title Page • The number of points can be obtained by multiplying � n ◭◭ ◮◮ the data density ρ ( x ) by the box volume (2∆ i ). i =1 ◭ ◮ • The data density can be estimated based on the data. Page 9 of 22 � n • So, we minimize a i · ∆ i under the constraint Go Back i =1 Full Screen n � ρ ( x ) · 2 n · ∆ i = k. Close i =1 Quit

  10. Need to Preserve Privacy How to Preserve . . . 9. First Result: (Asymptotically) Optimal Subdi- In Statistical Data . . . vision into Boxes (Case of k -Anonymity) Estimating Accuracy . . . • Method: Lagrange multiplier technique leads to Towards Optimal . . . � � � � ∆ i = c ( x ) ∂C First Result: . . . � � , where a i = � . � a i ∂x i We Need to Dismiss . . . � How to Also Take into . . . � n � � • From the constraint, we get c ( x ) = 1 � k Main Result: Optimal . . . 2 · ρ ( x ) · a j . n Home Page j =1 Title Page • Conclusion: around each point x , we need to select the box with half-widths ◭◭ ◮◮ � � n ◭ ◮ � a j n ∆ i = 1 k j =1 Page 10 of 22 2 · ρ ( x ) · . n a i Go Back • The resulting accuracy: ∆ = n · � c ( x ) , where the sum Full Screen x is taken over all N data points x . Close Quit

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend