privacy preserving data mining
play

Privacy Preserving Data Mining Li Xiong Department of Mathematics - PowerPoint PPT Presentation

CS378 Introduction to Data Mining Privacy Preserving Data Mining Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University Netflix Sequel 2006, Netflix announced the challenge 2007,


  1. CS378 Introduction to Data Mining Privacy Preserving Data Mining Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University

  2. Netflix Sequel • 2006, Netflix announced the challenge • 2007, researchers from University of Texas identified individuals by matching Netflix datasets with IMDB • July 2009, $1M grand prize awarded • August 2009, Netflix announced the second challenge • December 2009, four Netflix users filed a class action lawsuit against Netflix • March 2010, Netflix canceled the second challenge

  3. 3

  4. Netflix Sequel • 2006, Netflix announced the challenge • 2007, researchers from University of Texas identified individuals by matching Netflix datasets with IMDB • July 2009, $1M grand prize awarded • August 2009, Netflix announced the second challenge • December 2009, four Netflix users filed a class action lawsuit against Netflix • March 2010, Netflix canceled the second challenge

  5. Netflix Sequel • 2006, Netflix announced the challenge • 2007, researchers from University of Texas identified individuals by matching Netflix datasets with IMDB • July 2009, $1M grand prize awarded • August 2009, Netflix announced the second challenge • December 2009, four Netflix users filed a class action lawsuit against Netflix • March 2010, Netflix canceled the second competition

  6. Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000 users took the psychographic personality test app ” thisisyourdigitallife ” • 2016, Trump’s campaign invest heavily in Facebook ads • March 2018, reports revealed that 50 million (later revised to 87 million) Facebook profiles were harvested for Cambridge Analytica and used for Trump’s campaign • April 11, 2018, Zuckerberg testified before Congress

  7. Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000 users took the psychographic personality test app ” thisisyourdigitallife ” • 2016, Trump’s campaign invest heavily in Facebook ads • March 2018, reports revealed that 50 million (later revised to 87 million) Facebook profiles were harvested for Cambridge Analytica and used for Trump’s campaign • April 11, 2018, Zuckerberg testified before Congress

  8. • How many people know we are here? (a) no one (b) 1-10 i.e. family and friends (c) 10-100 i.e. colleagues and more (social network) friends

  9. • 73% / 33% of Android apps shared personal info (i.e. email) / GPS coordinates with third parties • 45% / 47% of iOS apps shared email / GPS coordinates with third parties Location data sharing by iOS apps (left) to domains (right) Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps, 2015-10-30 https://techscience.org/a/2015103001/

  10. The EHR Data Map

  11. Shopping records

  12. Big Data Goes Personal • Movie ratings • Social network/media data • Mobile GPS data • Electronic medical records • Shopping history • Online browsing history

  13. Data Mining

  14. Data Mining … the dark side

  15. Privacy Preserving Data Mining Private Sanitized Privacy Preserving Data Data/ Data Mining Models • Privacy goal: personal data is not revealed and cannot be inferred • Utility goal: data/models as close to the private data as possible

  16. Privacy preserving data mining • Differential privacy • Definition • Building blocks (primitive mechanisms) • Composition rules • Data mining algorithms with differential privacy • k-means clustering w/ differential privacy • Frequent pattern mining w/ differential privacy

  17. Differential Privacy

  18. Traditional De-identification and Anonymization • Attribute suppression, perturbation, generalization • Inference possible with external data Sanitized Original De-identification View Data anonymization

  19. Massachusetts GIC Incident (1990s) • Massachusetts Group Insurance Commission (GIC) Encounter data (“de - identified”) – mid 1990s • External information: voter roll from city of Cambridge • Governor’s health records identified • 87% Americans can be uniquely identified using: Zip, birthdate, and sex (2000) Name SSN Birth Zip Diagnosis date Alice 123456789 44 48202 AIDS Bob 323232323 44 48202 AIDS Charley 232345656 44 48201 Asthma Dave 333333333 55 48310 Asthma Eva 666666666 55 48310 Diabetes

  20. AOL Query Log Release (2006) 20 million Web search queries by AOL AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com • User 4417749 • “numb fingers”, • “60 single men” • “dog that urinates on everything” • “landscapers in Lilburn, Ga ” • Several people names with last name Arnold • “homes sold in shadow lake subdivision gwinnett county georgia ”

  21. The Genome Hacker (2013)

  22. Differential Privacy • Statistical outcome (view) is indistinguishable regardless whether a particular user is included in the data

  23. Differential Privacy • Statistical outcome (view) is indistinguishable regardless whether a particular user is included in the data

  24. Differential Privacy • View is indistinguishable regardless of the input Private Privacy preserving Models Data D data mining/sharing /Data mechanism Private Data D’

  25. Differential privacy: an example Perturbed histogram Original records Original histogram with differential privacy

  26. Laplace Mechanism Query q Private True q(D) + η Data answer q(D) η Laplace Distribution – Lap(S/ ε ) 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 0 2 4 6 8 10

  27. Laplace Distribution • PDF: • Denoted as Lap(b) when u=0 • Mean u • Variance 2b 2

  28. How much noise for privacy? [Dwork et al., TCC 2006] Sensitivity : Consider a query q: I  R. S(q) is the smallest number s.t. for any neighboring tables D, D’, | q(D) – q (D’ ) | ≤ S(q) Theorem : If sensitivity of the query is S , then the algorithm A(D) = q(D) + Lap(S(q)/ ε ) guarantees ε -differential privacy

  29. Example: COUNT query • Number of people having HIV+ • Sensitivity = ?

  30. Example: COUNT query • Number of people having HIV+ • Sensitivity = 1 • ε - differentially private count: 3 + η , where η is drawn from Lap(1/ ε ) • Mean = 0 • Variance = 2/ ε 2

  31. Example: Sum (Average) query • Sum of Age (suppose Age is in [a,b]) • Sensitivity = ?

  32. Example: Sum (Average) query • Sum of Age (suppose Age is in [a,b]) • Sensitivity = b

  33. Composition theorems Sequential composition Parallel composition ∑ i ε i – differential privacy max( ε i ) – differential privacy

  34. Sequential Composition • If M 1 , M 2 , ..., M k are algorithms that access a private database D such that each M i satisfies ε i - differential privacy, then the combination of their outputs satisfies ε -differential privacy with ε=ε 1 +...+ ε k

  35. Parallel Composition • If M 1 , M 2 , ..., M k are algorithms that access disjoint databases D 1 , D 2 , …, D k such that each M i satisfies ε i - differential privacy, then the combination of their outputs satisfies ε -differential privacy with ε= max{ε 1 ,..., ε k }

  36. Postprocessing • If M 1 is an ε differentially private algorithm that accesses a private database D, then outputting M 2 (M 1 (D)) also satisfies ε -differential privacy. Module 2 42 Tutorial: Differential Privacy in the Wild

  37. Differential privacy: an example Perturbed histogram Original records Original histogram with differential privacy

  38. Privacy preserving data mining • Differential privacy • Definition • Building blocks (primitive mechanisms) • Composition rules • Data mining algorithms with differential privacy • k-means clustering w/ differential privacy • Frequent itemsets mining w/ differential privacy

  39. Privacy Preserving Data Mining as Constrained Optimization • Two goals • Privacy • Error (utility) • Given a task and privacy budget ε, how to design a set of queries (functions) and allocate the budget such that the error is minimized?

  40. Data mining algorithms with differential privacy • General algorithmic framework • Decompose a data mining algorithm into a set of functions • Allocate privacy budget to each function • Implement each function with ε i differential privacy • Compute noisy output using Laplace mechanism based on sensitivity of the function and ε i • Compose them using composition theorem • Optimization techniques • Decomposition design • Budget allocation • Sensitivity reduction for each function

  41. Review: K-means Clustering

  42. K-means Problem • Partition a set of points x 1 , x 2 , …, x n into k clusters S 1 , S 2 , …, S k such that the SSE is minimized: Mean of the cluster S i

  43. K-means Algorithm • Initialize a set of k centers • Repeat until convergence 1. Assign each point to its nearest center 2. Update the set of centers • Output final set of k centers and the points in each cluster

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend