de anonymizing data
play

De-anonymizing Data CompSci 590.03 Instructor: Ashwin - PowerPoint PPT Presentation

Source (http://xkcd.org/834/) De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12 1 Announcements Project ideas will be posted on the site by Friday. You are welcome to send me (or talk to


  1. Source (http://xkcd.org/834/) De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12 1

  2. Announcements • Project ideas will be posted on the site by Friday. – You are welcome to send me (or talk to me about) your own ideas. Lecture 2 : 590.03 Fall 12 2

  3. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 3

  4. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 4

  5. Personal Big-Data Person 1 Person 2 Person 3 Person N r 1 r 2 r 3 r N Google Census Hospital DB DB DB Information Recommen- Medical Doctors Economists Retrieval dation Researchers Researchers Algorithms Lecture 2 : 590.03 Fall 12 5

  6. The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA • Name • Name uniquely identified • SSN • Address • Zip using ZipCode, • Date • Visit Date Birth Date, and Sex. • Birth Registered • Diagnosis date • Party • Procedure Name linked to Diagnosis affiliation • Medication • Sex • Date last • Total Charge voted Medical Data Voter List Lecture 2 : 590.03 Fall 12 6

  7. The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA 87 % of US population • Name • Name uniquely identified • SSN • Address • Zip using ZipCode, • Date • Visit Date Birth Date, and Sex. • Birth Registered • Diagnosis date • Party • Procedure affiliation • Medication • Sex • Date last • Total Charge voted Quasi Identifier Medical Data Voter List Lecture 2 : 590.03 Fall 12 7

  8. Statistical Privacy (Trusted Collector) Problem Utility: Privacy: No breach about any individual Server D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 8

  9. Statistical Privacy (Untrusted Collector) Problem Server f ( ) D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 9

  10. Randomized Response • Flip a coin – heads with probability p, and – tails with probability 1-p (p > ½) • Answer question according to the following table: True Answer = Yes True Answer = No Heads Yes No Tails No Yes Lecture 2 : 590.03 Fall 12 10

  11. Statistical Privacy (Trusted Collector) Problem Server D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 11

  12. Query Answering How many allergy patients? Hospital ‘ D B Correlate Genome to disease Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 12

  13. Query Answering • Need to know the list of questions up front • Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions. • Will see this in detail later in the course. Lecture 2 : 590.03 Fall 12 13

  14. Anonymous/ Sanitized Data Publishing Hospital D B writingcenterunderground.wordpress.com I wont tell you what questions I am interested in! Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 14

  15. Anonymous/ Sanitized Data Publishing Hospital Answer any # of questions directly on D B ’ without D’ B any modifications. D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 15

  16. Today’s class • Identifying individual records and their sensitive values from data publishing (with insufficient sanitization). Lecture 2 : 590.03 Fall 12 16

  17. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 17

  18. Terms • Coin tosses of an algorithm • Union Bound • Heavy Tailed Distribution Lecture 2 : 590.03 Fall 12 18

  19. Terms (contd.) • Heavy Tailed Distribution Normal Distribution Not heavy tailed. Lecture 2 : 590.03 Fall 12 19

  20. Terms (contd.) • Heavy Tailed Distribution Laplace Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12 20

  21. Terms (contd.) • Heavy Tailed Distribution Zipf Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12 21

  22. Terms (contd.) • Cosine Similarity θ • Collaborative filtering – Problem of recommending new items to a user based on their ratings on previously seen items. Lecture 2 : 590.03 Fall 12 22

  23. Netflix Dataset Column/Attribute Movies 3 4 2 1 5 Record (r) 1 1 1 5 5 1 Users 5 2 2 1 4 2 1 4 3 3 5 4 3 1 3 2 4 Rating + TimeStamp Lecture 2 : 590.03 Fall 12 23

  24. Definitions • Support – Set (or number) of non-null attributes in a record or column • Similarity • Sparsity Lecture 2 : 590.03 Fall 12 24

  25. Adversary Model • Aux(r) – some subset of attributes from r Lecture 2 : 590.03 Fall 12 25

  26. Privacy Breach • Definition 1: An algorithm A outputs an r’ such that • Definition 2: (When only a sample of the dataset is input) Lecture 2 : 590.03 Fall 12 26

  27. Algorithm ScoreBoard • For each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’. • Pick r’ with the maximum score OR • Return all records with Score > α Lecture 2 : 590.03 Fall 12 27

  28. Analysis Theorem 1: Suppose we use Scoreboard with α = 1 – ε . If Aux contains m randomly chosen attributes s.t. Then Scoreboard returns a record r’ such that Pr [ Sim (m, r’) > 1 – ε – δ ] > 1 – ε Lecture 2 : 590.03 Fall 12 28

  29. Proof of Theorem 1 • Call r’ a false match if Sim (Aux, r’) < 1 – ε – δ . • For any false match, Pr[ Sim(Aux i , r i ’) > 1 – ε ] < 1 – δ • Sim (Aux, r’) = min Sim(Aux i , r i ’) • Therefore, Pr[ Sim (Aux, r’) > 1 – ε ] < (1 – δ ) m • Pr[some false match has similarity > 1- ε ] < N(1- δ ) m • N(1- δ ) m < ε when m > log(N/ ε ) / log(1/1- δ ) Lecture 2 : 590.03 Fall 12 29

  30. Other results • If dataset D is (1- ε - δ , ε )-sparse, then D can be (1, 1- ε )- deanonymized. • Analogous results when a list of candidate records are returned Lecture 2 : 590.03 Fall 12 30

  31. Netflix Dataset • Slightly different algorithm Lecture 2 : 590.03 Fall 12 31

  32. Summary of Netflix Paper • Adversary can use a subset of ratings made by a user to uniquely identify the user’s record from the “ anonymized ” dataset with high probability • Simple Scoreboard algorithm provably guarantees identification of records. • A variant of Scoreboard can de-anonymize Netflix dataset. • Algorithms are robust to noise in the adversary’s background knowledge Lecture 2 : 590.03 Fall 12 32

  33. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 33

  34. Social Network Data • Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entities • Example: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc. Lecture 2 : 590.03 Fall 12 34

  35. Anonymizing Social Networks Alice Bob Cathy Diane Ed Fred Grace • Naïve anonymization – removes the label of each node and publish only the structure of the network • Information Leaks – Nodes may still be re-identified based on network structure Lecture 2 : 590.03 Fall 12 35

  36. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Consider the above email communication graph – Each node represents an individual – Each edge between two individuals indicates that they have exchanged emails Lecture 2 : 590.03 Fall 12 36

  37. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Alice has sent emails to three individuals only Lecture 2 : 590.03 Fall 12 37

  38. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Alice has sent emails to three individuals only • Only one node in the anonymized network has a degree three • Hence, Alice can re-identify herself Lecture 2 : 590.03 Fall 12 38

  39. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Cathy has sent emails to five individuals Lecture 2 : 590.03 Fall 12 39

  40. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Cathy has sent emails to five individuals • Only one node has a degree five • Hence, Cathy can re-identify herself Lecture 2 : 590.03 Fall 12 40

  41. Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Now consider that Alice and Cathy share their knowledge about the anonymized network • What can they learn about the other individuals? Lecture 2 : 590.03 Fall 12 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend