Data Privacy Anonymization Li Xiong CS573 Data Privacy and - PowerPoint PPT Presentation

Data Privacy – Anonymization Li Xiong CS573 Data Privacy and Security

Outline • Inference control • Anonymization problem • Anonymization notions and approaches (and how they fail to work!) – k-anonymity – l-diversity – t-closeness • Takeaways

Access Control vs. Inference Control  Access control : protecting information from being accessed by unauthorized users Data Access Control  Inference control (disclosure control) : protecting private data from being inferred from sanitized data or models by authorized users Sanitized Original Inference Control Data/ Data Models

Disclosure Risk and Information Loss • Privacy (disclosure risk) - the risk that a given form of disclosure will arise if the data is released • Utility (information loss) - the information which exist in the initial data but not in released data due to disclosure control methods Sanitized Original Inference Control Data/ Data Models

What to Protect: Classical Intuition for Privacy • Uninformative principle (Dalenius 1977) – Access to the published data does not reveal anything extra about any target victim, even with the presence of attacker’s background knowledge obtained from other sources • Similar to semantic security of encryption – Knowledge of the ciphertext (and length) of some unknown message does not reveal any additional information on the message that can be feasibly extracted slide 6

What to protect: types of disclosure • Membership disclosure: Attacker can tell that a given person is in the dataset • Identity disclosure: Attacker can tell which record corresponds to a given person • Sensitive attribute disclosure: Attacker can tell that a given person or record has a certain sensitive attribute slide 7

What’s published • Microdata represents a set of records containing information on an individual unit such as a person, a firm, an institution • Macrodata represents computed/derived statistics • Models and patterns from machine learning and data mining

Name Age Diagnosis Income Age Diagnosis Income Wayne 44 AIDS 45,500 44 AIDS 50,000 Gore 44 Asthma 37,900 44 Asthma 40,000 Banks 55 AIDS 67,000 55 AIDS 70,000 Casey 44 Asthma 21,000 44 Asthma 20,000 Stone 55 Asthma 90,000 55 Asthma 90,000 Kopi 45 Diabetes 48,000 45 Diabetes 50,000 Simms 25 Diabetes 49,000 - Diabetes 50,000 Wood 35 AIDS 66,000 - AIDS 70,000 Aaron 55 AIDS 69,000 55 AIDS 70,000 Pall 45 Tuberculosis 34,000 45 - 30,000 Masked Microdata Initial Microdata Disclosure Control For Microdata

Name Age Diagnosis Income Wayne 44 AIDS 45,500 Gore 44 Asthma 37,900 Banks 55 AIDS 67,000 Casey 44 Asthma 21,000 Stone 55 Asthma 90,000 Kopi 45 Diabetes 48,000 Simms 25 Diabetes 49,000 Wood 35 AIDS 66,000 Aaron 55 AIDS 69,000 Pall 45 Tuberculosis 34,000 Initial Microdata Count Diagnosis 4 AIDS Count Diagnosis 3 Asthma 4 AIDS Masked Table 1 3 Asthma 2 Diabetes 1 Tuberculosis Table 1 - Count Diagnosis Count Age Income 5 31 - 40 188,200 Count Age Income 3 41 - 50 226,000 1 <= 30 49,000 1 31- 40 66,000 Masked Table 2 5 41 - 50 188,200 3 51-60 226,000 0 > 60 0 Table 2 - Total Incoming Masked Tables from Tables Tables Disclosure Control for Macro Data (Statistics Tables)

Name Age Diagnosis Income Wayne 44 AIDS 45,500 Gore 44 Asthma 37,900 Banks 55 AIDS 67,000 Casey 44 Asthma 21,000 Stone 55 Asthma 90,000 Kopi 45 Diabetes 48,000 Simms 25 Diabetes 49,000 Wood 35 AIDS 66,000 Aaron 55 AIDS 69,000 Pall 45 Tuberculosis 34,000 Initial Microdata Disclosure Control For Data Mining/Machine Learning Models

Inference Control Methods • Microdata Release (Anonymization) – Input perturbation: attribute suppression, generalization, perturbation • Macrodata Release – Output perturbation: summary statistics with perturbation • Query restriction/auditing (interactive version) – Auditor decides which queries are OK, type of noise slide 15

Outline • Anonymization problem • Anonymization notions and approaches (and how they fail to work!) – Basic attempt: de-identification – k-anonymity – l-diversity – t-closeness • Takeaways

Basic Attempt • Remove/replace identifier attributes Original Sanitized De-identification Data Records

Data “ Anonymization ” • Remove “personally identifying information” (PII) – Name, Social Security number, phone number, email, address… what else? • Problem: PII has no technical meaning or common definition – Defined in sectoral laws such as HIPAA (PHI: Protected Health Information) • 18 identifiers – Any information can be personally identifying • E.g. Rare disease condition • Many de-anonymization examples: GIC, AOL dataset, Netflix Prize dataset slide 18

Massachusetts GIC Incident  Massachusetts GIC released “ anonymized ” data on state employees’ hospital visit  Then Governor William Weld assured public on privacy GIC Anonymized Name SSN Age Zip Diagnosis Age Zip Diagnosis Alice 123456789 44 48202 AIDS 44 48202 AIDS Bob 323232323 44 48202 AIDS 44 48202 AIDS Charley 232345656 44 48201 Asthma 44 48201 Asthma Dave 333333333 55 48310 Asthma 55 48310 Asthma Eva 666666666 55 48310 Diabetes 55 48310 Diabetes

Massachusetts GIC  Then graduate student Sweeney linked the data with Voter registration data in Cambridge and identified Governor Weld’s record Name SSN Age Zip Diagnosis Age Zip Diagnosis Alice 123456789 44 48202 AIDS 44 48202 AIDS Bob 323232323 44 48202 AIDS 44 48202 AIDS Charley 232345656 44 48201 Asthma 44 48201 Asthma Dave 333333333 55 48310 Asthma 55 48310 Asthma Eva 666666666 55 48310 Diabetes 55 48310 Diabetes Voter roll for Cambridge Name Age Zip Alice 44 48202 Charley 44 48201 Dave 55 48310

Re-identification 9/9/2018 21

AOL Query Log Release 20 million Web search queries by AOL AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com (Source: AOL Query Log)

User No. 4417749 • User 4417749 – “numb fingers”, – “60 single men” – “dog that urinates on everything” – “landscapers in Lilburn, Ga ” – Several people names with last name Arnold – “homes sold in shadow lake subdivision gwinnett county georgia ”

User No. 4417749 • User 4417749 – “numb fingers”, – “60 single men” – “dog that urinates on everything” – “landscapers in Lilburn, Ga ” – Several people names with last name Arnold – “homes sold in shadow lake subdivision gwinnett county georgia ” Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her dogs

The Genome Hacker (2013)

Outline • Anonymization problem • Anonymization notions and approaches (and how they fail to work!) – Basic attempt: de-identification – k-anonymity – l-diversity – t-closeness • Takeaways

K-Anonymity • The term was introduced in 1998 by Samarati and Sweeney. • Important papers: – Sweeney L. (2002), K-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, 557-570 – Sweeney L. (2002), Achieving K-Anonymity Privacy Protection using Generalization and Suppression, International Journal on Uncertainty, Fuzziness and Knowledge- based Systems, Vol. 10, No. 5, 571-588 – Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 6, 1010-1027 • Hundreds of papers on the topic in the past decade – Theoretical results – Many algorithms achieving k-anonymity – Many improved principles and algorithms

Motivating Example Original Sanitized De-identification Data Records Non-Sensitive Data Sensitive Data Non-Sensitive Data Sensitive Data # Zip Age Nationality Name Condition # Zip Age Nationality Condition 1 13053 28 Brazilian Ronaldo Heart Disease 1 13053 28 Brazilian Heart Disease 2 13067 29 US Bob Heart Disease 2 13067 29 US Heart Disease 3 13053 37 Indian Kumar Cancer 3 13053 37 Indian Cancer 4 13067 36 Japanese Umeko Cancer 4 13067 36 Japanese Cancer

Motivating Example Original Sanitized De-identification Data Records Non-Sensitive Data Sensitive Data Non-Sensitive Data Sensitive Data # Zip Age Nationality Name Condition # Zip Age Nationality Condition 1 13053 28 Brazilian Ronaldo Heart Disease 1 13053 28 Brazilian Heart Disease 2 13067 29 US Bob Heart Disease 2 13067 29 US Heart Disease 3 13053 37 Indian Kumar Cancer 3 13053 37 Indian Cancer 4 13067 36 Japanese Umeko Cancer 4 13067 36 Japanese Cancer Attacker’s Knowledge: Voter registration list # Name Zip Age Nationality 1 John 13067 45 US 2 Paul 13067 22 US 3 Bob 13067 29 US 4 Chris 13067 23 US

Data Privacy Anonymization Li Xiong CS573 Data Privacy and - PowerPoint PPT Presentation

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference control Anonymization problem Anonymization notions and approaches (and how they fail to work!) k-anonymity l-diversity t-closeness

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Introduction to Anonymization (I) Claire McKay Bowen Postdoctoral Researcher, Los Alamos

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Sequential Composition Claire McKay Bowen Postdoctoral Researcher, Los Alamos National Laboratory

Laplace Sanitizer Claire McKay Bowen Postdoctoral Researcher, Los Alamos National Laboratory

Data Masking and Anonymization for PostgreSQL 1 The Anonymization Challenge 8 Strategies

CS573 Data Privacy and Security Data Anonymization (cont.) Li Xiong Department of Mathematics

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

Big Data and the application of anonymization techniques Annual Privacy Forum 2015 7-8 October,

Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and

Towards Plausible Graph Anonymization Yang Zhang, Mathias Humbert, Bartlomiej Surma, Praveen

Issues of Data Mining Kyle Borah OutLine Background Data Anonymization Encryption

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573

Web and Semantic Web MO826/MC936 - Information Systems Topics Andr Santanch Laboratory of

The GenABEL project for statistical genomics Yurii Aulchenko [ YuriiA consulting (NL) | ICG SB

Agricultural Economics and Farm Surveys Department Teagasc Trevor Donnellan Ag Econ and Farm

Using International Information In National Single Step Genomic BLUP In Swiss Dairy Cattle

Algorithms in Bioinformatics: A Practical Introduction Genome Rearrangement Evidences of Genome

Short read quality assessment Martin Morgan 1 June 20-23, 2011 1 mtmorgan@fhcrc.org Why sequence?

Alper Sarikaya 1 , Michael Correll 2 , Jorge M. Dinis 1 , David H. OConnor 1,3 , and Michael

Breakthroughs and Big Questions: AIDS vaccine research in 2014 Mary A. Marovich Director,