Privacy Preserving Data Mining Li Xiong Department of Mathematics - PowerPoint PPT Presentation

CS378 Introduction to Data Mining Privacy Preserving Data Mining Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University

Netflix Sequel • 2006, Netflix announced the challenge • 2007, researchers from University of Texas identified individuals by matching Netflix datasets with IMDB • July 2009, $1M grand prize awarded • August 2009, Netflix announced the second challenge • December 2009, four Netflix users filed a class action lawsuit against Netflix • March 2010, Netflix canceled the second challenge

Netflix Sequel • 2006, Netflix announced the challenge • 2007, researchers from University of Texas identified individuals by matching Netflix datasets with IMDB • July 2009, $1M grand prize awarded • August 2009, Netflix announced the second challenge • December 2009, four Netflix users filed a class action lawsuit against Netflix • March 2010, Netflix canceled the second competition

Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000 users took the psychographic personality test app ” thisisyourdigitallife ” • 2016, Trump’s campaign invest heavily in Facebook ads • March 2018, reports revealed that 50 million (later revised to 87 million) Facebook profiles were harvested for Cambridge Analytica and used for Trump’s campaign • April 11, 2018, Zuckerberg testified before Congress

• How many people know we are here? (a) no one (b) 1-10 i.e. family and friends (c) 10-100 i.e. colleagues and more (social network) friends

• 73% / 33% of Android apps shared personal info (i.e. email) / GPS coordinates with third parties • 45% / 47% of iOS apps shared email / GPS coordinates with third parties Location data sharing by iOS apps (left) to domains (right) Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps, 2015-10-30 https://techscience.org/a/2015103001/

The EHR Data Map

Shopping records

Big Data Goes Personal • Movie ratings • Social network/media data • Mobile GPS data • Electronic medical records • Shopping history • Online browsing history

Data Mining

Data Mining … the dark side

Privacy Preserving Data Mining Private Sanitized Privacy Preserving Data Data/ Data Mining Models • Privacy goal: personal data is not revealed and cannot be inferred • Utility goal: data/models as close to the private data as possible

Privacy preserving data mining • Differential privacy • Definition • Building blocks (primitive mechanisms) • Composition rules • Data mining algorithms with differential privacy • k-means clustering w/ differential privacy • Frequent pattern mining w/ differential privacy

Differential Privacy

Traditional De-identification and Anonymization • Attribute suppression, perturbation, generalization • Inference possible with external data Sanitized Original De-identification View Data anonymization

Massachusetts GIC Incident (1990s) • Massachusetts Group Insurance Commission (GIC) Encounter data (“de - identified”) – mid 1990s • External information: voter roll from city of Cambridge • Governor’s health records identified • 87% Americans can be uniquely identified using: Zip, birthdate, and sex (2000) Name SSN Birth Zip Diagnosis date Alice 123456789 44 48202 AIDS Bob 323232323 44 48202 AIDS Charley 232345656 44 48201 Asthma Dave 333333333 55 48310 Asthma Eva 666666666 55 48310 Diabetes

AOL Query Log Release (2006) 20 million Web search queries by AOL AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com • User 4417749 • “numb fingers”, • “60 single men” • “dog that urinates on everything” • “landscapers in Lilburn, Ga ” • Several people names with last name Arnold • “homes sold in shadow lake subdivision gwinnett county georgia ”

The Genome Hacker (2013)

Differential Privacy • Statistical outcome (view) is indistinguishable regardless whether a particular user is included in the data

Differential Privacy • View is indistinguishable regardless of the input Private Privacy preserving Models Data D data mining/sharing /Data mechanism Private Data D’

Differential privacy: an example Perturbed histogram Original records Original histogram with differential privacy

Laplace Mechanism Query q Private True q(D) + η Data answer q(D) η Laplace Distribution – Lap(S/ ε ) 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 0 2 4 6 8 10

Laplace Distribution • PDF: • Denoted as Lap(b) when u=0 • Mean u • Variance 2b 2

How much noise for privacy? [Dwork et al., TCC 2006] Sensitivity : Consider a query q: I  R. S(q) is the smallest number s.t. for any neighboring tables D, D’, | q(D) – q (D’ ) | ≤ S(q) Theorem : If sensitivity of the query is S , then the algorithm A(D) = q(D) + Lap(S(q)/ ε ) guarantees ε -differential privacy

Example: COUNT query • Number of people having HIV+ • Sensitivity = ?

Example: COUNT query • Number of people having HIV+ • Sensitivity = 1 • ε - differentially private count: 3 + η , where η is drawn from Lap(1/ ε ) • Mean = 0 • Variance = 2/ ε 2

Example: Sum (Average) query • Sum of Age (suppose Age is in [a,b]) • Sensitivity = ?

Example: Sum (Average) query • Sum of Age (suppose Age is in [a,b]) • Sensitivity = b

Composition theorems Sequential composition Parallel composition ∑ i ε i – differential privacy max( ε i ) – differential privacy

Sequential Composition • If M 1 , M 2 , ..., M k are algorithms that access a private database D such that each M i satisfies ε i - differential privacy, then the combination of their outputs satisfies ε -differential privacy with ε=ε 1 +...+ ε k

Parallel Composition • If M 1 , M 2 , ..., M k are algorithms that access disjoint databases D 1 , D 2 , …, D k such that each M i satisfies ε i - differential privacy, then the combination of their outputs satisfies ε -differential privacy with ε= max{ε 1 ,..., ε k }

Postprocessing • If M 1 is an ε differentially private algorithm that accesses a private database D, then outputting M 2 (M 1 (D)) also satisfies ε -differential privacy. Module 2 42 Tutorial: Differential Privacy in the Wild

Differential privacy: an example Perturbed histogram Original records Original histogram with differential privacy

Privacy preserving data mining • Differential privacy • Definition • Building blocks (primitive mechanisms) • Composition rules • Data mining algorithms with differential privacy • k-means clustering w/ differential privacy • Frequent itemsets mining w/ differential privacy

Privacy Preserving Data Mining as Constrained Optimization • Two goals • Privacy • Error (utility) • Given a task and privacy budget ε, how to design a set of queries (functions) and allocate the budget such that the error is minimized?

Data mining algorithms with differential privacy • General algorithmic framework • Decompose a data mining algorithm into a set of functions • Allocate privacy budget to each function • Implement each function with ε i differential privacy • Compute noisy output using Laplace mechanism based on sensitivity of the function and ε i • Compose them using composition theorem • Optimization techniques • Decomposition design • Budget allocation • Sensitivity reduction for each function

Review: K-means Clustering

K-means Problem • Partition a set of points x 1 , x 2 , …, x n into k clusters S 1 , S 2 , …, S k such that the SSE is minimized: Mean of the cluster S i

K-means Algorithm • Initialize a set of k centers • Repeat until convergence 1. Assign each point to its nearest center 2. Update the set of centers • Output final set of k centers and the points in each cluster

Privacy Preserving Data Mining Li Xiong Department of Mathematics - PowerPoint PPT Presentation

CS378 Introduction to Data Mining Privacy Preserving Data Mining Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University Netflix Sequel 2006, Netflix announced the challenge 2007,

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Privacy preserving data mining randomized response and association rule hiding Li Xiong

DIMACS/PORTIA Workshop on Privacy Preserving Data Mining Data Mining & Information Privacy:

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Collaborative Privacy Preserving Data Mining in Vertically Partitioned Databases Ehud Gudes

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Privacy Preserving Data Mining: Additive Data Perturbation Outline Input perturbation

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

Privacy preserving data mining multiplicative perturbation techniques Li Xiong CS573 Data

Introduction What is data mining? to Data mining functionalities Data Mining Major

Wireless Security: A Perspective (aka. What Weve Done Wrong, and Some of What We Can Do About

Mobile Application Security Testing and Code Review 19 Nov 2013 Mobile and Smart Device

4MOST 4m Multi-Object Spectroscopic Telescope 4MOST: a Wide-field, high-multiplex optical

AMD & SSSD Two SiPM-based scintillation detectors for Auger Johannes Schumacher for the

UART Thou Mad? Mickey and Toby Legal Notice Our opinion is our own. It DOES NOT IN ANY WAY

The Dark Side of Digital Financial Transformation: Cybersecurity and Technological Risk Douglas

A brief introduction to information security Part I Tyler Moore Computer Science &

Data Society the Impact of Data Sharing on Our Everydays Life 1 st Swiss Workshop on Data

Privacy Preserving Data Mining Li Xiong Department of Mathematics - PowerPoint PPT Presentation

CS378 Introduction to Data Mining Privacy Preserving Data Mining Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University Netflix Sequel 2006, Netflix announced the challenge 2007,

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Privacy preserving data mining randomized response and association rule hiding Li Xiong

DIMACS/PORTIA Workshop on Privacy Preserving Data Mining Data Mining &amp; Information Privacy:

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Collaborative Privacy Preserving Data Mining in Vertically Partitioned Databases Ehud Gudes

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Privacy Preserving Data Mining: Additive Data Perturbation Outline Input perturbation

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

Privacy preserving data mining multiplicative perturbation techniques Li Xiong CS573 Data

Introduction What is data mining? to Data mining functionalities Data Mining Major

Wireless Security: A Perspective (aka. What Weve Done Wrong, and Some of What We Can Do About

Mobile Application Security Testing and Code Review 19 Nov 2013 Mobile and Smart Device

4MOST 4m Multi-Object Spectroscopic Telescope 4MOST: a Wide-field, high-multiplex optical

AMD &amp; SSSD Two SiPM-based scintillation detectors for Auger Johannes Schumacher for the

UART Thou Mad? Mickey and Toby Legal Notice Our opinion is our own. It DOES NOT IN ANY WAY

The Dark Side of Digital Financial Transformation: Cybersecurity and Technological Risk Douglas

A brief introduction to information security Part I Tyler Moore Computer Science &amp;

Data Society the Impact of Data Sharing on Our Everydays Life 1 st Swiss Workshop on Data

DIMACS/PORTIA Workshop on Privacy Preserving Data Mining Data Mining & Information Privacy:

AMD & SSSD Two SiPM-based scintillation detectors for Auger Johannes Schumacher for the

A brief introduction to information security Part I Tyler Moore Computer Science &