Privacy Preserving Data Mining Li Xiong Department of Mathematics - - PowerPoint PPT Presentation
Privacy Preserving Data Mining Li Xiong Department of Mathematics - - PowerPoint PPT Presentation
CS378 Introduction to Data Mining Privacy Preserving Data Mining Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University Netflix Sequel 2006, Netflix announced the challenge 2007,
Netflix Sequel
- 2006, Netflix announced the challenge
- 2007, researchers from University of Texas identified
individuals by matching Netflix datasets with IMDB
- July 2009, $1M grand prize awarded
- August 2009, Netflix announced the second challenge
- December 2009, four Netflix users filed a class action
lawsuit against Netflix
- March 2010, Netflix canceled the second challenge
3
Netflix Sequel
- 2006, Netflix announced the challenge
- 2007, researchers from University of Texas identified
individuals by matching Netflix datasets with IMDB
- July 2009, $1M grand prize awarded
- August 2009, Netflix announced the second challenge
- December 2009, four Netflix users filed a class action
lawsuit against Netflix
- March 2010, Netflix canceled the second challenge
Netflix Sequel
- 2006, Netflix announced the challenge
- 2007, researchers from University of Texas identified
individuals by matching Netflix datasets with IMDB
- July 2009, $1M grand prize awarded
- August 2009, Netflix announced the second challenge
- December 2009, four Netflix users filed a class action
lawsuit against Netflix
- March 2010, Netflix canceled the second competition
Facebook-Cambridge Analytica
- April 2010, Facebook launches Open Graph
- 2013, 300,000 users took the psychographic personality
test app ”thisisyourdigitallife”
- 2016, Trump’s campaign invest heavily in Facebook ads
- March 2018, reports revealed that 50 million (later revised
to 87 million) Facebook profiles were harvested for Cambridge Analytica and used for Trump’s campaign
- April 11, 2018, Zuckerberg testified before Congress
Facebook-Cambridge Analytica
- April 2010, Facebook launches Open Graph
- 2013, 300,000 users took the psychographic personality
test app ”thisisyourdigitallife”
- 2016, Trump’s campaign invest heavily in Facebook ads
- March 2018, reports revealed that 50 million (later revised
to 87 million) Facebook profiles were harvested for Cambridge Analytica and used for Trump’s campaign
- April 11, 2018, Zuckerberg testified before Congress
- How many people know we are here?
(a) no one (b) 1-10 i.e. family and friends (c) 10-100 i.e. colleagues and more (social network) friends
Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps, 2015-10-30 https://techscience.org/a/2015103001/
- 73% / 33% of Android
apps shared personal info (i.e. email) / GPS coordinates with third parties
- 45% / 47% of iOS
apps shared email / GPS coordinates with third parties
Location data sharing by iOS apps (left) to domains (right)
The EHR Data Map
Shopping records
Big Data Goes Personal
- Movie ratings
- Social network/media data
- Mobile GPS data
- Electronic medical records
- Shopping history
- Online browsing history
Data Mining
Data Mining … the dark side
Private Data Sanitized Data/ Models Privacy Preserving Data Mining
Privacy Preserving Data Mining
- Privacy goal: personal data is not revealed and cannot be
inferred
- Utility goal: data/models as close to the private data as
possible
Privacy preserving data mining
- Differential privacy
- Definition
- Building blocks (primitive mechanisms)
- Composition rules
- Data mining algorithms with differential privacy
- k-means clustering w/ differential privacy
- Frequent pattern mining w/ differential privacy
Differential Privacy
Original Data Sanitized View De-identification anonymization
Traditional De-identification and Anonymization
- Attribute suppression, perturbation, generalization
- Inference possible with external data
Massachusetts GIC Incident (1990s)
- Massachusetts Group Insurance Commission (GIC) Encounter
data (“de-identified”) – mid 1990s
- External information: voter roll from city of Cambridge
- Governor’s health records identified
- 87% Americans can be uniquely identified using: Zip, birthdate,
and sex (2000)
Name SSN Birth date Zip Diagnosis Alice 123456789 44 48202 AIDS Bob 323232323 44 48202 AIDS Charley 232345656 44 48201 Asthma Dave 333333333 55 48310 Asthma Eva 666666666 55 48310 Diabetes
AOL Query Log Release (2006)
- User 4417749
- “numb fingers”,
- “60 single men”
- “dog that urinates on everything”
- “landscapers in Lilburn, Ga”
- Several people names with last name Arnold
- “homes sold in shadow lake subdivision
gwinnett county georgia”
AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268
- zark horse blankets
2006-03-01 17:39:28 8 http://www.blanketsnmore.com
20 million Web search queries by AOL
The Genome Hacker (2013)
Differential Privacy
- Statistical outcome (view) is indistinguishable regardless
whether a particular user is included in the data
Differential Privacy
- Statistical outcome (view) is indistinguishable regardless
whether a particular user is included in the data
Private Data D Models /Data Privacy preserving data mining/sharing mechanism
Differential Privacy
- View is indistinguishable regardless of the input
Private Data D’
Original records Original histogram Perturbed histogram with differential privacy
Differential privacy: an example
Laplace Mechanism
0.2 0.4 0.6
- 10 -8
- 6
- 4
- 2
2 4 6 8 10
Laplace Distribution – Lap(S/ε)
Private Data Query q
True answer
q(D) q(D) + η η
Laplace Distribution
- PDF:
- Denoted as Lap(b) when u=0
- Mean u
- Variance 2b2
How much noise for privacy?
Sensitivity: Consider a query q: I R. S(q) is the smallest number s.t. for any neighboring tables D, D’, | q(D) – q(D’) | ≤ S(q) Theorem: If sensitivity of the query is S, then the algorithm A(D) = q(D) + Lap(S(q)/ε) guarantees ε-differential privacy
[Dwork et al., TCC 2006]
Example: COUNT query
- Number of people having HIV+
- Sensitivity = ?
Example: COUNT query
- Number of people having HIV+
- Sensitivity = 1
- ε-differentially private count: 3 + η,
where η is drawn from Lap(1/ε)
- Mean = 0
- Variance = 2/ε2
Example: Sum (Average) query
- Sum of Age (suppose Age is in [a,b])
- Sensitivity = ?
Example: Sum (Average) query
- Sum of Age (suppose Age is in [a,b])
- Sensitivity = b
Composition theorems
Sequential composition ∑iεi –differential privacy Parallel composition max(εi)–differential privacy
Sequential Composition
- If M1, M2, ..., Mk are algorithms that access a
private database D such that each Mi satisfies εi - differential privacy, then the combination of their outputs satisfies ε-differential privacy with ε=ε1+...+εk
Parallel Composition
- If M1, M2, ..., Mk are algorithms that access disjoint
databases D1, D2, …, Dk such that each Mi satisfies εi - differential privacy, then the combination of their outputs satisfies ε-differential privacy with ε= max{ε1,...,εk}
Postprocessing
- If M1 is an ε differentially private algorithm that accesses a
private database D, then outputting M2(M1(D)) also satisfies ε-differential privacy.
Module 2
Tutorial: Differential Privacy in the Wild
42
Original records Original histogram Perturbed histogram with differential privacy
Differential privacy: an example
Privacy preserving data mining
- Differential privacy
- Definition
- Building blocks (primitive mechanisms)
- Composition rules
- Data mining algorithms with differential privacy
- k-means clustering w/ differential privacy
- Frequent itemsets mining w/ differential privacy
Privacy Preserving Data Mining as Constrained Optimization
- Two goals
- Privacy
- Error (utility)
- Given a task and privacy budget ε, how to design a set of
queries (functions) and allocate the budget such that the error is minimized?
Data mining algorithms with differential privacy
- General algorithmic framework
- Decompose a data mining algorithm into a set of
functions
- Allocate privacy budget to each function
- Implement each function with εi differential privacy
- Compute noisy output using Laplace mechanism
based on sensitivity of the function and εi
- Compose them using composition theorem
- Optimization techniques
- Decomposition design
- Budget allocation
- Sensitivity reduction for each function
Review: K-means Clustering
K-means Problem
- Partition a set of points x1, x2, …, xn into k clusters S1, S2,
…, Sk such that the SSE is minimized:
Mean of the cluster Si
K-means Algorithm
- Initialize a set of k centers
- Repeat until convergence
- 1. Assign each point to its nearest center
- 2. Update the set of centers
- Output final set of k centers and the points in each cluster
Differentially Private K-means
- Initialize a set of k centers
- Repeat iterations until convergence
- In each iteration (given a set of centers):
- 1. Assign the points to the closest center
- 2. Compute the size of each cluster
- 3. Compute the sum (centroid) of points in each cluster
- Output the final centroid and size of each cluster
[BDMN 05]
Differentially Private K-means
- Initialize a set of k centers
- Suppose we fix the number of iterations to T
- In each iteration (given a set of centers):
- 1. Assign the points to the closest center
- 2. Compute the noisy size of each cluster
- 3. Compute the noisy sum (centroid)
- f points in each cluster
- Output the final centroid and size of each cluster
[BDMN 05]
Differentially Private K-means
- Initialize a set of k centers
- Suppose we fix the number of iterations to T
- In each iteration (given a set of centers):
- 1. Assign the points to the closest center
- 2. Compute the noisy size of each cluster
- 3. Compute the noisy sum (centroid)
- f points in each cluster
- Output the final centroid and size of each cluster
[BDMN 05]
Each iteration uses ε/T privacy, total privacy is ε
Differentially Private K-means
- Initialize a set of k centers
- Suppose we fix the number of iterations to T
- In each iteration (given a set of centers):
- 1. Assign the points to the closest center
- 2. Compute the noisy size of each cluster
- 3. Compute the noisy sum (centroid)
- f points in each cluster
- Output the final centroid and size of each cluster
[BDMN 05]
Each iteration uses ε/T privacy, total privacy is ε S = 1 S = Dom
Differentially Private K-means
- Initialize a set of k centers
- Suppose we fix the number of iterations to T
- In each iteration (given a set of centers):
- 1. Assign the points to the closest center
- 2. Compute the noisy size of each cluster
- 3. Compute the noisy sum (centroid)
- f points in each cluster
- Output the final centroid and size of each cluster
[BDMN 05]
Each iteration uses ε/T privacy, total privacy is ε
Laplace(2T/ε) Laplace(2T *dom/ε)
Results (T = 10 iterations, random initialization) Original K-means algorithm Laplace K-means algorithm
- Laplace k-means can distinguish clusters that are far apart
- Laplace k-means can’t distinguish small clusters that are close by.
Privacy preserving data mining
- Differential privacy
- Definition
- Building blocks (primitive mechanisms)
- Composition rules
- Data mining algorithms with differential privacy
- k-means clustering w/ differential privacy
- Frequent itemsets/sequence mining w/
differential privacy
Frequent Sequence Mining (FSM)
ID 100 200 300 400 500 Record a→c→d b→c→d a→b→c→e→d d→b a→d→c→d Database D Sequence {a} {b} {c} {d} Sup. 3 3 4 4 {e} 1 C1: cand 1-seqs Sequence {a} {b} {c} {d} Sup. 3 3 4 4 F1: freq 1-seqs
Sequence {a→a} {a→b} {a→c} {a→d} Sup. 1 3 3 {b→a} {b→b} {b→c} {b→d} 2 2 1 {c→a} {c→b} {c→c} {c→d} 4 {d→a} {d→b} {d→c} {d→d} 1 1 C2: cand 2-seqs Sequence {a→c} {a→d} {c→d} Sup. 3 3 4 F3: freq 2-seqs
Scan D Scan D Scan D
Sequence {a→a} {a→b} {a→c} {a→d} {b→a} {b→b} {b→c} {b→d} {c→a} {c→b} {c→c} {c→d} {d→a} {d→b} {d→c} {d→d} C2: cand 2-seqs
Sequence {a→b→c} C3: cand 3-seqs Sequence {a→b→c} Sup. 3 F3: freq 3-seqs
Baseline Differentially Private FSM
ID 100 200 300 400 500 Record a→c→d b→c→d a→b→c→e→d d→b a→d→c→d Database D
Sequence {a} {b} {c} {d} Sup. 3 3 4 4 {e} 1 C1: cand 1-seqs noise 0.2
- 0.4
0.4
- 0.5
0.8
Sequence {a→a} {a→c} {a→d} {c→a} {c→c} {c→d} {d→a} {d→c} {d→d} C2: cand 2-seqs Sequence {a→a} {a→c} {a→d} Sup. 3 3 {c→a} {c→c} {c→d} 4 {d→a} {d→c} {d→d} 1 C2: cand 2-seqs noise 0.2 0.3 0.2
- 0.5
0.8 0.2 0.3 2.1
- 0.5
Scan D Scan D
Sequence {a→c→d} C3: cand 3-seqs {a→d→c}
noise 0.3 Sequence {a→c→d} Sup. 3 {a→d→c} 1 C3: cand 3-seqs
Scan D
Sequence {a} {c} {d} Noisy Sup. 3.2 4.4 3.5 F1: freq 1-seqs
Sequence {a→c} {a→d} {c→d} Noisy Sup. 3.3 3.2 4.2 F2: freq 2-seqs {d→c} 3.1
Sequence {a→c→d} Noisy Sup. 3 F3: freq 3-seqs
Lap(|C2| / ε2) Lap(|C1| / ε1) Lap(|C3| / ε3)
S Xu, S Su, X Cheng, Z Li, L Xiong. Differentially Private Frequent Sequence Mining via Sampling-based Candidate Pruning. ICDE 2015
Frequent pattern (subgraph) mining
- Represent each record as a graph
- Modeling the co-occurrence between diagnosis, procedures, medications
- Frequent subgraph mining with differential privacy
v1 v2 v3 v4 v1 v2 v3 v4 v1 v2 v3 v4
…
Threshold = 3
v1 v4
…
Input Graphs Frequent Subgraphs
support = 3
- S. Xu, S. Su, L. Xiong, X. Cheng, K. Xiao, Differentially Private Frequent
Subgraph Mining. ICDE 2016
Acknowledgements
- Research support
- Center for Comprehensive Informatics
- Woodrow Wilson Foundation
- Cisco research award
- Students
- James Gardner
- Yonghui Xiao
- Collaborators
- Andrew Post, CCI
- Fusheng Wang, CCI
- Tyrone Grandison, IBM
- Chun Yuan, Tsinghua
Emory Assured Information Management and Sharing (AIMS) Lab
- Collect, use, analyze, share data