Privacy Preserving Data Mining Li Xiong Department of Mathematics - - PowerPoint PPT Presentation

privacy preserving data mining
SMART_READER_LITE
LIVE PREVIEW

Privacy Preserving Data Mining Li Xiong Department of Mathematics - - PowerPoint PPT Presentation

CS378 Introduction to Data Mining Privacy Preserving Data Mining Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University Netflix Sequel 2006, Netflix announced the challenge 2007,


slide-1
SLIDE 1

Privacy Preserving Data Mining

Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University

CS378 Introduction to Data Mining

slide-2
SLIDE 2

Netflix Sequel

  • 2006, Netflix announced the challenge
  • 2007, researchers from University of Texas identified

individuals by matching Netflix datasets with IMDB

  • July 2009, $1M grand prize awarded
  • August 2009, Netflix announced the second challenge
  • December 2009, four Netflix users filed a class action

lawsuit against Netflix

  • March 2010, Netflix canceled the second challenge
slide-3
SLIDE 3

3

slide-4
SLIDE 4

Netflix Sequel

  • 2006, Netflix announced the challenge
  • 2007, researchers from University of Texas identified

individuals by matching Netflix datasets with IMDB

  • July 2009, $1M grand prize awarded
  • August 2009, Netflix announced the second challenge
  • December 2009, four Netflix users filed a class action

lawsuit against Netflix

  • March 2010, Netflix canceled the second challenge
slide-5
SLIDE 5

Netflix Sequel

  • 2006, Netflix announced the challenge
  • 2007, researchers from University of Texas identified

individuals by matching Netflix datasets with IMDB

  • July 2009, $1M grand prize awarded
  • August 2009, Netflix announced the second challenge
  • December 2009, four Netflix users filed a class action

lawsuit against Netflix

  • March 2010, Netflix canceled the second competition
slide-6
SLIDE 6

Facebook-Cambridge Analytica

  • April 2010, Facebook launches Open Graph
  • 2013, 300,000 users took the psychographic personality

test app ”thisisyourdigitallife”

  • 2016, Trump’s campaign invest heavily in Facebook ads
  • March 2018, reports revealed that 50 million (later revised

to 87 million) Facebook profiles were harvested for Cambridge Analytica and used for Trump’s campaign

  • April 11, 2018, Zuckerberg testified before Congress
slide-7
SLIDE 7

Facebook-Cambridge Analytica

  • April 2010, Facebook launches Open Graph
  • 2013, 300,000 users took the psychographic personality

test app ”thisisyourdigitallife”

  • 2016, Trump’s campaign invest heavily in Facebook ads
  • March 2018, reports revealed that 50 million (later revised

to 87 million) Facebook profiles were harvested for Cambridge Analytica and used for Trump’s campaign

  • April 11, 2018, Zuckerberg testified before Congress
slide-8
SLIDE 8
  • How many people know we are here?

(a) no one (b) 1-10 i.e. family and friends (c) 10-100 i.e. colleagues and more (social network) friends

slide-9
SLIDE 9
slide-10
SLIDE 10

Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps, 2015-10-30 https://techscience.org/a/2015103001/

  • 73% / 33% of Android

apps shared personal info (i.e. email) / GPS coordinates with third parties

  • 45% / 47% of iOS

apps shared email / GPS coordinates with third parties

Location data sharing by iOS apps (left) to domains (right)

slide-11
SLIDE 11

The EHR Data Map

slide-12
SLIDE 12

Shopping records

slide-13
SLIDE 13

Big Data Goes Personal

  • Movie ratings
  • Social network/media data
  • Mobile GPS data
  • Electronic medical records
  • Shopping history
  • Online browsing history
slide-14
SLIDE 14
slide-15
SLIDE 15

Data Mining

slide-16
SLIDE 16

Data Mining … the dark side

slide-17
SLIDE 17

Private Data Sanitized Data/ Models Privacy Preserving Data Mining

Privacy Preserving Data Mining

  • Privacy goal: personal data is not revealed and cannot be

inferred

  • Utility goal: data/models as close to the private data as

possible

slide-18
SLIDE 18

Privacy preserving data mining

  • Differential privacy
  • Definition
  • Building blocks (primitive mechanisms)
  • Composition rules
  • Data mining algorithms with differential privacy
  • k-means clustering w/ differential privacy
  • Frequent pattern mining w/ differential privacy
slide-19
SLIDE 19

Differential Privacy

slide-20
SLIDE 20

Original Data Sanitized View De-identification anonymization

Traditional De-identification and Anonymization

  • Attribute suppression, perturbation, generalization
  • Inference possible with external data
slide-21
SLIDE 21

Massachusetts GIC Incident (1990s)

  • Massachusetts Group Insurance Commission (GIC) Encounter

data (“de-identified”) – mid 1990s

  • External information: voter roll from city of Cambridge
  • Governor’s health records identified
  • 87% Americans can be uniquely identified using: Zip, birthdate,

and sex (2000)

Name SSN Birth date Zip Diagnosis Alice 123456789 44 48202 AIDS Bob 323232323 44 48202 AIDS Charley 232345656 44 48201 Asthma Dave 333333333 55 48310 Asthma Eva 666666666 55 48310 Diabetes

slide-22
SLIDE 22

AOL Query Log Release (2006)

  • User 4417749
  • “numb fingers”,
  • “60 single men”
  • “dog that urinates on everything”
  • “landscapers in Lilburn, Ga”
  • Several people names with last name Arnold
  • “homes sold in shadow lake subdivision

gwinnett county georgia”

AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268

  • zark horse blankets

2006-03-01 17:39:28 8 http://www.blanketsnmore.com

20 million Web search queries by AOL

slide-23
SLIDE 23

The Genome Hacker (2013)

slide-24
SLIDE 24

Differential Privacy

  • Statistical outcome (view) is indistinguishable regardless

whether a particular user is included in the data

slide-25
SLIDE 25

Differential Privacy

  • Statistical outcome (view) is indistinguishable regardless

whether a particular user is included in the data

slide-26
SLIDE 26

Private Data D Models /Data Privacy preserving data mining/sharing mechanism

Differential Privacy

  • View is indistinguishable regardless of the input

Private Data D’

slide-27
SLIDE 27
slide-28
SLIDE 28

Original records Original histogram Perturbed histogram with differential privacy

Differential privacy: an example

slide-29
SLIDE 29

Laplace Mechanism

0.2 0.4 0.6

  • 10 -8
  • 6
  • 4
  • 2

2 4 6 8 10

Laplace Distribution – Lap(S/ε)

Private Data Query q

True answer

q(D) q(D) + η η

slide-30
SLIDE 30

Laplace Distribution

  • PDF:
  • Denoted as Lap(b) when u=0
  • Mean u
  • Variance 2b2
slide-31
SLIDE 31

How much noise for privacy?

Sensitivity: Consider a query q: I  R. S(q) is the smallest number s.t. for any neighboring tables D, D’, | q(D) – q(D’) | ≤ S(q) Theorem: If sensitivity of the query is S, then the algorithm A(D) = q(D) + Lap(S(q)/ε) guarantees ε-differential privacy

[Dwork et al., TCC 2006]

slide-32
SLIDE 32

Example: COUNT query

  • Number of people having HIV+
  • Sensitivity = ?
slide-33
SLIDE 33

Example: COUNT query

  • Number of people having HIV+
  • Sensitivity = 1
  • ε-differentially private count: 3 + η,

where η is drawn from Lap(1/ε)

  • Mean = 0
  • Variance = 2/ε2
slide-34
SLIDE 34

Example: Sum (Average) query

  • Sum of Age (suppose Age is in [a,b])
  • Sensitivity = ?
slide-35
SLIDE 35

Example: Sum (Average) query

  • Sum of Age (suppose Age is in [a,b])
  • Sensitivity = b
slide-36
SLIDE 36

Composition theorems

Sequential composition ∑iεi –differential privacy Parallel composition max(εi)–differential privacy

slide-37
SLIDE 37

Sequential Composition

  • If M1, M2, ..., Mk are algorithms that access a

private database D such that each Mi satisfies εi - differential privacy, then the combination of their outputs satisfies ε-differential privacy with ε=ε1+...+εk

slide-38
SLIDE 38

Parallel Composition

  • If M1, M2, ..., Mk are algorithms that access disjoint

databases D1, D2, …, Dk such that each Mi satisfies εi - differential privacy, then the combination of their outputs satisfies ε-differential privacy with ε= max{ε1,...,εk}

slide-39
SLIDE 39

Postprocessing

  • If M1 is an ε differentially private algorithm that accesses a

private database D, then outputting M2(M1(D)) also satisfies ε-differential privacy.

Module 2

Tutorial: Differential Privacy in the Wild

42

slide-40
SLIDE 40

Original records Original histogram Perturbed histogram with differential privacy

Differential privacy: an example

slide-41
SLIDE 41

Privacy preserving data mining

  • Differential privacy
  • Definition
  • Building blocks (primitive mechanisms)
  • Composition rules
  • Data mining algorithms with differential privacy
  • k-means clustering w/ differential privacy
  • Frequent itemsets mining w/ differential privacy
slide-42
SLIDE 42

Privacy Preserving Data Mining as Constrained Optimization

  • Two goals
  • Privacy
  • Error (utility)
  • Given a task and privacy budget ε, how to design a set of

queries (functions) and allocate the budget such that the error is minimized?

slide-43
SLIDE 43

Data mining algorithms with differential privacy

  • General algorithmic framework
  • Decompose a data mining algorithm into a set of

functions

  • Allocate privacy budget to each function
  • Implement each function with εi differential privacy
  • Compute noisy output using Laplace mechanism

based on sensitivity of the function and εi

  • Compose them using composition theorem
  • Optimization techniques
  • Decomposition design
  • Budget allocation
  • Sensitivity reduction for each function
slide-44
SLIDE 44

Review: K-means Clustering

slide-45
SLIDE 45

K-means Problem

  • Partition a set of points x1, x2, …, xn into k clusters S1, S2,

…, Sk such that the SSE is minimized:

Mean of the cluster Si

slide-46
SLIDE 46

K-means Algorithm

  • Initialize a set of k centers
  • Repeat until convergence
  • 1. Assign each point to its nearest center
  • 2. Update the set of centers
  • Output final set of k centers and the points in each cluster
slide-47
SLIDE 47

Differentially Private K-means

  • Initialize a set of k centers
  • Repeat iterations until convergence
  • In each iteration (given a set of centers):
  • 1. Assign the points to the closest center
  • 2. Compute the size of each cluster
  • 3. Compute the sum (centroid) of points in each cluster
  • Output the final centroid and size of each cluster

[BDMN 05]

slide-48
SLIDE 48

Differentially Private K-means

  • Initialize a set of k centers
  • Suppose we fix the number of iterations to T
  • In each iteration (given a set of centers):
  • 1. Assign the points to the closest center
  • 2. Compute the noisy size of each cluster
  • 3. Compute the noisy sum (centroid)
  • f points in each cluster
  • Output the final centroid and size of each cluster

[BDMN 05]

slide-49
SLIDE 49

Differentially Private K-means

  • Initialize a set of k centers
  • Suppose we fix the number of iterations to T
  • In each iteration (given a set of centers):
  • 1. Assign the points to the closest center
  • 2. Compute the noisy size of each cluster
  • 3. Compute the noisy sum (centroid)
  • f points in each cluster
  • Output the final centroid and size of each cluster

[BDMN 05]

Each iteration uses ε/T privacy, total privacy is ε

slide-50
SLIDE 50

Differentially Private K-means

  • Initialize a set of k centers
  • Suppose we fix the number of iterations to T
  • In each iteration (given a set of centers):
  • 1. Assign the points to the closest center
  • 2. Compute the noisy size of each cluster
  • 3. Compute the noisy sum (centroid)
  • f points in each cluster
  • Output the final centroid and size of each cluster

[BDMN 05]

Each iteration uses ε/T privacy, total privacy is ε S = 1 S = Dom

slide-51
SLIDE 51

Differentially Private K-means

  • Initialize a set of k centers
  • Suppose we fix the number of iterations to T
  • In each iteration (given a set of centers):
  • 1. Assign the points to the closest center
  • 2. Compute the noisy size of each cluster
  • 3. Compute the noisy sum (centroid)
  • f points in each cluster
  • Output the final centroid and size of each cluster

[BDMN 05]

Each iteration uses ε/T privacy, total privacy is ε

Laplace(2T/ε) Laplace(2T *dom/ε)

slide-52
SLIDE 52

Results (T = 10 iterations, random initialization) Original K-means algorithm Laplace K-means algorithm

  • Laplace k-means can distinguish clusters that are far apart
  • Laplace k-means can’t distinguish small clusters that are close by.
slide-53
SLIDE 53

Privacy preserving data mining

  • Differential privacy
  • Definition
  • Building blocks (primitive mechanisms)
  • Composition rules
  • Data mining algorithms with differential privacy
  • k-means clustering w/ differential privacy
  • Frequent itemsets/sequence mining w/

differential privacy

slide-54
SLIDE 54

Frequent Sequence Mining (FSM)

ID 100 200 300 400 500 Record a→c→d b→c→d a→b→c→e→d d→b a→d→c→d Database D Sequence {a} {b} {c} {d} Sup. 3 3 4 4 {e} 1 C1: cand 1-seqs Sequence {a} {b} {c} {d} Sup. 3 3 4 4 F1: freq 1-seqs

Sequence {a→a} {a→b} {a→c} {a→d} Sup. 1 3 3 {b→a} {b→b} {b→c} {b→d} 2 2 1 {c→a} {c→b} {c→c} {c→d} 4 {d→a} {d→b} {d→c} {d→d} 1 1 C2: cand 2-seqs Sequence {a→c} {a→d} {c→d} Sup. 3 3 4 F3: freq 2-seqs

Scan D Scan D Scan D

Sequence {a→a} {a→b} {a→c} {a→d} {b→a} {b→b} {b→c} {b→d} {c→a} {c→b} {c→c} {c→d} {d→a} {d→b} {d→c} {d→d} C2: cand 2-seqs

Sequence {a→b→c} C3: cand 3-seqs Sequence {a→b→c} Sup. 3 F3: freq 3-seqs

slide-55
SLIDE 55

Baseline Differentially Private FSM

ID 100 200 300 400 500 Record a→c→d b→c→d a→b→c→e→d d→b a→d→c→d Database D

Sequence {a} {b} {c} {d} Sup. 3 3 4 4 {e} 1 C1: cand 1-seqs noise 0.2

  • 0.4

0.4

  • 0.5

0.8

Sequence {a→a} {a→c} {a→d} {c→a} {c→c} {c→d} {d→a} {d→c} {d→d} C2: cand 2-seqs Sequence {a→a} {a→c} {a→d} Sup. 3 3 {c→a} {c→c} {c→d} 4 {d→a} {d→c} {d→d} 1 C2: cand 2-seqs noise 0.2 0.3 0.2

  • 0.5

0.8 0.2 0.3 2.1

  • 0.5

Scan D Scan D

Sequence {a→c→d} C3: cand 3-seqs {a→d→c}

noise 0.3 Sequence {a→c→d} Sup. 3 {a→d→c} 1 C3: cand 3-seqs

Scan D

Sequence {a} {c} {d} Noisy Sup. 3.2 4.4 3.5 F1: freq 1-seqs

Sequence {a→c} {a→d} {c→d} Noisy Sup. 3.3 3.2 4.2 F2: freq 2-seqs {d→c} 3.1

Sequence {a→c→d} Noisy Sup. 3 F3: freq 3-seqs

Lap(|C2| / ε2) Lap(|C1| / ε1) Lap(|C3| / ε3)

S Xu, S Su, X Cheng, Z Li, L Xiong. Differentially Private Frequent Sequence Mining via Sampling-based Candidate Pruning. ICDE 2015

slide-56
SLIDE 56

Frequent pattern (subgraph) mining

  • Represent each record as a graph
  • Modeling the co-occurrence between diagnosis, procedures, medications
  • Frequent subgraph mining with differential privacy

v1 v2 v3 v4 v1 v2 v3 v4 v1 v2 v3 v4

Threshold = 3

v1 v4

Input Graphs Frequent Subgraphs

support = 3

  • S. Xu, S. Su, L. Xiong, X. Cheng, K. Xiao, Differentially Private Frequent

Subgraph Mining. ICDE 2016

slide-57
SLIDE 57

Acknowledgements

  • Research support
  • Center for Comprehensive Informatics
  • Woodrow Wilson Foundation
  • Cisco research award
  • Students
  • James Gardner
  • Yonghui Xiao
  • Collaborators
  • Andrew Post, CCI
  • Fusheng Wang, CCI
  • Tyrone Grandison, IBM
  • Chun Yuan, Tsinghua
slide-58
SLIDE 58

Emory Assured Information Management and Sharing (AIMS) Lab

  • Collect, use, analyze, share data

without compromising privacy