1 Data Mining and Privacy The primary task in data mining: Develop - PDF document

Privacy Breaches in Privacy-Preserving Data Mining Johannes Gehrke Department of Computer Science Cornell University Joint work with Sasha Evfimievski (Cornell), Ramakrishnan Srikant (IBM), and Rakesh Agrawal (IBM) Motivation: Information Spheres Local information sphere � Within each organization � Continuously process distributed high-speed distributed data streams � Online evaluation of thousands of triggers � Storage/archival, data provenance of all data is important � One view: The “real-time” enterprise Global information sphere � Between organizations � Share data in a privacy-preserving way Global Information Sphere Distributed privacy-preserving information integration and mining Technical challenges: � Collaboration of different distributed parties without revealing private data 1

Data Mining and Privacy � The primary task in data mining: Develop models about aggregated data. � Can we develop accurate models without access to precise information in individual data records? Randomization Overview Alice J.S. Bach, painting, Recommendation nasa.gov, Service … Bob B. Spears, baseball, cnn.com, Chris … B. Marley, camping, linux.org, … Randomization Overview Alice J.S. Bach, J.S. Bach, J.S. Bach, painting, Recommendation painting, painting, nasa.gov, nasa.gov, Service nasa.gov, … … … B. Spears, B. Spears, Bob baseball, baseball, cnn.com, cnn.com, B. Spears, … … baseball, B. Marley, B. Marley, cnn.com, camping, camping, Chris … linux.org, linux.org, … … B. Marley, camping, linux.org, … 2

Randomization Overview Alice J.S. Bach, J.S. Bach, J.S. Bach, painting, painting, Recommendation painting, nasa.gov, nasa.gov, Service nasa.gov, … … … B. Spears, B. Spears, Bob baseball, baseball, Associations cnn.com, cnn.com, B. Spears, … … baseball, B. Marley, B. Marley, cnn.com, camping, camping, Chris … linux.org, linux.org, Recommendations … … B. Marley, camping, linux.org, … Randomization Overview Alice J.S. Bach, Metallica, Metallica, painting, painting, Recommendation painting, nasa.gov, nasa.gov, Service nasa.gov, … … … Support Recovery B. Spears, B. Spears, Bob soccer, soccer, Associations bbc.co.uk, bbc.co.uk, B. Spears, … … baseball, B. Marley, B. Marley, cnn.com, camping, camping, Chris … microsoft.com microsoft.com Recommendations … … B. Marley, camping, linux.org, … Associations Recap � A transaction t is a set of items (e.g. books) � All transactions form a set T of transactions � Any itemset A has support s in T if { } ∈ ⊆ # t T | A t ( ) = = s supp A T � Itemset A is frequent if s ≥ s min � If A ⊆ B , then supp (A) ≥ supp (B). 3

Associations Recap � A transaction t is a set of items (e.g. books) � All transactions form a set T of transactions � Any itemset A has support s in T if { } ∈ ⊆ # t T | A t ( ) s = supp A = T � Itemset A is frequent if s ≥ smin � If A ⊆ B , then supp (A) ≥ supp (B). � Example: � 20% transactions contain X, � 5% transactions contain X and Y; � Then: confidence of “X ⇒ Y” is 5/20 = 0.25 = 25%. The Problem � How to randomize transactions so that � we can find frequent itemsets � while preserving privacy at transaction level? Talk Outline � Problem Definition � Uniform Randomization and Privacy Breaches � Cut-and-Paste Randomization � Experimental Evaluation � Generalized Privacy Breaches 4

Uniform Randomization � Given a transaction, � keep item with 20% probability, � replace with a new random item with 80% probability. Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have 94% have { x , y } , { x , z } , have one or zero or { y , z } only items of { x , y , z } { x , y , z } Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have 94% { x , y } , { x , z } , have have one or zero { x , y , z } or { y , z } only items of { x , y , z } Uniform randomization: How many have { x , y , z } ? 5

Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have 94% have { x , y } , { x , z } , have one or zero { x , y , z } or { y , z } only items of { x , y , z } at most • 0.2 3 • 0.2 2 • 8/10,000 • 0.2 • (9/10,000) 2 0.008% 0.00016% less than 0.00002% 800 ts. 16 trans. 2 transactions Uniform randomization: How many have { x , y , z } ? Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have 94% have { x , y } , { x , z } , have one or zero or { y , z } only items of { x , y , z } { x , y , z } at most • 0.2 3 • 0.2 2 • 8/10,000 • 0.2 • (9/10,000) 2 0.008% 0.00016% less than 0.00002% 800 ts. 16 trans. 2 transactions 97.8% 1.9% 0.3% Uniform randomization: How many have { x , y , z } ? Example: {x, y, z} � Given nothing, we have only 1% probability that {x, y, z} occurs in the original transaction � Given {x, y, z} in the randomized transaction, we have about 98% certainty of {x, y, z} in the original one. � This is what we call a privacy breach. � Uniform randomization preserves privacy “on average,” but not “in the worst case.” 6

Privacy Breaches � Suppose: � t is an original transaction; � t’ is the corresponding randomized transaction; � A is a (frequent) itemset. � Definition: Itemset A causes a privacy breach of level ρ (e.g. 50%) if, for some item z ∈ A, [ ] ′ ∈ ⊆ ≥ ρ Pr z t | A t � Assumption: no external information besides t’. Talk Outline � Problem Definition � Uniform Randomization and Privacy Breaches � Cut-and-Paste Randomization � Experimental Evaluation � Generalized Privacy Breaches Our Solution “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton � Insert many false items into each transaction � Hide true itemsets among false ones � Can we still find frequent itemsets while having sufficient privacy? 7

Definition of cut-and-paste � Given transaction t of size m , construct t’ : t = a , b , c , u , v , w , x , y , z t’ = Definition of cut-and-paste � Given transaction t of size m , construct t’ : � Choose a number j between 0 and K m (cutoff); t = a , b , c , u , v , w , x , y , z t’ = j = 4 Definition of cut-and-paste � Given transaction t of size m , construct t’ : � Choose a number j between 0 and K m (cutoff); � Include j items of t into t’ ; t = a , b , c , u , v , w , x , y , z t’ = b , v , x , z j = 4 8

Definition of cut-and-paste � Given transaction t of size m , construct t’ : � Choose a number j between 0 and K m (cutoff); � Include j items of t into t’ ; � Each other item is included into t’ with probability p m . The choice of K m and p m is based on the desired level of privacy. t = a , b , c , u , v , w , x , y , z t’ = b , v , x , z d , e , g , h , l , m , n , p , s , … j = 4 Partial Supports To recover original support of an itemset, we need randomized supports of its subsets. � Given an itemset A of size k and transaction size m , � A vector of partial supports of A is r ( ) = s s , s ,..., s , where 0 1 k 1 { ( ) } = ⋅ ∈ ∩ = s # t T | # t A l l T � Here s k is the same as the support of A . r s ′ � Randomized partial supports are denoted by . Transition Matrix � Let k = | A | , m = | t | . � Transition matrix P = P ( k, m ) connects randomized partial supports with original ones: r r ′ = ⋅ E s P s , where [ ( ) ( ) ] ′ ′ P = Pr # t ∩ A = l | # t ∩ A = l ′ l , l � Randomized supports are distributed as a sum of multinomial distributions. 9

The Unbiased Estimators � Given randomized partial supports, we can estimate original partial supports: r r = ⋅ ′ = − s Q s , where Q P 1 est � Covariance matrix for this estimator: r 1 k ∑ = ⋅ Cov s s Q D [ l ] Q T , est l T = l 0 = ⋅ δ − ⋅ where D [ l ] P P P = i , j i , l i j i , l j , l � To estimate it, substitute s l with ( s est ) l . � Special case: estimators for support and its variance Class of Randomizations � Our analysis works for any randomization that satisfies two properties: � A per-transaction randomization applies the same procedure to each transaction, using no information about other transactions; � An item-invariant randomization does not depend on any ordering or naming of items. � Both uniform and cut-and-paste randomizations satisfy these two properties. Apriori Let k = 1, candidate sets = all 1-itemsets. Repeat: Count support for all candidate sets 1. Output the candidate sets with support ≥ s min 2. New candidate sets = all (k + 1)-itemsets s.t. 3. all their k-subsets are candidate sets with support ≥ smin Let k = k + 1 4. Stop when there are no more candidate sets. 10

1 Data Mining and Privacy The primary task in data mining: Develop - PDF document

Privacy Breaches in Privacy-Preserving Data Mining Johannes Gehrke Department of Computer Science Cornell University Joint work with Sasha Evfimievski (Cornell), Ramakrishnan Srikant (IBM), and Rakesh Agrawal (IBM) Motivation: Information

PR PROACTIVE CTIVE SE SECUR CURITY ITY: : DATA A BREA BREACH CH ASSE ASSESSM SSMENT

GSP Stakeholder Committee Stakeholder Committee Meeting July 23, 2018 Agenda Welcome,

Support Vector Machine Machine Learning 10-601B Seyoung Kim

r rt r t r

CSC 495.002 Group Projects Dr. Ozg ur Kafal North Carolina State University

Formal Methods and CyberSecurity James Davenport University of Bath Former Fulbright

CYBER SECURITY - DE-RISKING THE USE OF CLOUD SERVICES Maritz Cloete, CISSP, M.CIIS 16 September

Health Information Privacy Breaches Privacy Breaches EHIL Webinar November 14, 2011 1 2011

B: Data Reproducibility What are we doing in Singapore, Tim White and what should journals be

Just Culture CAPT JEFF SALVON-HARMAN, MD JUST CULTURE, CERTIFIED QUALITY FOCUS OFFICE OF THE

Stress Robert Sapolsky studying baboons in Kenya social order includes dominant males being

Policy y maki king a g and e eviden ence e A A practition oner ers p perspec

Kernel lock-down series http://outflux.net/slides/2014/lss/lockdown.pdf Linux Security Summit,

Informatics in Victorian Aboriginal Community Controlled Organisations: Context, challenges,

Project Speed Infrastructure preferences July 2020 Preference for delay and engagement

Began 1909 Covers British, Irish, (from 43AD to the present), imperial and Commonwealth,

Acts Series Lesson #127 October 8, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert L.

1 Samuel 8:49, 1920 So all the elders of Israel gathered together and came to Samuel at

Luis Fernndez Carril Candidate, Ph.D. in Humanistic Studies, Tecnolgico de Monterrey, Mexico

Manufacturing and the Manufacturing and the Manufacturing and the Manufacturing and the

Defining sustainable and qualitative growth Sebastiano Sabato Observatoir ire social europen

THE CASE FOR A GLOBAL THRESHOLDS AND ALLOCATION COUNCIL Allen L. White, Ph.D. 13 June 2018

Framework and Terminology of the Circular Built Environment Tillmann Klein Linear Economy

sustainability science Lus M. A. Bettencourt Santa Fe Institute & Los Alamos National

1 Data Mining and Privacy The primary task in data mining: Develop - PDF document

Privacy Breaches in Privacy-Preserving Data Mining Johannes Gehrke Department of Computer Science Cornell University Joint work with Sasha Evfimievski (Cornell), Ramakrishnan Srikant (IBM), and Rakesh Agrawal (IBM) Motivation: Information

PR PROACTIVE CTIVE SE SECUR CURITY ITY: : DATA A BREA BREACH CH ASSE ASSESSM SSMENT

GSP Stakeholder Committee Stakeholder Committee Meeting July 23, 2018 Agenda Welcome,

Support Vector Machine Machine Learning 10-601B Seyoung Kim

r rt r t r

CSC 495.002 Group Projects Dr. Ozg ur Kafal North Carolina State University

Formal Methods and CyberSecurity James Davenport University of Bath Former Fulbright

CYBER SECURITY - DE-RISKING THE USE OF CLOUD SERVICES Maritz Cloete, CISSP, M.CIIS 16 September

Health Information Privacy Breaches Privacy Breaches EHIL Webinar November 14, 2011 1 2011

B: Data Reproducibility What are we doing in Singapore, Tim White and what should journals be

Just Culture CAPT JEFF SALVON-HARMAN, MD JUST CULTURE, CERTIFIED QUALITY FOCUS OFFICE OF THE

Stress Robert Sapolsky studying baboons in Kenya social order includes dominant males being

Policy y maki king a g and e eviden ence e A A practition oner ers p perspec

Kernel lock-down series http://outflux.net/slides/2014/lss/lockdown.pdf Linux Security Summit,

Informatics in Victorian Aboriginal Community Controlled Organisations: Context, challenges,

Project Speed Infrastructure preferences July 2020 Preference for delay and engagement

Began 1909 Covers British, Irish, (from 43AD to the present), imperial and Commonwealth,

Acts Series Lesson #127 October 8, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert L.

1 Samuel 8:49, 1920 So all the elders of Israel gathered together and came to Samuel at

Luis Fernndez Carril Candidate, Ph.D. in Humanistic Studies, Tecnolgico de Monterrey, Mexico

Manufacturing and the Manufacturing and the Manufacturing and the Manufacturing and the

Defining sustainable and qualitative growth Sebastiano Sabato Observatoir ire social europen

THE CASE FOR A GLOBAL THRESHOLDS AND ALLOCATION COUNCIL Allen L. White, Ph.D. 13 June 2018

Framework and Terminology of the Circular Built Environment Tillmann Klein Linear Economy

sustainability science Lus M. A. Bettencourt Santa Fe Institute &amp; Los Alamos National

sustainability science Lus M. A. Bettencourt Santa Fe Institute & Los Alamos National