Privacy preserving data mining randomized response and association - - PowerPoint PPT Presentation

privacy preserving data mining randomized response and
SMART_READER_LITE
LIVE PREVIEW

Privacy preserving data mining randomized response and association - - PowerPoint PPT Presentation

Privacy preserving data mining randomized response and association rule hiding Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University Privacy Preserving Data Mining


slide-1
SLIDE 1

Privacy preserving data mining – randomized response and association rule hiding

Li Xiong

CS573 Data Privacy and Anonymity

Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University

slide-2
SLIDE 2

Privacy Preserving Data Mining Techniques

 Protecting sensitive raw data

 Randomization (additive noise)  Geometric perturbation and projection (multiplicative

noise)

 Randomized response technique

 Categorical data perturbation in data collection model

 Protecting sensitive knowledge (knowledge

hiding)

slide-3
SLIDE 3

Data Collection Model

Data cannot be shared directly because of privacy concern

slide-4
SLIDE 4

Background: Randomized Response

) 5 . ( ) (     Yes P

P'(Yes)  P(Yes)  P(No) (1) P'(No)  P(Yes) (1)  P(No)

Do you smoke? Head Tail No Yes The true answer is “Yes” Biased coin:

5 . ) (     Head P

slide-5
SLIDE 5

Decision Tree Mining using Randomized Response

 Multiple attributes encoded in bits

) 5 . ( ) (     Yes P

Head Tail False answer !E: 001 True answer E: 110 Biased coin:

5 . ) (     Head P

 Column distribution can be estimated for

learning a decision tree!

Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

slide-6
SLIDE 6

Accuracy of Decision tree built on randomized response

slide-7
SLIDE 7

Generalization for Multi-Valued Categorical Data

True Value: Si Si Si+1 Si+2 Si+3

q1 q2 q3 q4

P'(s1) P'(s2) P'(s3) P'(s4)              q1 q4 q3 q2 q2 q1 q4 q3 q3 q2 q1 q4 q4 q3 q2 q1             P(s1) P(s2) P(s3) P(s4)            

M

slide-8
SLIDE 8

A Generalization

 RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]  RR Matrix can be arbitrary  Can we find optimal RR matrices?

M  a11 a12 a13 a14 a21 a22 a23 a24 a31 a32 a33 a34 a41 a42 a43 a44            

OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

slide-9
SLIDE 9

What is an optimal matrix?

 Which of the following is better?

M1  1 1 1          

M2 

1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3

         

Privacy: M2 is better Utility: M1 is better So, what is an optimal matrix?

slide-10
SLIDE 10

Optimal RR Matrix

 An RR matrix M is optimal if no other RR

matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M).

 Privacy Quantification  Utility Quantification

 A number of privacy and utility metrics have

been proposed.

 Privacy: how accurately one can estimate

individual info.

 Utility: how accurately we can estimate aggregate

info.

slide-11
SLIDE 11

Metrics

 Privacy: accuracy of estimate of individual values  Utility: difference between the original probability and the

estimated probability

slide-12
SLIDE 12

Optimization Methods

 Approach 1: Weighted sum:

w1 Privacy + w2 Utility

 Approach 2

 Fix Privacy, find M with the optimal Utility.  Fix Utility, find M with the optimal Privacy.  Challenge: Difficult to generate M with a fixed

privacy or utility.

 Proposed Approach: Multi-Objective

Optimization

slide-13
SLIDE 13

Optimization algorithm

 Evolutionary Multi-Objective Optimization (EMOO)

 The algorithm

 Start with a set of initial RR matrices  Repeat the following steps in each iteration

 Mating: selecting two RR matrices in the pool  Crossover: exchanging several columns between the

two RR matrices

 Mutation: change some values in a RR matrix  Meet the privacy bound: filtering the resultant matrices  Evaluate the fitness value for the new RR matrices.

Note : the fitness values is defined in terms of privacy and utility metrics

slide-14
SLIDE 14

Illustration

slide-15
SLIDE 15

Output of Optimization

Privacy Utility Worse Better M1M2 M4 M3 M5 M7 M6 M8

The optimal set is often plotted in the objective space as Pareto front.

slide-16
SLIDE 16

For First attribute of Adult data

slide-17
SLIDE 17

Privacy Preserving Data Mining Techniques

 Protecting sensitive raw data

 Randomization (additive noise)  Geometric perturbation and projection (multiplicative

noise)

 Randomized response technique

 Protecting sensitive knowledge (knowledge

hiding)

 Frequent itemset and association rule hiding  Downgrading classifier effectiveness

slide-18
SLIDE 18

Frequent Itemset Mining and Association Rule Mining

 Frequent itemset mining: frequent set of items in a transaction data set  Association rules: associations between items

slide-19
SLIDE 19

Frequent Itemset Mining and Association Rule Mining

 First proposed by Agrawal, Imielinski, and Swami in SIGMOD 1993

 SIGMOD Test of Time Award 2003

“This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ”

 Apriori algorithm in VLDB 1994

 #4 in the top 10 data mining algorithms in ICDM 2006

  • R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in

large databases. In SIGMOD ’93. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.

slide-20
SLIDE 20

February 19, 2009 20

Basic Concepts: Frequent Patterns and Association Rules

 Itemset: X = {x1, …, xk} (k-itemset)  Frequent itemset: X with minimum

support count

 Support count (absolute support): count

  • f transactions containing X

 Association rule: A  B with minimum

support and confidence

 Support: probability that a transaction

contains A  B s = P(A  B)

 Confidence: conditional probability that

a transaction having A also contains B c = P(A | B)  Association rule mining process

 Find all frequent patterns (more costly)  Generate strong association rules

Customer buys diaper Customer buys both Customer buys beer Transaction-id I tems bought

10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

slide-21
SLIDE 21

February 19, 2009

Illustration of Frequent Itemsets and Association Rules

Transaction-id I tems bought

10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

Frequent itemsets (minimum support count = 3) ?

Association rules (minimum support = 50%, minimum confidence = 50%) ?

{ A:3, B:3, D:4, E:3, AD:3} A  D (60%, 100%) D  A (60%, 75%)

slide-22
SLIDE 22

SIGMOD Ph.D. Workshop IDAR’07

22

Association Rule Hiding: what? why??

 Problem: hide sensitive association rules in

data without losing non-sensitive rules

 Motivations: confidential rules may have

serious adverse effects

slide-23
SLIDE 23

SIGMOD Ph.D. Workshop IDAR’07

Problem statement

 Given

 a database D to be released  minimum threshold “MST”, “MCT”  a set of association rules R mined from D  a set of sensitive rules Rh

R to be hided

 Find a new database D’ such that

 the rules in Rh cannot be mined from D’  the rules in R-Rh can still be mined as many as

possible 

slide-24
SLIDE 24

SIGMOD Ph.D. Workshop IDAR’07

Solutions

 Data modification approaches

 Basic idea: data sanitization D->D’  Approaches: distortion,blocking  Drawbacks

 Cannot control hiding effects intuitively, lots of I/O

 Data reconstruction approaches

 Basic idea: knowledge sanitization D->K->D’  Potential advantages

 Can easily control the availability of rules and control the

hiding effects directly, intuitively, handily

slide-25
SLIDE 25

Distortion-based Techniques

A B C D 1 1 1 1 1 1 1 1 1 1 1 1 1

Rule A Rule A→ →C has:

C has: Support( Support(A A→

→C

C)= 80% )= 80% Confidence( Confidence(A A→

→C

C)= 100% )= 100%

Sample Database Sample Database

A B C D 1 1 1 1 1 1 1 1 1 1 1

Distorted Database Distorted Database Rule A Rule A→ →C has now:

C has now: Support( Support(A A→

→C

C)= 40% )= 40% Confidence( Confidence(A A→

→C

C)= 50% )= 50%

Distortion Algorithm

slide-26
SLIDE 26

Side Effects

Before Hiding Before Hiding Process Process After Hiding After Hiding Process Process Side Effect Side Effect Rule Ri has had conf(R conf(Ri

i)> MCT

)> MCT Rule Ri has now conf(R conf(Ri

i)< MCT

)< MCT Rule Eliminated (Undesirable Side Effect) Rule Ri has had conf(R conf(Ri

i)< MCT

)< MCT Rule Ri has now conf(R conf(Ri

i)> MCT

)> MCT Ghost Rule (Undesirable Side Effect) Large Itemset I has had sup(I sup(I )> MST )> MST Itemset I has now sup(I sup(I )< MST )< MST Itemset Eliminated (Undesirable Side Effect)

slide-27
SLIDE 27

Distortion-based Techniques

Challenges/Goals:

To minimize the undesirable Side Effects that the hiding process causes to non-sensitive rules.

To minimize the number of 1 1’ ’s s that must be deleted in the database.

Algorithms must be linear in time as the database increases in size.

slide-28
SLIDE 28

Sensitive itemsets: ABC

slide-29
SLIDE 29

Data distortion [Atallah 99]

 Hardness result:

 The distortion problem is NP Hard

 Heuristic search

 Find items to remove and transactions to

remove the items from

Disclosure Limitation of Sensitive Rules, M. Atallah, A. Elmagarmid, M. Ibrahim, E. Bertino, V. Verykios, 1999

slide-30
SLIDE 30
slide-31
SLIDE 31

Heuristic Approach

 A greedy bottom-up search through the

ancestors (subsets) of the sensitive itemset for the parent with maximum support (why?)

 At the end of the search, 1-itemset is selected

 Search through the common transactions

containing the item and the sensitive itemset for the transaction that affects minimum number of 2-itemsets

 Delete the selected item from the identified

transaction

slide-32
SLIDE 32
slide-33
SLIDE 33

Results comparison

slide-34
SLIDE 34

Blocking-based Techniques

A A B B C C D D 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A A B B C C D D 1 1 1 1 1 1 1 1 ? ? 1 1 ? ? 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Blocking Algorithm

Initial Database Initial Database New Database New Database Support and Confidence becomes marginal. Support and Confidence becomes marginal. In New Database: 60% In New Database: 60% ≤ ≤ conf(A conf(A → → C) C) ≤ ≤ 100% 100%

slide-35
SLIDE 35

SIGMOD Ph.D. Workshop IDAR’07

Data reconstruction approach

D’

D

D

. 1 Frequent Set Mining

FS R R- Rh ’ FS

. 2 Perform sanitization Algorithm 3.FP

  • tree - based Inverse Frequent Set Mining

FP-tree

slide-36
SLIDE 36

2007-7-10 SIGMOD Ph.D. Workshop IDAR’07

36

The first two phases

 1. Frequent set mining

 Generate all frequent itemsets with their supports and

support counts FS from original database D

 2. Perform sanitization algorithm

 Input: FS output in phase 1, R, Rh  Output: sanitized frequent itemsets FS’  Process

 Select hiding strategy  Identify sensitive frequent sets  Perform sanitization

In best cases, sanitization algorithm can ensure from FS’ ,we can exactly get the non-sensitive rules set R-Rh

FS FS ’ R

  • R

h R

slide-37
SLIDE 37

2007-7-10

SIGMOD Ph.D. Workshop IDAR’07

37

Example: the first two phases

TID Items T1 ABCE T2 ABC T3 ABCD T4 ABD T5 AD T6 ACD Oiginal Database: D σ = 4 MST=66% MCT=75% Frequent Itemsets: FS A:6 100% B:4 66% C:4 66% D:4 66% AB:4 66% AC:4 66% AD:4 66% Frequent Itemsets: FS' A:6 100% C:4 66% D:4 66% AC:4 66% AD:4 66% rules confid- ence support CA 100% 66% DA 100% 66% Association Rules: R-Rh rules confid- ence support B A 100% 66% C A 100% 66% D A 100% 66% Association Rules: R   

  • 1. Frequent

set mining

  • 2. Perform

sanitization algorithm

slide-38
SLIDE 38

Open research questions

 Optimal solution  Itemsets sanitization

 The support and confidence of the rules in R- Rh should remain

unchanged as much as possible  Integrating data protection and knowledge (rule) protection

slide-39
SLIDE 39

Coming up

 Cryptographic protocols for privacy

preserving distributed data mining