Privacy preserving data mining – randomized response and association rule hiding
Li Xiong
CS573 Data Privacy and Anonymity
Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University
Privacy preserving data mining randomized response and association - - PowerPoint PPT Presentation
Privacy preserving data mining randomized response and association rule hiding Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University Privacy Preserving Data Mining
CS573 Data Privacy and Anonymity
Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University
Randomization (additive noise) Geometric perturbation and projection (multiplicative
Randomized response technique
Categorical data perturbation in data collection model
Data cannot be shared directly because of privacy concern
) 5 . ( ) ( Yes P
Do you smoke? Head Tail No Yes The true answer is “Yes” Biased coin:
5 . ) ( Head P
) 5 . ( ) ( Yes P
Head Tail False answer !E: 001 True answer E: 110 Biased coin:
5 . ) ( Head P
Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003
q1 q2 q3 q4
OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008
1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3
Privacy Quantification Utility Quantification
Privacy: how accurately one can estimate
Utility: how accurately we can estimate aggregate
Fix Privacy, find M with the optimal Utility. Fix Utility, find M with the optimal Privacy. Challenge: Difficult to generate M with a fixed
Start with a set of initial RR matrices Repeat the following steps in each iteration
Mating: selecting two RR matrices in the pool Crossover: exchanging several columns between the
two RR matrices
Mutation: change some values in a RR matrix Meet the privacy bound: filtering the resultant matrices Evaluate the fitness value for the new RR matrices.
Note : the fitness values is defined in terms of privacy and utility metrics
Privacy Utility Worse Better M1M2 M4 M3 M5 M7 M6 M8
The optimal set is often plotted in the objective space as Pareto front.
Randomization (additive noise) Geometric perturbation and projection (multiplicative
Randomized response technique
Frequent itemset and association rule hiding Downgrading classifier effectiveness
Frequent itemset mining: frequent set of items in a transaction data set Association rules: associations between items
First proposed by Agrawal, Imielinski, and Swami in SIGMOD 1993
SIGMOD Test of Time Award 2003
“This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ”
Apriori algorithm in VLDB 1994
#4 in the top 10 data mining algorithms in ICDM 2006
large databases. In SIGMOD ’93. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.
February 19, 2009 20
Itemset: X = {x1, …, xk} (k-itemset) Frequent itemset: X with minimum
support count
Support count (absolute support): count
Association rule: A B with minimum
support and confidence
Support: probability that a transaction
contains A B s = P(A B)
Confidence: conditional probability that
a transaction having A also contains B c = P(A | B) Association rule mining process
Find all frequent patterns (more costly) Generate strong association rules
Customer buys diaper Customer buys both Customer buys beer Transaction-id I tems bought
10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F
February 19, 2009
Transaction-id I tems bought
10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F
Frequent itemsets (minimum support count = 3) ?
Association rules (minimum support = 50%, minimum confidence = 50%) ?
{ A:3, B:3, D:4, E:3, AD:3} A D (60%, 100%) D A (60%, 75%)
SIGMOD Ph.D. Workshop IDAR’07
22
SIGMOD Ph.D. Workshop IDAR’07
a database D to be released minimum threshold “MST”, “MCT” a set of association rules R mined from D a set of sensitive rules Rh
the rules in Rh cannot be mined from D’ the rules in R-Rh can still be mined as many as
SIGMOD Ph.D. Workshop IDAR’07
Basic idea: data sanitization D->D’ Approaches: distortion,blocking Drawbacks
Cannot control hiding effects intuitively, lots of I/O
Basic idea: knowledge sanitization D->K->D’ Potential advantages
Can easily control the availability of rules and control the
hiding effects directly, intuitively, handily
A B C D 1 1 1 1 1 1 1 1 1 1 1 1 1
Rule A Rule A→ →C has:
C has: Support( Support(A A→
→C
C)= 80% )= 80% Confidence( Confidence(A A→
→C
C)= 100% )= 100%
Sample Database Sample Database
A B C D 1 1 1 1 1 1 1 1 1 1 1
Distorted Database Distorted Database Rule A Rule A→ →C has now:
C has now: Support( Support(A A→
→C
C)= 40% )= 40% Confidence( Confidence(A A→
→C
C)= 50% )= 50%
Distortion Algorithm
Before Hiding Before Hiding Process Process After Hiding After Hiding Process Process Side Effect Side Effect Rule Ri has had conf(R conf(Ri
i)> MCT
)> MCT Rule Ri has now conf(R conf(Ri
i)< MCT
)< MCT Rule Eliminated (Undesirable Side Effect) Rule Ri has had conf(R conf(Ri
i)< MCT
)< MCT Rule Ri has now conf(R conf(Ri
i)> MCT
)> MCT Ghost Rule (Undesirable Side Effect) Large Itemset I has had sup(I sup(I )> MST )> MST Itemset I has now sup(I sup(I )< MST )< MST Itemset Eliminated (Undesirable Side Effect)
To minimize the undesirable Side Effects that the hiding process causes to non-sensitive rules.
To minimize the number of 1 1’ ’s s that must be deleted in the database.
Algorithms must be linear in time as the database increases in size.
Sensitive itemsets: ABC
The distortion problem is NP Hard
Find items to remove and transactions to
Disclosure Limitation of Sensitive Rules, M. Atallah, A. Elmagarmid, M. Ibrahim, E. Bertino, V. Verykios, 1999
At the end of the search, 1-itemset is selected
A A B B C C D D 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A A B B C C D D 1 1 1 1 1 1 1 1 ? ? 1 1 ? ? 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Blocking Algorithm
Initial Database Initial Database New Database New Database Support and Confidence becomes marginal. Support and Confidence becomes marginal. In New Database: 60% In New Database: 60% ≤ ≤ conf(A conf(A → → C) C) ≤ ≤ 100% 100%
SIGMOD Ph.D. Workshop IDAR’07
D
. 1 Frequent Set Mining
FS R R- Rh ’ FS
. 2 Perform sanitization Algorithm 3.FP
FP-tree
2007-7-10 SIGMOD Ph.D. Workshop IDAR’07
36
Generate all frequent itemsets with their supports and
support counts FS from original database D
Input: FS output in phase 1, R, Rh Output: sanitized frequent itemsets FS’ Process
Select hiding strategy Identify sensitive frequent sets Perform sanitization
In best cases, sanitization algorithm can ensure from FS’ ,we can exactly get the non-sensitive rules set R-Rh
FS FS ’ R
h R
2007-7-10
SIGMOD Ph.D. Workshop IDAR’07
37
TID Items T1 ABCE T2 ABC T3 ABCD T4 ABD T5 AD T6 ACD Oiginal Database: D σ = 4 MST=66% MCT=75% Frequent Itemsets: FS A:6 100% B:4 66% C:4 66% D:4 66% AB:4 66% AC:4 66% AD:4 66% Frequent Itemsets: FS' A:6 100% C:4 66% D:4 66% AC:4 66% AD:4 66% rules confid- ence support CA 100% 66% DA 100% 66% Association Rules: R-Rh rules confid- ence support B A 100% 66% C A 100% 66% D A 100% 66% Association Rules: R
set mining
sanitization algorithm
Optimal solution Itemsets sanitization
The support and confidence of the rules in R- Rh should remain
unchanged as much as possible Integrating data protection and knowledge (rule) protection