August 2002 MASK Presentation (VLDB) 1
Maintaining Data Privacy in Association Rule Mining
Shariq Rizvi
Indian Institute of Technology, Bombay Joint work with: Jayant Haritsa Indian Institute of Science
Maintaining Data Privacy in Association Rule Mining Shariq Rizvi - - PowerPoint PPT Presentation
Maintaining Data Privacy in Association Rule Mining Shariq Rizvi Indian Institute of Technology, Bombay Joint work with: Jayant Haritsa Indian Institute of Science August 2002 MASK Presentation (VLDB) 1 A Typical Web-Service Form August
August 2002 MASK Presentation (VLDB) 1
Shariq Rizvi
Indian Institute of Technology, Bombay Joint work with: Jayant Haritsa Indian Institute of Science
August 2002 MASK Presentation (VLDB) 2
August 2002 MASK Presentation (VLDB) 3
“Action m ovies released in July rarely bomb at the box office” “amazon.com: If you are buying Macbeth, you may want to read The Count of Monte Cristo”
August 2002 MASK Presentation (VLDB) 4
v Diabetes v Lung Cancer v Myopia NO YES Major Illnesses Insurance premium for the children may be increased because lung cancer is suspect to genetic transmission.
August 2002 MASK Presentation (VLDB) 5
90% of all PhD students don’t do research!
August 2002 MASK Presentation (VLDB) 6
August 2002 MASK Presentation (VLDB) 7
August 2002 MASK Presentation (VLDB) 8
On supermarket purchases, indicates which items are typically bought together
80 percent of customers purchasing coffee also purchased milk.
Coffee ⇒ Milk (0.8)
To ensure statistical significance, need to also compute the “support’’ – coffee and milk are purchased together by 60 percent of customers.
Coffee ⇒ Milk (0.8,0.6)
August 2002 MASK Presentation (VLDB) 9
August 2002 MASK Presentation (VLDB) 10
User Data
Data Mining Algorithm
Aggregate Models Privacy! Privacy!
August 2002 MASK Presentation (VLDB) 11
User-visibility of privacy
Data aggregation-time efficiency Mining-time efficiency
August 2002 MASK Presentation (VLDB) 12
Data Privacy Accurate Models Vs.
August 2002 MASK Presentation (VLDB) 13
A Distortion Procedure A Reconstruction Procedure
August 2002 MASK Presentation (VLDB) 14
August 2002 MASK Presentation (VLDB) 15
0s represent absence of the item in the transaction 1s represent presence of the item in the transaction
Global data swapping? (privacy not “user-visible”) Data perturbation?
with probability 1-p (p= 0.1 – 90% flips)
August 2002 MASK Presentation (VLDB) 16
1 1 1 MS Office Diet Coke Insulin Diapers
Original Customer Tuple Distorted Tuple
1 MS Office Diet Coke Insulin Diapers
1= bought 0= not bought
August 2002 MASK Presentation (VLDB) 17
column
Pr{ Yi= 1| Xi= 1} x Pr{ Xi= 1| Yi= 1} + Pr{ Yi= 0| Xi= 1} x Pr{ Xi= 1| Yi= 0}
August 2002 MASK Presentation (VLDB) 18
p s p s p s p s p s p s s p R
i i i i i i i
) 1 ( ) 1 ( ) 1 ( ) 1 )( 1 ( ) , (
2 2
− + − − + − − + =
si = support for item i p = distortion parameter R(p,si) for given si
August 2002 MASK Presentation (VLDB) 19
100 )) , ( 1 ( ) , ( × − =
i i
s p R s p P
P(p,si) for si= 0.01
The Playground!
August 2002 MASK Presentation (VLDB) 20
… … … 1 1 1 … … … MS Office Diet Coke Insulin diapers … … 1 … … 1 1 1 1
90% distortion 10% distortion More visible distortion ⇒ Happier Custom er? p= 0.1 p= 0.9
August 2002 MASK Presentation (VLDB) 21
August 2002 MASK Presentation (VLDB) 22
(Mining Associations with Secrecy Konstraints)
1. F= ? 2. Cands= Set of all items 3. Length= 1 4. While Cands? ?
1. Count 2Length components for each c ? Cands 2. Reconstruct the support for each c ? Cands 3. Add all frequent itemsets to F 4. Cands= Apriori-Gen(Cands) 5. Length= Length+ 1
5. Return F
August 2002 MASK Presentation (VLDB) 23
Counters
August 2002 MASK Presentation (VLDB) 24
(Mining Associations with Secrecy Konstraints)
1. F= ? 2. Cands= Set of all items 3. Length= 1 4. While Cands? ?
1. Count 2Length components for each c ? Cands 2. Reconstruct the support for each c ? Cands 3. Add all frequent itemsets to F 4. Cands= Apriori-Gen(Cands) 5. Length= Length+ 1
5. Return F
August 2002 MASK Presentation (VLDB) 25
c0, c1 = 0,1 counts in the original column cD
0, cD 1= 0,1 counts in
the distorted column p = distortion parameter C = M -1CD
August 2002 MASK Presentation (VLDB) 26
C = Original 2n Counts CD= Distorted 2n Counts (eg. counts for 00, 01, 10, 11 for a 2-itemset)
m i,j = probability that a tuple of the form j distorts to a tuple of the form i
“010” tuple distorts to a “001” = p x (1-p) x (1-p)
August 2002 MASK Presentation (VLDB) 27
and p
August 2002 MASK Presentation (VLDB) 28
August 2002 MASK Presentation (VLDB) 29
100 sup _ | sup _ sup _ | | | 1 × − =
f f f
act act rec F ρ 100 | | | | × − =
+
F F R σ
100 | | | | × − =
−
F R F σ
( false positives) ( false negatives)
R= reconstructed set of frequent itemsets F= actual set of frequent itemsets
August 2002 MASK Presentation (VLDB) 30
500 items 0.6 million tuples
1000 items 1 million tuples
August 2002 MASK Presentation (VLDB) 31
25.0 1.4 4 4 9.6 11.0 2.6 73 3 7.1 6.7 3.9 239 2 2.8 4.0 5.9 249 1 s + s - ? | F| Level
August 2002 MASK Presentation (VLDB) 32
400.0 50.0 7.6 4 4 2308.2 30.1 32.9 73 3 1907.5 20.1 33.6 239 2 15.7 7.2 19.0 249 1 s + s - ? | F| Level
August 2002 MASK Presentation (VLDB) 33
75.0 1.4 4 4 45.2 2.9 73 3 23.4 1.3 4.0 239 2 0.4 1.2 6.1 249 1 s + s - ? | F| Level
August 2002 MASK Presentation (VLDB) 34
August 2002 MASK Presentation (VLDB) 35
August 2002 MASK Presentation (VLDB) 36
n x2 n in C = M -1CD has only
n+ 1 distinct entries
count(11)= a0count D(00) + a1count D(01) + a2count D(10) + a3count D(11) a1= a2
August 2002 MASK Presentation (VLDB) 37
Example (pass 2):
count D(11) = dbsize
11 are already being counted
(p~ 0.9) by a factor of 4
August 2002 MASK Presentation (VLDB) 38
W ork
August 2002 MASK Presentation (VLDB) 39
August 2002 MASK Presentation (VLDB) 40
August 2002 MASK Presentation (VLDB) 41
August 2002 MASK Presentation (VLDB) 42