Maintaining Data Privacy in Association Rule Mining Shariq Rizvi - - PowerPoint PPT Presentation

maintaining data privacy in association rule mining
SMART_READER_LITE
LIVE PREVIEW

Maintaining Data Privacy in Association Rule Mining Shariq Rizvi - - PowerPoint PPT Presentation

Maintaining Data Privacy in Association Rule Mining Shariq Rizvi Indian Institute of Technology, Bombay Joint work with: Jayant Haritsa Indian Institute of Science August 2002 MASK Presentation (VLDB) 1 A Typical Web-Service Form August


slide-1
SLIDE 1

August 2002 MASK Presentation (VLDB) 1

Maintaining Data Privacy in Association Rule Mining

Shariq Rizvi

Indian Institute of Technology, Bombay Joint work with: Jayant Haritsa Indian Institute of Science

slide-2
SLIDE 2

August 2002 MASK Presentation (VLDB) 2

A Typical Web-Service Form

slide-3
SLIDE 3

August 2002 MASK Presentation (VLDB) 3

The Good Side

  • Better aggregate models
  • Improved customer services

“Action m ovies released in July rarely bomb at the box office” “amazon.com: If you are buying Macbeth, you may want to read The Count of Monte Cristo”

slide-4
SLIDE 4

August 2002 MASK Presentation (VLDB) 4

The Dark Side

  • Breach of data privacy

v Diabetes v Lung Cancer v Myopia NO YES Major Illnesses Insurance premium for the children may be increased because lung cancer is suspect to genetic transmission.

slide-5
SLIDE 5

August 2002 MASK Presentation (VLDB) 5

The Dark Side (contd)

  • Discovery of sensitive models

90% of all PhD students don’t do research!

slide-6
SLIDE 6

August 2002 MASK Presentation (VLDB) 6

The Nuclear Power Equivalence

How do we get all the good without suffering from the bad?

slide-7
SLIDE 7

August 2002 MASK Presentation (VLDB) 7

Our Focus

Addressing privacy concerns in the context of Boolean Association Rule Mining

slide-8
SLIDE 8

August 2002 MASK Presentation (VLDB) 8

Association Rules

  • Co-occurence of events:

On supermarket purchases, indicates which items are typically bought together

80 percent of customers purchasing coffee also purchased milk.

Coffee ⇒ Milk (0.8)

To ensure statistical significance, need to also compute the “support’’ – coffee and milk are purchased together by 60 percent of customers.

Coffee ⇒ Milk (0.8,0.6)

slide-9
SLIDE 9

August 2002 MASK Presentation (VLDB) 9

Frequent Itemsets

  • T = set of transactions
  • I = set of items
  • supmin – User–specified threshold

“X ? I is frequent if more than supmin transactions in T, support X”

slide-10
SLIDE 10

August 2002 MASK Presentation (VLDB) 10

Privacy and BAR Mining

  • Preventing discovery of sensitive rules
  • Atallah et al [ KDEX 1999]
  • Saygin, Verykios, Clifton [ SIGMOD Record 2001]
  • Dasseni, Verykios [ IHW 2001]
  • Saygin et al [ RIDE 2002]
  • Preventing disclosure of data
  • Our work
  • Concurrent work by Evfimievski et al [ KDD 2002]

User Data

Data Mining Algorithm

Aggregate Models Privacy! Privacy!

slide-11
SLIDE 11

August 2002 MASK Presentation (VLDB) 11

Requirements for Mining with Data Privacy

  • High Privacy

User-visibility of privacy

  • Highly accurate models
  • Efficiency

Data aggregation-time efficiency Mining-time efficiency

slide-12
SLIDE 12

August 2002 MASK Presentation (VLDB) 12

Conflicting Goals

Data Privacy Accurate Models Vs.

slide-13
SLIDE 13

August 2002 MASK Presentation (VLDB) 13

The Game Plan

User Data Distorted Data Our Algorithm Pretty Accurate Models

A Distortion Procedure A Reconstruction Procedure

slide-14
SLIDE 14

August 2002 MASK Presentation (VLDB) 14

Outline

  • Privacy by data distortion
  • Mining the distorted database (MASK)
  • Experimental Evaluation
  • Run-time Optimizations
  • Conclusions, Limitations and Future Work
slide-15
SLIDE 15

August 2002 MASK Presentation (VLDB) 15

Distortion Procedure

  • View the database as a matrix of 0s and 1s

0s represent absence of the item in the transaction 1s represent presence of the item in the transaction

Global data swapping? (privacy not “user-visible”) Data perturbation?

  • Independently flip some entries in the
  • matrix. Don’t flip with probability p, flip

with probability 1-p (p= 0.1 – 90% flips)

slide-16
SLIDE 16

August 2002 MASK Presentation (VLDB) 16

Torvald’s Dilemma

1 1 1 MS Office Diet Coke Insulin Diapers

Original Customer Tuple Distorted Tuple

1 MS Office Diet Coke Insulin Diapers

1= bought 0= not bought

slide-17
SLIDE 17

August 2002 MASK Presentation (VLDB) 17

Privacy Breach Measure

  • Reconstruction probability of a ‘1’ in the ith

column

Pr{ Yi= 1| Xi= 1} x Pr{ Xi= 1| Yi= 1} + Pr{ Yi= 0| Xi= 1} x Pr{ Xi= 1| Yi= 0}

slide-18
SLIDE 18

August 2002 MASK Presentation (VLDB) 18

Reconstruction Probability of a ‘1’

p s p s p s p s p s p s s p R

i i i i i i i

) 1 ( ) 1 ( ) 1 ( ) 1 )( 1 ( ) , (

2 2

− + − − + − − + =

si = support for item i p = distortion parameter R(p,si) for given si

slide-19
SLIDE 19

August 2002 MASK Presentation (VLDB) 19

Privacy Measure

100 )) , ( 1 ( ) , ( × − =

i i

s p R s p P

P(p,si) for si= 0.01

The Playground!

slide-20
SLIDE 20

August 2002 MASK Presentation (VLDB) 20

Data Distortion and Psychology

… … … 1 1 1 … … … MS Office Diet Coke Insulin diapers … … 1 … … 1 1 1 1

90% distortion 10% distortion More visible distortion ⇒ Happier Custom er? p= 0.1 p= 0.9

slide-21
SLIDE 21

August 2002 MASK Presentation (VLDB) 21

Outline

  • Privacy by data distortion
  • Mining the distorted database ( MASK)
  • Experimental Evaluation
  • Run-time Optimizations
  • Conclusions, Limitations and Future Work
slide-22
SLIDE 22

August 2002 MASK Presentation (VLDB) 22

MASK

(Mining Associations with Secrecy Konstraints)

1. F= ? 2. Cands= Set of all items 3. Length= 1 4. While Cands? ?

1. Count 2Length components for each c ? Cands 2. Reconstruct the support for each c ? Cands 3. Add all frequent itemsets to F 4. Cands= Apriori-Gen(Cands) 5. Length= Length+ 1

5. Return F

slide-23
SLIDE 23

August 2002 MASK Presentation (VLDB) 23

Counters

  • 2n counters for an n-itemset
  • { c00, c01, c10, c11} for a 2-itemset
  • { c000, c001, c010, c011, c100, c101, c110,

c111} for a 3-itemset

slide-24
SLIDE 24

August 2002 MASK Presentation (VLDB) 24

MASK

(Mining Associations with Secrecy Konstraints)

1. F= ? 2. Cands= Set of all items 3. Length= 1 4. While Cands? ?

1. Count 2Length components for each c ? Cands 2. Reconstruct the support for each c ? Cands 3. Add all frequent itemsets to F 4. Cands= Apriori-Gen(Cands) 5. Length= Length+ 1

5. Return F

slide-25
SLIDE 25

August 2002 MASK Presentation (VLDB) 25

Support Reconstruction for 1-itemsets

c0, c1 = 0,1 counts in the original column cD

0, cD 1= 0,1 counts in

the distorted column p = distortion parameter C = M -1CD

slide-26
SLIDE 26

August 2002 MASK Presentation (VLDB) 26

Support Reconstruction for an n-itemset

C = M -1CD

C = Original 2n Counts CD= Distorted 2n Counts (eg. counts for 00, 01, 10, 11 for a 2-itemset)

M = { m i,j}

m i,j = probability that a tuple of the form j distorts to a tuple of the form i

  • eg. m 1,2 for a 3-itemset is the probability that a

“010” tuple distorts to a “001” = p x (1-p) x (1-p)

slide-27
SLIDE 27

August 2002 MASK Presentation (VLDB) 27

The Big Picture

  • User-visible Privacy
  • Value of p is pre-decided
  • Data-miner gets both the distorted data

and p

  • Reconstruction of supports
slide-28
SLIDE 28

August 2002 MASK Presentation (VLDB) 28

Outline

  • Privacy by data distortion
  • Mining the distorted database (MASK)
  • Experim ental Evaluation
  • Run-time Optimizations
  • Conclusions, Limitations and Future Work
slide-29
SLIDE 29

August 2002 MASK Presentation (VLDB) 29

Error Metrics

  • Support Error

100 sup _ | sup _ sup _ | | | 1 × − =

∑ f

f f f

act act rec F ρ 100 | | | | × − =

+

F F R σ

100 | | | | × − =

F R F σ

  • Identity Error

( false positives) ( false negatives)

R= reconstructed set of frequent itemsets F= actual set of frequent itemsets

slide-30
SLIDE 30

August 2002 MASK Presentation (VLDB) 30

The Setup

  • Scaled Real Dataset (BMS-WebView)

500 items 0.6 million tuples

  • Synthetic Dataset (IBM Almaden)

1000 items 1 million tuples

  • Experiments across p & supmin values
  • Low supmin values are tough nuts
slide-31
SLIDE 31

August 2002 MASK Presentation (VLDB) 31

Results with p= 0.9, supmin= 0.25%

25.0 1.4 4 4 9.6 11.0 2.6 73 3 7.1 6.7 3.9 239 2 2.8 4.0 5.9 249 1 s + s - ? | F| Level

slide-32
SLIDE 32

August 2002 MASK Presentation (VLDB) 32

Results with p= 0.7, supmin= 0.25%

400.0 50.0 7.6 4 4 2308.2 30.1 32.9 73 3 1907.5 20.1 33.6 239 2 15.7 7.2 19.0 249 1 s + s - ? | F| Level

slide-33
SLIDE 33

August 2002 MASK Presentation (VLDB) 33

Effect of Relaxation p= 0.9, supmin= 0.25%

75.0 1.4 4 4 45.2 2.9 73 3 23.4 1.3 4.0 239 2 0.4 1.2 6.1 249 1 s + s - ? | F| Level

  • 10% relaxation in supmin
slide-34
SLIDE 34

August 2002 MASK Presentation (VLDB) 34

Summary of Experiments

  • “Window of opportunity”: around

p= 0.9 (symmetrically 0.1)

  • Unusable Models as p? ›0.5
  • Significant loss of privacy as p? ›1, 0
  • Most identity errors occur near the

supmin boundary

  • Low errors at higher levels
slide-35
SLIDE 35

August 2002 MASK Presentation (VLDB) 35

Outline

  • Distortion and Reconstruction
  • Privacy Metric
  • MASK Algorithm
  • Experimental Evaluation
  • Run-tim e Optim izations
  • Conclusions, Limitations and Future Work
slide-36
SLIDE 36

August 2002 MASK Presentation (VLDB) 36

Linear Number of Counters

  • Each row of M 2

n x2 n in C = M -1CD has only

n+ 1 distinct entries

  • Example (n = 2):

count(11)= a0count D(00) + a1count D(01) + a2count D(10) + a3count D(11) a1= a2

  • Only n+ 1 counters for an n-itemset
slide-37
SLIDE 37

August 2002 MASK Presentation (VLDB) 37

Cutting Down on Counting

Example (pass 2):

  • count D(00)+ count D(01)+ count D(10)+

count D(11) = dbsize

  • Disregard ‘00’ counts – since 01, 10 and

11 are already being counted

  • Speeds up pass 2 in experimental runs

(p~ 0.9) by a factor of 4

slide-38
SLIDE 38

August 2002 MASK Presentation (VLDB) 38

Outline

  • Distortion and Reconstruction
  • Privacy Metric
  • MASK Algorithm
  • Experimental Evaluation
  • Run-time Optimizations
  • Conclusions, Lim itations and Future

W ork

slide-39
SLIDE 39

August 2002 MASK Presentation (VLDB) 39

Conclusions

  • Simple probabilistic distortion of data:

“User-visible”

  • Achieves conflicting goals of privacy

and model accuracy

  • Optimizations significantly reduce

time and space complexity

slide-40
SLIDE 40

August 2002 MASK Presentation (VLDB) 40

Limitations

  • Even with the optimizations, the time

complexity is high compared to standard (non-privacy-preserving) mining

  • Does not take into account the

re-interrogation of data with mining results [ KDD02]

slide-41
SLIDE 41

August 2002 MASK Presentation (VLDB) 41

Future Work

  • Improvements in running time
  • Refinement of privacy estimates
  • Extensions to generalized and

quantitative association rules

slide-42
SLIDE 42

August 2002 MASK Presentation (VLDB) 42

Take Away

Like Reagan to Gorbachev on monitoring nuclear reductions: “ Trust but verify”,

  • ur motto is

“Trust, but distort”