Simulatability The enemy knows the system, Claude Shannon CompSci - - PowerPoint PPT Presentation

simulatability
SMART_READER_LITE
LIVE PREVIEW

Simulatability The enemy knows the system, Claude Shannon CompSci - - PowerPoint PPT Presentation

Simulatability The enemy knows the system, Claude Shannon CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 6 : 590.03 Fall 12 1 Announcements Please meet with me at least 2 times before you finalize your project (deadline


slide-1
SLIDE 1

Simulatability

“The enemy knows the system”, Claude Shannon

CompSci 590.03 Instructor: Ashwin Machanavajjhala

1 Lecture 6 : 590.03 Fall 12

slide-2
SLIDE 2

Announcements

  • Please meet with me at least 2 times before you finalize your

project (deadline Sep 28).

Lecture 6 : 590.03 Fall 12 2

slide-3
SLIDE 3

Recap – L-Diversity

  • The link between identity and attribute value is the sensitive

information. “Does Bob have Cancer? Heart disease? Flu?” “Does Umeko have Cancer? Heart disease? Flu?”

  • Adversary knows ≤ L-2 negation statements.

“Umeko does not have Heart Disease.”

– Data Publisher may not know exact adversarial knowledge

  • Privacy is breached when identity can be linked to attribute value

with high probability Pr[ “Bob has Cancer” | published table, adv. knowledge] > t

3 Lecture 6 : 590.03 Fall 12

slide-4
SLIDE 4

Zip Age Nat.

Disease 1306* <=40 * Heart 1306* <=40 * Flu 1306* <=40 * Cancer 1306* <=40 * Cancer 1485* >40 * Cancer 1485* >40 * Heart 1485* >40 * Flu 1485* >40 * Flu 1305* <=40 * Heart 1305* <=40 * Flu 1305* <=40 * Cancer 1305* <=40 * Cancer

Recap – 3-Diverse Table

4

L-Diversity Principle: Every group of tuples with the same Q-ID values has ≥ L distinct sensitive values of roughly equal proportions.

Lecture 6 : 590.03 Fall 12

slide-5
SLIDE 5

Outline

  • Simulatable Auditing
  • Minimality Attack in anonymization
  • Simulatable algorithms for anoymization

Lecture 6 : 590.03 Fall 12 5

slide-6
SLIDE 6

Query Auditing

Database has numeric values (say salaries of employees). Database either truthfully answers a question or denies answering. MIN, MAX, SUM queries over subsets of the database. Question: When to allow/deny queries?

Database

Researcher

Query Safe to publish? Yes No

6 Lecture 6 : 590.03 Fall 12

slide-7
SLIDE 7

Why should we deny queries?

  • Q1: Ben’s sensitive value?

– DENY

  • Q2: Max sensitive value of

males?

– ANSWER: 2

  • Q3: Max sensitive value of 1st

year PhD students?

– ANSWER: 3

  • But Q3 + Q2 => Xi = 3

Lecture 6 : 590.03 Fall 12 7

Name 1st year PhD Gender Sensitiv e value Ben Y M 1 Bha N M 1 Ios Y M 1 Jan N M 2 Jian Y M 2 Jie N M 1 Joe N M 2 Moh N M 1 Son N F 1 Xi Y F 3 Yao N M 2

slide-8
SLIDE 8

Value-Based Auditing

  • Let a1, a2, …, ak be the answers to previous queries Q1, Q2, …, Qk.
  • Let ak+1 be the answer to Qk+1.

ai = f(ci1x1, ci2x2, …, cinxn), i = 1 … k+1 cim = 1 if Qi depends on xm Check if any xj has a unique solution.

8 Lecture 6 : 590.03 Fall 12

slide-9
SLIDE 9

Value-based Auditing

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

9 Lecture 6 : 590.03 Fall 12

slide-10
SLIDE 10

Value-based Auditing

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

max(x1, x2 , x3 , x4 , x5)

Ans: 10

10

  • ∞ ≤ x1 … x5≤ 10

10 Lecture 6 : 590.03 Fall 12

slide-11
SLIDE 11

Value-based Auditing

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

max(x1, x2 , x3 , x4 , x5)

Ans: 10

10 max(x1, x2 , x3 , x4)

Ans: 8

DENY

  • ∞ ≤ x1 … x4 ≤ 8

=> x5 = 10

11 Lecture 6 : 590.03 Fall 12

slide-12
SLIDE 12

Value-based Auditing

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

max(x1, x2 , x3 , x4 , x5)

Ans: 10

10 max(x1, x2 , x3 , x4)

Ans: 8

DENY

Denial means some value can be compromised!

12 Lecture 6 : 590.03 Fall 12

slide-13
SLIDE 13

Value-based Auditing

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

max(x1, x2 , x3 , x4 , x5)

Ans: 10

10 max(x1, x2 , x3 , x4)

Ans: 8

DENY

What could max(x1, x2, x3, x4) be?

13 Lecture 6 : 590.03 Fall 12

slide-14
SLIDE 14

Value-based Auditing

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

max(x1, x2 , x3 , x4 , x5)

Ans: 10

10 max(x1, x2 , x3 , x4)

Ans: 8

DENY

From first answer,

max(x1,x2,x3,x4) ≤ 10

14 Lecture 6 : 590.03 Fall 12

slide-15
SLIDE 15

Value-based Auditing

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

max(x1, x2 , x3 , x4 , x5)

Ans: 10

10 max(x1, x2 , x3 , x4)

Ans: 8

DENY

If, max(x1,x2,x3,x4) = 10 Then, no privacy breach

15 Lecture 6 : 590.03 Fall 12

slide-16
SLIDE 16

Value-based Auditing

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

max(x1, x2 , x3 , x4 , x5)

Ans: 10

10 max(x1, x2 , x3 , x4)

Ans: 8

DENY

Hence,

max(x1,x2,x3,x4) < 10

=> x5 = 10!

16 Lecture 6 : 590.03 Fall 12

slide-17
SLIDE 17

Value-based Auditing

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

max(x1, x2 , x3 , x4 , x5)

Ans: 10

10 max(x1, x2 , x3 , x4)

Ans: 8

DENY

Hence,

max(x1,x2,x3,x4) < 10

=> x5 = 10!

Denials leak information. Attack occurred since privacy analysis did not assume that attacker knows the algorithm.

17 Lecture 6 : 590.03 Fall 12

slide-18
SLIDE 18

Simulatable Auditing [Kenthapadi et al PODS ‘05]

  • An auditor is simulatable if the decision to deny a query Qk is

made based on information already available to the attacker. – Can use queries Q1, Q2, …, Qk and answers a1, a2, …, ak-1 – Cannot use ak or the actual data to make the decision.

  • Denials provably do not leak informaiton

– Because the attacker could equivalently determine whether the query would be denied. – Attacker can mimic or simulate the auditor.

18 Lecture 6 : 590.03 Fall 12

slide-19
SLIDE 19

Simulatable Auditing Algorithm

  • Data Values: {x1, x2 , x3 , x4 , x5}, Queries: MAX.
  • Allow query if value of xi can’t be inferred.

x1 x2 x3 x4 x5

max(x1, x2 , x3 , x4 , x5)

Ans: 10

10 max(x1, x2 , x3 , x4)

Before computing answer

DENY

Ans > 10 => not possible Ans = 10 => -∞ ≤ x1 … x4 ≤ 10 Ans < 10 => x5 = 10

SAFE UNSAFE

19 Lecture 6 : 590.03 Fall 12

slide-20
SLIDE 20

Summary of Simulatable Auditing

  • Decision to deny answers must be based on past queries

answered in some (many!) cases.

  • Denials can leak information if the adversary does not know all

the information that is used to decide whether to deny the query.

20 Lecture 6 : 590.03 Fall 12

slide-21
SLIDE 21

Outline

  • Simulatable Auditing
  • Minimality Attack in anonymization
  • Simulatable algorithms for anoymization

Lecture 6 : 590.03 Fall 12 21

slide-22
SLIDE 22

Minimality attack on Generalization algorithms

  • Algorithms for K-anonymity, L-diversity, T-closeness, etc. try to

maximize utility.

– Find a minimally generalized table in the lattice that satisfies privacy, and maximizes utility.

  • But … attacker also knows this algorithm!

Lecture 6 : 590.03 Fall 12 22

slide-23
SLIDE 23

Example Minimality attack [Wong et al VLDB07]

  • Dataset with one quasi-identifier and 2 values q1, q2.
  • q1, q2 generalize to Q.
  • Sensitive attribute: Cancer – yes/no
  • We want to ensure P[Cancer = yes] < ½.

– OK to know if an individual does not have Cancer.

  • Published Table:

Lecture 6 : 590.03 Fall 12 23

QID Cancer

Q Yes Q Yes Q No Q No q2 No q2 No

slide-24
SLIDE 24

Which input datasets could have led to the published table?

Lecture 6 : 590.03 Fall 12 24

QID Cancer

Q Yes Q Yes Q No Q No q2 No q2 No

Output dataset {q1,q2}  Q (“2-diverse”) Possible Input dataset 3 occurrences of q1

QID Cancer

q1 Yes q1 Yes q1 No q2 No q2 No q2 No

QID Cancer

q1 Yes q1 No q1 No q2 Yes q2 No q2 No

slide-25
SLIDE 25

Which input datasets could have led to the published table?

Lecture 6 : 590.03 Fall 12 25

QID Cancer

Q Yes Q Yes Q No Q No q2 No q2 No

Output dataset {q1,q2}  Q (“2-diverse”) Possible Input dataset 3 occurrences of q1

QID Cancer

q1 Yes Q No Q No q2 Yes q2 No q2 No

This is a better generalization!

slide-26
SLIDE 26

Which input datasets could have led to the published table?

Lecture 6 : 590.03 Fall 12 26

QID Cancer

Q Yes Q Yes Q No Q No q2 No q2 No

Output dataset {q1,q2}  Q (“2-diverse”) Possible Input dataset 1 occurrence of q1

QID Cancer

q2 Yes q1 Yes q2 No q2 No q2 No q2 No

QID Cancer

q2 Yes q2 Yes q1 No q2 No q2 No q2 No

slide-27
SLIDE 27

Which input datasets could have led to the published table?

Lecture 6 : 590.03 Fall 12 27

QID Cancer

Q Yes Q Yes Q No Q No q2 No q2 No

Output dataset {q1,q2}  Q (“2-diverse”) Possible Input dataset 3 occurrences of q1

QID Cancer

q2 Yes Q No Q No q2 Yes q2 No q2 No

This is a better generalization!

slide-28
SLIDE 28

Which input datasets could have led to the published table?

Lecture 6 : 590.03 Fall 12 28

QID Cancer

Q Yes Q Yes Q No Q No q2 No q2 No

Output dataset {q1,q2}  Q (“2-diverse”) Possible Input dataset 3 occurrences of q1

QID Cancer

q2 Yes Q No Q No q2 Yes q2 No q2 No

There must be exactly two tuples with q1

slide-29
SLIDE 29

Which input datasets could have led to the published table?

QID Cancer

Q Yes Q Yes Q No Q No q2 No q2 No

Output dataset {q1,q2}  Q (“2-diverse”) Possible Input dataset 2 occurrences of q1

QID Cancer

q1 Yes q1 Yes q2 No q2 No q2 No q2 No

QID Cancer

q2 Yes q2 Yes q1 No q1 No q2 No q2 No

QID Cancer

q1 Yes q2 Yes q1 No q2 No q2 No q2 No

Already satisfies privacy

29 Lecture 6 : 590.03 Fall 12

slide-30
SLIDE 30

Which input datasets could have led to the published table?

QID Cancer

Q Yes Q Yes Q No Q No q2 No q2 No

Output dataset {q1,q2}  Q (“2-diverse”) Possible Input dataset 2 occurrences of q1

QID Cancer

q1 Yes q1 Yes q2 No q2 No q2 No q2 No

QID Cancer

q2 Yes q2 Yes q1 No q1 No q2 No q2 No

Learning Cancer=NO is OK, Hence, this is private

30 Lecture 6 : 590.03 Fall 12

slide-31
SLIDE 31

Which input datasets could have led to the published table?

QID Cancer

Q Yes Q Yes Q No Q No q2 No q2 No

Output dataset {q1,q2}  Q (“2-diverse”) Possible Input dataset 2 occurrences of q1

QID Cancer

q1 Yes q1 Yes q2 No q2 No q2 No q2 No

This is the ONLY input that results in the output!

P[Cancer = yes | q1] = 1

31 Lecture 6 : 590.03 Fall 12

slide-32
SLIDE 32

Outline

  • Simulatable Auditing
  • Minimality Attack in anonymization
  • Transparent Anonymization:

Simulatable algorithms for anoymization

Lecture 6 : 590.03 Fall 12 32

slide-33
SLIDE 33

Transparent Anonymization

  • Assume that the adversary knows the algorithm that is being

used.

Lecture 6 : 590.03 Fall 12 33

O: Output table I(O, A): Input tables that result in O due to algorithm A I: All possible input tables

slide-34
SLIDE 34

Transparent Anonymization

  • According to I(O, A) privacy must be guaranteed.

– Probability must be computed assuming I(O,A) is the actual set of all possible input tables.

  • What is an efficient algorithm for Transparent Anonymization?

– For L-diversity?

Lecture 6 : 590.03 Fall 12 34

slide-35
SLIDE 35

Ace Algorithm [Xiao et al TODS’10]

Step 1: Assign Just based on the sensitive values, construct (in a randomized fashion) an intermediate L-diverse generation. Step 2: Split Only based on the quasi-identifier values (and without looking at sensitive values) , deterministically refine the intermediate solution to maximize utility.

Lecture 6 : 590.03 Fall 12 35

slide-36
SLIDE 36

Step 1: Assign

  • Input Table

Lecture 6 : 590.03 Fall 12 36

slide-37
SLIDE 37

Step 1: Assign

  • St is the set of all tuples (grouped by sensitive value)
  • Iteratively,

– Remove α tuples each from the β (≥L) most frequent sensitive values

Lecture 6 : 590.03 Fall 12 37

slide-38
SLIDE 38

Step 1: Assign

  • St is the set of all tuples (grouped by sensitive value)
  • Iteratively,

– Remove α tuples each from the β (≥L) most frequent sensitive values

– 1st iteration β=2, α=2

Lecture 6 : 590.03 Fall 12 38

slide-39
SLIDE 39

Step 1: Assign

  • St is the set of all tuples (grouped by sensitive value)
  • Iteratively,

– Remove α tuples each from the β (≥L) most frequent sensitive values

– 2nd iteration β=2, α=1

Lecture 6 : 590.03 Fall 12 39

slide-40
SLIDE 40

Step 1: Assign

  • St is the set of all tuples (grouped by sensitive value)
  • Iteratively,

– Remove α tuples each from the β (≥L) most frequent sensitive values

– 3rd iteration β=2, α=1

Lecture 6 : 590.03 Fall 12 40

slide-41
SLIDE 41

Intermediate Generalization

Name Age Zip Ann 21 10000 Bob 27 18000 Gill 60 63000 Ed 54 60000 Don 32 35000 Fred 60 63000 Hera 60 63000 Cate 32 35000

Lecture 6 : 590.03 Fall 12 41

Disease Dyspepsia Dyspepsia Flu Flu Bronchitis Gastritis Diabetes Gastritis

slide-42
SLIDE 42

Step 2: Split

  • If a bucket contains α>1 tuples of each sensitive value, split it into

two buckets, Ba and Bb s.t.,

– Pick 1 ≤ αa < α tuples from each sensitive value in bucket B, and put them in bucket Ba. The remaining tuples go to Bb. – The division (Ba, Bb) is optimal in terms of utility.

Lecture 6 : 590.03 Fall 12 42

Name Age Zip Ann 21 10000 Bob 27 18000 Gill 60 63000 Ed 54 60000 Don 32 35000 Fred 60 63000 Hera 60 63000 Cate 32 35000

slide-43
SLIDE 43

Why does the Ace algorithm satisfy Transparent L-Diversity?

  • According to I(O, A) privacy must be guaranteed.

– Probability must be computed assuming I(O,A) is the actual set of all possible input tables.

Lecture 6 : 590.03 Fall 12 43

O: Output table I(O, A): Input tables that result in O due to algorithm A I: All possible input tables

slide-44
SLIDE 44

Ace algorithm analysis

Lemma 1: The assign step satisfies transparent L-diversity. Proof (sketch):

  • Consider an intermediate output Int
  • Suppose there is some input table T such that Assign(T) = Int
  • Any other table T’ where the sensitive values of 2 individuals in

the same group are swapped, also leads to the same intermediate

  • utput Int.

Lecture 6 : 590.03 Fall 12 44

slide-45
SLIDE 45

Ace algorithm analysis

Lecture 6 : 590.03 Fall 12 45

Both tables result in the same intermediate output.

slide-46
SLIDE 46

Ace algorithm analysis

Lemma 1: The assign step satisfies transparent L-diversity. Proof (sketch):

  • Consider an intermediate output Int
  • Suppose there is some input table T such that Assign(T) = Int
  • Any other table T’, where the sensitive values of 2 individuals in the same

group are swapped, also leads to the same intermediate output.

  • The set of input tables I(Int,A) contains all possible assignments of

diseases to individuals within each group of Int.

Lecture 6 : 590.03 Fall 12 46

slide-47
SLIDE 47

Ace algorithm analysis

Lemma 1: The assign step satisfies transparent L-diversity. Proof (sketch):

  • The set of table I(Int,A) contains all possible assignments of diseases to

individuals in each group of Int.

  • P[Ann has dyspepsia | I(Int,A) and Int] = 1/2

Lecture 6 : 590.03 Fall 12 47

Name Age Zip Ann 21 10000 Bob 27 18000 Gill 60 63000 Ed 54 60000 Disease Dyspepsia Dyspepsia Flu Flu

slide-48
SLIDE 48

Ace algorithm analysis

Lemma 2: The split phase also satisfies transparent L-diversity. Proof (sketch):

  • I(Int, Assign) contains all tables where an individual is assigned to an

arbitrary sensitive value within the same group in Int.

  • Suppose some input table T ε I(Int, Assign) results in the final output

O after Split.

Lecture 6 : 590.03 Fall 12 48

slide-49
SLIDE 49

Ace algorithm analysis

  • Split does not depend on the sensitive values.

Lecture 6 : 590.03 Fall 12 49

Ann Gill Bob Ed dyspepsia flu Ann Bob dyspepsia flu Gill Ed dyspepsia flu

results in

Bob Ed Ann Gill dyspepsia flu Bob Ann dyspepsia flu Ed Gill dyspepsia flu

results in

slide-50
SLIDE 50

Ace algorithm analysis

Lecture 6 : 590.03 Fall 12 50

If T ε I(Int, Assign), and it results in O after split, Then, T’ ε I(Int, Assign), and it results in O after split

Table T Table T’

slide-51
SLIDE 51

Ace algorithm analysis

  • Lemma 2:

The split phase also satisfies transparent L-diversity. Proof (sketch)

  • Let T’ be generated by “swapping diseases” in some bucket.
  • If T ε I(Int, Assign), and it results in O after split,

Then, T’ ε I(Int, Assign), and it results in O after split.

  • For any individual it is equally likely that sensitive value is one of

≥L choices.

  • Therefore, P[individual has disease | I(O, Ace)] < 1/L

Lecture 6 : 590.03 Fall 12 51

slide-52
SLIDE 52

Summary

  • Many systems assume privacy/security is guaranteed by assuming

the adversary does not know the algorithm.

– This is bad …

  • Simulatable algorithms avoid this problem

– Ideally choices made by the algorithm should be simulatable by the adversary.

  • Anonymization algorithms are also susceptible to adversaries who

know the algorithm or the objective function.

  • Transparent anonymization limits the inference an attacker (who

knows the algorithm) can make about sensitive values.

Lecture 6 : 590.03 Fall 12 52

slide-53
SLIDE 53

Next Class

  • Composition of privacy
  • Differential Privacy

Lecture 6 : 590.03 Fall 12 53

slide-54
SLIDE 54

References

  • A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, “L-Diversity: Privacy

beyond k-anonymity”, ICDE 2006

  • K. Kenthapadi, N. Mishra, K. Nissim, “Simulatable Auditing”, PODS 2005
  • R. Wong, A. Fu, K. Wang, J. Pei, “Minimality attack in privacy preserving data publishing”,

PVLDB 2007

  • X. Xiao, Y. Tao & N. Koudas, “Transparent Anonymization: Thwarting adversaries who know

the algorithm”, TODS 2010

Lecture 6 : 590.03 Fall 12 54