- Bee-Chung Chen 2007 beechung@cs.wisc.edu
Privacy Skyline:
Bee-Chung Chen, Kristen LeFevre University of Wisconsin – Madison
Privacy with Multidimensional Adversarial Knowledge
Raghu Ramakrishnan Yahoo! Research
Privacy Skyline: Privacy with Multidimensional Adversarial - - PowerPoint PPT Presentation
Bee-Chung Chen, Kristen LeFevre University of Wisconsin – Madison
Raghu Ramakrishnan Yahoo! Research
2
Bee-Chung Chen 2007 beechung@cs.wisc.edu
12455 M 38 Tom Cancer 12453 M 31 Gary Cancer 12433 M 34 Frank Flu 12412 M 35 Ed AIDS 12343 M 27 Dick Flu 12344 F 23 Cary Flu 12342 M 24 Bob AIDS 12345 F 20 Ann Disease Zipcode Gender Age Name
3
Bee-Chung Chen 2007 beechung@cs.wisc.edu
M 38 (Tom) 12453 M 31 (Gary) 12433 M 34 (Frank) 2 12412 M 35 (Ed) 12343 M 27 (Dick) 12344 F 23 (Cary) 12342 M 24 (Bob) 1 12345 F 20 (Ann) Group Zipcode Gender Age Flu Cancer Cancer AIDS 2 AIDS Flu Flu AIDS 1 Disease Group
Pr(Tom has AIDS | above data and above knowledge) = 1 1 2* 3* Any M 1234* 123**
4
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– This probability is the adversary’s confidence that person t has sensitive value s, after he sees the released dataset – Equivalent definition: D* is safe if maxt,s Pr(t has s | D*, Adversarial Knolwedge) < c – Prior work following this intuition: [Machanavajjhala et al., 2006; Martin et al., 2007; Xiao and Tao, 2006]
Maximum breach probability
5
Bee-Chung Chen 2007 beechung@cs.wisc.edu
describe various kinds of adversarial knowledge
– We provide intuitive knowledge expressions that cover three kinds of common adversarial knowledge
analyze data safety in the presence of various kinds
– We propose a skyline tool for what-if analysis in the “knowledge space”
generate a safe dataset to release
– We develop algorithms (based on a “congregation” property) orders of magnitude faster than the best known dynamic programming technique [Martin et al., 2007]
6
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– How the privacy breach is defined
7
Bee-Chung Chen 2007 beechung@cs.wisc.edu
AIDS 12455 M 38 Tom Cancer 12453 M 31 Gary Cancer 12433 M 34 Frank Flu 12412 M 35 Ed AIDS 12343 M 27 Dick Flu 12344 F 23 Cary Flu 12342 M 24 Bob AIDS 12345 F 20 Ann Disease Zipcode Gender Age Name 12455 M 38 (Tom) 12453 M 31 (Gary) 12433 M 34 (Frank) 2 12412 M 35 (Ed) 12343 M 27 (Dick) 12344 F 23 (Cary) 12342 M 24 (Bob) 1 12345 F 20 (Ann) Group Zipcode Gender Age Flu Cancer Cancer AIDS 2 AIDS Flu Flu AIDS 1 Disease Group
Original dataset D Release candidate D*
(in the talk)
set-valued (in the paper)
8
Bee-Chung Chen 2007 beechung@cs.wisc.edu
Reconstruction
A reconstruction of D* is intuitively a possible original dataset (possible world) that would generate D* by using the grouping mechanism
12455 M 38 (Tom) 12453 M 31 (Gary) 12433 M 34 (Frank) 2 12412 M 35 (Ed)
12343 M 27 (Dick) 12344 F 23 (Cary) 12342 M 24 (Bob) 1 12345 F 20 (Ann) Group Zipcode Gender Age
Flu Cancer Cancer AIDS 2
AIDS Flu Flu AIDS 1 Disease Group
Flu … Tom Cancer … Gray Cancer … Frank AIDS … Ed
Release candidate D*
AIDS … Tom Cancer … Gary Cancer … Frank Flu … Ed
Reconstructions of Group 2
Fix Permute
Assumption: Without any additional knowledge, every reconstruction is equally likely
9
Bee-Chung Chen 2007 beechung@cs.wisc.edu
E.g., K = (Tom[S] ≠ Cancer) ∧ (Ed[S] = Flu)
– Knowledge expressions may also include variables
E.g., K = (Tom[S] ≠ x x) ∧ (u u[S] ≠ y y) ∧ (v v[S] = s s → Tom[S] = s s)
– Maximum breach probability
The maximization is over variables t, u, v, s, x, y, by substituting them with constants in the dataset Pr( Tom[S] = AIDS | K, D* ) ≡ # of reconstructions of D* that satisfy K ∧ (Tom[S] = AIDS) # of reconstructions of D* that satisfy K
10
Bee-Chung Chen 2007 beechung@cs.wisc.edu
max Pr( t[S] = s | D*, K ) < c
– K is a conjunction of m implications
E.g., K = (u1[S] = x1 → v1[S] = y1) ∧ … ∧ (um[S] = xm → vm[S] = ym)
– Not intuitive: What is the practical meaning of m implications? – Some limitations: Some simple knowledge cannot be expressed
– Computing breach probability is NP-hard
– Useful (intuitive & cover common adversarial knowledge) – Computationally feasible
11
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Tradeoff between expressiveness and feasibility
12
Bee-Chung Chen 2007 beechung@cs.wisc.edu
Assume a person has only one record in the dataset in this talk (Multiple sensitive values per person is in the paper)
12455 M 38 (Tom) 12453 M 31 (Gary) 12433 M 34 (Frank) 2 12412 M 35 (Ed) 12343 M 27 (Dick) 12344 F 23 (Cary) 12342 M 24 (Bob) 1 12345 F 20 (Ann) Group Zipcode Gender Age Flu Cancer Cancer AIDS 2 AIDS Flu Flu AIDS 1 Disease Group
13
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Knowledge about the target: ℓ ℓ ℓ ℓ ℓ ℓ ℓ ℓ sensitive values sensitive values that t does not have t[S] ≠ x1 ∧ … ∧ t[S] ≠ xℓ – Knowledge about others: The sensitive values of k k other people
u1[S] = y1 ∧ … ∧ uk[S] = yk – Knowledge about relationships: A group of m m people people who have the same sensitive value as t (v1[S] = s → t[S] = s) ∧ … ∧ (vm[S] = s → t[S] = s)
– No matter what those ℓ ℓ ℓ ℓ ℓ ℓ ℓ ℓ sensitive values sensitive values, what those k k people people and what those m m people people are, the adversary should not be able to predict any person t to have any sensitive value s with confidence ≥ c
14
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Skyline privacy criterion – Skyline exploratory tool
15
Bee-Chung Chen 2007 beechung@cs.wisc.edu
ℓ k m
(3, 4, 2) m = 2
Example: (ℓ, k, m) = (3, 4, 2) and c = 0.5 A release candidate is safe if no adversary with the following knowledge can predict any person t to have any sensitive value s with confidence ≥ 0.5
(4, 3, 2)
k-anonymity and ℓ-diversity are two special cases of this criterion
16
Bee-Chung Chen 2007 beechung@cs.wisc.edu
ℓ k m
(3, 4, 2) (2, 7, 2) (5, 2, 2) m = 2
17
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– The data owner specifies a set of skyline points – The system checks whether a release candidate is safe
– Given a release candidate – Find the set of skyline points such that
any point beneath the skyline, and
w.r.t. any point above the skyline
ℓ k m
m = 2 Safe Unsafe
18
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– SkylineCheck (in this talk)
– SkylineAnonymize (in the paper)
– SkylineFind (in the technical report)
19
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– ℒt,s(ℓ,k,m) = Kt(ℓ) ∧ Ku(k) ∧ Kv,t(m)
– Variables:
– How to find the variable assignment that maximizes the breach probability
20
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Variables:
– Consider all possible ways of assigning person variables into QI-groups – For each assignment of person variables, find the assignment of sensitive-value variables that maximizes the breach probability
1
12345 F 20 12342 M 24 12344 F 23 12343 M 27
2
12412 M 35 12433 M 34 12453 M 31 12455 M 38 12455 M 38 12453 M 31 12433 M 34
4
12412 M 35 12343 M 27 12344 F 23 12342 M 24
3
12345 F 20
Group Zipcode Gender Age
AIDS Flu Flu AIDS
1
Flu Cancer Cancer AIDS
2
Flu Cancer Cancer AIDS
4
AIDS Flu Flu AIDS
3
Disease Group
Example assignment of person variables: Group 1: t, u1 Group 2: u2, v1, v2 Group 3: u3, u4 Group 4: v3, v4
Release candidate D*
21
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Variables:
– All u1, …, uk would congregate in
– All v1, …, vm would congregate in
– t would be in one of the above two
1
12345 F 20 12342 M 24 12344 F 23 12343 M 27
2
12412 M 35 12433 M 34 12453 M 31 12455 M 38 12455 M 38 12453 M 31 12433 M 34
4
12412 M 35 12343 M 27 12344 F 23 12342 M 24
3
12345 F 20
Group Zipcode Gender Age
AIDS Flu Flu AIDS
1
Flu Cancer Cancer AIDS
2
Flu Cancer Cancer AIDS
4
AIDS Flu Flu AIDS
3
Disease Group
Example assignment of person variables: Group 1: Group 2: t, u1, …, uk Group 3: Group 4: v1, …, vm
Release candidate D*
22
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Case 1:
max Pr(…) = 1 / [ (minA CF1(A)) + 1]
– Case 2:
max Pr(…) = 1 / [ (minB CF2(B))⋅(minC CF3(C)) + 1]
– Case 3:
max Pr(…) = 1 / [ (minD CF4(D))⋅(minE CF5(E)) + 1] (For a fixed QI-group, CF1, …, CF5 are closed-form formulas)
23
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Scan the dataset once – During the scan, update the 5 sufficient statistics for each skyline point – Compute the maximum breach probability based on these statistics
24
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Adaptation of the Mondrian algorithm by LeFevre et al. – The congregation property makes the adaptation easy
3 4 1 2 6 5
Group Zipcode Gender Age
3 5 1 2 6 3
Disease Group
Gender M F Zipcode
53*** 54*** 55*** 56***
Age < 40 ≥ 40
25
Bee-Chung Chen 2007 beechung@cs.wisc.edu
26
Bee-Chung Chen 2007 beechung@cs.wisc.edu
[Martin et al., 2007]
27
Bee-Chung Chen 2007 beechung@cs.wisc.edu
50 100 150 200 1M 2M 3M 4M 5M Number of records Improvement ratio
Improvement ratio = Execution time of DP Execution time of ours
28
Bee-Chung Chen 2007 beechung@cs.wisc.edu
2000 4000 6000 8000 10000 10 20 30 40 50 60 70 80 90 100 Dataset Size (Millions of Records) Elapsed Time (Sec) (ℓ,k,m)=(0,1000,0) (ℓ,k,m)=(3,1000,10)
Confidence threshold: 1 Knowledge threshold:
Main memory size: 512 MB 512 MB Record size: 44 Byte per record 4.4 GB 4.4 GB
29
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Useful expressions that satisfy the congregation property
– Other kinds of adversarial knowledge
– Other kinds of data
32
Bee-Chung Chen 2007 beechung@cs.wisc.edu
(ℓ: x-axis, k =10, m =10)
50 100 150 200 4 8 12 16
ℓ
Improvement ratio
(ℓ=10, k =10, m =10)
50 100 150 200 1M 2M 3M 4M 5M Number of records Improvement ratio
(ℓ=10, k : x-axis, m =10)
200 400 600 800 1000 8 16 24 32 k Improvement ratio
(ℓ=10, k =10, m : x-axis)
50 100 150 200 250 4 8 12 16 m Improvement ratio
Improvement ratio = Execution time of DP Execution time of ours
33
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Size: 45,222 records – Sensitive attribute: Occupation
– D* is only safe for an adversary with knowledge beneath the knowledge skyline – E.g., if the adversary knows 5 people’s
t’s occupation with confidence ≥ 0.95 ℓ k m (0, 4, 0) (1, 3, 1) (2, 2, 2) (3, 1, 2) (2, 1, 3) (4, 0, 3) (3, 0, 4)
Knowledge skyline of D*
34
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Each QI-group has at least k people – k-Anonymity is a special case of our 3D privacy criterion with knowledge (0, k−2, 0) and confidence 1
– Each QI-group has ℓ well-represented sensitive values – (c,ℓ)-Diversity is a special case of our 3D privacy criterion with knowledge (ℓ−2, 0, 0) and confidence c/(c+1)
35
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Add noise to query outputs so that no one can tell whether a record is in the original dataset with a high probability
– Xiao and Tao (SIGMOD’06 and VLDB’06) – Li et al. (ICDE’07)
36
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– Require complete independence between sensitive information and the released dataset
Machanavajjhala and Gehrke (PODS’06)
– Bound the asymptotic probability of the answer of a Boolean query given views when the domain size → ∞
37
Bee-Chung Chen 2007 beechung@cs.wisc.edu
– K = (A1[S] = C1 ↔ B1[S] = D1) ∧ … ∧ (Am[S] = Cm ↔ Bm[S] = Dm)