CS573 Data Privacy and Security Midterm Review
Li Xiong
Department of Mathematics and Computer Science Emory University
Midterm Review Li Xiong Department of Mathematics and Computer - - PowerPoint PPT Presentation
CS573 Data Privacy and Security Midterm Review Li Xiong Department of Mathematics and Computer Science Emory University Principles of Data Security CIA Triad Confidentiality Prevent the disclosure of information to unauthorized users
Department of Mathematics and Computer Science Emory University
11/8/2016 3
Original Data Sanitized Records De-identification anonymization
Original Data Statistics/ Models/ Synthetic Records Differentially Private Data Sharing
11/9/2016 8
Original Data Encrypted Data Encryption
9
11/9/2016 9
Encrypted Data
Computation /Queries
10
xn x1 x3 x2 f(x1,x2,…, xn)
11
11/8/2016 12
Caucas 78712 Flu Asian 78705 Shingle s Caucas 78754 Flu Asian 78705 Acne AfrAm 78705 Acne Caucas 78705 Flu Caucas 787XX Flu
Asian/AfrA m
78705 Shingle s Caucas 787XX Flu
Asian/AfrA m
78705 Acne
Asian/AfrA m
78705 Acne Caucas 787XX Flu
Quasi-identifiers (QID) = race, zipcode Sensitive attribute = diagnosis K-anonymity: the size of each QID group is at least k
slide 14
… … …
Rusty Shackleford Caucas
78705 … … … Caucas 787XX Flu
Asian/AfrA m
78705 Shingle s Caucas 787XX Flu
Asian/AfrA m
78705 Acne
Asian/AfrA m
78705 Acne Caucas 787XX Flu
Problem: sensitive attributes are not “diverse” within each quasi-identifier group
Caucas 787XX Flu Caucas 787XX Shingle s Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne Caucas 787XX Flu
Asian/AfrA m
78XXX Flu
Asian/AfrA m
78XXX Flu
Asian/AfrA m
78XXX Acne
Asian/AfrA m
78XXX Shingle s
Asian/AfrA
78XXX Acne
Entropy of sensitive attributes within each quasi-identifier group must be at least l
[Machanavajjhala et al. ICDE ‘06]
… HIV- … HIV- … HIV- … HIV- … HIV- … HIV+ … HIV- … HIV- … HIV- … HIV- … HIV- … HIV-
Original dataset
Q1 HIV- Q1 HIV- Q1 HIV- Q1 HIV+ Q1 HIV- Q1 HIV- Q2 HIV- Q2 HIV- Q2 HIV- Q2 HIV- Q2 HIV- Q2 Flu
Anonymization B
Q1 HIV+ Q1 HIV- Q1 HIV+ Q1 HIV- Q1 HIV+ Q1 HIV- Q2 HIV- Q2 HIV- Q2 HIV- Q2 HIV- Q2 HIV- Q2 HIV-
Anonymization A
99% have HIV-
50% HIV- quasi-identifier group is “diverse” This leaks a ton of information 99% HIV- quasi-identifier group is not “diverse” …yet anonymized database does not leak anything
Caucas 787XX Flu Caucas 787XX Shingle s Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne Caucas 787XX Flu
Asian/AfrA m
78XXX Flu
Asian/AfrA m
78XXX Flu
Asian/AfrA m
78XXX Acne
Asian/AfrA m
78XXX Shingle s
Asian/AfrA
78XXX Acne
[Li et al. ICDE ‘07]
Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database
slide 17
slide 18
11/8/2016 19
Original records Original histogram
Original records Original histogram Perturbed histogram with differential privacy
D D’
Global Sensitivity
11/8/2016 26
Inputs Outputs Sample output r with a utility score function u(D,r)
Tutorial: Differential Privacy in the Wild 29 Module 2
11/8/2016 32
11/8/2016 33
Name Age Income HIV+ Frank 42 30K Y Bob 31 60K Y Mary 28 20K Y … … … …
Original Records DP V-optimal Histogram Multi-dimensional partitioning
DP unit Histogram
DP Interface ε/2-DP ε/2-DP
histogram counts with differential privacy
histogram for partitioning
histogram counts with differential privacy
35
Non-parametric methods (only work well for low-dimensional data) Parametric methods (joint distribution difficult to model)
Fit the data to a distribution, make inferences about parameters
e.g. PrivacyOnTheMap
Original data Histogram Synthetic data Perturbation
Learn empirical distribution through histograms
e.g. PSD , Privelet, FP, P-HP
A semi-parametric method
Non-parametric estimation for each dimension
Age Hours /week Income
42 64 30K 31 82 60K 28 40 20K 43 36 80K
… … …
Original data set
Hours/week Age Income
DP Marginal Histograms Dependence structure
Age Hours /week Income 42 64 30K 31 82 60K 28 40 20K 43 36 80K … … …
DP synthetic data set
Parametric estimation for dependence
Metrics:
Random range-count queries with random query predicates covering all attributes Relative error: Absolute error:
11/8/2016 41
ID 100 200 300 400 500 Record a→c→d b→c→d a→b→c→e→d d→b a→d→c→d Database D Sequence {a} {b} {c} {d} Sup. 3 3 4 4 {e} 1 C1: cand 1-seqs Sequence {a} {b} {c} {d} Sup. 3 3 4 4 F1: freq 1-seqs
Sequence {a→a} {a→b} {a→c} {a→d} Sup. 1 3 3 {b→a} {b→b} {b→c} {b→d} 2 2 1 {c→a} {c→b} {c→c} {c→d} 4 {d→a} {d→b} {d→c} {d→d} 1 1 C2: cand 2-seqs Sequence {a→c} {a→d} {c→d} Sup. 3 3 4 F3: freq 2-seqs
Scan D Scan D Scan D
Sequence {a→a} {a→b} {a→c} {a→d} {b→a} {b→b} {b→c} {b→d} {c→a} {c→b} {c→c} {c→d} {d→a} {d→b} {d→c} {d→d} C2: cand 2-seqs
Sequence {a→b→c} C3: cand 3-seqs Sequence {a→b→c} Sup. 3 F3: freq 3-seqs
ID 100 200 300 400 500 Record a→c→d b→c→d a→b→c→e→d d→b a→d→c→d Database D
Sequence {a} {b} {c} {d} Sup. 3 3 4 4 {e} 1 C1: cand 1-seqs noise 0.2
0.4
0.8
Sequence {a→a} {a→c} {a→d} {c→a} {c→c} {c→d} {d→a} {d→c} {d→d} C2: cand 2-seqs Sequence {a→a} {a→c} {a→d} Sup. 3 3 {c→a} {c→c} {c→d} 4 {d→a} {d→c} {d→d} 1 C2: cand 2-seqs noise 0.2 0.3 0.2
0.8 0.2 0.3 2.1
Scan D Scan D
Sequence {a→c→d} C3: cand 3-seqs {a→d→c}
noise 0.3 Sequence {a→c→d} Sup. 3 {a→d→c} 1 C3: cand 3-seqs
Scan D
Sequence {a} {c} {d} Noisy Sup. 3.2 4.4 3.5 F1: freq 1-seqs
Sequence {a→c} {a→d} {c→d} Noisy Sup. 3.3 3.2 4.2 F2: freq 2-seqs {d→c} 3.1
Sequence {a→c→d} Noisy Sup. 3 F3: freq 3-seqs
Lap(|C2| / ε2) Lap(|C1| / ε1) Lap(|C3| / ε3)
– Reduce sequence length
Original Database
mth sample database 2nd sample database 1st sample database …… Partition
2 precision recall F score precision recall
'
{(| |) / }
X x x x
RE median sup sup sup
11/9/2016 46
Finance.com Fashion.co m WeirdStuff.com . . .
Disease (Y/N) Y Y N Y N N
With probability p, Report true value With probability 1-p, Report flipped value
Disease (Y/N) Y N N N Y N
D O
11/9/2016 50
11/8/2016 55
56
E D
m plaintext k encryption key k’ decryption key Ek(m) ciphertext Dk’ (Ek(m)) = m attacker
– attacker knows E and D – attacker doesn’t know the (decryption) key
– to systematically recover plaintext from ciphertext – to deduce the (decryption) key
– ciphertext-only – known-plaintext – (adaptive) chosen-plaintext – (adaptive) chosen-ciphertext
slide 57
plaintext ciphertext
A-B encryption algorithm decryption algorithm
A-B plaintext message, m c=KA-B (m) K (m)
A-B
m = K ( )
A-B
11/9/2016 60
61
xn x1 x3 x2 f(x1,x2,…, xn)
slide 62
[Goldreich-Micali-Wigderson 1987]
x1 f2(x1,x2) f1(x1,x2) x2
slide 63
slide 64
slide 67
m0, m1 m = 0 or 1 S inputs two bits, R inputs the index of one of S’s bits R learns his chosen bit, S learns nothing
– S does not learn which bit R has chosen; R does not learn the value
[Rabin 1981]
𝑜−1 𝑗=1
11/9/2016 72
73
11/9/2016 73
Encrypted Data
Computation /Queries
76
11/9/2016 77
– C1 holds encrypted data E(a), E(b) – C2 holds decryption key sk
– C1 and C2 do not obtain anything about the data and result
– Utilize additive homomorphic property – Use random shares to ensure C2 only has access to decrypted data with random shares
– secure multiplication, secure comparison, …
11/9/2016 79
11/9/2016 81