CS573 Data Privacy and Security Anonymization methods Anonymization - - PowerPoint PPT Presentation
CS573 Data Privacy and Security Anonymization methods Anonymization - - PowerPoint PPT Presentation
CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today Permutation based anonymization methods (cont.) Other privacy principles for microdata publishing publishing Statistical databases
Today
- Permutation based anonymization methods
(cont.)
- Other privacy principles for microdata
publishing publishing
- Statistical databases
Anonymization methods
- Non-perturbative: don't distort the data
– Generalization – Suppression
- Perturbative: distort the data
- Perturbative: distort the data
– Microaggregation/clustering – Additive noise
- Anatomization and permutation
– De-associate relationship between QID and sensitive attribute
Concept of the Anatomy Algorithm
- Release 2 tables, (QIT) and
(ST)
- Use the same QI groups (satisfy l!diversity), replace the
sensitive attribute values with a Group!ID column
- Then produce a sensitive table with statistics
- Then produce a sensitive table with statistics
tuple ID
- 1
23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT
- 1
headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST
Specifications of Anatomy
DEFINITION 3. (Anatomy) With a given !diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: ( ) ( ) ST will be constructed as the following: (, , )
Privacy properties
THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/
- 23
M 11000 1 dyspepsia 2 23 M 11000 1 pneumonia 2 27 M 13000 1 dyspepsia 2 27 M 13000 1 pneumonia 2 27 M 13000 1 pneumonia 2 35 M 59000 1 dyspepsia 2 35 M 59000 1 pneumonia 2 59 M 12000 1 dyspepsia 2 59 M 12000 1 pneumonia 2 61 F 54000 2 bronchitis 1 61 F 54000 2 flu 2 61 F 54000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 70 F 30000 2 bronchitis 1 70 F 30000 2 flu 2 70 F 30000 2 stomachache 1
Comparison with generalization
- Compare with generalization on two assumptions:
A1: the adversary has the QI!values of the target individual A2: the adversary also knows that the individual is definitely in the If A1 and A2 are true, anatomy is as good as generalization 1/ If A1 and A2 are true, anatomy is as good as generalization 1/ holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger
Preserving Data Correlation
- Examine the correlation between Age and Disease in T using
probability density function pdf
- Example: t1
tuple ID
- 1 (Bob)
23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1
Preserving Data Correlation
- To re!construct an approximate pdf of from the
generalization table:
tuple ID
- 1
[21,60] M [10001, 60000] pneumonia 1 [21,60] M [10001, 60000] pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2
Preserving Data Correlation
- To re!construct an approximate pdf of from the QIT and
ST tables:
tuple ID
- 1
23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT
- 1
headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST
Preserving Data Correlation
- To figure out a more rigorous comparison, calculate the “
distance” with the following equation: The distance for anatomy is 0.5 while the distance for The distance for anatomy is 0.5 while the distance for generalization is 22.5
Preserving Data Correlation
Idea: Measure the error for each tuple by using the following formula: Objective: for all tuples in and obtain a minimal (RCE):
Algorithm: Nearly!Optimal Anatomizing Algorithm
Experiments
- dataset CENSUS that contained the personal information of
500k American adults containing 9 discrete attributes
- Created two sets of tables
Set 1: 5 tables denoted as OCC!3, ..., OCC!7 so that OCC! (3 ≤ ≤ 7) uses the first as QI!attributes and (3 ≤ ≤ 7) uses the first as QI!attributes and as the sensitive attribute Set 2: 5 tables denoted as SAL!3, ..., SAL!7 so that SAL! (3 ≤ ≤ 7) uses the first as QI!attributes and !" as the sensitive attribute g
Experiments
Today
- Permutation based anonymization methods
(cont.)
- Other privacy principles for microdata
publishing publishing
- Statistical databases
- Differential privacy
A 3-anonymous patient table Homogeneity attack
Attacks on k-Anonymity
- k-Anonymity does not provide privacy if
– Sensitive values in an equivalence class lack diversity – The attacker has background knowledge
Zipcode Age Disease 476** 2* Heart Disease 476** 2* Heart Disease 476** 2* Heart Disease 4790* ≥40 Flu 4790* ≥40 Heart Disease 4790* ≥40 Cancer 476** 3* Heart Disease 476** 3* Cancer 476** 3* Cancer
A 3-anonymous patient table
Bob
- 47678
27 Carl
- 47673
36
Homogeneity attack Background knowledge attack
slide 16
Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne
Sensitive attributes must be
[Machanavajjhala et al. ICDE ‘06]
l-Diversity
Caucas 787XX Acne Caucas 787XX Flu
Asian/AfrAm
78XXX Flu
Asian/AfrAm
78XXX Flu
Asian/AfrAm
78XXX Acne
Asian/AfrAm
78XXX Shingles
Asian/AfrAm
78XXX Acne
Asian/AfrAm
78XXX Flu
Sensitive attributes must be “diverse” within each quasi-identifier equivalence class
slide 17
Distinct l-Diversity
- Each equivalence class has at least l well-
represented sensitive values
- Doesn’t prevent probabilistic inference attacks
slide 18
10 records 8 records have HIV 2 records have other values
Other Versions of l-Diversity
- Probabilistic l-diversity
– The frequency of the most frequent value in an equivalence class is bounded by 1/l
- Entropy l-diversity
- Entropy l-diversity
– The entropy of the distribution of sensitive values in each equivalence class is at least log(l)
- Recursive (c,l)-diversity
– r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value – Intuition: the most frequent value does not appear too frequently
slide 19
2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Cancer
Original dataset
Neither Necessary, Nor Sufficient
2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu
99% have cancer
2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Cancer
Original dataset
Q1 Flu Q1 Flu Q1 Cancer Q1 Flu Q1 Cancer
Anonymization A
Neither Necessary, Nor Sufficient
2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer
99% have cancer
50% cancer ⇒ quasi5identifier group is “diverse”
slide 21
2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Cancer
Original dataset
Q1 Flu Q1 Cancer Q1 Cancer Q1 Cancer Q1 Cancer
Anonymization B
Q1 Flu Q1 Flu Q1 Cancer Q1 Flu Q1 Cancer
Anonymization A
Neither Necessary, Nor Sufficient
2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Flu Q2 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer
99% have cancer
50% cancer ⇒ quasi5identifier group is “diverse”
- 99% cancer ⇒ quasi5identifier group is not “diverse”
slide 22
Limitations of l-Diversity
- Example: sensitive attribute is HIV+ (1%) or HIV-
(99%)
– Very different degrees of sensitivity!
- l-diversity is unnecessary
– 2-diversity is unnecessary for an equivalence class that contains only HIV- records
- l-diversity is difficult to achieve
– Suppose there are 10000 records in total – To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes
slide 23
Skewness Attack
- Example: sensitive attribute is HIV+ (1%) or
HIV- (99%)
- Consider an equivalence class that contains an
equal number of HIV+ and HIV- records equal number of HIV+ and HIV- records
– Diverse, but potentially violates privacy!
- l-diversity does not differentiate:
– Equivalence class 1: 49 HIV+ and 1 HIV- – Equivalence class 2: 1 HIV+ and 49 HIV-
slide 24
l-diversity does not consider overall distribution of sensitive values!
Bob
- 47678
27
Zipcode Age Salary Disease 476** 2* 20K Gastric Ulcer 476** 2* 30K Gastritis 476** 2* 40K Stomach Cancer 4790* ≥40 50K Gastritis
A 3-diverse patient table Similarity attack
Sensitive Attribute Disclosure
47678 27
4790* ≥40 50K Gastritis 4790* ≥40 100K Flu 4790* ≥40 70K Bronchitis 476** 3* 60K Bronchitis 476** 3* 80K Pneumonia 476** 3* 90K Stomach Cancer
Conclusion 1. Bob’s salary is in [20k,40k], which is relatively low 2. Bob has some stomach-related disease
l-diversity does not consider semantics of sensitive values!
slide 25
t-Closeness: A New Privacy Measure
- Rationale
Belief Knowledge
Observations
- Q is public or can be derived
- Potential knowledge gain from Q and
Pi about Specific individuals
Principle
External Knowledge
Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class
B0 B1 B2
Principle
- The distance between Q and Pi should
be bounded by a threshold t.
Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne [Li et al. ICDE ‘07]
Distribution of sensitive attributes within each quasi-identifier group should
t-Closeness
Caucas 787XX Acne Caucas 787XX Flu
Asian/AfrAm
78XXX Flu
Asian/AfrAm
78XXX Flu
Asian/AfrAm
78XXX Acne
Asian/AfrAm
78XXX Shingles
Asian/AfrAm
78XXX Acne
Asian/AfrAm
78XXX Flu
quasi-identifier group should be “close” to their distribution in the entire original database
slide 27
Distance Measures
- P=(p1,p2,…,pm), Q=(q1,q2,…,qm)
- Trace-distance
- KL-divergence
- None of these measures reflect the semantic distance among values.
- Q: {3K,4K,5K,6K,7K,8K,9K,10K,11k}
P1:{3K,4K,5k} P2:{5K,7K,10K}
- Intuitively, D[P1,Q]>D[P2,Q]
Earth Mover’s Distance
- If the distributions are interpreted as two different ways
- f piling up a certain amount of dirt over region D, EMD is
the minimum cost of turning one pile into the other
– the cost is amount of dirt moved * the distance by which it is moved – Assume two piles have the same amount of dirt
- Extensions for comparison of distributions with different
- Extensions for comparison of distributions with different
total masses.
– allow for a partial match, discard leftover "dirt“, without cost – allow for mass to be created or destroyed, but with a cost penalty
Earth Mover’s Distance
- Formulation
– P=(p1,p2,…,pm), Q=(q1,q2,…,qm) – dij: the ground distance between element i of P and element j of Q. – Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:
How to calculate EMD(Cont’d)
- EMD for categorical attributes
– Hierarchical distance – Hierarchical distance is a metric
Earth Mover’s Distance
- Example
– {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} – Move 1/9 probability for each of the following pairs
- 3k->6k,3k->7k cost: 1/9*(3+4)/8
- 3k->6k,3k->7k cost: 1/9*(3+4)/8
- 4k->8k,4k->9k cost: 1/9*(4+5)/8
- 5k->10k,5k->11k cost: 1/9*(5+6)/8
– Total cost: 1/9*27/8=0.375 – With P2={6k,8k,11k} , we can get the total cost is 1/9 * 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.
Experiments
- Goal
– To show l-diversity does not provide sufficient privacy protection (the similarity attack). – To show the efficiency and data quality of using t- – To show the efficiency and data quality of using t- closeness are comparable with other privacy measures.
- Setup
– Adult dataset from UC Irvine ML repository – 30162 tuples, 9 attributes (2 sensitive attributes) – Algorithm: Incognito
Experiments
- Comparisons of privacy measurements
– k-Anonymity – Entropy l-diversity – Recursive (c,l)-diversity – Recursive (c,l)-diversity – k-Anonymity with t-closeness
Experiments
- Efficiency
– The efficiency of using t-closeness is comparable with other privacy measurements
Experiments
- Data utility
– Discernibility metric; Minimum average group size – The data quality of using t-closeness is comparable with other privacy measurements
- This is k-anonymous,
l-diverse and t-close…
Anonymous, “t-Close” Dataset
- …so secure, right?
slide 37
- Bob is Caucasian and
I heard he was admitted to hospital with flu…
What Does Attacker Know?
- slide 38
- Bob is Caucasian and
I heard he was admitted to hospital … And I know three other Caucasions admitted to hospital with Acne or
What Does Attacker Know?
- hospital with Acne or
Shingles …
slide 39
k-Anonymity and Partition-based notions
- Syntactic
– Focuses on data transformation, not on what can be learned from the anonymized dataset – “k-anonymous” dataset can leak sensitive – “k-anonymous” dataset can leak sensitive information
- “Quasi-identifier” fallacy
– Assumes a priori that attacker will not know certain information about his target
slide 40
Today
- Permutation based anonymization methods
(cont.)
- Other privacy principles for microdata
publishing publishing
- Statistical databases
– Definitions and early methods – Output perturbation and differential privacy
- Originated from the study on statistical
database
- A statistical database is a database which
provides statistics on subsets of records
Statistical Data Release
provides statistics on subsets of records
- OLAP vs. OLTP
- Statistics may be performed to compute SUM,
MEAN, MEDIAN, COUNT, MAX AND MIN of records
Types of Statistical Databases
Static – a static database is made once and never changes Dynamic – changes continuously to reflect real-time data and never changes
Example: U.S. Census
real-time data
Example: most online research databases
Types of Statistical Databases
Centralized – one database Decentralized – multiple decentralized databases databases
General purpose – like
census
Special purpose – like
bank, hospital, academia, etc
- Exact compromise – a user is able to determine the exact
value of a sensitive attribute of an individual
- Partial compromise – a user is able to obtain an estimator for
a sensitive attribute with a bounded variance
- Positive compromise – determine an attribute has a particular
Data Compromise
- Positive compromise – determine an attribute has a particular
value
- Negative compromise – determine an attribute does not have
a particular value
- Relative compromise – determine the ranking of some
confidential values
Statistical Quality of Information
- Bias – difference between the unperturbed
statistic and the expected value of its perturbed estimate
- Precision – variance of the estimators obtained
by users by users
- Consistency – lack of contradictions and
paradoxes
– Contradictions: different responses to same query; average differs from sum/count – Paradox: negative count
Methods
Query restriction Data perturbation/anonymization Output perturbation
Data Perturbation
Query Results Q u e r y Results
Noise Added to Results
User 1
Query Results
Output Perturbation
Query Results
to Results
User 2
Query R e s u l t s Original Database
Query Results
Statistical data release vs. data anonymization
- Data anonymization is one technique that can
be used to build statistical database
- Other techniques such as query restriction
and output purterbation can be used to build and output purterbation can be used to build statistical database or release statistical data
- Different privacy principles can be used