[PPT] - CS573 Data Privacy and Security Anonymization methods Anonymization PowerPoint Presentation

SLIDE 1

CS573 Data Privacy and Security Anonymization methods Anonymization methods

Li Xiong

SLIDE 2

Today

Permutation based anonymization methods

(cont.)

Other privacy principles for microdata

publishing publishing

Statistical databases

SLIDE 3

Anonymization methods

Non-perturbative: don't distort the data

– Generalization – Suppression

Perturbative: distort the data
Perturbative: distort the data

– Microaggregation/clustering – Additive noise

Anatomization and permutation

– De-associate relationship between QID and sensitive attribute

SLIDE 4

Concept of the Anatomy Algorithm

Release 2 tables, (QIT) and

(ST)

Use the same QI groups (satisfy l!diversity), replace the

sensitive attribute values with a Group!ID column

Then produce a sensitive table with statistics
Then produce a sensitive table with statistics

tuple ID

1

23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

1

headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

SLIDE 5

Specifications of Anatomy

DEFINITION 3. (Anatomy) With a given !diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: ( ) ( ) ST will be constructed as the following: (, , )

SLIDE 6

Privacy properties

THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/

23

M 11000 1 dyspepsia 2 23 M 11000 1 pneumonia 2 27 M 13000 1 dyspepsia 2 27 M 13000 1 pneumonia 2 27 M 13000 1 pneumonia 2 35 M 59000 1 dyspepsia 2 35 M 59000 1 pneumonia 2 59 M 12000 1 dyspepsia 2 59 M 12000 1 pneumonia 2 61 F 54000 2 bronchitis 1 61 F 54000 2 flu 2 61 F 54000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 70 F 30000 2 bronchitis 1 70 F 30000 2 flu 2 70 F 30000 2 stomachache 1

SLIDE 7

Comparison with generalization

Compare with generalization on two assumptions:

A1: the adversary has the QI!values of the target individual A2: the adversary also knows that the individual is definitely in the If A1 and A2 are true, anatomy is as good as generalization 1/ If A1 and A2 are true, anatomy is as good as generalization 1/ holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger

SLIDE 8

Preserving Data Correlation

Examine the correlation between Age and Disease in T using

probability density function pdf

Example: t1

tuple ID

1 (Bob)

23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1

SLIDE 9

Preserving Data Correlation

To re!construct an approximate pdf of from the

generalization table:

tuple ID

1

[21,60] M [10001, 60000] pneumonia 1 [21,60] M [10001, 60000] pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2

SLIDE 10

Preserving Data Correlation

To re!construct an approximate pdf of from the QIT and

ST tables:

tuple ID

1

23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

1

headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

SLIDE 11

Preserving Data Correlation

To figure out a more rigorous comparison, calculate the “

distance” with the following equation: The distance for anatomy is 0.5 while the distance for The distance for anatomy is 0.5 while the distance for generalization is 22.5

SLIDE 12

Preserving Data Correlation

Idea: Measure the error for each tuple by using the following formula: Objective: for all tuples in and obtain a minimal (RCE):

Algorithm: Nearly!Optimal Anatomizing Algorithm

SLIDE 13

Experiments

dataset CENSUS that contained the personal information of

500k American adults containing 9 discrete attributes

Created two sets of tables

Set 1: 5 tables denoted as OCC!3, ..., OCC!7 so that OCC! (3 ≤ ≤ 7) uses the first as QI!attributes and (3 ≤ ≤ 7) uses the first as QI!attributes and as the sensitive attribute Set 2: 5 tables denoted as SAL!3, ..., SAL!7 so that SAL! (3 ≤ ≤ 7) uses the first as QI!attributes and !" as the sensitive attribute g

SLIDE 14

Experiments

SLIDE 15

Today

Permutation based anonymization methods

(cont.)

Other privacy principles for microdata

publishing publishing

Statistical databases
Differential privacy

SLIDE 16

A 3-anonymous patient table Homogeneity attack

Attacks on k-Anonymity

k-Anonymity does not provide privacy if

– Sensitive values in an equivalence class lack diversity – The attacker has background knowledge

Zipcode Age Disease 476** 2* Heart Disease 476** 2* Heart Disease 476** 2* Heart Disease 4790* ≥40 Flu 4790* ≥40 Heart Disease 4790* ≥40 Cancer 476** 3* Heart Disease 476** 3* Cancer 476** 3* Cancer

A 3-anonymous patient table

Bob

47678

27 Carl

47673

36

Homogeneity attack Background knowledge attack

slide 16

SLIDE 17

Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne

Sensitive attributes must be

[Machanavajjhala et al. ICDE ‘06]

l-Diversity

Caucas 787XX Acne Caucas 787XX Flu

Asian/AfrAm

78XXX Flu

Asian/AfrAm

78XXX Flu

Asian/AfrAm

78XXX Acne

Asian/AfrAm

78XXX Shingles

Asian/AfrAm

78XXX Acne

Asian/AfrAm

78XXX Flu

Sensitive attributes must be “diverse” within each quasi-identifier equivalence class

slide 17

SLIDE 18

Distinct l-Diversity

Each equivalence class has at least l well-

represented sensitive values

Doesn’t prevent probabilistic inference attacks

slide 18

10 records 8 records have HIV 2 records have other values

SLIDE 19

Other Versions of l-Diversity

Probabilistic l-diversity

– The frequency of the most frequent value in an equivalence class is bounded by 1/l

Entropy l-diversity
Entropy l-diversity

– The entropy of the distribution of sensitive values in each equivalence class is at least log(l)

Recursive (c,l)-diversity

– r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value – Intuition: the most frequent value does not appear too frequently

slide 19

SLIDE 20

2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Cancer

Original dataset

Neither Necessary, Nor Sufficient

2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu

99% have cancer

SLIDE 21

2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Cancer

Original dataset

Q1 Flu Q1 Flu Q1 Cancer Q1 Flu Q1 Cancer

Anonymization A

Neither Necessary, Nor Sufficient

2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer

99% have cancer

50% cancer ⇒ quasi5identifier group is “diverse”

slide 21

SLIDE 22

2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Cancer

Original dataset

Q1 Flu Q1 Cancer Q1 Cancer Q1 Cancer Q1 Cancer

Anonymization B

Q1 Flu Q1 Flu Q1 Cancer Q1 Flu Q1 Cancer

Anonymization A

Neither Necessary, Nor Sufficient

2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Flu Q2 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer

99% have cancer

50% cancer ⇒ quasi5identifier group is “diverse”

99% cancer ⇒ quasi5identifier group is not “diverse”

slide 22

SLIDE 23

Limitations of l-Diversity

Example: sensitive attribute is HIV+ (1%) or HIV-

(99%)

– Very different degrees of sensitivity!

l-diversity is unnecessary

– 2-diversity is unnecessary for an equivalence class that contains only HIV- records

l-diversity is difficult to achieve

– Suppose there are 10000 records in total – To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

slide 23

SLIDE 24

Skewness Attack

Example: sensitive attribute is HIV+ (1%) or

HIV- (99%)

Consider an equivalence class that contains an

equal number of HIV+ and HIV- records equal number of HIV+ and HIV- records

– Diverse, but potentially violates privacy!

l-diversity does not differentiate:

– Equivalence class 1: 49 HIV+ and 1 HIV- – Equivalence class 2: 1 HIV+ and 49 HIV-

slide 24

l-diversity does not consider overall distribution of sensitive values!

SLIDE 25

Bob

47678

27

Zipcode Age Salary Disease 476** 2* 20K Gastric Ulcer 476** 2* 30K Gastritis 476** 2* 40K Stomach Cancer 4790* ≥40 50K Gastritis

A 3-diverse patient table Similarity attack

Sensitive Attribute Disclosure

47678 27

4790* ≥40 50K Gastritis 4790* ≥40 100K Flu 4790* ≥40 70K Bronchitis 476** 3* 60K Bronchitis 476** 3* 80K Pneumonia 476** 3* 90K Stomach Cancer

Conclusion 1. Bob’s salary is in [20k,40k], which is relatively low 2. Bob has some stomach-related disease

l-diversity does not consider semantics of sensitive values!

slide 25

SLIDE 26

t-Closeness: A New Privacy Measure

Rationale

Belief Knowledge

Observations

Q is public or can be derived
Potential knowledge gain from Q and

Pi about Specific individuals

Principle

External Knowledge

Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

B0 B1 B2

Principle

The distance between Q and Pi should

be bounded by a threshold t.

SLIDE 27

Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne [Li et al. ICDE ‘07]

Distribution of sensitive attributes within each quasi-identifier group should

t-Closeness

Caucas 787XX Acne Caucas 787XX Flu

Asian/AfrAm

78XXX Flu

Asian/AfrAm

78XXX Flu

Asian/AfrAm

78XXX Acne

Asian/AfrAm

78XXX Shingles

Asian/AfrAm

78XXX Acne

Asian/AfrAm

78XXX Flu

quasi-identifier group should be “close” to their distribution in the entire original database

slide 27

SLIDE 28

Distance Measures

P=(p1,p2,…,pm), Q=(q1,q2,…,qm)
Trace-distance
KL-divergence
None of these measures reflect the semantic distance among values.
Q: {3K,4K,5K,6K,7K,8K,9K,10K,11k}

P1:{3K,4K,5k} P2:{5K,7K,10K}

Intuitively, D[P1,Q]>D[P2,Q]

SLIDE 29

Earth Mover’s Distance

If the distributions are interpreted as two different ways
f piling up a certain amount of dirt over region D, EMD is

the minimum cost of turning one pile into the other

– the cost is amount of dirt moved * the distance by which it is moved – Assume two piles have the same amount of dirt

Extensions for comparison of distributions with different
Extensions for comparison of distributions with different

total masses.

– allow for a partial match, discard leftover "dirt“, without cost – allow for mass to be created or destroyed, but with a cost penalty

SLIDE 30

Earth Mover’s Distance

Formulation

– P=(p1,p2,…,pm), Q=(q1,q2,…,qm) – dij: the ground distance between element i of P and element j of Q. – Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:

SLIDE 31

How to calculate EMD(Cont’d)

EMD for categorical attributes

– Hierarchical distance – Hierarchical distance is a metric

SLIDE 32

Earth Mover’s Distance

Example

– {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} – Move 1/9 probability for each of the following pairs

3k->6k,3k->7k cost: 1/9*(3+4)/8
3k->6k,3k->7k cost: 1/9*(3+4)/8
4k->8k,4k->9k cost: 1/9*(4+5)/8
5k->10k,5k->11k cost: 1/9*(5+6)/8

– Total cost: 1/927/8=0.375 – With P2={6k,8k,11k} , we can get the total cost is 1/9 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.

SLIDE 33

Experiments

Goal

– To show l-diversity does not provide sufficient privacy protection (the similarity attack). – To show the efficiency and data quality of using t- – To show the efficiency and data quality of using t- closeness are comparable with other privacy measures.

Setup

– Adult dataset from UC Irvine ML repository – 30162 tuples, 9 attributes (2 sensitive attributes) – Algorithm: Incognito

SLIDE 34

Experiments

Comparisons of privacy measurements

– k-Anonymity – Entropy l-diversity – Recursive (c,l)-diversity – Recursive (c,l)-diversity – k-Anonymity with t-closeness

SLIDE 35

Experiments

Efficiency

– The efficiency of using t-closeness is comparable with other privacy measurements

SLIDE 36

Experiments

Data utility

– Discernibility metric; Minimum average group size – The data quality of using t-closeness is comparable with other privacy measurements

SLIDE 37

This is k-anonymous,

l-diverse and t-close…

Anonymous, “t-Close” Dataset

…so secure, right?

slide 37

SLIDE 38

Bob is Caucasian and

I heard he was admitted to hospital with flu…

What Does Attacker Know?

slide 38

SLIDE 39

Bob is Caucasian and

I heard he was admitted to hospital … And I know three other Caucasions admitted to hospital with Acne or

What Does Attacker Know?

hospital with Acne or

Shingles …

slide 39

SLIDE 40

k-Anonymity and Partition-based notions

Syntactic

– Focuses on data transformation, not on what can be learned from the anonymized dataset – “k-anonymous” dataset can leak sensitive – “k-anonymous” dataset can leak sensitive information

“Quasi-identifier” fallacy

– Assumes a priori that attacker will not know certain information about his target

slide 40

SLIDE 41

Today

Permutation based anonymization methods

(cont.)

Other privacy principles for microdata

publishing publishing

Statistical databases

– Definitions and early methods – Output perturbation and differential privacy

SLIDE 42

Originated from the study on statistical

database

A statistical database is a database which

provides statistics on subsets of records

Statistical Data Release

provides statistics on subsets of records

OLAP vs. OLTP
Statistics may be performed to compute SUM,

MEAN, MEDIAN, COUNT, MAX AND MIN of records

SLIDE 43

Types of Statistical Databases

Static – a static database is made once and never changes Dynamic – changes continuously to reflect real-time data and never changes

Example: U.S. Census

real-time data

Example: most online research databases

SLIDE 44

Types of Statistical Databases

Centralized – one database Decentralized – multiple decentralized databases databases

General purpose – like

census

Special purpose – like

bank, hospital, academia, etc

SLIDE 45

Exact compromise – a user is able to determine the exact

value of a sensitive attribute of an individual

Partial compromise – a user is able to obtain an estimator for

a sensitive attribute with a bounded variance

Positive compromise – determine an attribute has a particular

Data Compromise

Positive compromise – determine an attribute has a particular

value

Negative compromise – determine an attribute does not have

a particular value

Relative compromise – determine the ranking of some

confidential values

SLIDE 46

Statistical Quality of Information

Bias – difference between the unperturbed

statistic and the expected value of its perturbed estimate

Precision – variance of the estimators obtained

by users by users

Consistency – lack of contradictions and

paradoxes

– Contradictions: different responses to same query; average differs from sum/count – Paradox: negative count

SLIDE 47

Methods

Query restriction Data perturbation/anonymization Output perturbation

SLIDE 48

Data Perturbation

Query Results Q u e r y Results

SLIDE 49

Noise Added to Results

User 1

Query Results

Output Perturbation

Query Results

to Results

User 2

Query R e s u l t s Original Database

Query Results

SLIDE 50

Statistical data release vs. data anonymization

Data anonymization is one technique that can

be used to build statistical database

Other techniques such as query restriction

and output purterbation can be used to build and output purterbation can be used to build statistical database or release statistical data

Different privacy principles can be used

CS573 Data Privacy and Security Anonymization methods Anonymization methods

Li Xiong

Today

(cont.)

publishing publishing

Anonymization methods

– Generalization – Suppression

– Microaggregation/clustering – Additive noise

– De-associate relationship between QID and sensitive attribute

Concept of the Anatomy Algorithm

(ST)

sensitive attribute values with a Group!ID column

Specifications of Anatomy

DEFINITION 3. (Anatomy) With a given !diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: ( ) ( ) ST will be constructed as the following: (, , )

Privacy properties

THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/

Comparison with generalization

Preserving Data Correlation

probability density function pdf

Preserving Data Correlation

generalization table:

Preserving Data Correlation

ST tables:

Preserving Data Correlation

distance” with the following equation: The distance for anatomy is 0.5 while the distance for The distance for anatomy is 0.5 while the distance for generalization is 22.5

Preserving Data Correlation

Idea: Measure the error for each tuple by using the following formula: Objective: for all tuples in and obtain a minimal (RCE):

Algorithm: Nearly!Optimal Anatomizing Algorithm

Experiments

500k American adults containing 9 discrete attributes

Experiments

Today

(cont.)

publishing publishing

Attacks on k-Anonymity

l-Diversity

Distinct l-Diversity

represented sensitive values

Other Versions of l-Diversity

– The frequency of the most frequent value in an equivalence class is bounded by 1/l

– The entropy of the distribution of sensitive values in each equivalence class is at least log(l)

– r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value – Intuition: the most frequent value does not appear too frequently

Neither Necessary, Nor Sufficient

Neither Necessary, Nor Sufficient

Neither Necessary, Nor Sufficient

Limitations of l-Diversity

(99%)

– Very different degrees of sensitivity!

– 2-diversity is unnecessary for an equivalence class that contains only HIV- records

– Suppose there are 10000 records in total – To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

Skewness Attack

HIV- (99%)

equal number of HIV+ and HIV- records equal number of HIV+ and HIV- records

– Diverse, but potentially violates privacy!

– Equivalence class 1: 49 HIV+ and 1 HIV- – Equivalence class 2: 1 HIV+ and 49 HIV-

Sensitive Attribute Disclosure

t-Closeness: A New Privacy Measure

Observations

Principle

B0 B1 B2

Principle

t-Closeness

Distance Measures

Earth Mover’s Distance

the minimum cost of turning one pile into the other

– the cost is amount of dirt moved * the distance by which it is moved – Assume two piles have the same amount of dirt

total masses.

– allow for a partial match, discard leftover "dirt“, without cost – allow for mass to be created or destroyed, but with a cost penalty

Earth Mover’s Distance

How to calculate EMD(Cont’d)

– Hierarchical distance – Hierarchical distance is a metric

Earth Mover’s Distance

– {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} – Move 1/9 probability for each of the following pairs

– Total cost: 1/9*27/8=0.375 – With P2={6k,8k,11k} , we can get the total cost is 1/9 * 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.

Experiments

– To show l-diversity does not provide sufficient privacy protection (the similarity attack). – To show the efficiency and data quality of using t- – To show the efficiency and data quality of using t- closeness are comparable with other privacy measures.

– Adult dataset from UC Irvine ML repository – 30162 tuples, 9 attributes (2 sensitive attributes) – Algorithm: Incognito

Experiments

– k-Anonymity – Entropy l-diversity – Recursive (c,l)-diversity – Recursive (c,l)-diversity – k-Anonymity with t-closeness

Experiments

– Total cost: 1/927/8=0.375 – With P2={6k,8k,11k} , we can get the total cost is 1/9 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.