CS573 Data Privacy and Security Anonymization methods Anonymization - - PowerPoint PPT Presentation

cs573 data privacy and security anonymization methods
SMART_READER_LITE
LIVE PREVIEW

CS573 Data Privacy and Security Anonymization methods Anonymization - - PowerPoint PPT Presentation

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today Permutation based anonymization methods (cont.) Other privacy principles for microdata publishing publishing Statistical databases


slide-1
SLIDE 1

CS573 Data Privacy and Security Anonymization methods Anonymization methods

Li Xiong

slide-2
SLIDE 2

Today

  • Permutation based anonymization methods

(cont.)

  • Other privacy principles for microdata

publishing publishing

  • Statistical databases
slide-3
SLIDE 3

Anonymization methods

  • Non-perturbative: don't distort the data

– Generalization – Suppression

  • Perturbative: distort the data
  • Perturbative: distort the data

– Microaggregation/clustering – Additive noise

  • Anatomization and permutation

– De-associate relationship between QID and sensitive attribute

slide-4
SLIDE 4

Concept of the Anatomy Algorithm

  • Release 2 tables, (QIT) and

(ST)

  • Use the same QI groups (satisfy l!diversity), replace the

sensitive attribute values with a Group!ID column

  • Then produce a sensitive table with statistics
  • Then produce a sensitive table with statistics

tuple ID

  • 1

23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

  • 1

headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

slide-5
SLIDE 5

Specifications of Anatomy

DEFINITION 3. (Anatomy) With a given !diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: ( ) ( ) ST will be constructed as the following: (, , )

slide-6
SLIDE 6

Privacy properties

THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/

  • 23

M 11000 1 dyspepsia 2 23 M 11000 1 pneumonia 2 27 M 13000 1 dyspepsia 2 27 M 13000 1 pneumonia 2 27 M 13000 1 pneumonia 2 35 M 59000 1 dyspepsia 2 35 M 59000 1 pneumonia 2 59 M 12000 1 dyspepsia 2 59 M 12000 1 pneumonia 2 61 F 54000 2 bronchitis 1 61 F 54000 2 flu 2 61 F 54000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 70 F 30000 2 bronchitis 1 70 F 30000 2 flu 2 70 F 30000 2 stomachache 1

slide-7
SLIDE 7

Comparison with generalization

  • Compare with generalization on two assumptions:

A1: the adversary has the QI!values of the target individual A2: the adversary also knows that the individual is definitely in the If A1 and A2 are true, anatomy is as good as generalization 1/ If A1 and A2 are true, anatomy is as good as generalization 1/ holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger

slide-8
SLIDE 8

Preserving Data Correlation

  • Examine the correlation between Age and Disease in T using

probability density function pdf

  • Example: t1

tuple ID

  • 1 (Bob)

23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1

slide-9
SLIDE 9

Preserving Data Correlation

  • To re!construct an approximate pdf of from the

generalization table:

tuple ID

  • 1

[21,60] M [10001, 60000] pneumonia 1 [21,60] M [10001, 60000] pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2

slide-10
SLIDE 10

Preserving Data Correlation

  • To re!construct an approximate pdf of from the QIT and

ST tables:

tuple ID

  • 1

23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

  • 1

headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

slide-11
SLIDE 11

Preserving Data Correlation

  • To figure out a more rigorous comparison, calculate the “

distance” with the following equation: The distance for anatomy is 0.5 while the distance for The distance for anatomy is 0.5 while the distance for generalization is 22.5

slide-12
SLIDE 12

Preserving Data Correlation

Idea: Measure the error for each tuple by using the following formula: Objective: for all tuples in and obtain a minimal (RCE):

Algorithm: Nearly!Optimal Anatomizing Algorithm

slide-13
SLIDE 13

Experiments

  • dataset CENSUS that contained the personal information of

500k American adults containing 9 discrete attributes

  • Created two sets of tables

Set 1: 5 tables denoted as OCC!3, ..., OCC!7 so that OCC! (3 ≤ ≤ 7) uses the first as QI!attributes and (3 ≤ ≤ 7) uses the first as QI!attributes and as the sensitive attribute Set 2: 5 tables denoted as SAL!3, ..., SAL!7 so that SAL! (3 ≤ ≤ 7) uses the first as QI!attributes and !" as the sensitive attribute g

slide-14
SLIDE 14

Experiments

slide-15
SLIDE 15

Today

  • Permutation based anonymization methods

(cont.)

  • Other privacy principles for microdata

publishing publishing

  • Statistical databases
  • Differential privacy
slide-16
SLIDE 16

A 3-anonymous patient table Homogeneity attack

Attacks on k-Anonymity

  • k-Anonymity does not provide privacy if

– Sensitive values in an equivalence class lack diversity – The attacker has background knowledge

Zipcode Age Disease 476** 2* Heart Disease 476** 2* Heart Disease 476** 2* Heart Disease 4790* ≥40 Flu 4790* ≥40 Heart Disease 4790* ≥40 Cancer 476** 3* Heart Disease 476** 3* Cancer 476** 3* Cancer

A 3-anonymous patient table

Bob

  • 47678

27 Carl

  • 47673

36

Homogeneity attack Background knowledge attack

slide 16

slide-17
SLIDE 17

Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne

Sensitive attributes must be

[Machanavajjhala et al. ICDE ‘06]

l-Diversity

Caucas 787XX Acne Caucas 787XX Flu

Asian/AfrAm

78XXX Flu

Asian/AfrAm

78XXX Flu

Asian/AfrAm

78XXX Acne

Asian/AfrAm

78XXX Shingles

Asian/AfrAm

78XXX Acne

Asian/AfrAm

78XXX Flu

Sensitive attributes must be “diverse” within each quasi-identifier equivalence class

slide 17

slide-18
SLIDE 18

Distinct l-Diversity

  • Each equivalence class has at least l well-

represented sensitive values

  • Doesn’t prevent probabilistic inference attacks

slide 18

10 records 8 records have HIV 2 records have other values

slide-19
SLIDE 19

Other Versions of l-Diversity

  • Probabilistic l-diversity

– The frequency of the most frequent value in an equivalence class is bounded by 1/l

  • Entropy l-diversity
  • Entropy l-diversity

– The entropy of the distribution of sensitive values in each equivalence class is at least log(l)

  • Recursive (c,l)-diversity

– r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value – Intuition: the most frequent value does not appear too frequently

slide 19

slide-20
SLIDE 20

2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Cancer

Original dataset

Neither Necessary, Nor Sufficient

2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu

99% have cancer

slide-21
SLIDE 21

2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Cancer

Original dataset

Q1 Flu Q1 Flu Q1 Cancer Q1 Flu Q1 Cancer

Anonymization A

Neither Necessary, Nor Sufficient

2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer

99% have cancer

50% cancer ⇒ quasi5identifier group is “diverse”

slide 21

slide-22
SLIDE 22

2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Cancer

Original dataset

Q1 Flu Q1 Cancer Q1 Cancer Q1 Cancer Q1 Cancer

Anonymization B

Q1 Flu Q1 Flu Q1 Cancer Q1 Flu Q1 Cancer

Anonymization A

Neither Necessary, Nor Sufficient

2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Cancer 2 Flu 2 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Flu Q2 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer

99% have cancer

50% cancer ⇒ quasi5identifier group is “diverse”

  • 99% cancer ⇒ quasi5identifier group is not “diverse”

slide 22

slide-23
SLIDE 23

Limitations of l-Diversity

  • Example: sensitive attribute is HIV+ (1%) or HIV-

(99%)

– Very different degrees of sensitivity!

  • l-diversity is unnecessary

– 2-diversity is unnecessary for an equivalence class that contains only HIV- records

  • l-diversity is difficult to achieve

– Suppose there are 10000 records in total – To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

slide 23

slide-24
SLIDE 24

Skewness Attack

  • Example: sensitive attribute is HIV+ (1%) or

HIV- (99%)

  • Consider an equivalence class that contains an

equal number of HIV+ and HIV- records equal number of HIV+ and HIV- records

– Diverse, but potentially violates privacy!

  • l-diversity does not differentiate:

– Equivalence class 1: 49 HIV+ and 1 HIV- – Equivalence class 2: 1 HIV+ and 49 HIV-

slide 24

l-diversity does not consider overall distribution of sensitive values!

slide-25
SLIDE 25

Bob

  • 47678

27

Zipcode Age Salary Disease 476** 2* 20K Gastric Ulcer 476** 2* 30K Gastritis 476** 2* 40K Stomach Cancer 4790* ≥40 50K Gastritis

A 3-diverse patient table Similarity attack

Sensitive Attribute Disclosure

47678 27

4790* ≥40 50K Gastritis 4790* ≥40 100K Flu 4790* ≥40 70K Bronchitis 476** 3* 60K Bronchitis 476** 3* 80K Pneumonia 476** 3* 90K Stomach Cancer

Conclusion 1. Bob’s salary is in [20k,40k], which is relatively low 2. Bob has some stomach-related disease

l-diversity does not consider semantics of sensitive values!

slide 25

slide-26
SLIDE 26

t-Closeness: A New Privacy Measure

  • Rationale

Belief Knowledge

Observations

  • Q is public or can be derived
  • Potential knowledge gain from Q and

Pi about Specific individuals

Principle

External Knowledge

Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

B0 B1 B2

Principle

  • The distance between Q and Pi should

be bounded by a threshold t.

slide-27
SLIDE 27

Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne [Li et al. ICDE ‘07]

Distribution of sensitive attributes within each quasi-identifier group should

t-Closeness

Caucas 787XX Acne Caucas 787XX Flu

Asian/AfrAm

78XXX Flu

Asian/AfrAm

78XXX Flu

Asian/AfrAm

78XXX Acne

Asian/AfrAm

78XXX Shingles

Asian/AfrAm

78XXX Acne

Asian/AfrAm

78XXX Flu

quasi-identifier group should be “close” to their distribution in the entire original database

slide 27

slide-28
SLIDE 28

Distance Measures

  • P=(p1,p2,…,pm), Q=(q1,q2,…,qm)
  • Trace-distance
  • KL-divergence
  • None of these measures reflect the semantic distance among values.
  • Q: {3K,4K,5K,6K,7K,8K,9K,10K,11k}

P1:{3K,4K,5k} P2:{5K,7K,10K}

  • Intuitively, D[P1,Q]>D[P2,Q]
slide-29
SLIDE 29

Earth Mover’s Distance

  • If the distributions are interpreted as two different ways
  • f piling up a certain amount of dirt over region D, EMD is

the minimum cost of turning one pile into the other

– the cost is amount of dirt moved * the distance by which it is moved – Assume two piles have the same amount of dirt

  • Extensions for comparison of distributions with different
  • Extensions for comparison of distributions with different

total masses.

– allow for a partial match, discard leftover "dirt“, without cost – allow for mass to be created or destroyed, but with a cost penalty

slide-30
SLIDE 30

Earth Mover’s Distance

  • Formulation

– P=(p1,p2,…,pm), Q=(q1,q2,…,qm) – dij: the ground distance between element i of P and element j of Q. – Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:

slide-31
SLIDE 31

How to calculate EMD(Cont’d)

  • EMD for categorical attributes

– Hierarchical distance – Hierarchical distance is a metric

slide-32
SLIDE 32

Earth Mover’s Distance

  • Example

– {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} – Move 1/9 probability for each of the following pairs

  • 3k->6k,3k->7k cost: 1/9*(3+4)/8
  • 3k->6k,3k->7k cost: 1/9*(3+4)/8
  • 4k->8k,4k->9k cost: 1/9*(4+5)/8
  • 5k->10k,5k->11k cost: 1/9*(5+6)/8

– Total cost: 1/9*27/8=0.375 – With P2={6k,8k,11k} , we can get the total cost is 1/9 * 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.

slide-33
SLIDE 33

Experiments

  • Goal

– To show l-diversity does not provide sufficient privacy protection (the similarity attack). – To show the efficiency and data quality of using t- – To show the efficiency and data quality of using t- closeness are comparable with other privacy measures.

  • Setup

– Adult dataset from UC Irvine ML repository – 30162 tuples, 9 attributes (2 sensitive attributes) – Algorithm: Incognito

slide-34
SLIDE 34

Experiments

  • Comparisons of privacy measurements

– k-Anonymity – Entropy l-diversity – Recursive (c,l)-diversity – Recursive (c,l)-diversity – k-Anonymity with t-closeness

slide-35
SLIDE 35

Experiments

  • Efficiency

– The efficiency of using t-closeness is comparable with other privacy measurements

slide-36
SLIDE 36

Experiments

  • Data utility

– Discernibility metric; Minimum average group size – The data quality of using t-closeness is comparable with other privacy measurements

slide-37
SLIDE 37
  • This is k-anonymous,

l-diverse and t-close…

Anonymous, “t-Close” Dataset

  • …so secure, right?

slide 37

slide-38
SLIDE 38
  • Bob is Caucasian and

I heard he was admitted to hospital with flu…

What Does Attacker Know?

  • slide 38
slide-39
SLIDE 39
  • Bob is Caucasian and

I heard he was admitted to hospital … And I know three other Caucasions admitted to hospital with Acne or

What Does Attacker Know?

  • hospital with Acne or

Shingles …

slide 39

slide-40
SLIDE 40

k-Anonymity and Partition-based notions

  • Syntactic

– Focuses on data transformation, not on what can be learned from the anonymized dataset – “k-anonymous” dataset can leak sensitive – “k-anonymous” dataset can leak sensitive information

  • “Quasi-identifier” fallacy

– Assumes a priori that attacker will not know certain information about his target

slide 40

slide-41
SLIDE 41

Today

  • Permutation based anonymization methods

(cont.)

  • Other privacy principles for microdata

publishing publishing

  • Statistical databases

– Definitions and early methods – Output perturbation and differential privacy

slide-42
SLIDE 42
  • Originated from the study on statistical

database

  • A statistical database is a database which

provides statistics on subsets of records

Statistical Data Release

provides statistics on subsets of records

  • OLAP vs. OLTP
  • Statistics may be performed to compute SUM,

MEAN, MEDIAN, COUNT, MAX AND MIN of records

slide-43
SLIDE 43

Types of Statistical Databases

Static – a static database is made once and never changes Dynamic – changes continuously to reflect real-time data and never changes

Example: U.S. Census

real-time data

Example: most online research databases

slide-44
SLIDE 44

Types of Statistical Databases

Centralized – one database Decentralized – multiple decentralized databases databases

General purpose – like

census

Special purpose – like

bank, hospital, academia, etc

slide-45
SLIDE 45
  • Exact compromise – a user is able to determine the exact

value of a sensitive attribute of an individual

  • Partial compromise – a user is able to obtain an estimator for

a sensitive attribute with a bounded variance

  • Positive compromise – determine an attribute has a particular

Data Compromise

  • Positive compromise – determine an attribute has a particular

value

  • Negative compromise – determine an attribute does not have

a particular value

  • Relative compromise – determine the ranking of some

confidential values

slide-46
SLIDE 46

Statistical Quality of Information

  • Bias – difference between the unperturbed

statistic and the expected value of its perturbed estimate

  • Precision – variance of the estimators obtained

by users by users

  • Consistency – lack of contradictions and

paradoxes

– Contradictions: different responses to same query; average differs from sum/count – Paradox: negative count

slide-47
SLIDE 47

Methods

Query restriction Data perturbation/anonymization Output perturbation

slide-48
SLIDE 48

Data Perturbation

Query Results Q u e r y Results

slide-49
SLIDE 49

Noise Added to Results

User 1

Query Results

Output Perturbation

Query Results

to Results

User 2

Query R e s u l t s Original Database

Query Results

slide-50
SLIDE 50

Statistical data release vs. data anonymization

  • Data anonymization is one technique that can

be used to build statistical database

  • Other techniques such as query restriction

and output purterbation can be used to build and output purterbation can be used to build statistical database or release statistical data

  • Different privacy principles can be used
slide-51
SLIDE 51

Security Methods

Query restriction (early methods)

Query size control Query set overlap control Query auditing Query auditing

Data perturbation/anonymization Output perturbation