[PPT] - Anonymization Algorithms - Other techniques, metrics, and extended PowerPoint Presentation

SLIDE 1

Anonymization Algorithms - Other techniques, metrics, and extended scenarios

Li Xiong

CS573 Data Privacy and Anonymity

SLIDE 2

So far

 k-anonymity (protect identity disclosure)  Anonymization algorithms

 Generalization and suppression  Microaggregation and clustering

 Privacy principles beyond k-anonymity

 l-diversity, t-closeness (protect attribute

disclosure)

 m-invariance (protect continuous publishing)

SLIDE 3

Agenda

 Other anonymization technique

 Anatomization

 Information metrics  Extended scenarios

SLIDE 4

Anonymization methods

 Non-perturbative: don't distort the data

 Generalization  Suppression

 Perturbative: distort the data

 Microaggregation/clustering  Additive noise

 Anatomization and permutation

 De-associate relationship between QID and

sensitive attribute

SLIDE 5

tuple ID Age Sex Zipcode Disease 1 (Bob) 23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1 tuple ID Age Sex Zipcode Disease 1 [21,60] M [10001, 60000] pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2

Problems with k-anonymity and l-diversity

Query A: SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

SLIDE 6

Querying generalized table

R1 and R2 are the anonymized QID groups
Q is the query range
p = Area(R1 ∩ RQ)/Area(R1) = (10*10)/(50*40) = 0.05
Estimated Answer for A: 2(0.05) = 0.1

SLIDE 7

Concept of the Anatomy Algorithm

Release 2 tables, quasi-identifier table (QIT) and sensitive

table (ST)

Use the same QI groups (satisfy l-diversity), replace the

sensitive attribute values with a Group-ID column

Then produce a sensitive table with Disease statistics

tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

Group-ID Disease Count 1 headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

SLIDE 8

Concept of the Anatomy Algorithm

Does it satisfy k-anonymity? l-diversity?
Query results?

tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

Group-ID Disease Count 1 headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

SLIDE 9

Specifications of Anatomy

T is representation of the microdata to be published
T has d QI attributes Aqi1, Aqi2, ..., Aqid and a sensitive

attribute As

Each Aqii (1 ≤ i ≤ d ) is either numerical or categorical, but As

can only be categorical because of l-diversity

t is a tuple within T and Aqii is the value of t with [d + 1] as

the As value

With the above stated, we can consider t to be a point in a

(d +1)-dimensional data space regarded as DS

SLIDE 10

Specifications of Anatomy cont.

DEFINITION 1. (Partition/QI-group) A partition is several subsets of T and only allow each tuple to belong to one subset Subsets are know as QI-groups and are denoted as follows QI1, QI2, ...,QIm

SLIDE 11

Specifications of Anatomy cont.

DEFINITION 2. (l-diverse partition) A partition is considered l-diverse if it conforms to the following: v is the most frequent sensitive value in a QI-group QIj and cj(v) is the number of tuples that match v cj(v)/|QIj| ≤ 1/l |QIj| is the number of tuples of QIj c1(dyspepsia) = c1(pneumonia) = 2 and c2(flu) = 2 |QI1| = |QI2| = 4 so this satisfies the condition 2/4 ≤ 1/2

SLIDE 12

Specifications of Anatomy cont.

DEFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (Aqi1, Aqi2, ..., Aqid, Group-ID) ST will be constructed as the following: (Group-ID, As, Count)

SLIDE 13

Privacy properties

THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

Age Sex Zipcode Group-ID Disease Count 23 M 11000 1 dyspepsia 2 23 M 11000 1 pneumonia 2 27 M 13000 1 dyspepsia 2 27 M 13000 1 pneumonia 2 35 M 59000 1 dyspepsia 2 35 M 59000 1 pneumonia 2 59 M 12000 1 dyspepsia 2 59 M 12000 1 pneumonia 2 61 F 54000 2 bronchitis 1 61 F 54000 2 flu 2 61 F 54000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 70 F 30000 2 bronchitis 1 70 F 30000 2 flu 2 70 F 30000 2 stomachache 1

SLIDE 14

Comparison with generalization

Compare with generalization on two assumptions:

A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata If A1 and A2 are true, anatomy is as good as generalization 1/l holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger

SLIDE 15

Preserving Data Correlation

Examine the correlation between Age and Disease in T using

probability density function pdf

Example: t1

tuple ID Age Sex Zipcode Disease 1 (Bob) 23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1

SLIDE 16

Preserving Data Correlation cont.

To re-construct an approximate pdf of t1 from the

generalization table:

tuple ID Age Sex Zipcode Disease 1 [21,60] M [10001, 60000] pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2

SLIDE 17

Preserving Data Correlation cont.

To re-construct an approximate pdf of t1 from the QIT and

ST tables:

tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

Group-ID Disease Count 1 headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

SLIDE 18

Preserving Data Correlation cont.

To figure out a more rigorous comparison, calculate the “L2

distance” with the following equation: The distance for anatomy is 0.5 while the distance for generalization is 22.5

Anatomy provides for better re-constructions of the

probability density functions of all tuples.

SLIDE 19

Preserving Data Correlation cont.

measure the error for each pdf by using the following

formula: Objective: for all tuples t in T and obtain a minimal re- construction error (RCE):

SLIDE 20

Nearly-Optimal Anatomizing Algorithm

They propose an efficient algorithm for anatomizing tables

that will minimize the RCE

The resulting QIT and ST achieves an RCE that only

deviates from the lower bound by a factor < 1 + 1/n, where n is the size of T

This algorithm has linear I/O complexity O(n/b) where b is

the page size

SLIDE 21

Nearly-Optimal Anatomizing Algorithm cont.

PROPERTY 1. At the end

f the group-creation

phase, each non-empty bucket has only one tuple. PROPERTY 2. The set S' always includes at least

ne QI-group.

PROPERTY 3. After the residue-assignment phase, each QI group has at least l tuples with distinct senstive attribute value

SLIDE 22

Experiments

dataset CENSUS that contained the personal information of

500k American adults containing 9 discrete attributes

Created two sets of microdata tables

Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute As g

SLIDE 23

Experiments cont.

SLIDE 24

Experiments cont.

SLIDE 25

Experiments cont.

SLIDE 26

Experiments cont.

SLIDE 27

Conclusion

Anatomy was designed to overcome the problem of

generalization of losing too much data and still obtain privacy

Anatomy has a significantly lower error rate as compared

with generalization

Several items would require further research
Multiple sensitive attributes
Effective mining of patterns in microdata

SLIDE 28

Agenda

 Other anonymization technique

 Anatomization

 Information metrics  Extended scenarios

SLIDE 29

Information Metrics

 General purpose metrics  Special purpose metrics  Trade-off metrics

SLIDE 30

General Purpose Metrics

 General idea: measure “similarity” between the

riginal data and the anonymized data

 Minimal distortion metric (Samarati 2001; Sweeney

2002, Wang and Fung 2006)

 Charge a penalty to each instance of a value

generalized or suppressed (independently of

ther records)

 ILoss (Xiao and Tao 2006)

 Charge a penalty when a specific value is

generalized

SLIDE 31

General Purpose Metrics cont.

 Discernibility Metric (DM) (K-OPTIMIZE,

Mondrian, l-diversity …)

 Charge a penalty to each record for being

indistinguishable from other records

SLIDE 32

Special Purpose Metrics

 Classification: Classification metric (CM)

(Iyengar 2002)

 Charge a penalty for each record suppressed

r generalized to a group in which the record’s

class is not the majority class

 Query

 Query error: count queries  Query imprecision: overlapped range

SLIDE 33

Extended Scenarios

 Multiple release publishing  Continuous release publishing  Collaborative/distributed publishing

SLIDE 34

Other types of data

 High dimensional transaction data

 Market basket, web queries

 Moving objects data

 Location based services