Anonymization Algorithms - Other techniques, metrics, and extended - - PowerPoint PPT Presentation

anonymization algorithms other techniques metrics and
SMART_READER_LITE
LIVE PREVIEW

Anonymization Algorithms - Other techniques, metrics, and extended - - PowerPoint PPT Presentation

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573 Data Privacy and Anonymity So far k-anonymity (protect identity disclosure) Anonymization algorithms Generalization and suppression


slide-1
SLIDE 1

Anonymization Algorithms - Other techniques, metrics, and extended scenarios

Li Xiong

CS573 Data Privacy and Anonymity

slide-2
SLIDE 2

So far

 k-anonymity (protect identity disclosure)  Anonymization algorithms

 Generalization and suppression  Microaggregation and clustering

 Privacy principles beyond k-anonymity

 l-diversity, t-closeness (protect attribute

disclosure)

 m-invariance (protect continuous publishing)

slide-3
SLIDE 3

Agenda

 Other anonymization technique

 Anatomization

 Information metrics  Extended scenarios

slide-4
SLIDE 4

Anonymization methods

 Non-perturbative: don't distort the data

 Generalization  Suppression

 Perturbative: distort the data

 Microaggregation/clustering  Additive noise

 Anatomization and permutation

 De-associate relationship between QID and

sensitive attribute

slide-5
SLIDE 5

tuple ID Age Sex Zipcode Disease 1 (Bob) 23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1 tuple ID Age Sex Zipcode Disease 1 [21,60] M [10001, 60000] pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2

Problems with k-anonymity and l-diversity

Query A: SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

slide-6
SLIDE 6

Querying generalized table

  • R1 and R2 are the anonymized QID groups
  • Q is the query range
  • p = Area(R1 ∩ RQ)/Area(R1) = (10*10)/(50*40) = 0.05
  • Estimated Answer for A: 2(0.05) = 0.1
slide-7
SLIDE 7

Concept of the Anatomy Algorithm

  • Release 2 tables, quasi-identifier table (QIT) and sensitive

table (ST)

  • Use the same QI groups (satisfy l-diversity), replace the

sensitive attribute values with a Group-ID column

  • Then produce a sensitive table with Disease statistics

tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

Group-ID Disease Count 1 headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

slide-8
SLIDE 8

Concept of the Anatomy Algorithm

  • Does it satisfy k-anonymity? l-diversity?
  • Query results?

tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

Group-ID Disease Count 1 headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

slide-9
SLIDE 9

Specifications of Anatomy

  • T is representation of the microdata to be published
  • T has d QI attributes Aqi1, Aqi2, ..., Aqid and a sensitive

attribute As

  • Each Aqii (1 ≤ i ≤ d ) is either numerical or categorical, but As

can only be categorical because of l-diversity

  • t is a tuple within T and Aqii is the value of t with [d + 1] as

the As value

  • With the above stated, we can consider t to be a point in a

(d +1)-dimensional data space regarded as DS

slide-10
SLIDE 10

Specifications of Anatomy cont.

DEFINITION 1. (Partition/QI-group) A partition is several subsets of T and only allow each tuple to belong to one subset Subsets are know as QI-groups and are denoted as follows QI1, QI2, ...,QIm

slide-11
SLIDE 11

Specifications of Anatomy cont.

DEFINITION 2. (l-diverse partition) A partition is considered l-diverse if it conforms to the following: v is the most frequent sensitive value in a QI-group QIj and cj(v) is the number of tuples that match v cj(v)/|QIj| ≤ 1/l |QIj| is the number of tuples of QIj c1(dyspepsia) = c1(pneumonia) = 2 and c2(flu) = 2 |QI1| = |QI2| = 4 so this satisfies the condition 2/4 ≤ 1/2

slide-12
SLIDE 12

Specifications of Anatomy cont.

DEFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (Aqi1, Aqi2, ..., Aqid, Group-ID) ST will be constructed as the following: (Group-ID, As, Count)

slide-13
SLIDE 13

Privacy properties

THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

Age Sex Zipcode Group-ID Disease Count 23 M 11000 1 dyspepsia 2 23 M 11000 1 pneumonia 2 27 M 13000 1 dyspepsia 2 27 M 13000 1 pneumonia 2 35 M 59000 1 dyspepsia 2 35 M 59000 1 pneumonia 2 59 M 12000 1 dyspepsia 2 59 M 12000 1 pneumonia 2 61 F 54000 2 bronchitis 1 61 F 54000 2 flu 2 61 F 54000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 65 F 25000 2 bronchitis 1 65 F 25000 2 flu 2 65 F 25000 2 stomachache 1 70 F 30000 2 bronchitis 1 70 F 30000 2 flu 2 70 F 30000 2 stomachache 1

slide-14
SLIDE 14

Comparison with generalization

  • Compare with generalization on two assumptions:

A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata If A1 and A2 are true, anatomy is as good as generalization 1/l holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger

slide-15
SLIDE 15

Preserving Data Correlation

  • Examine the correlation between Age and Disease in T using

probability density function pdf

  • Example: t1

tuple ID Age Sex Zipcode Disease 1 (Bob) 23 M 11000 pneumonia 2 27 M 13000 Dyspepsia 3 35 M 59000 Dyspepsia 4 59 M 12000 pneumonia 5 61 F 54000 flu 6 65 F 25000 stomach pain 7 (Alice) 65 F 25000 flu 8 70 F 30000 bronchitis table 1

slide-16
SLIDE 16

Preserving Data Correlation cont.

  • To re-construct an approximate pdf of t1 from the

generalization table:

tuple ID Age Sex Zipcode Disease 1 [21,60] M [10001, 60000] pneumonia 2 [21,60] M [10001, 60000] Dyspepsia 3 [21,60] M [10001, 60000] Dyspepsia 4 [21,60] M [10001, 60000] pneumonia 5 [61,70] F [10001, 60000] flu 6 [61,70] F [10001, 60000] stomach pain 7 [61,70] F [10001, 60000] flu 8 [61,70] F [10001, 60000] bronchitis table 2

slide-17
SLIDE 17

Preserving Data Correlation cont.

  • To re-construct an approximate pdf of t1 from the QIT and

ST tables:

tuple ID Age Sex Zipcode Group-ID 1 23 M 11000 1 2 27 M 13000 1 3 35 M 59000 1 4 59 M 12000 1 5 61 F 54000 2 6 65 F 25000 2 7 65 F 25000 2 8 70 F 30000 2 QIT

Group-ID Disease Count 1 headache 2 1 pneumonia 2 2 bronchitis 1 2 flu 2 2 stomach ache 1 ST

slide-18
SLIDE 18

Preserving Data Correlation cont.

  • To figure out a more rigorous comparison, calculate the “L2

distance” with the following equation: The distance for anatomy is 0.5 while the distance for generalization is 22.5

  • Anatomy provides for better re-constructions of the

probability density functions of all tuples.

slide-19
SLIDE 19

Preserving Data Correlation cont.

  • measure the error for each pdf by using the following

formula: Objective: for all tuples t in T and obtain a minimal re- construction error (RCE):

slide-20
SLIDE 20

Nearly-Optimal Anatomizing Algorithm

  • They propose an efficient algorithm for anatomizing tables

that will minimize the RCE

  • The resulting QIT and ST achieves an RCE that only

deviates from the lower bound by a factor < 1 + 1/n, where n is the size of T

  • This algorithm has linear I/O complexity O(n/b) where b is

the page size

slide-21
SLIDE 21

Nearly-Optimal Anatomizing Algorithm cont.

PROPERTY 1. At the end

  • f the group-creation

phase, each non-empty bucket has only one tuple. PROPERTY 2. The set S' always includes at least

  • ne QI-group.

PROPERTY 3. After the residue-assignment phase, each QI group has at least l tuples with distinct senstive attribute value

slide-22
SLIDE 22

Experiments

  • dataset CENSUS that contained the personal information of

500k American adults containing 9 discrete attributes

  • Created two sets of microdata tables

Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute As g

slide-23
SLIDE 23

Experiments cont.

slide-24
SLIDE 24

Experiments cont.

slide-25
SLIDE 25

Experiments cont.

slide-26
SLIDE 26

Experiments cont.

slide-27
SLIDE 27

Conclusion

  • Anatomy was designed to overcome the problem of

generalization of losing too much data and still obtain privacy

  • Anatomy has a significantly lower error rate as compared

with generalization

  • Several items would require further research
  • Multiple sensitive attributes
  • Effective mining of patterns in microdata
slide-28
SLIDE 28

Agenda

 Other anonymization technique

 Anatomization

 Information metrics  Extended scenarios

slide-29
SLIDE 29

Information Metrics

 General purpose metrics  Special purpose metrics  Trade-off metrics

slide-30
SLIDE 30

General Purpose Metrics

 General idea: measure “similarity” between the

  • riginal data and the anonymized data

 Minimal distortion metric (Samarati 2001; Sweeney

2002, Wang and Fung 2006)

 Charge a penalty to each instance of a value

generalized or suppressed (independently of

  • ther records)

 ILoss (Xiao and Tao 2006)

 Charge a penalty when a specific value is

generalized

slide-31
SLIDE 31

General Purpose Metrics cont.

 Discernibility Metric (DM) (K-OPTIMIZE,

Mondrian, l-diversity …)

 Charge a penalty to each record for being

indistinguishable from other records

slide-32
SLIDE 32

Special Purpose Metrics

 Classification: Classification metric (CM)

(Iyengar 2002)

 Charge a penalty for each record suppressed

  • r generalized to a group in which the record’s

class is not the majority class

 Query

 Query error: count queries  Query imprecision: overlapped range

slide-33
SLIDE 33

Extended Scenarios

 Multiple release publishing  Continuous release publishing  Collaborative/distributed publishing

slide-34
SLIDE 34

Other types of data

 High dimensional transaction data

 Market basket, web queries

 Moving objects data

 Location based services

 Textual data