Anonymization Algorithms - Microaggregation and Clustering Li Xiong - - PowerPoint PPT Presentation

anonymization algorithms microaggregation and clustering
SMART_READER_LITE
LIVE PREVIEW

Anonymization Algorithms - Microaggregation and Clustering Li Xiong - - PowerPoint PPT Presentation

Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control,


slide-1
SLIDE 1

Anonymization Algorithms - Microaggregation and Clustering

Li Xiong

CS573 Data Privacy and Anonymity

slide-2
SLIDE 2

Anonymization using Microaggregation or Clustering

 Practical Data-Oriented Microaggregation for

Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002

 Ordinal, Continuous and Heterogeneous k-anonymity

through microaggregation, Domingo-Ferrer, DMKD 2005

 Achieving anonymity via clustering, Aggarwal, PODS

2006

 Efficient k-anonymization using clustering techniques,

Byun, DASFAA 2007

slide-3
SLIDE 3

Anonymization Methods

Perturbative: distort the data

statistics computed on the perturbed dataset should not differ

significantly from the original

microaggregation, additive noise

Non-perturbative: don't distort the data

generalization: combine several categories to form new less

specific category

suppression: remove values of a few attributes in some

records, or entire records

slide-4
SLIDE 4

Types of data

 Continuous: attribute is numeric and arithmetic

  • perations can be performed on it

 Categorical: attribute takes values over a finite set and

standard arithmetic operations don't make sense

Ordinal: ordered range of categories

≤, min and max operations are meaningful

Nominal: unordered

only equality comparison operation is meaningful

slide-5
SLIDE 5

Measure tradeoffs

k-Anonymity: a dataset satisfies k-anonymity for k > 1 if at least k records exist for each combination of quasi- identifier values assuming k-anonymity is enough protection against disclosure risk, one can concentrate on information loss measures

slide-6
SLIDE 6

 Satisfying k-anonymity using generalization and

suppression is NP-hard

 Computational cost of finding the optimal generalization  How to determine the subset of appropriate

generalizations

semantics of categories and intended use of data e.g., ZIP code:

{08201, 08205} -> 0820* makes sense {08201, 05201} -> 0*201 doesn't

Critique of Generalization/Suppression

slide-7
SLIDE 7

Problems cont.

 How to apply a generalization

globally

may generalize records that don't need it

locally

difficult to automate and analyze number of generalizations is even larger

 Generalization and suppression on

continuous data are unsuitable

a numeric attribute becomes categorical and loses its numeric semantics

slide-8
SLIDE 8

Problems cont.

 How to optimally combine generalization and

suppression is unknown

 Use of suppression is not homogenous

suppress entire records or only some attributes of some records blank a suppressed value or replace it with a neutral value

slide-9
SLIDE 9

Microaggregation/Clustering

 Two steps:

 Partition original dataset into clusters of similar

records containing at least k records

 For each cluster, compute an aggregation

  • peration and use it to replace the original

records

 e.g., mean for continuous data, median for

categorical data

slide-10
SLIDE 10

Advantages:

 a unified approach, unlike combination of generalization

and suppression

 Near-optimal heuristics exist  Doesn't generate new categories  Suitable for continuous data without removing their

numeric semantics

slide-11
SLIDE 11

Advantages cont.

 Reduces data distortion

 K-anonymity requires an attribute to be

generalized or suppressed, even if all but one tuple in the set have the same value.

 Clustering allows a cluster center to be

published instead, “enabling us to release more information.”

slide-12
SLIDE 12

Original Table

Age Salary Amy 25 50 Brian 27 60 Carol 29 100 David 35 110 Evelyn 39 120

slide-13
SLIDE 13

2-Anonymity with Generalization

Age Salary Amy 20-30 50-100 Brian 20-30 50-100 Carol 20-30 50-100 David 30-40 100-150 Evelyn 30-40 100-150

Generalization allows pre-specified ranges

slide-14
SLIDE 14

2-Anonymity with Clustering

Age Salary Amy [25-29] [50-100] Brian [25-29] [50-100] Carol [25-29] [50-100] David [35-39] [110-120] Evelyn [35-39] [110-120]

Cluster centers ([27,70], [37,115]) published 27=(25+27+29)/3 70=(50+60+100)/3 37=(35+39)/2 115=(110+120)/2

slide-15
SLIDE 15

Another example: no common value among each attribute

slide-16
SLIDE 16

Generalization vs. clustering

 Generalized version of the table would need to

suppress all attributes.

 Clustered Version of the table would publish the

cluster center as (1, 1, 1, 1), and the radius as 1.

slide-17
SLIDE 17

Anonymization using Microaggregation or Clustering

 Practical Data-Oriented Microaggregation for

Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002

 Ordinal, Continuous and Heterogeneous k-anonymity

through microaggregation, Domingo-Ferrer, DMKD 2005

 Achieving anonymity via clustering, Aggarwal, PODS

2006

 Efficient k-anonymization using clustering techniques,

Byun, DASFAA 2007

slide-18
SLIDE 18

Multivariate microaggregation algorithm

 MDAV-generic: Generic version of MDAV algorithm

(Maximum Distance to Average Vector) from previous papers

 Works with any type of data (continuous, ordinal,

nominal), aggregation operator and distance calculation

slide-19
SLIDE 19

MDAV-generic(R: dataset, k: integer) while |R| ≥ 3k

  • 1. compute average record ~x of all records in R
  • 2. find most distant record xr from ~x
  • 3. find most distant record xs from xr
  • 4. form two clusters from k-1 records closest to xr and k-1

closest to xs

  • 5. Remove the clusters from R and run MDAV-generic on

the remaining dataset end while if 3k-1 ≤ |R| ≤ 2k

  • 1. compute average record ~x of remaining records in R
  • 2. find the most distant record xr from ~x
  • 3. form a cluster from k-1 records closest to ~x
  • 4. form another cluster containing the remaining records

else (fewer than 2k records in R) form a new cluster from the remaining records

slide-20
SLIDE 20

MDAV-generic for continuous attributes

 use arithmetic mean and Euclidean distance  standardize attributes (subtract mean and divide by

standard deviation) to give them equal weight for computing distances

 After MDAV-generic, destandardize attributes

xij is value of k-anonymized jth attribute for the ith record m1

0(j) and m2(j) are mean and variance of the k-anonymized jth attribute

u1

0(j) and u2(j)

are mean and variance of the original jth attribute

slide-21
SLIDE 21

MDAV-generic for ordinal attributes

 The distance between two categories a and b in an

attribute Vi:

dord(a,b) = (|{i| ≤ i < b}|) / |D(Vi)|

i.e., the number of categories separating a and b divided

by the number of categories in the attribute

Nominal attributes

 The distance between two values is defined according to

equality: 0 if they're equal, else 1

slide-22
SLIDE 22

Empirical Results

 Continuous attributes

 From the U.S. Current Population Survey (1995)

 1080 records described by 13 continuous attributes  Computed k-anonymity for k = 3, ..., 9 and quasi-

identifiers with 6 and 13 attributes

 Categorical attributes

 From the U.S. Housing Survey (1993)

 Three ordinal and eight nominal attributes  Computed k-anonymity for k = 2, ..., 9 and quasi-

identifiers with 3, 4, 8 and 11 attributes

slide-23
SLIDE 23

IL measures for continuous attributes

 IL1 = mean variation of individual attributes in original

and k-anonymous datasets

 IL2 = mean variation of attribute means in both datasets  IL3 = mean variation of attribute variances  IL4 = mean variation of attribute covariances  IL5 = mean variation of attribute Pearson's correlations  IL6 = 100 times the average of IL1-6

slide-24
SLIDE 24

 MDAV-generic preserves means and variances  The impact on the non-preserved statistics grows with the quasi- identifier length, as one would expect  For a fixed-quasi-identifier length, the impact on the non-preserved statistics grows with k

slide-25
SLIDE 25

IL measures for categorical attributes

 Dist: direct comparison of original and protected values

using a categorical distance

 CTBIL': mean variation of frequencies in contingency

tables for original and protected data (based on another paper by Domingo-Ferrer and Torra)

 ACTBIL': CTBIL' divided by the total number of cells

in all considered tables

 EBIL: Entropy-based information loss (based on another

paper by Domingo-Ferrer and Torra)

slide-26
SLIDE 26

Ordinal attribute protection using median

slide-27
SLIDE 27

Ordinal attribute protection using convex median

slide-28
SLIDE 28

Anonymization using Microaggregation or Clustering

 Practical Data-Oriented Microaggregation for

Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002

 Ordinal, Continuous and Heterogeneous k-anonymity

through microaggregation, Domingo-Ferrer, DMKD 2005

 Achieving anonymity via clustering, Aggarwal, PODS

2006

 Efficient k-anonymization using clustering techniques,

Byun, DASFAA 2007

slide-29
SLIDE 29

r-Clustering

 Attributes from a table are first redefined as

points in metric space.

 These points are clustered, and then the cluster

centers are published, rather than the original quasi-identifiers.

 r is the lower bound on the number of members

in each cluster.

 r is used instead of k to denote the minimum

degree of anonymity because k is typically used in clustering to denote the number of clusters.

slide-30
SLIDE 30

Data published for clusters

 Three features are published for the clustered

data

 the quasi-identifying attributes of the cluster

center

 the number of points within the cluster  the set of sensitive values for the cluster

(which remain unchanged, as with k- anonymity)

 A measure of the quality of the clusters will

also be published.

slide-31
SLIDE 31

Defining the records in metric space

 Some attributes, such as age and height, are

easily mapped to metric space.

 Others, such as zip may first need to be

converted, for example to longitude and latitude.

 Some attributes may need to be scaled, such as

location, which may differ by thousands of miles.

 Some attributes such as race or nationality may

not convert to points in metric space easily.

slide-32
SLIDE 32

How to measure the quality of the cluster

 Measures how much it distorts the original data.  Maximum radius (r-GATHER problem)

 Maximum radius of all clusters

 Cellular cost (r-CELLULAR CLUSTERING problem)

 Each cluster incurs a “facility cost” to set up the

cluster center.

 Each cluster incurs a “service cost” which is equal to

the radius times the number of points in the cluster

 Sum of the facility and services costs for each of the

clusters.

slide-33
SLIDE 33

25 points, radius 10 14 points, radius 8 17 points, radius 7

Points arranged in clusters

slide-34
SLIDE 34

Cluster quality measurements

 Maximum radius = 10  Facility cost plus service cost:

  • Facility cost = f(c)
  • Service cost = (17 x 7) + (14 x 8) + (25 x 10)

= 481

slide-35
SLIDE 35

r-GATHER problem

 “The r-Gather problem is to cluster n points in

a metric space into a set of clusters, such that each cluster has at least r points. The

  • bjective is to minimize the maximum radius

among the clusters.”

slide-36
SLIDE 36

“Outlier” points

 r-GATHER and r-CELLULAR CLUSTERING,

like k-anonymity, are sensitive to outlier points (i.e., points which are far removed from the rest of the data).

 The clustering solutions in this paper are

generalized to allow an e fraction of outliers to be removed from the data, that is, e fraction of the tuples can be suppressed.

slide-37
SLIDE 37

(r,e)-GATHER Clustering

 The (r, e)-GATHER clustering formulation of

the problem allows an e fraction of the outlier points to be unclustered (i.e., these tuples are suppressed).

 The paper finds that there is a polynomial

time algorithm that provides a 4- approximation for the (r,e)-GATHER problem.

slide-38
SLIDE 38

r-CELLULAR CLUSTERING defined

 The CELLULAR CLUSTERING problem is

to arrange n points into clusters with each

cluster has at least r points and with the

minimum total cellular cost.

slide-39
SLIDE 39

(r,e)-CELLULAR CLUSTERING

 There is also a (r,e)-CELLULAR

CLUSTERING problem in which an e fraction

  • f the points can be excluded.

 The details of the constant-factor

approximation of this problem are deferred to the full version of this paper.

slide-40
SLIDE 40

Anonymization using Microaggregation or Clustering

 Practical Data-Oriented Microaggregation for

Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002

 Ordinal, Continuous and Heterogeneous k-anonymity

through microaggregation, Domingo-Ferrer, DMKD 2005

 Achieving anonymity via clustering, Aggarwal, PODS

2006

 Efficient k-anonymization using clustering techniques,

Byun, DASFAA 2007

slide-41
SLIDE 41

41

Anonymization And Clustering

 k-Member Clustering Problem

 From a given set of n records, find a set of clusters such

that

 Each cluster contains at least k records, and  The total intra-cluster distance is minimized.

 The problem is NP-complete

slide-42
SLIDE 42

42

Distance Metrics

 Distance metric for records

 Measure the dissimilarities between two data

points

 Sum of all dissimilarities between corresponding

attributes.

  • Numerical values
  • Categorical values
slide-43
SLIDE 43

43

Distance between two numerical values

 Definition

Let D be a finite numeric

  • domain. Then the

normalized distance between two values vi, vj  D is defined as: where |D| is the domain size measured by the difference between the maximum and minimum values in D.

Age Country Occupation Salary Diagnosis r1 41 USA Armed-Forces ≥50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher <50K Obesity r4 38 Iran Tech-support ≥50K Flu r5 24 Brazil Doctor ≥50K Cancer r6 45 Greece Salesman <50K Fever

Example 1 Distance between r1 and r2 with respect to Age attribute is |57-41|/|57-24| = 16/33 = 0.4848.. Example 2 Distance between r5 and r6 with respect to Age attribute is |24-45|/|57-24| = 21/33 = 0.6363..

slide-44
SLIDE 44

44

Distance between two categorical values

 Equally different to each

  • ther.

 0 if they are the same  1 if they are different

 Relationships can be

easily captured in a taxonomy tree.

Taxonomy tree of Country Taxonomy tree of Occupation

slide-45
SLIDE 45

45

Distance between two categorical values

 Definition

Let D be a categorical domain and TD be a taxonomy tree defined for D. The normalized distance between two values vi, vj  D is defined as: where  (x, y) is the subtree rooted at the lowest common ancestor of x and y, and H(T) represents the height of tree T.

Taxonomy tree of Country

Example: The distance between India and USA is 3/3 = 1. The distance between India and Iran is 2/3 = 0.66.

slide-46
SLIDE 46

46

Distance between two records

 Definition

Let QT = {N1, . . . ,Nm, C1, . . . ,Cn} be the quasi-identifier of table T , where Ni(i = 1, . . . ,m) is an attribute with a numeric domain and Ci(i = 1, . . . , n) is an attribute with a categorical domain. The distance of two records r1, r2  T is defined as: where δN is the distance function for numeric attribute, and δC is the distance function for categorical attribute.

slide-47
SLIDE 47

47

Distance between two records Continued…

Taxonomy tree of Country Taxonomy tree of Occupation

Age Country Occupation Salary Diagno sis r1 41 USA Armed-Forces ≥50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher <50K Obesity r4 38 Iran Tech-support ≥50K Flu r5 24 Brazil Doctor ≥50K Cancer r6 45 Greece Salesman <50K Fever

Example the distance between the r1 and r2 is (16/33) + (3/3) + 1 = 2.485. the distance between the r1 and r3 is (1/33) + (1/3) + 1 = 1.363.

slide-48
SLIDE 48

48

Cost Function - Information loss (IL)

 The amount of distortion (i.e., information loss) caused by the

generalization process. Note: Records in each cluster are generalized to share the same quasi-identifier value that represents every original quasi- identifier value in the cluster.

 Definition: Let e = {r1, . . . , rk} be a cluster (i.e., equivalence

class). Then the amount of information loss in e, denoted by IL(e), is defined as: where |e| is the number of records in e, |N| represents the size of numeric domain N, (Cj) is the subtree rooted at the lowest common ancestor of every value in Cj, and H(T) is the height

  • f tree T.
slide-49
SLIDE 49

49

Cost Function - Information loss (IL)

Taxonomy tree of Country

Age Country Occupation Salary Diagnosis r1 41 USA Armed-Forces ≥50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher <50K Obesity r4 38 Iran Tech-support ≥50K Flu r5 24 Brazil Doctor ≥50K Cancer r6 45 Greece Salesman <50K Fever

IL(e1) = 3 • D(e1) D(e1) = (41-24)/33 + (2/3) + 1 = 2.1818… IL(e1) = 3 • 2.1818…= 6.5454…

Age Country Occupation Salary Diagnosis 41 USA Armed-Forces ≥50K Cancer 40 Canada Teacher <50K Obesity 24 Brazil Doctor ≥50K Cancer

Cluster e1

Example

IL(e2) = 3 • D(e2) D(e2) = (57-24)/33 + (3/3) + 1 = 3 IL(e2) = 3 • 3 = 9

Age Country Occupation Salary Diagnosis 41 USA Armed-Forces ≥50K Cancer 57 India Tech-support <50K Flu 24 Brazil Doctor ≥50K Cancer

Cluster e2

slide-50
SLIDE 50

50

Greedy k-member clustering algorithm

slide-51
SLIDE 51

51

Diversity Metrics

The Equal Diversity metric (ED)

 assumes all sensitive attribute

values are equally sensitive where φ(e, s) = 1 if every record in e has the same s value; φ(e, s) = 0,

  • therwise.

Modification to the greedy algorithm:

Sensitive Diversity metric (SD)

 assumes there are two types of

values in a sensitive attribute:

 truly-sensitive  not-so-sensitive

where ψ(e, s) = 1 if every record in e has the same s value that is truly- sensitive; ψ(e, s) = 0, otherwise Modification to the greedy algorithm

slide-52
SLIDE 52

52

classification metric (CM)

 preserve the correlation between quasi-

identifier and class labels (non-sensitive values)

Where N is the total number of records, and Penalty(row r) = 1 if r is suppressed or the class label of r is different from the class label of the majority in the equivalence group. Modification to the greedy algorithm:

slide-53
SLIDE 53

53

Experimentl Results

 Experimentl Setup

 Data: Adult dataset from the UC Irvine Machine Learning

Repository

 10 attributes (2 numeric, 7 categorical, 1 class)

 Compare with 2 other algorithms

 Median partitioning (mondrian algorithm)  k-Nearest neighbor

slide-54
SLIDE 54

54

Experimentl Results

slide-55
SLIDE 55

55

Conclusion

 Transforming the k-anonymity problem to the

k-member clustering problem

 Overall the Greedy Algorithm produced better

results compared to other algorithms at the cost of efficiency