SLIDE 1 Anonymization Algorithms - Microaggregation and Clustering
Li Xiong
CS573 Data Privacy and Anonymity
SLIDE 2
Anonymization using Microaggregation or Clustering
Practical Data-Oriented Microaggregation for
Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002
Ordinal, Continuous and Heterogeneous k-anonymity
through microaggregation, Domingo-Ferrer, DMKD 2005
Achieving anonymity via clustering, Aggarwal, PODS
2006
Efficient k-anonymization using clustering techniques,
Byun, DASFAA 2007
SLIDE 3 Anonymization Methods
Perturbative: distort the data
statistics computed on the perturbed dataset should not differ
significantly from the original
microaggregation, additive noise
Non-perturbative: don't distort the data
generalization: combine several categories to form new less
specific category
suppression: remove values of a few attributes in some
records, or entire records
SLIDE 4 Types of data
Continuous: attribute is numeric and arithmetic
- perations can be performed on it
Categorical: attribute takes values over a finite set and
standard arithmetic operations don't make sense
Ordinal: ordered range of categories
≤, min and max operations are meaningful
Nominal: unordered
only equality comparison operation is meaningful
SLIDE 5
Measure tradeoffs
k-Anonymity: a dataset satisfies k-anonymity for k > 1 if at least k records exist for each combination of quasi- identifier values assuming k-anonymity is enough protection against disclosure risk, one can concentrate on information loss measures
SLIDE 6 Satisfying k-anonymity using generalization and
suppression is NP-hard
Computational cost of finding the optimal generalization How to determine the subset of appropriate
generalizations
semantics of categories and intended use of data e.g., ZIP code:
{08201, 08205} -> 0820* makes sense {08201, 05201} -> 0*201 doesn't
Critique of Generalization/Suppression
SLIDE 7 Problems cont.
How to apply a generalization
globally
may generalize records that don't need it
locally
difficult to automate and analyze number of generalizations is even larger
Generalization and suppression on
continuous data are unsuitable
a numeric attribute becomes categorical and loses its numeric semantics
SLIDE 8 Problems cont.
How to optimally combine generalization and
suppression is unknown
Use of suppression is not homogenous
suppress entire records or only some attributes of some records blank a suppressed value or replace it with a neutral value
SLIDE 9 Microaggregation/Clustering
Two steps:
Partition original dataset into clusters of similar
records containing at least k records
For each cluster, compute an aggregation
- peration and use it to replace the original
records
e.g., mean for continuous data, median for
categorical data
SLIDE 10 Advantages:
a unified approach, unlike combination of generalization
and suppression
Near-optimal heuristics exist Doesn't generate new categories Suitable for continuous data without removing their
numeric semantics
SLIDE 11 Advantages cont.
Reduces data distortion
K-anonymity requires an attribute to be
generalized or suppressed, even if all but one tuple in the set have the same value.
Clustering allows a cluster center to be
published instead, “enabling us to release more information.”
SLIDE 12
Original Table
Age Salary Amy 25 50 Brian 27 60 Carol 29 100 David 35 110 Evelyn 39 120
SLIDE 13
2-Anonymity with Generalization
Age Salary Amy 20-30 50-100 Brian 20-30 50-100 Carol 20-30 50-100 David 30-40 100-150 Evelyn 30-40 100-150
Generalization allows pre-specified ranges
SLIDE 14
2-Anonymity with Clustering
Age Salary Amy [25-29] [50-100] Brian [25-29] [50-100] Carol [25-29] [50-100] David [35-39] [110-120] Evelyn [35-39] [110-120]
Cluster centers ([27,70], [37,115]) published 27=(25+27+29)/3 70=(50+60+100)/3 37=(35+39)/2 115=(110+120)/2
SLIDE 15
Another example: no common value among each attribute
SLIDE 16
Generalization vs. clustering
Generalized version of the table would need to
suppress all attributes.
Clustered Version of the table would publish the
cluster center as (1, 1, 1, 1), and the radius as 1.
SLIDE 17
Anonymization using Microaggregation or Clustering
Practical Data-Oriented Microaggregation for
Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002
Ordinal, Continuous and Heterogeneous k-anonymity
through microaggregation, Domingo-Ferrer, DMKD 2005
Achieving anonymity via clustering, Aggarwal, PODS
2006
Efficient k-anonymization using clustering techniques,
Byun, DASFAA 2007
SLIDE 18 Multivariate microaggregation algorithm
MDAV-generic: Generic version of MDAV algorithm
(Maximum Distance to Average Vector) from previous papers
Works with any type of data (continuous, ordinal,
nominal), aggregation operator and distance calculation
SLIDE 19 MDAV-generic(R: dataset, k: integer) while |R| ≥ 3k
- 1. compute average record ~x of all records in R
- 2. find most distant record xr from ~x
- 3. find most distant record xs from xr
- 4. form two clusters from k-1 records closest to xr and k-1
closest to xs
- 5. Remove the clusters from R and run MDAV-generic on
the remaining dataset end while if 3k-1 ≤ |R| ≤ 2k
- 1. compute average record ~x of remaining records in R
- 2. find the most distant record xr from ~x
- 3. form a cluster from k-1 records closest to ~x
- 4. form another cluster containing the remaining records
else (fewer than 2k records in R) form a new cluster from the remaining records
SLIDE 20 MDAV-generic for continuous attributes
use arithmetic mean and Euclidean distance standardize attributes (subtract mean and divide by
standard deviation) to give them equal weight for computing distances
After MDAV-generic, destandardize attributes
xij is value of k-anonymized jth attribute for the ith record m1
0(j) and m2(j) are mean and variance of the k-anonymized jth attribute
u1
0(j) and u2(j)
are mean and variance of the original jth attribute
SLIDE 21 MDAV-generic for ordinal attributes
The distance between two categories a and b in an
attribute Vi:
dord(a,b) = (|{i| ≤ i < b}|) / |D(Vi)|
i.e., the number of categories separating a and b divided
by the number of categories in the attribute
Nominal attributes
The distance between two values is defined according to
equality: 0 if they're equal, else 1
SLIDE 22 Empirical Results
Continuous attributes
From the U.S. Current Population Survey (1995)
1080 records described by 13 continuous attributes Computed k-anonymity for k = 3, ..., 9 and quasi-
identifiers with 6 and 13 attributes
Categorical attributes
From the U.S. Housing Survey (1993)
Three ordinal and eight nominal attributes Computed k-anonymity for k = 2, ..., 9 and quasi-
identifiers with 3, 4, 8 and 11 attributes
SLIDE 23 IL measures for continuous attributes
IL1 = mean variation of individual attributes in original
and k-anonymous datasets
IL2 = mean variation of attribute means in both datasets IL3 = mean variation of attribute variances IL4 = mean variation of attribute covariances IL5 = mean variation of attribute Pearson's correlations IL6 = 100 times the average of IL1-6
SLIDE 24
MDAV-generic preserves means and variances The impact on the non-preserved statistics grows with the quasi- identifier length, as one would expect For a fixed-quasi-identifier length, the impact on the non-preserved statistics grows with k
SLIDE 25 IL measures for categorical attributes
Dist: direct comparison of original and protected values
using a categorical distance
CTBIL': mean variation of frequencies in contingency
tables for original and protected data (based on another paper by Domingo-Ferrer and Torra)
ACTBIL': CTBIL' divided by the total number of cells
in all considered tables
EBIL: Entropy-based information loss (based on another
paper by Domingo-Ferrer and Torra)
SLIDE 26
Ordinal attribute protection using median
SLIDE 27
Ordinal attribute protection using convex median
SLIDE 28
Anonymization using Microaggregation or Clustering
Practical Data-Oriented Microaggregation for
Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002
Ordinal, Continuous and Heterogeneous k-anonymity
through microaggregation, Domingo-Ferrer, DMKD 2005
Achieving anonymity via clustering, Aggarwal, PODS
2006
Efficient k-anonymization using clustering techniques,
Byun, DASFAA 2007
SLIDE 29
r-Clustering
Attributes from a table are first redefined as
points in metric space.
These points are clustered, and then the cluster
centers are published, rather than the original quasi-identifiers.
r is the lower bound on the number of members
in each cluster.
r is used instead of k to denote the minimum
degree of anonymity because k is typically used in clustering to denote the number of clusters.
SLIDE 30 Data published for clusters
Three features are published for the clustered
data
the quasi-identifying attributes of the cluster
center
the number of points within the cluster the set of sensitive values for the cluster
(which remain unchanged, as with k- anonymity)
A measure of the quality of the clusters will
also be published.
SLIDE 31
Defining the records in metric space
Some attributes, such as age and height, are
easily mapped to metric space.
Others, such as zip may first need to be
converted, for example to longitude and latitude.
Some attributes may need to be scaled, such as
location, which may differ by thousands of miles.
Some attributes such as race or nationality may
not convert to points in metric space easily.
SLIDE 32 How to measure the quality of the cluster
Measures how much it distorts the original data. Maximum radius (r-GATHER problem)
Maximum radius of all clusters
Cellular cost (r-CELLULAR CLUSTERING problem)
Each cluster incurs a “facility cost” to set up the
cluster center.
Each cluster incurs a “service cost” which is equal to
the radius times the number of points in the cluster
Sum of the facility and services costs for each of the
clusters.
SLIDE 33 25 points, radius 10 14 points, radius 8 17 points, radius 7
Points arranged in clusters
SLIDE 34 Cluster quality measurements
Maximum radius = 10 Facility cost plus service cost:
- Facility cost = f(c)
- Service cost = (17 x 7) + (14 x 8) + (25 x 10)
= 481
SLIDE 35 r-GATHER problem
“The r-Gather problem is to cluster n points in
a metric space into a set of clusters, such that each cluster has at least r points. The
- bjective is to minimize the maximum radius
among the clusters.”
SLIDE 36
“Outlier” points
r-GATHER and r-CELLULAR CLUSTERING,
like k-anonymity, are sensitive to outlier points (i.e., points which are far removed from the rest of the data).
The clustering solutions in this paper are
generalized to allow an e fraction of outliers to be removed from the data, that is, e fraction of the tuples can be suppressed.
SLIDE 37
(r,e)-GATHER Clustering
The (r, e)-GATHER clustering formulation of
the problem allows an e fraction of the outlier points to be unclustered (i.e., these tuples are suppressed).
The paper finds that there is a polynomial
time algorithm that provides a 4- approximation for the (r,e)-GATHER problem.
SLIDE 38
r-CELLULAR CLUSTERING defined
The CELLULAR CLUSTERING problem is
to arrange n points into clusters with each
cluster has at least r points and with the
minimum total cellular cost.
SLIDE 39 (r,e)-CELLULAR CLUSTERING
There is also a (r,e)-CELLULAR
CLUSTERING problem in which an e fraction
- f the points can be excluded.
The details of the constant-factor
approximation of this problem are deferred to the full version of this paper.
SLIDE 40
Anonymization using Microaggregation or Clustering
Practical Data-Oriented Microaggregation for
Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002
Ordinal, Continuous and Heterogeneous k-anonymity
through microaggregation, Domingo-Ferrer, DMKD 2005
Achieving anonymity via clustering, Aggarwal, PODS
2006
Efficient k-anonymization using clustering techniques,
Byun, DASFAA 2007
SLIDE 41 41
Anonymization And Clustering
k-Member Clustering Problem
From a given set of n records, find a set of clusters such
that
Each cluster contains at least k records, and The total intra-cluster distance is minimized.
The problem is NP-complete
SLIDE 42 42
Distance Metrics
Distance metric for records
Measure the dissimilarities between two data
points
Sum of all dissimilarities between corresponding
attributes.
- Numerical values
- Categorical values
SLIDE 43 43
Distance between two numerical values
Definition
Let D be a finite numeric
normalized distance between two values vi, vj D is defined as: where |D| is the domain size measured by the difference between the maximum and minimum values in D.
Age Country Occupation Salary Diagnosis r1 41 USA Armed-Forces ≥50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher <50K Obesity r4 38 Iran Tech-support ≥50K Flu r5 24 Brazil Doctor ≥50K Cancer r6 45 Greece Salesman <50K Fever
Example 1 Distance between r1 and r2 with respect to Age attribute is |57-41|/|57-24| = 16/33 = 0.4848.. Example 2 Distance between r5 and r6 with respect to Age attribute is |24-45|/|57-24| = 21/33 = 0.6363..
SLIDE 44 44
Distance between two categorical values
Equally different to each
0 if they are the same 1 if they are different
Relationships can be
easily captured in a taxonomy tree.
Taxonomy tree of Country Taxonomy tree of Occupation
SLIDE 45 45
Distance between two categorical values
Definition
Let D be a categorical domain and TD be a taxonomy tree defined for D. The normalized distance between two values vi, vj D is defined as: where (x, y) is the subtree rooted at the lowest common ancestor of x and y, and H(T) represents the height of tree T.
Taxonomy tree of Country
Example: The distance between India and USA is 3/3 = 1. The distance between India and Iran is 2/3 = 0.66.
SLIDE 46 46
Distance between two records
Definition
Let QT = {N1, . . . ,Nm, C1, . . . ,Cn} be the quasi-identifier of table T , where Ni(i = 1, . . . ,m) is an attribute with a numeric domain and Ci(i = 1, . . . , n) is an attribute with a categorical domain. The distance of two records r1, r2 T is defined as: where δN is the distance function for numeric attribute, and δC is the distance function for categorical attribute.
SLIDE 47 47
Distance between two records Continued…
Taxonomy tree of Country Taxonomy tree of Occupation
Age Country Occupation Salary Diagno sis r1 41 USA Armed-Forces ≥50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher <50K Obesity r4 38 Iran Tech-support ≥50K Flu r5 24 Brazil Doctor ≥50K Cancer r6 45 Greece Salesman <50K Fever
Example the distance between the r1 and r2 is (16/33) + (3/3) + 1 = 2.485. the distance between the r1 and r3 is (1/33) + (1/3) + 1 = 1.363.
SLIDE 48 48
Cost Function - Information loss (IL)
The amount of distortion (i.e., information loss) caused by the
generalization process. Note: Records in each cluster are generalized to share the same quasi-identifier value that represents every original quasi- identifier value in the cluster.
Definition: Let e = {r1, . . . , rk} be a cluster (i.e., equivalence
class). Then the amount of information loss in e, denoted by IL(e), is defined as: where |e| is the number of records in e, |N| represents the size of numeric domain N, (Cj) is the subtree rooted at the lowest common ancestor of every value in Cj, and H(T) is the height
SLIDE 49 49
Cost Function - Information loss (IL)
Taxonomy tree of Country
Age Country Occupation Salary Diagnosis r1 41 USA Armed-Forces ≥50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher <50K Obesity r4 38 Iran Tech-support ≥50K Flu r5 24 Brazil Doctor ≥50K Cancer r6 45 Greece Salesman <50K Fever
IL(e1) = 3 • D(e1) D(e1) = (41-24)/33 + (2/3) + 1 = 2.1818… IL(e1) = 3 • 2.1818…= 6.5454…
Age Country Occupation Salary Diagnosis 41 USA Armed-Forces ≥50K Cancer 40 Canada Teacher <50K Obesity 24 Brazil Doctor ≥50K Cancer
Cluster e1
Example
IL(e2) = 3 • D(e2) D(e2) = (57-24)/33 + (3/3) + 1 = 3 IL(e2) = 3 • 3 = 9
Age Country Occupation Salary Diagnosis 41 USA Armed-Forces ≥50K Cancer 57 India Tech-support <50K Flu 24 Brazil Doctor ≥50K Cancer
Cluster e2
SLIDE 50 50
Greedy k-member clustering algorithm
SLIDE 51 51
Diversity Metrics
The Equal Diversity metric (ED)
assumes all sensitive attribute
values are equally sensitive where φ(e, s) = 1 if every record in e has the same s value; φ(e, s) = 0,
Modification to the greedy algorithm:
Sensitive Diversity metric (SD)
assumes there are two types of
values in a sensitive attribute:
truly-sensitive not-so-sensitive
where ψ(e, s) = 1 if every record in e has the same s value that is truly- sensitive; ψ(e, s) = 0, otherwise Modification to the greedy algorithm
SLIDE 52 52
classification metric (CM)
preserve the correlation between quasi-
identifier and class labels (non-sensitive values)
Where N is the total number of records, and Penalty(row r) = 1 if r is suppressed or the class label of r is different from the class label of the majority in the equivalence group. Modification to the greedy algorithm:
SLIDE 53 53
Experimentl Results
Experimentl Setup
Data: Adult dataset from the UC Irvine Machine Learning
Repository
10 attributes (2 numeric, 7 categorical, 1 class)
Compare with 2 other algorithms
Median partitioning (mondrian algorithm) k-Nearest neighbor
SLIDE 54 54
Experimentl Results
SLIDE 55 55
Conclusion
Transforming the k-anonymity problem to the
k-member clustering problem
Overall the Greedy Algorithm produced better
results compared to other algorithms at the cost of efficiency