Anonymization Algorithms - Microaggregation and Clustering Li Xiong - PowerPoint PPT Presentation

Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity

Anonymization using Microaggregation or Clustering  Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002  Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005  Achieving anonymity via clustering, Aggarwal, PODS 2006  Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

Anonymization Methods  Perturbative: distort the data  statistics computed on the perturbed dataset should not differ significantly from the original  microaggregation, additive noise  Non-perturbative: don't distort the data  generalization: combine several categories to form new less specific category  suppression: remove values of a few attributes in some records, or entire records

Types of data  Continuous: attribute is numeric and arithmetic operations can be performed on it  Categorical: attribute takes values over a finite set and standard arithmetic operations don't make sense  Ordinal: ordered range of categories  ≤ , min and max operations are meaningful  Nominal: unordered  only equality comparison operation is meaningful

Measure tradeoffs k-Anonymity: a dataset satisfies k-anonymity for k > 1 if at least k records exist for each combination of quasi- identifier values assuming k-anonymity is enough protection against disclosure risk, one can concentrate on information loss measures

Critique of Generalization/Suppression  Satisfying k-anonymity using generalization and suppression is NP-hard  Computational cost of finding the optimal generalization  How to determine the subset of appropriate generalizations  semantics of categories and intended use of data  e.g., ZIP code:  {08201, 08205} -> 0820* makes sense  {08201, 05201} -> 0*201 doesn't

Problems cont.  How to apply a generalization  globally  may generalize records that don't need it  locally  difficult to automate and analyze  number of generalizations is even larger  Generalization and suppression on continuous data are unsuitable  a numeric attribute becomes categorical and loses its numeric semantics

 Problems cont.  How to optimally combine generalization and suppression is unknown  Use of suppression is not homogenous  suppress entire records or only some attributes of some records  blank a suppressed value or replace it with a neutral value

Microaggregation/Clustering  Two steps:  Partition original dataset into clusters of similar records containing at least k records  For each cluster, compute an aggregation operation and use it to replace the original records  e.g., mean for continuous data, median for categorical data

Advantages:  a unified approach, unlike combination of generalization and suppression  Near-optimal heuristics exist  Doesn't generate new categories  Suitable for continuous data without removing their numeric semantics

Advantages cont.  Reduces data distortion  K -anonymity requires an attribute to be generalized or suppressed, even if all but one tuple in the set have the same value.  Clustering allows a cluster center to be published instead, “enabling us to release more information.”

Original Table Age Salary Amy 25 50 Brian 27 60 Carol 29 100 David 35 110 Evelyn 39 120

2-Anonymity with Generalization Age Salary Amy 20-30 50-100 Brian 20-30 50-100 Carol 20-30 50-100 David 30-40 100-150 Evelyn 30-40 100-150 Generalization allows pre-specified ranges

2-Anonymity with Clustering Age Salary Amy [25-29] [50-100] 27=(25+27+29)/3 Brian [25-29] [50-100] 70=(50+60+100)/3 Carol [25-29] [50-100] 37=(35+39)/2 David [35-39] [110-120] 115=(110+120)/2 Evelyn [35-39] [110-120] Cluster centers ([27,70], [37,115]) published

Another example: no common value among each attribute

Generalization vs. clustering  Generalized version of the table would need to suppress all attributes.  Clustered Version of the table would publish the cluster center as (1, 1, 1, 1), and the radius as 1.

Multivariate microaggregation algorithm  MDAV-generic: Generic version of MDAV algorithm (Maximum Distance to Average Vector) from previous papers  Works with any type of data (continuous, ordinal, nominal), aggregation operator and distance calculation

MDAV-generic(R: dataset, k: integer) while |R| ≥ 3k 1. compute average record ~x of all records in R 2. find most distant record x r from ~x 3. find most distant record x s from x r 4. form two clusters from k-1 records closest to x r and k-1 closest to x s 5. Remove the clusters from R and run MDAV-generic on the remaining dataset end while if 3k-1 ≤ |R| ≤ 2k 1. compute average record ~x of remaining records in R 2. find the most distant record x r from ~x 3. form a cluster from k-1 records closest to ~x 4. form another cluster containing the remaining records else (fewer than 2k records in R) form a new cluster from the remaining records

 MDAV-generic for continuous attributes  use arithmetic mean and Euclidean distance  standardize attributes (subtract mean and divide by standard deviation) to give them equal weight for computing distances  After MDAV-generic, destandardize attributes  x ij is value of k-anonymized jth attribute for the ith record  m 1 0 (j) and m 2 (j) are mean and variance of the k-anonymized jth attribute  u 1 0 (j) and u 2 (j) are mean and variance of the original jth attribute

 MDAV-generic for ordinal attributes  The distance between two categories a and b in an attribute V i :  d ord (a,b) = (|{i| ≤ i < b}|) / |D(V i )|  i.e., the number of categories separating a and b divided by the number of categories in the attribute  Nominal attributes  The distance between two values is defined according to equality: 0 if they're equal, else 1

Empirical Results  Continuous attributes  From the U.S. Current Population Survey (1995)  1080 records described by 13 continuous attributes  Computed k-anonymity for k = 3, ..., 9 and quasi- identifiers with 6 and 13 attributes  Categorical attributes  From the U.S. Housing Survey (1993)  Three ordinal and eight nominal attributes  Computed k-anonymity for k = 2, ..., 9 and quasi- identifiers with 3, 4, 8 and 11 attributes

 IL measures for continuous attributes  IL1 = mean variation of individual attributes in original and k-anonymous datasets  IL2 = mean variation of attribute means in both datasets  IL3 = mean variation of attribute variances  IL4 = mean variation of attribute covariances  IL5 = mean variation of attribute Pearson's correlations  IL6 = 100 times the average of IL1-6

 MDAV-generic preserves means and variances  The impact on the non-preserved statistics grows with the quasi- identifier length, as one would expect  For a fixed-quasi-identifier length, the impact on the non-preserved statistics grows with k

 IL measures for categorical attributes  Dist: direct comparison of original and protected values using a categorical distance  CTBIL': mean variation of frequencies in contingency tables for original and protected data (based on another paper by Domingo-Ferrer and Torra)  ACTBIL': CTBIL' divided by the total number of cells in all considered tables  EBIL: Entropy-based information loss (based on another paper by Domingo-Ferrer and Torra)

Ordinal attribute protection using median

Ordinal attribute protection using convex median

r- Clustering  Attributes from a table are first redefined as points in metric space.  These points are clustered, and then the cluster centers are published, rather than the original quasi-identifiers.  r is the lower bound on the number of members in each cluster.  r is used instead of k to denote the minimum degree of anonymity because k is typically used in clustering to denote the number of clusters.

Data published for clusters  Three features are published for the clustered data  the quasi-identifying attributes of the cluster center  the number of points within the cluster  the set of sensitive values for the cluster (which remain unchanged, as with k - anonymity)  A measure of the quality of the clusters will also be published.

Anonymization Algorithms - Microaggregation and Clustering Li Xiong - PowerPoint PPT Presentation

Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control,

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Introduction to Anonymization (I) Claire McKay Bowen Postdoctoral Researcher, Los Alamos

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference

Data Masking and Anonymization for PostgreSQL 1 The Anonymization Challenge 8 Strategies

Towards Plausible Graph Anonymization Yang Zhang, Mathias Humbert, Bartlomiej Surma, Praveen

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Clustering: Models and Algorithms Shikui Tu 2019-02-28 1 Outline Clustering K-mean

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Abstracting and Coding Boot Camp: Webinar Series Cancer Case Scenarios NAACCR 20152016

Convex Biclustering Eric Chi Rice University joint work with Genevera Allen and Rich Baraniuk

Algebra and tensors give interpretable groups for crosstalk mechanisms in breast cancer Mariano

Cluster Analysis This lab will demonstrate how to perform the following in Python:

Administrative notes October 26, 2017 Well do some In the News Groupwork today

Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

LIFE SCIENCES IN PARIS REGION PARIS AREA : FIRST EUROPEAN REGION IN THE FIELD OF LIFE SCIENCE AND

Clustering and information visualization Samuel Kaski University of Helsinki Department of